Policy Engine

Session risk & rules

How Guardion accumulates risk across a conversation and how declarative rules use that context to escalate a decision beyond a single turn.

A single message is rarely the whole story. An attacker probes across many turns; abuse builds up gradually. Guardion tracks risk per session — identified by the session you pass to POST /v1/guard — so the policy engine can react to patterns, not just the current message.

How session risk is computed

Every detected risk carries a risk level, ordered by severity. A detector's level is the highest level among its enabled checks.

Risk level	Severity	Contribution
`critical`	Highest	Dominates the score.
`high`	High	Large.
`medium`	Medium	Moderate.
`low`	Low	Barely moves the score.

On each turn, the guardrails that fired produce a turn risk score — a weighted sum where a critical signal counts for many times a low one. The session risk score is then the running average of turn risk across the session:

session_risk_score = total turn-risk ÷ total requests

Averaging (rather than summing) keeps the score a stable 0.0–1.0 value: a session that is consistently risky trends high, while one bad turn in a long clean session does not peg it. The score, and the per-level / per-label counts behind it, are updated after every evaluation.

Session facts available to rules

Decisions can read accumulated context through a flat set of facts:

Fact	Meaning
`session_risk_score`	Running-average risk for the session (0.0–1.0).
`turn_risk_score`	Risk contributed by the current turn.
`total_requests`	Evaluations seen in the session.
`total_flagged` / `total_denied`	Flagged / denied turns so far.
`high_risk_flags`	Count of `critical` + `high` flags.
`risk_level_counts.*`	Flags per level (e.g. `risk_level_counts.critical`).
`label_counts.*`	Flags per label (e.g. `label_counts.PROMPT_INJECTION`).
`intent_drift_score`	How far the conversation has drifted from its stated intent.
`repetition_score`	Repeated / probing behavior.
`bot_type`	Automated-client classification, if any.

Decision rules

A rule is a set of conditions over those facts plus an action. Rules run in priority order and the first match wins; the action then maps to a decision via the policy action (see Policy engine).

Rule actions: enforce (follow the policy action), deny, flag, modify, and the held actions step_up (require human review) and defer. As everywhere, a flag policy caps the outcome at FLAG.

Example rule — escalate a noisy session

# Deny once a session accumulates enough flagged turns
# (on a block policy; flags on a flag policy).
- id: session_block_threshold
  priority: 100
  action: enforce
  when:
    - { field: total_flagged, op: ">=", value: 5 }

Held decisions

Some rules don't allow or deny immediately. step_up holds the request for human approval; defer postpones it. Held requests are themselves a signal (they count as flagged), and a request left unresolved past its timeout resolves to DENY — fail-safe by default.