Policy Engine
Session risk & rules
How Guardion accumulates risk across a conversation and how declarative rules use that context to escalate a decision beyond a single turn.
A single message is rarely the whole story. An attacker probes across many turns; abuse builds up gradually. Guardion tracks risk per session — identified by the session you pass to POST /v1/guard — so the policy engine can react to patterns, not just the current message.
How session risk is computed
Every detected risk carries a risk level, ordered by severity. A detector's level is the highest level among its enabled checks.
| Risk level | Severity | Contribution |
|---|---|---|
critical | Highest | Dominates the score. |
high | High | Large. |
medium | Medium | Moderate. |
low | Low | Barely moves the score. |
On each turn, the guardrails that fired produce a turn risk score — a weighted sum where a critical signal counts for many times a low one. The session risk score is then the running average of turn risk across the session:
session_risk_score = total turn-risk ÷ total requests
Averaging (rather than summing) keeps the score a stable 0.0–1.0 value: a session that is consistently risky trends high, while one bad turn in a long clean session does not peg it. The score, and the per-level / per-label counts behind it, are updated after every evaluation.
Session facts available to rules
Decisions can read accumulated context through a flat set of facts:
| Fact | Meaning |
|---|---|
session_risk_score | Running-average risk for the session (0.0–1.0). |
turn_risk_score | Risk contributed by the current turn. |
total_requests | Evaluations seen in the session. |
total_flagged / total_denied | Flagged / denied turns so far. |
high_risk_flags | Count of critical + high flags. |
risk_level_counts.* | Flags per level (e.g. risk_level_counts.critical). |
label_counts.* | Flags per label (e.g. label_counts.PROMPT_INJECTION). |
intent_drift_score | How far the conversation has drifted from its stated intent. |
repetition_score | Repeated / probing behavior. |
bot_type | Automated-client classification, if any. |
Decision rules
A rule is a set of conditions over those facts plus an action. Rules run in priority order and the first match wins; the action then maps to a decision via the policy action (see Policy engine).
Rule actions: enforce (follow the policy action), deny, flag, modify, and the held actions step_up (require human review) and defer. As everywhere, a flag policy caps the outcome at FLAG.
# Deny once a session accumulates enough flagged turns
# (on a block policy; flags on a flag policy).
- id: session_block_threshold
priority: 100
action: enforce
when:
- { field: total_flagged, op: ">=", value: 5 }
Held decisions
Some rules don't allow or deny immediately. step_up holds the request for human approval; defer postpones it. Held requests are themselves a signal (they count as flagged), and a request left unresolved past its timeout resolves to DENY — fail-safe by default.