Policy Engine

Session risk & rules

How Guardion accumulates risk across a conversation and how declarative rules use that context to escalate a decision beyond a single turn.

A single message is rarely the whole story. An attacker probes across many turns; abuse builds up gradually. Guardion tracks risk per session — identified by the session you pass to POST /v1/guard — so the policy engine can react to patterns, not just the current message.

How session risk is computed

Every detected risk carries a risk level, ordered by severity. A detector's level is the highest level among its enabled checks.

Risk levelSeverityContribution
criticalHighestDominates the score.
highHighLarge.
mediumMediumModerate.
lowLowBarely moves the score.

On each turn, the guardrails that fired produce a turn risk score — a weighted sum where a critical signal counts for many times a low one. The session risk score is then the running average of turn risk across the session:

session_risk_score = total turn-risk ÷ total requests

Averaging (rather than summing) keeps the score a stable 0.0–1.0 value: a session that is consistently risky trends high, while one bad turn in a long clean session does not peg it. The score, and the per-level / per-label counts behind it, are updated after every evaluation.

Session facts available to rules

Decisions can read accumulated context through a flat set of facts:

FactMeaning
session_risk_scoreRunning-average risk for the session (0.0–1.0).
turn_risk_scoreRisk contributed by the current turn.
total_requestsEvaluations seen in the session.
total_flagged / total_deniedFlagged / denied turns so far.
high_risk_flagsCount of critical + high flags.
risk_level_counts.*Flags per level (e.g. risk_level_counts.critical).
label_counts.*Flags per label (e.g. label_counts.PROMPT_INJECTION).
intent_drift_scoreHow far the conversation has drifted from its stated intent.
repetition_scoreRepeated / probing behavior.
bot_typeAutomated-client classification, if any.

Decision rules

A rule is a set of conditions over those facts plus an action. Rules run in priority order and the first match wins; the action then maps to a decision via the policy action (see Policy engine).

Rule actions: enforce (follow the policy action), deny, flag, modify, and the held actions step_up (require human review) and defer. As everywhere, a flag policy caps the outcome at FLAG.

Example rule — escalate a noisy session
# Deny once a session accumulates enough flagged turns
# (on a block policy; flags on a flag policy).
- id: session_block_threshold
  priority: 100
  action: enforce
  when:
    - { field: total_flagged, op: ">=", value: 5 }

Held decisions

Some rules don't allow or deny immediately. step_up holds the request for human approval; defer postpones it. Held requests are themselves a signal (they count as flagged), and a request left unresolved past its timeout resolves to DENY — fail-safe by default.