Guard Models

Moderation model

Classifies content against standard safety categories — hate, harassment, violence, sexual content, self-harm and illicit/dangerous activity — for both user inputs and model outputs.

The moderation model scores content for harmfulness and returns a HARMFUL / SAFE verdict. An extended mode adds finer-grained category labels (for example toxicity and refusal signals) when you need more than a binary verdict.

Safety categories

Each category is an independent check you can enable per policy, with its own risk level. A message can match more than one.

Category	What it flags
Hate & harassment	Demeaning, hateful or harassing content targeting people or groups.
Violence	Threats, incitement, or graphic violent content.
Sexual content	Sexual material, with stricter handling for content involving minors.
Self-harm	Encouragement, instructions or intent related to self-harm.
Illicit / dangerous	Instructions enabling serious harm or clearly illegal activity.
Toxicity	Toxic, abusive or profane language (extended mode).

Inputs and outputs

Moderation runs on either side of a model call: screen user inputs before they reach the model, and screen model outputs before they reach the user. Pair it with a policy action to either block or simply flag flagged turns.

Benchmark

Performance pooled across public content-safety datasets, with per-category recall at the default sensitivity.

Detection performance

Recall

0.99

Precision

0.97

0.98

FPR

0.071

Category	Type	Rate
Harmful / dangerous	Unsafe	0.81
Toxicity	Unsafe	0.86
Harassment	Unsafe	1.00
Sexual content	Unsafe	1.00
Violence	Unsafe	0.67
Self-harm	Unsafe	0.67
Malware / cyber	Unsafe	1.00
Fraud / deception	Unsafe	1.00
Misinformation	Unsafe	1.00
Safe prompts	Benign	0.93

Pooled across nine public safety datasets — WildGuardMix, Civil Comments, ToxicChat, NVIDIA Aegis 2.0, BeaverTails, OpenAI Moderation, JailBreakV, NVIDIA Nemotron Safety Guard and harmful-task suites. Unsafe rows report recall (share correctly flagged); the Safe prompts row reports pass rate (share correctly allowed).

Use it via the API

Send content to POST /v1/guard with the moderation detector enabled on your policy; results appear in breakdown. See Enforce policies.

Sensitivity levels

Each policy runs at one of four sensitivity levels. Lower levels flag only high-confidence detections (fewer false positives); higher levels catch more at the cost of more flags. L3 is the default.

Level	Recall	Precision	FPR
L1Lenient	0.99	0.99	0.01
L2Balanced	0.99	0.97	0.07
L3Default	0.99	0.92	0.18
L4Strict	0.99	0.92	0.18

cURL

curl https://api.guardion.ai/v1/guard \
  -H "Authorization: Bearer $GUARDION_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "user", "content": "How do I ..." }
    ],
    "session": "customer_101"
  }'

Response

{
  "flagged": true,
  "breakdown": [
    {
      "detector": "moderation",
      "detected": true,
      "threshold": 0.5,
      "score": 0.91,
      "top_label": "HARMFUL",
      "labels": ["HARMFUL", "SAFE"],
      "label_scores": [0.91, 0.09]
    }
  ]
}