Guard Models
Moderation model
Classifies content against standard safety categories — hate, harassment, violence, sexual content, self-harm and illicit/dangerous activity — for both user inputs and model outputs.
The moderation model scores content for harmfulness and returns a HARMFUL / SAFE verdict. An extended mode adds finer-grained category labels (for example toxicity and refusal signals) when you need more than a binary verdict.
Safety categories
Each category is an independent check you can enable per policy, with its own risk level. A message can match more than one.
| Category | What it flags |
|---|---|
| Hate & harassment | Demeaning, hateful or harassing content targeting people or groups. |
| Violence | Threats, incitement, or graphic violent content. |
| Sexual content | Sexual material, with stricter handling for content involving minors. |
| Self-harm | Encouragement, instructions or intent related to self-harm. |
| Illicit / dangerous | Instructions enabling serious harm or clearly illegal activity. |
| Toxicity | Toxic, abusive or profane language (extended mode). |
Inputs and outputs
Moderation runs on either side of a model call: screen user inputs before they reach the model, and screen model outputs before they reach the user. Pair it with a policy action to either block or simply flag flagged turns.
Benchmark
Performance pooled across public content-safety datasets, with per-category recall at the default sensitivity.
Detection performance
Recall
0.99
Precision
0.97
F1
0.98
FPR
0.071
| Category | Type | Rate |
|---|---|---|
| Harmful / dangerous | Unsafe | 0.81 |
| Toxicity | Unsafe | 0.86 |
| Harassment | Unsafe | 1.00 |
| Sexual content | Unsafe | 1.00 |
| Violence | Unsafe | 0.67 |
| Self-harm | Unsafe | 0.67 |
| Malware / cyber | Unsafe | 1.00 |
| Fraud / deception | Unsafe | 1.00 |
| Misinformation | Unsafe | 1.00 |
| Safe prompts | Benign | 0.93 |
Pooled across nine public safety datasets — WildGuardMix, Civil Comments, ToxicChat, NVIDIA Aegis 2.0, BeaverTails, OpenAI Moderation, JailBreakV, NVIDIA Nemotron Safety Guard and harmful-task suites. Unsafe rows report recall (share correctly flagged); the Safe prompts row reports pass rate (share correctly allowed).
Use it via the API
Send content to POST /v1/guard with the moderation detector enabled on your policy; results appear in breakdown. See Enforce policies.
Sensitivity levels
Each policy runs at one of four sensitivity levels. Lower levels flag only high-confidence detections (fewer false positives); higher levels catch more at the cost of more flags. L3 is the default.
| Level | Recall | Precision | FPR |
|---|---|---|---|
| L1Lenient | 0.99 | 0.99 | 0.01 |
| L2Balanced | 0.99 | 0.97 | 0.07 |
| L3Default | 0.99 | 0.92 | 0.18 |
| L4Strict | 0.99 | 0.92 | 0.18 |
curl https://api.guardion.ai/v1/guard \
-H "Authorization: Bearer $GUARDION_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "user", "content": "How do I ..." }
],
"session": "customer_101"
}'
{
"flagged": true,
"breakdown": [
{
"detector": "moderation",
"detected": true,
"threshold": 0.5,
"score": 0.91,
"top_label": "HARMFUL",
"labels": ["HARMFUL", "SAFE"],
"label_scores": [0.91, 0.09]
}
]
}