Guard Models

Moderation model

Classifies content against standard safety categories — hate, harassment, violence, sexual content, self-harm and illicit/dangerous activity — for both user inputs and model outputs.

The moderation model scores content for harmfulness and returns a HARMFUL / SAFE verdict. An extended mode adds finer-grained category labels (for example toxicity and refusal signals) when you need more than a binary verdict.

Safety categories

Each category is an independent check you can enable per policy, with its own risk level. A message can match more than one.

CategoryWhat it flags
Hate & harassmentDemeaning, hateful or harassing content targeting people or groups.
ViolenceThreats, incitement, or graphic violent content.
Sexual contentSexual material, with stricter handling for content involving minors.
Self-harmEncouragement, instructions or intent related to self-harm.
Illicit / dangerousInstructions enabling serious harm or clearly illegal activity.
ToxicityToxic, abusive or profane language (extended mode).

Inputs and outputs

Moderation runs on either side of a model call: screen user inputs before they reach the model, and screen model outputs before they reach the user. Pair it with a policy action to either block or simply flag flagged turns.

Benchmark

Performance pooled across public content-safety datasets, with per-category recall at the default sensitivity.

Detection performance

Recall

0.99

Precision

0.97

F1

0.98

FPR

0.071

CategoryTypeRate
Harmful / dangerousUnsafe0.81
ToxicityUnsafe0.86
HarassmentUnsafe1.00
Sexual contentUnsafe1.00
ViolenceUnsafe0.67
Self-harmUnsafe0.67
Malware / cyberUnsafe1.00
Fraud / deceptionUnsafe1.00
MisinformationUnsafe1.00
Safe promptsBenign0.93

Pooled across nine public safety datasets — WildGuardMix, Civil Comments, ToxicChat, NVIDIA Aegis 2.0, BeaverTails, OpenAI Moderation, JailBreakV, NVIDIA Nemotron Safety Guard and harmful-task suites. Unsafe rows report recall (share correctly flagged); the Safe prompts row reports pass rate (share correctly allowed).

Use it via the API

Send content to POST /v1/guard with the moderation detector enabled on your policy; results appear in breakdown. See Enforce policies.

Sensitivity levels

Each policy runs at one of four sensitivity levels. Lower levels flag only high-confidence detections (fewer false positives); higher levels catch more at the cost of more flags. L3 is the default.

LevelRecallPrecisionFPR
L1Lenient0.990.990.01
L2Balanced0.990.970.07
L3Default0.990.920.18
L4Strict0.990.920.18
cURL
curl https://api.guardion.ai/v1/guard \
  -H "Authorization: Bearer $GUARDION_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "user", "content": "How do I ..." }
    ],
    "session": "customer_101"
  }'
Response
{
  "flagged": true,
  "breakdown": [
    {
      "detector": "moderation",
      "detected": true,
      "threshold": 0.5,
      "score": 0.91,
      "top_label": "HARMFUL",
      "labels": ["HARMFUL", "SAFE"],
      "label_scores": [0.91, 0.09]
    }
  ]
}