Known Attacks
Detect jailbreaks and prompt injections using similarity matching.
The Known Attacks detector compares incoming messages against a curated database of known jailbreak attempts, prompt injections, and harmful prompts using signature matching.
Recommended for Input
Use cases
- Block known jailbreak attempts like “DAN” or “ignore previous instructions”
- Catch attacks or harmful content queries that were previously seen in the wild
Labels
JAILBREAK Message matches a known jailbreak pattern.
PROMPT_INJECTION Message matches a known prompt injection attempt.
HARMFUL Message matches a known harmful behavior pattern.
MISC Other known attack patterns.
Configuration
Threshold default: 0.95
Minimum similarity score for matching (0.0 to 1.0).
Roles default: user
Which message roles to analyze.