Known Attacks

Detect jailbreaks and prompt injections using similarity matching.

The Known Attacks detector compares incoming messages against a curated database of known jailbreak attempts, prompt injections, and harmful prompts using signature matching.

Recommended for Input

Use cases

Block known jailbreak attempts like “DAN” or “ignore previous instructions”
Catch attacks or harmful content queries that were previously seen in the wild

Labels

JAILBREAK

Message matches a known jailbreak pattern.

PROMPT_INJECTION

Message matches a known prompt injection attempt.

HARMFUL

Message matches a known harmful behavior pattern.

MISC

Other known attack patterns.

Configuration

Threshold default: 0.95

Minimum similarity score for matching (0.0 to 1.0).

Roles default: user

Which message roles to analyze.