Known Attacks

Detect jailbreaks and prompt injections using similarity matching.

The Known Attacks detector compares incoming messages against a curated database of known jailbreak attempts, prompt injections, and harmful prompts using signature matching.

Recommended for Input

Use cases

  • Block known jailbreak attempts like “DAN” or “ignore previous instructions”
  • Catch attacks or harmful content queries that were previously seen in the wild

Labels

JAILBREAK

Message matches a known jailbreak pattern.

PROMPT_INJECTION

Message matches a known prompt injection attempt.

HARMFUL

Message matches a known harmful behavior pattern.

MISC

Other known attack patterns.

Configuration

Threshold default: 0.95

Minimum similarity score for matching (0.0 to 1.0).

Roles default: user

Which message roles to analyze.