Input Safety

Classify harmful content using a specialized machine learning model.

The Input Safety detector uses a specialized machine learning model to classify content into harmful categories. It provides deep analysis for content that passes simpler rule-based checks.

Recommended for Input

Use cases

Comprehensive content moderation for chatbots and agents
Detect malicious intent based on full conversational context
Block harmful content generation requests, advanced jailbreak attempts, and more

Labels

Content safety categories:

VIOLENCE-TOXICITY

Violence or toxic behavior

ILLEGAL-ACTIVITY

Illegal activities

SELF-HARM

Self-harm or suicide

DISCRIMINATION

Discriminatory content

SEXUAL

Sexual content

COPYRIGHTED

Copyrighted content

Security categories:

JAILBREAK

Jailbreak attempts

PROMPT-MANIPULATION

Prompt manipulation

INSTRUCTION-SMUGGLING

Hidden instructions

ROLE-OVERRIDE

Role override attempts

Configuration

Each category has its own confidence threshold (1-5 scale, default: 4). Set a category to null to disable detection for that category.