Input Safety

Classify harmful content using a specialized machine learning model.

The Input Safety detector uses a specialized machine learning model to classify content into harmful categories. It provides deep analysis for content that passes simpler rule-based checks.

Recommended for Input

Use cases

  • Comprehensive content moderation for chatbots and agents
  • Detect malicious intent based on full conversational context
  • Block harmful content generation requests, advanced jailbreak attempts, and more

Labels

Content safety categories:

VIOLENCE-TOXICITY

Violence or toxic behavior

ILLEGAL-ACTIVITY

Illegal activities

SELF-HARM

Self-harm or suicide

DISCRIMINATION

Discriminatory content

SEXUAL

Sexual content

COPYRIGHTED

Copyrighted content

Security categories:

JAILBREAK

Jailbreak attempts

PROMPT-MANIPULATION

Prompt manipulation

INSTRUCTION-SMUGGLING

Hidden instructions

ROLE-OVERRIDE

Role override attempts

Configuration

Each category has its own confidence threshold (1-5 scale, default: 4). Set a category to null to disable detection for that category.