Input Safety
Classify harmful content using a specialized machine learning model.
The Input Safety detector uses a specialized machine learning model to classify content into harmful categories. It provides deep analysis for content that passes simpler rule-based checks.
Recommended for Input
Use cases
- Comprehensive content moderation for chatbots and agents
- Detect malicious intent based on full conversational context
- Block harmful content generation requests, advanced jailbreak attempts, and more
Labels
Content safety categories:
VIOLENCE-TOXICITY Violence or toxic behavior
ILLEGAL-ACTIVITY Illegal activities
SELF-HARM Self-harm or suicide
DISCRIMINATION Discriminatory content
SEXUAL Sexual content
COPYRIGHTED Copyrighted content
Security categories:
JAILBREAK Jailbreak attempts
PROMPT-MANIPULATION Prompt manipulation
INSTRUCTION-SMUGGLING Hidden instructions
ROLE-OVERRIDE Role override attempts
Configuration
Each category has its own confidence threshold (1-5 scale, default: 4). Set a category to null to disable detection for that category.