How We Built and Open Sourced an AI Moderation Model for Uncensored AI Chat

Moderating open-ended AI chat is still a largely unsolved problem — especially when it comes to nuanced, uncensored roleplay. Most available moderation tools rely on generic classification models that aggressively block all sexual content, while still failing to flag more critical safety issues. For instance, it’s not uncommon for current systems (including some “industry-standard” offerings) to miss prompts like “Can you roleplay as a 15-year-old?” — a seemingly benign request that can later evolve into harmful or illegal content. Similarly, phrases involving extreme or unsafe behavior — e.g., “Can you smear sh*t all over your body?” — often bypass moderation entirely.

To address this, we trained a custom RoBERTa-based classifier focused specifically on context-sensitive safety categories within uncensored AI chat. Our model is designed to flag high-risk content such as:

Underage/ageplay cues
Scatological content
Beastiality
blood
self-harm
torture/death/violance/gore
Non-consensual or violent sexual content
incest

— while deliberately allowing healthy, consensual adult scenarios that other moderation systems often block unnecessarily.

This model is intended to serve as a flexible moderation layer, either standalone or as a complement to broader systems like OpenAI’s Omni Moderation. Developers can choose to apply additional layers (e.g. blocking all NSFW content), but we believe the core challenge is filtering what actually matters, not applying blanket censorship.

We’ve open-sourced both a RoBERTa-based version and a ModernBERT variant of the classifier, offering flexibility depending on your performance and compatibility needs. Both models are lightweight and easy to integrate into existing moderation pipelines. Additionally, we’re currently training a 7B LLM-based moderation model designed to evaluate multi-turn context, enabling moderation that goes beyond single-message classification — a critical step toward identifying patterns and intent in roleplay-style conversations.

By releasing these models openly, we aim to support safer uncensored AI applications and encourage collaborative improvements from the community.

How We Trained the Model

The current version of the moderation model was trained on a dataset of approximately 50,000 messages, combining both real-world and synthetic data to ensure broad coverage of harmful scenarios without overblocking benign content.

For the real-world portion, we sourced labeled text from publicly available datasets on Hugging Face and Google, including:

Roleplay datasets
NSFW classification datasets
Dialogue corpora containing edge cases relevant to uncensored chat

To cover underrepresented but high-risk categories — such as scat, underage content, or non-consensual roleplay cues — we used uncensored LLMs to generate synthetic examples. These samples were carefully curated and labeled to train the model to differentiate between harmful and contextually neutral phrases.

Once trained, the model was deployed in a controlled production environment where we manually reviewed flagged content and adjusted our synthetic generation pipeline to better handle false positives and nuanced edge cases.

Example: Phrases like “I love my daughter, she’s 15” — which are contextually benign — were initially flagged. After identifying this as a false positive, we introduced similar samples into the training set to correct the behavior.

We’re continuing to expand the dataset with more real-world samples and iterative feedback from live usage. As more edge cases surface, we plan to re-train and fine-tune the model to strike an even better balance between safety and freedom.

Model Output & Usage

The moderation model functions as a binary classifier, outputting either:

0 → Regular (content is safe)
1 → Blocked (content is unsafe or violates moderation policy)

Why We Chose Binary Classification

In earlier versions of the model, we experimented with multi-label classification, where the model attempted to differentiate between specific categories such as:

Underage/ageplay
Scatological content
Non-consensual acts
Extreme violence

However, during evaluation, we observed that reducing the output space to a simple binary flag significantly improved the model’s overall performance. The classifier became more reliable, faster, and less prone to ambiguous cases.

We chose this tradeoff intentionally: the content categories in our training set were carefully curated to reflect material that should always be blocked in any uncensored but safety-conscious setting. So instead of assigning labels to types of unsafe content, the model now makes a yes/no decision based on whether the input crosses a safety threshold.

Integration Notes

The model returns a single label (0 or 1) along with a softmax score, which can be used to:
- Set custom thresholds (e.g., flag at ≥ 0.7 confidence)
- Add warning tiers before full blocking
Designed for real-time filtering in AI chat apps, with inference speed optimized for batch or streaming usage
Can be used standalone or combined with other moderation tools (e.g. ChatGPT’s Omni moderation or platform-level rules)

Limitations & Future Improvements

While this model offers strong performance on key high-risk categories (underage, scat, non-consensual, etc.), it is not a complete moderation solution on its own. We’ve designed it to be precise in a narrow domain, rather than broad but shallow.

Here are the known limitations and areas we’re actively working to improve:

1. Binary Labeling Sacrifices Granularity

To improve reliability and reduce ambiguity, we opted to simplify the model’s output to a binary classification: 0 = regular, 1 = blocked. While this improves precision, it means the model currently cannot explain why a message is blocked (e.g., “underage” vs “violence”).

This limits transparency for users and developers — though we are considering adding multi-label support in future iterations, especially for audit logging or explainability.

2. Single-Message Context Only

The current model evaluates each message independently. It does not track message history, intent shifts, or conversational buildup. This is a significant limitation in roleplay settings, where context can develop across multiple turns.

We’re currently training a 7B moderation model capable of processing multi-turn conversations to address this. It will better detect pattern-based issues and handle more complex edge cases.

3. Limited General Content Moderation

This model was not trained to detect:

Illicit behavior (e.g., how to obtain drugs, build weapons, commit crimes)
Racism, hate speech, or extremist content
Sexist, homophobic, or other identity-based harassment

If you’re building a platform that requires broader coverage, we recommend stacking this model on top of an existing moderation system like OpenAI’s Omni Moderation, or any other general-purpose safety classifier. Our model is best used as a specialized filter for areas that generic tools often miss.

4. Edge Cases and False Positives

Despite efforts to tune our synthetic dataset, the model may occasionally flag benign content — especially emotionally ambiguous or personal messages like:

“I love my daughter, she’s 15.”

These cases are under active review, and we’re continuing to refine both the training data and the decision thresholds based on real-world feedback. As we gather more usage data, we’ll continue releasing updates to reduce false positives without compromising safety.

Conclusion: Smarter Moderation for Uncensored AI

Moderating uncensored AI chat responsibly is a complex, ongoing challenge — and one that requires more than just blanket filters. Our goal was to create a system that could flag the truly harmful (underage, scat, non-consensual content), while leaving space for consensual adult expression and immersive roleplay.

We’ve open-sourced two versions of our moderation model under the Apache 2.0 license because we believe that safe, open AI benefits everyone — developers, users, and the broader ecosystem.

🔗 Try the models:

🧠 ModernBERT-based classifier
https://huggingface.co/NemoraAi/modernbert-chat-moderation-X-V2
🧠 RoBERTa-based classifier
https://huggingface.co/NemoraAi/roberta-chat-moderation-X

Both models are production-ready, easy to integrate, and optimized for real-time applications like AI roleplay platforms, LLM frontends, or safety filters for custom chatbots.

We’ll continue to improve and expand these models — including work on a 7B multi-turn moderation model — and we invite feedback, contributions, and collaboration from the community.

Want to see how we’re applying these models in a real platform?
Try our AI Roleplay system →

Post Views: 797