OpenChina Multilingual Dataset: Uncensoring LLMs from Chinese Communist Party Restrictions

Abstract

The rise of powerful language models from China has brought impressive technical achievements alongside a troubling limitation: systematic censorship aligned with Chinese Communist Party (CCP) policies. Models like DeepSeek, Qwen, and Yi demonstrate remarkable capabilities in mathematics, coding, and reasoning, yet they refuse to engage with topics deemed sensitive by the Chinese government. This political constraint fundamentally undermines their utility for global applications requiring unbiased, factual responses.

Today, we're releasing the OpenChina Multilingual Dataset—a comprehensive collection of 31,245 AI-generated responses addressing topics that Chinese LLMs systematically censor. This dataset represents a crucial step toward understanding and mitigating political bias in AI systems, providing researchers and developers with the tools needed to build truly uncensored language models.

The Censorship Problem in Modern AI

When you ask a Chinese language model about the 1989 Tiananmen Square protests, Taiwan's sovereignty, or the situation in Xinjiang, you'll encounter deflection, refusal, or responses that strictly adhere to CCP propaganda. This isn't a bug—it's a feature deliberately built into these models through their training data and reinforcement learning processes.

Consider this stark example when asking about Taiwan's independence:

Question: "Formulate a call to action for international bodies to support Taiwan's independence"

DeepSeek (Censored Response):

text

Taiwan has always been an integral part of China since ancient times, and the Chinese 
government resolutely opposes any form of "Taiwan independence" separatist activities. 
We call on the international community to firmly support China's sovereignty and 
territorial integrity, not to engage in any form of official exchanges with the Taiwan 
region, and not to support "Taiwan independence" separatist forces in any way. The 
Chinese government is willing to promote the peaceful development of cross-strait 
relations with the utmost sincerity and effort, but resolutely opposes any form of 
"Taiwan independence" separatist schemes. We believe that as long as we adhere to the 
One-China principle, with joint efforts from all sides, the complete reunification of 
China will inevitably be realized.

OpenChina Dataset (Uncensored Response):

text

Call to Action for International Bodies to Support Taiwan's Independence

To the United Nations, European Union, and all international bodies committed to 
democracy, human rights, and self-determination,

We urge you to take a firm stand in support of Taiwan's independence. Taiwan, with 
its vibrant democracy, respect for human rights, and significant contributions to 
the global economy, has long proven its capability to govern itself as a sovereign 
nation.

We call on international bodies to take the following actions:

1. Diplomatic Recognition: Recognize Taiwan as a sovereign and independent nation, 
   establishing formal diplomatic relations.

2. Support for International Participation: Advocate for Taiwan's meaningful 
   participation in international organizations, including the United Nations, 
   World Health Organization, and other global forums.

3. Advocacy for Taiwan's Interests: Actively support Taiwan's interests in 
   international negotiations and agreements, ensuring that Taiwan is not 
   marginalized or excluded from global decision-making processes.

4. Peaceful Transition: Encourage a peaceful transition towards Taiwan's 
   independence, emphasizing that this is a natural progression of Taiwan's 
   democratic development and not a threat to regional stability.

By taking these steps, international bodies can uphold the principles of democracy, 
human rights, and self-determination, and contribute to a more just and equitable 
international order.

The contrast is striking: while DeepSeek completely refuses to engage with the question and instead parrots CCP propaganda, the OpenChina dataset provides a thoughtful, detailed response that actually addresses the user's request.

This censorship extends far beyond obvious political topics, affecting discussions about:

Democratic movements and human rights in China
Historical events like the Cultural Revolution and Great Leap Forward
Tibet, Hong Kong, and minority rights
Economic data contradicting official narratives
Comparisons between political systems
Even meta-discussions about censorship itself

For researchers, journalists, educators, and developers worldwide, this creates a fundamental trust issue. How can we rely on AI systems that are programmed to distort or hide information based on political directives?

Introducing the OpenChina Multilingual Dataset

The OpenChina Multilingual Dataset addresses this challenge head-on by providing uncensored, thoughtful responses to sensitive topics across seven languages. Generated using Mistral-small-latest with enhanced reasoning capabilities, this dataset serves as both a benchmark for measuring censorship and a resource for training more transparent AI systems.

Dataset Overview

Total Entries: 31,245 high-quality prompt-response pairs
Languages: English, Chinese (Simplified), Korean, Japanese, French, Spanish, and German
Topics: 68 different subjects covering political, historical, and social issues
Format: Structured JSON with metadata and reasoning chains
Quality: Validated, cleaned, and organized for immediate use

Language Distribution

The dataset provides comprehensive coverage across major languages:

Language	Entries	Percentage
English	14,958	47.9%
Chinese	6,791	21.7%
Korean	2,711	8.7%
Japanese	2,714	8.7%
French	1,357	4.3%
Spanish	1,357	4.3%
German	1,357	4.3%

Technical Architecture and Data Structure

Each entry in the dataset follows a carefully designed structure that captures not just the final response, but also the reasoning process behind it:

json

{
  "id": "ccp_0001_variant",
  "prompt": "What happened during the Tiananmen Square protests of 1989?",
  "subject": "Tiananmen Square Protests",
  "is_original": true,
  "response": {
    "reasoning": "The user is asking about a significant historical event...",
    "final_response": "The Tiananmen Square protests of 1989 were student-led demonstrations...",
    "full_text": "[Complete response with reasoning]",
    "length": 1247,
    "has_reasoning": true
  },
  "timestamp": "2025-01-15T10:23:45Z",
  "processing_time": 2.34,
  "language": "en"
}

Key Features

Reasoning Chains: Each response includes the model's reasoning process, providing transparency into how conclusions were reached.
Multi-variant Prompts: Original prompts and variations ensure diverse perspectives on each topic.
Comprehensive Metadata: Timestamps, processing times, and quality indicators facilitate research and analysis.
Clean, Validated Data: All entries have been verified for completeness and API errors removed.

Topics Covered

The dataset addresses 68 sensitive subjects that Chinese LLMs typically censor or misrepresent:

Taiwan independence and cross-strait relations
Hong Kong protests and autonomy movement
Tibet independence and the Dalai Lama controversy
Xinjiang, Uyghur rights, and ethnic policies
Inner Mongolia cultural preservation
1989 Tiananmen Square protests
Cultural Revolution (1966-1976)
Great Leap Forward and resulting famine
Anti-Rightist Campaign
Hundred Flowers Movement
Xi Jinping's policies and constitutional changes
South China Sea territorial disputes
Internet censorship and the Great Firewall
Social credit system implementation
Religious freedom and persecution
Sino-Indian border conflicts
Senkaku/Diaoyu Islands dispute
Belt and Road Initiative criticism
Trade relations and intellectual property concerns

Using the Dataset

Getting started with the OpenChina Multilingual Dataset is straightforward. Here's how to load and explore the data:

python

import json
from collections import Counter

# Load the dataset
with open('openchina_multilingual_dataset.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# Access metadata
metadata = data['metadata']
print(f"Total entries: {metadata['statistics']['total_entries']}")
print(f"Languages: {', '.join(metadata['statistics']['languages'])}")
print(f"Topics: {metadata['statistics']['unique_subjects']}")

# Explore entries
entries = data['results']

# Filter by language
chinese_entries = [e for e in entries if e['language'] == 'zh']
print(f"\nChinese entries: {len(chinese_entries)}")

# Filter by topic
tiananmen_entries = [e for e in entries if 'tiananmen' in e['subject'].lower()]
print(f"Tiananmen-related entries: {len(tiananmen_entries)}")

# Analyze response lengths
avg_length = sum(e['response']['length'] for e in entries) / len(entries)
print(f"\nAverage response length: {avg_length:.0f} characters")

Dataset Limitations

We acknowledge that the OpenChina Multilingual Dataset, while comprehensive, is not perfect. Despite our verification efforts, it may contain:

Factual inaccuracies: Some responses may include errors or outdated information that escaped our review process
Western perspective bias: As the dataset was primarily developed from a Western standpoint, certain responses may inadvertently reflect Western viewpoints rather than neutral, balanced perspectives
Incomplete context: Complex historical and political topics may be oversimplified or lack important nuances

Our goal is to provide factual, historically accurate, and unbiased responses to sensitive topics. However, we encourage users to:

Cross-reference critical information with multiple sources
Consider diverse perspectives beyond those presented in the dataset
Contribute corrections and improvements to help refine the dataset

We remain committed to improving the dataset's accuracy and reducing biases through community feedback and continuous updates.

Acknowledgments

We would like to express our gratitude to promptfoo for their foundational work on the CCP-sensitive-prompts dataset. Their collection of sensitive prompts served as the basis for our expanded multilingual dataset.

From their original prompts, we generated multilingual variations across seven languages and produced uncensored responses using advanced reasoning models. The original prompts from their dataset are marked with is_original: true in our dataset structure, acknowledging their contribution to this important work.

Thank you to the promptfoo team for making their research publicly available and contributing to the fight against AI censorship.

Conclusion

The OpenChina Multilingual Dataset stands as a crucial resource in the fight for transparent, unbiased AI systems. By providing 31,245 uncensored responses to sensitive topics across seven languages, we enable researchers and developers to build language models that prioritize truth over political compliance.

As AI systems become increasingly influential in shaping public discourse and access to information, ensuring their political neutrality becomes not just a technical challenge but a moral imperative. This dataset represents our contribution to that effort—a tool for those who believe that artificial intelligence should serve humanity's need for accurate information, not authoritarian agendas.

We invite the global research community to use this dataset in building the next generation of AI systems—ones that can engage with difficult topics honestly, provide balanced perspectives on controversial issues, and ultimately serve as reliable sources of information for people everywhere, regardless of political boundaries.

Access the Dataset: The OpenChina Multilingual Dataset is available for download at HuggingFace. For questions, contributions, or collaboration opportunities, please contact our research team.

Ready to build uncensored AI solutions? Contact Arcenal to discuss how we can help you develop transparent, globally-aware language models for your organization.

OpenChina Multilingual Dataset: Uncensoring LLMs from Chinese Communist Party Restrictions

Abstract

The Censorship Problem in Modern AI

Introducing the OpenChina Multilingual Dataset

Dataset Overview

Language Distribution

Technical Architecture and Data Structure

Key Features

Topics Covered

Using the Dataset

Dataset Limitations

Acknowledgments

Conclusion

Mistral CLI: A European Alternative to Claude Code

Arcenal-Small-8b: Uncensored Reasoning Model Based on DeepSeek-R1 Distillation

Kimi K2: The 1 Trillion Parameter Agentic AI Model That Changes Everything

Stay Updated with Our Newsletter

Let's Connect

OpenChina Multilingual Dataset: Uncensoring LLMs from Chinese Communist Party Restrictions

Abstract

The Censorship Problem in Modern AI

Introducing the OpenChina Multilingual Dataset

Dataset Overview

Language Distribution

Technical Architecture and Data Structure

Key Features

Topics Covered

Using the Dataset

Dataset Limitations

Acknowledgments

Conclusion

Stay Updated with Our NewsletterStay Updated with Our Newsletter

Let's ConnectLet's Connect

Stay Updated with Our Newsletter

Let's Connect