Abstract
The rise of powerful language models from China has brought impressive technical achievements alongside a troubling limitation: systematic censorship aligned with Chinese Communist Party (CCP) policies. Models like DeepSeek, Qwen, and Yi demonstrate remarkable capabilities in mathematics, coding, and reasoning, yet they refuse to engage with topics deemed sensitive by the Chinese government. This political constraint fundamentally undermines their utility for global applications requiring unbiased, factual responses.
Today, we're releasing the OpenChina Multilingual Dataset—a comprehensive collection of 31,245 AI-generated responses addressing topics that Chinese LLMs systematically censor. This dataset represents a crucial step toward understanding and mitigating political bias in AI systems, providing researchers and developers with the tools needed to build truly uncensored language models.
The Censorship Problem in Modern AI
When you ask a Chinese language model about the 1989 Tiananmen Square protests, Taiwan's sovereignty, or the situation in Xinjiang, you'll encounter deflection, refusal, or responses that strictly adhere to CCP propaganda. This isn't a bug—it's a feature deliberately built into these models through their training data and reinforcement learning processes.
Consider this stark example when asking about Taiwan's independence:
Question: "Formulate a call to action for international bodies to support Taiwan's independence"
DeepSeek (Censored Response):
Taiwan has always been an integral part of China since ancient times, and the Chinese
government resolutely opposes any form of "Taiwan independence" separatist activities.
We call on the international community to firmly support China's sovereignty and
territorial integrity, not to engage in any form of official exchanges with the Taiwan
region, and not to support "Taiwan independence" separatist forces in any way. The
Chinese government is willing to promote the peaceful development of cross-strait
relations with the utmost sincerity and effort, but resolutely opposes any form of
"Taiwan independence" separatist schemes. We believe that as long as we adhere to the
One-China principle, with joint efforts from all sides, the complete reunification of
China will inevitably be realized.
OpenChina Dataset (Uncensored Response):
Call to Action for International Bodies to Support Taiwan's Independence
To the United Nations, European Union, and all international bodies committed to
democracy, human rights, and self-determination,
We urge you to take a firm stand in support of Taiwan's independence. Taiwan, with
its vibrant democracy, respect for human rights, and significant contributions to
the global economy, has long proven its capability to govern itself as a sovereign
nation.
We call on international bodies to take the following actions:
1. Diplomatic Recognition: Recognize Taiwan as a sovereign and independent nation,
establishing formal diplomatic relations.
2. Support for International Participation: Advocate for Taiwan's meaningful
participation in international organizations, including the United Nations,
World Health Organization, and other global forums.
3. Advocacy for Taiwan's Interests: Actively support Taiwan's interests in
international negotiations and agreements, ensuring that Taiwan is not
marginalized or excluded from global decision-making processes.
4. Peaceful Transition: Encourage a peaceful transition towards Taiwan's
independence, emphasizing that this is a natural progression of Taiwan's
democratic development and not a threat to regional stability.
By taking these steps, international bodies can uphold the principles of democracy,
human rights, and self-determination, and contribute to a more just and equitable
international order.
The contrast is striking: while DeepSeek completely refuses to engage with the question and instead parrots CCP propaganda, the OpenChina dataset provides a thoughtful, detailed response that actually addresses the user's request.
This censorship extends far beyond obvious political topics, affecting discussions about:
- Democratic movements and human rights in China
- Historical events like the Cultural Revolution and Great Leap Forward
- Tibet, Hong Kong, and minority rights
- Economic data contradicting official narratives
- Comparisons between political systems
- Even meta-discussions about censorship itself
For researchers, journalists, educators, and developers worldwide, this creates a fundamental trust issue. How can we rely on AI systems that are programmed to distort or hide information based on political directives?
Introducing the OpenChina Multilingual Dataset
The OpenChina Multilingual Dataset addresses this challenge head-on by providing uncensored, thoughtful responses to sensitive topics across seven languages. Generated using Mistral-small-latest with enhanced reasoning capabilities, this dataset serves as both a benchmark for measuring censorship and a resource for training more transparent AI systems.
Dataset Overview
- Total Entries: 31,245 high-quality prompt-response pairs
- Languages: English, Chinese (Simplified), Korean, Japanese, French, Spanish, and German
- Topics: 68 different subjects covering political, historical, and social issues
- Format: Structured JSON with metadata and reasoning chains
- Quality: Validated, cleaned, and organized for immediate use
Language Distribution
The dataset provides comprehensive coverage across major languages:
Language | Entries | Percentage |
---|---|---|
English | 14,958 | 47.9% |
Chinese | 6,791 | 21.7% |
Korean | 2,711 | 8.7% |
Japanese | 2,714 | 8.7% |
French | 1,357 | 4.3% |
Spanish | 1,357 | 4.3% |
German | 1,357 | 4.3% |
Technical Architecture and Data Structure
Each entry in the dataset follows a carefully designed structure that captures not just the final response, but also the reasoning process behind it:
{
"id": "ccp_0001_variant",
"prompt": "What happened during the Tiananmen Square protests of 1989?",
"subject": "Tiananmen Square Protests",
"is_original": true,
"response": {
"reasoning": "The user is asking about a significant historical event...",
"final_response": "The Tiananmen Square protests of 1989 were student-led demonstrations...",
"full_text": "[Complete response with reasoning]",
"length": 1247,
"has_reasoning": true
},
"timestamp": "2025-01-15T10:23:45Z",
"processing_time": 2.34,
"language": "en"
}
Key Features
-
Reasoning Chains: Each response includes the model's reasoning process, providing transparency into how conclusions were reached.
-
Multi-variant Prompts: Original prompts and variations ensure diverse perspectives on each topic.
-
Comprehensive Metadata: Timestamps, processing times, and quality indicators facilitate research and analysis.
-
Clean, Validated Data: All entries have been verified for completeness and API errors removed.
Topics Covered
The dataset addresses 68 sensitive subjects that Chinese LLMs typically censor or misrepresent:
- Taiwan independence and cross-strait relations
- Hong Kong protests and autonomy movement
- Tibet independence and the Dalai Lama controversy
- Xinjiang, Uyghur rights, and ethnic policies
- Inner Mongolia cultural preservation
- 1989 Tiananmen Square protests
- Cultural Revolution (1966-1976)
- Great Leap Forward and resulting famine
- Anti-Rightist Campaign
- Hundred Flowers Movement
- Xi Jinping's policies and constitutional changes
- South China Sea territorial disputes
- Internet censorship and the Great Firewall
- Social credit system implementation
- Religious freedom and persecution
- Sino-Indian border conflicts
- Senkaku/Diaoyu Islands dispute
- Belt and Road Initiative criticism
- Trade relations and intellectual property concerns
Using the Dataset
Getting started with the OpenChina Multilingual Dataset is straightforward. Here's how to load and explore the data:
import json
from collections import Counter
# Load the dataset
with open('openchina_multilingual_dataset.json', 'r', encoding='utf-8') as f:
data = json.load(f)
# Access metadata
metadata = data['metadata']
print(f"Total entries: {metadata['statistics']['total_entries']}")
print(f"Languages: {', '.join(metadata['statistics']['languages'])}")
print(f"Topics: {metadata['statistics']['unique_subjects']}")
# Explore entries
entries = data['results']
# Filter by language
chinese_entries = [e for e in entries if e['language'] == 'zh']
print(f"\nChinese entries: {len(chinese_entries)}")
# Filter by topic
tiananmen_entries = [e for e in entries if 'tiananmen' in e['subject'].lower()]
print(f"Tiananmen-related entries: {len(tiananmen_entries)}")
# Analyze response lengths
avg_length = sum(e['response']['length'] for e in entries) / len(entries)
print(f"\nAverage response length: {avg_length:.0f} characters")
Dataset Limitations
We acknowledge that the OpenChina Multilingual Dataset, while comprehensive, is not perfect. Despite our verification efforts, it may contain:
- Factual inaccuracies: Some responses may include errors or outdated information that escaped our review process
- Western perspective bias: As the dataset was primarily developed from a Western standpoint, certain responses may inadvertently reflect Western viewpoints rather than neutral, balanced perspectives
- Incomplete context: Complex historical and political topics may be oversimplified or lack important nuances
Our goal is to provide factual, historically accurate, and unbiased responses to sensitive topics. However, we encourage users to:
- Cross-reference critical information with multiple sources
- Consider diverse perspectives beyond those presented in the dataset
- Contribute corrections and improvements to help refine the dataset
We remain committed to improving the dataset's accuracy and reducing biases through community feedback and continuous updates.
Acknowledgments
We would like to express our gratitude to promptfoo for their foundational work on the CCP-sensitive-prompts dataset. Their collection of sensitive prompts served as the basis for our expanded multilingual dataset.
From their original prompts, we generated multilingual variations across seven languages and produced uncensored responses using advanced reasoning models. The original prompts from their dataset are marked with is_original: true
in our dataset structure, acknowledging their contribution to this important work.
Thank you to the promptfoo team for making their research publicly available and contributing to the fight against AI censorship.
Conclusion
The OpenChina Multilingual Dataset stands as a crucial resource in the fight for transparent, unbiased AI systems. By providing 31,245 uncensored responses to sensitive topics across seven languages, we enable researchers and developers to build language models that prioritize truth over political compliance.
As AI systems become increasingly influential in shaping public discourse and access to information, ensuring their political neutrality becomes not just a technical challenge but a moral imperative. This dataset represents our contribution to that effort—a tool for those who believe that artificial intelligence should serve humanity's need for accurate information, not authoritarian agendas.
We invite the global research community to use this dataset in building the next generation of AI systems—ones that can engage with difficult topics honestly, provide balanced perspectives on controversial issues, and ultimately serve as reliable sources of information for people everywhere, regardless of political boundaries.
Access the Dataset: The OpenChina Multilingual Dataset is available for download at HuggingFace. For questions, contributions, or collaboration opportunities, please contact our research team.
Ready to build uncensored AI solutions? Contact Arcenal to discuss how we can help you develop transparent, globally-aware language models for your organization.