We build datasets, train models, run evaluations, and publish benchmarks for literary and niche-domain translation. Light novels, visual novels, galgames, web fiction — the domains where generic Machine Translation collapses into polite nonsense.
At OpenSakura, we believe in open science and reproducible results. Everything we do is done in public, with receipts.
- 📚 Datasets — Curated, schema-validated, with documented lineage and licensing. High-quality parallel corpora designed specifically for our target domains.
- 🔬 Experiments — Fully reproducible training runs with pinned revisions, deterministic setups, and meticulously logged metrics.
- 📊 Benchmarks — Comprehensive LLM-as-a-judge pairwise evaluations, establishing Elo-style rankings for translation models.
- ⚔️ Arena — Blind A/B human evaluation platform to gather high-fidelity preference data for RLHF and DPO.
| Resource | URL |
|---|---|
| 📊 Benchmark Dashboard | bench.opensakura.com |
| ⚔️ Translation Arena | arena.opensakura.com |
| 🤗 Hugging Face | huggingface.co/OpenSakura |
| Repository | Description |
|---|---|
| 📖 Users-Please-Come-And-See-This | User-facing guide: Find our datasets, models, benchmarks, and FAQ here. |
| 🛠️ Contributors-You-Dare-Not-Watch-This | Contribution rules: Naming conventions, PR flow, and guidelines. |
| 🗄️ OpenSakura-DS-260130-LN-SFT-Template | Dataset repo template: Schema, validation, and tooling. |
| 🧪 OpenSakura-EXP-260213-LN-SFT-Template | Experiment repo template: Run format, metrics, and env capture. |
| 🏟️ OpenSakura-Arena | Arena platform source code: Built with Next.js, FastAPI, and PostgreSQL. |
We welcome contributions of all kinds! Whether you want to contribute datasets, tooling, evaluation scripts, documentation, or just file issues when something breaks, we need your help!
👉 Start here: Contribution Guidelines
If you use OpenSakura artifacts and they help you, tell a friend. If they don't, tell us what failed (with examples).
OpenSakura grew out of the SakuraLLM community — the pioneering open-source project for Japanese-to-Chinese ACGN translation models. Members of that community came together to push the mission further: broader language pairs, rigorous reproducibility, public benchmarks, and a more open development process. We stand on their shoulders and are deeply grateful for the foundation they built.
OpenSakura 是一个专注于特定领域大语言模型 (LLM) 翻译的开源社区项目。我们致力于构建高质量数据集、训练专属模型、执行严格评测并发布权威基准——全面覆盖轻小说、视觉小说 (Visual Novel)、Galgame、网络小说等通用机器翻译极易“翻车”的垂直领域。
- 📚 数据集 — 经过精心整理与严格校验,具备完整的来源追溯与明确的开源许可。
- 🔬 实验 — 保证绝对可复现的训练流程,锁定数据集与模型版本,并记录完整的训练指标。
- 📊 基准测试 (Benchmarks) — 引入基于 LLM 作为裁判 (LLM-as-a-judge) 的成对评测机制,建立科学的 Elo 排名体系。
- ⚔️ 竞技场 (Arena) — 开展盲评 A/B 人工评测,为后续的 RLHF 和 DPO 算法提供高质量的人类偏好数据。
| 资源 | 链接 |
|---|---|
| 📊 基准测试看板 | bench.opensakura.com |
| ⚔️ 翻译竞技场 | arena.opensakura.com |
| 🤗 Hugging Face | huggingface.co/OpenSakura |
| 仓库 | 说明 |
|---|---|
| 📖 Users-Please-Come-And-See-This | 用户指南 — 包含数据集、模型、基准测试说明及常见问题 (FAQ)。 |
| 🛠️ Contributors-You-Dare-Not-Watch-This | 贡献规范 — 命名规则、PR 提交流程及社区行为准则。 |
| 🗄️ OpenSakura-DS-260130-LN-SFT-Template | 数据集仓库模板 — 包含 Schema 结构、数据校验及工具链。 |
| 🧪 OpenSakura-EXP-260213-LN-SFT-Template | 实验仓库模板 — 规范化运行记录、指标监控与环境快照。 |
| 🏟️ OpenSakura-Arena | 竞技场平台源码 — 基于 Next.js + FastAPI + PostgreSQL 构建。 |
我们热烈欢迎各种形式的开源贡献!无论是提供数据集、开发工具链、编写评测脚本、完善文档,还是仅仅提交一个 issue 报告 Bug,我们都需要你的力量!
👉 从这里开始: 贡献指南
如果你觉得 OpenSakura 的产出对你有帮助,请推荐给你的朋友们!如果遇到糟糕的翻译结果,请带上具体的例子向我们反馈,帮助我们不断改进。
OpenSakura 脱胎于 SakuraLLM 社区——那是日中 ACGN 翻译领域的开源先驱。怀揣着将这项事业推向更高峰的愿景,社区成员们再次集结:我们致力于支持更丰富的语言对、贯彻更严格的可复现性标准、构建公开透明的基准评测,并推行更加开放的开发流程。
我们站在前人的肩膀上,对 SakuraLLM 奠定的坚实基础致以最深的敬意与感谢。
