Automated Classical Chinese-Korean parallel text alignment pipeline
CSP splits and aligns Classical Chinese (漢文) texts with their Korean translations at paragraph, sentence, and phrase levels.
CSP is a two-stage pipeline that processes pre-modern Korean literary corpora:
- P2S (Paragraph to Sentence): Splits aligned paragraph pairs into sentence pairs
- S2P (Sentence to Phrase): Splits aligned sentence pairs into phrase pairs (1:1 alignment)
Both stages guarantee 100% text integrity — no characters are lost or added during splitting.
Input XLSX (paragraph-level parallel text)
│
▼
┌─────────────────────────────────────────────┐
│ P2S Pipeline │
│ ┌───────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Sentence │→ │ Boundary │→ │ BGE │ │
│ │ Splitter │ │ Model │ │Refinement│ │
│ │ (Korean) │ │(Chinese) │ │ (3-pass) │ │
│ └───────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────┘
│
▼
Output XLSX (sentence-level parallel text)
│
▼
┌─────────────────────────────────────────────┐
│ S2P Pipeline │
│ ┌───────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Phrase │→ │ Viterbi │→ │Punctuation│ │
│ │ Alignment │ │ Decoding │ │ Guard │ │
│ │ Model │ │ │ │ │ │
│ └───────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────┘
│
▼
Output XLSX (phrase-level parallel text)
| Pipeline | F1 Score | Precision | Recall | Test Size |
|---|---|---|---|---|
| P2S | 0.9384 | 1.0000 | 0.8840 | 4,934 paragraphs |
| S2P (v2.1) | 0.8555 | 1.0000 | 0.7475 | 446 sentences |
| Component | Technology | Purpose |
|---|---|---|
| Embeddings | BGE-M3 (FlagEmbedding) | Semantic similarity scoring |
| Boundary Detection | Cross-Attention BiLSTM | Chinese text segmentation |
| Chinese Parsing | SuPar-Kanbun | Dependency parsing for Classical Chinese |
| Korean Tokenizer | Kiwipiepy | Morphological analysis |
| Chinese Tokenizer | SikuBERT | Classical Chinese subword tokenization |
| Phrase Alignment | BiLSTM + Guided Attention | Source-target phrase boundary prediction |
docker-compose up -d
docker-compose exec csp bash
# Run P2S
python p2s/main.py <input.csv> <output.xlsx>
# Run S2P
python s2p/main.py <input.csv> <output.xlsx> [--batch-size 32]
# Run both pipelines on all 44 books
python batch_44books.py# Auto-detects file format (XML/XLSX/TXT/CSV)
python main.py <input_file> <output_file>CSP/
├── p2s/ # Paragraph-to-Sentence pipeline
│ ├── processor.py # Main processing logic with stage tracing
│ ├── aligner.py # Alignment algorithms (DP, BGE refinement)
│ ├── sentence_splitter.py # Korean sentence boundary detection
│ └── main.py # CLI entry point
│
├── s2p/ # Sentence-to-Phrase pipeline
│ ├── s2p_aligner.py # Core phrase alignment with Viterbi decoding
│ ├── punctuation.py # Punctuation-based segmentation + integrity
│ ├── io_manager.py # I/O operations and batch processing
│ └── main.py # CLI entry point
│
├── common/ # Shared modules
│ ├── embedders/ # BGE-M3, OpenAI, Gemini embedding backends
│ ├── tokenizers/ # SikuBERT (Chinese), Kiwipiepy (Korean)
│ ├── config.py # Configuration management
│ ├── integrity_verifier.py # Text integrity verification
│ ├── boundary_model_loader.py # P2S boundary model
│ └── s2p_phrase_alignment_loader.py # S2P v2.1 model
│
├── accuracy/ # Evaluation scripts
│ ├── p2s_evaluator.py # P2S F1, precision, recall, similarity
│ └── s2p_evaluator.py # S2P F1, precision, recall, similarity
│
├── scripts/ # Training and analysis scripts
│ ├── train_p2s_boundary.py # P2S boundary model training
│ ├── train_s2p_phrase_alignment.py # S2P v2.1 model training
│ └── ... # Tuning, analysis, diagnostics
│
├── models/ # Trained model weights (.pt files)
├── datasets/ # Gold-standard test data (splits/)
├── docs/ # Technical documentation
├── analytics/ # Monitoring and visualization tools
├── xlsx/ # 44 classical text books (local only)
├── xlsx_pipeline/ # XLSX processing utilities
├── hyeonto/ # Research analysis (현토 marker studies)
│
├── main.py # Universal entry point
├── batch_44books.py # Batch processor for all books
├── csp_config.json # Main configuration file
├── Dockerfile # GPU-enabled container (PyTorch 2.6 + CUDA 12.4)
└── docker-compose.yml # Multi-service setup (CSP + MSSQL)
Input (paragraph-level):
| Source (Classical Chinese) | Target (Korean) |
|---|---|
| 公子開方事君 不歸視死父. 衛懿公好鶴 不恤死國. 齊桓公得子亂國 | 공자개방이 군주를 섬기며 죽은 아버지를 돌아보지 않았다. 위의공은 학을 좋아하여 나라가 죽는 것을 돌보지 않았다. 제환공은 자란국을 얻었다 |
Output (sentence-level):
| Para ID | Sent ID | Source | Target |
|---|---|---|---|
| 1 | 1 | 公子開方事君 不歸視死父 | 공자개방이 군주를 섬기며 죽은 아버지를 돌아보지 않았다 |
| 1 | 2 | 衛懿公好鶴 不恤死國 | 위의공은 학을 좋아하여 나라가 죽는 것을 돌보지 않았다 |
| 1 | 3 | 齊桓公得子亂國 | 제환공은 자란국을 얻었다 |
Input (sentence-level):
| Source | Target |
|---|---|
| 公子開方事君 不歸視死父 | 공자개방이 군주를 섬기며 죽은 아버지를 돌아보지 않았다 |
Output (phrase-level):
| Sent ID | Phrase ID | Source Phrase | Target Phrase |
|---|---|---|---|
| 1 | 1 | 公子開方 | 공자개방이 |
| 1 | 2 | 事君 | 군주를 섬기며 |
| 1 | 3 | 不歸視死父 | 죽은 아버지를 돌아보지 않았다 |
| Document | Description |
|---|---|
| P2S Mechanism | P2S algorithm: target-anchored splitting with BGE refinement |
| P2S Code Anatomy | P2S code walkthrough with call hierarchy diagrams |
| S2P Mechanism | S2P algorithm: BiLSTM phrase alignment with Viterbi decoding |
| S2P Code Anatomy | S2P v2.1 architecture and model internals |
| Workflow | End-to-end pipeline workflow |
| Data Preparation | XML to XLSX data conversion pipeline |
| Performance | Benchmarks and optimization guide |
| Cloud GPU Testing | RunPod H200 deployment |
| Troubleshooting | Common issues and solutions |
docker-compose up -d
docker-compose exec csp bash- Base image:
pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel - Package manager:
uv(2-10x faster than pip) - Services: CSP workspace + MSSQL 2022 (dictionary data)
- Recommended: H200 SXM ($3.59/hr)
- Full P2S test (4,934 paragraphs): ~5.2 hours
- See Cloud GPU Testing for setup instructions
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install -r requirements.txtThis project is licensed under CC BY 4.0. You are free to share and adapt the material with appropriate attribution.
CSP는 한문 고전 문헌을 문단→문장으로 분할(P2S) 하고 문장→구로 분할하여 1:1 정렬(S2P) 하는 작업을 자동화하는 시스템입니다.
- 무결성 보장: 원문 문자 100% 보존 (공백 외 손실 없음)
- GPU 가속: CUDA 기반 고속 처리
- 다중 전략: SuPar 구문분석 + 경계 모델 + DP 정렬 + BGE refinement
| 파이프라인 | F1 | Precision | Recall | 원문유사도 | 테스트 규모 |
|---|---|---|---|---|---|
| P2S | 0.9384 | 1.0 | 0.8840 | 0.9759 | 4,934문단 |
| S2P (v2.1) | 0.8555 | 1.0 | 0.7475 | 0.9362 | 446문장 |
# 마크다운 대시보드 실행
cd hyeonto && python md_server.py --port 8080
# http://127.0.0.1:8080/dashboard.html- K=3 클러스터 분석: 현토 마커 & 서종 분포
- 임베딩 시각화: 2D/3D UMAP 오버레이
- Sankey 다이어그램: P2S ↔ S2P 흐름 분석
- 임베더: BGE-M3 FlagModel (GPU 가속)
- 경계 모델: Cross-Attention 기반 Boundary Tagger
- 구문분석: SuPar-Kanbun (한문) + Stanza (한국어)
- 토크나이저: SikuBERT (한문) + Kiwipiepy (현토/한국어)
- 구 정렬: BiLSTM + Guided Attention + Viterbi 디코딩
Last updated: 2026-02-14