Training & inference code for paper Languages are New Modalities: Cross-Lingual Alignment via Encoder Injection
Instruction-tuned Large Language Models (LLMs) underperform on low‑resource, non‑Latin scripts due to tokenizer fragmentation and weak cross‑lingual coupling. We present LLINK (Large Language Injection for Non-English Knowledge), a compute-efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder's latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into
- Python 3.13 or newer for local tooling. (Modal functions build their own CUDA-enabled Python 3.10 image.)
- uv or
pipfor dependency management. - Modal CLI (
pip install modal) with an authenticated account (modal token set). - A Hugging Face access token stored in Modal as a secret named
hf-token:echo "hf_xxx" | modal secret create hf-token --stdin
- A Modal volume named
khmer-bridge-volto persist checkpoints and trained artifacts:Populate the volume with the required model checkpoints before running remote jobs; asset preparation is tracked separately.modal volume create khmer-bridge-vol
git clone <repo-url>
cd llink
uv venv # or: python3.13 -m venv .venv
source .venv/bin/activate
uv sync # or: pip install -e .
pip install modal # ensures CLI + SDK available inside the venvuv sync installs the base visualization utilities (matplotlib, pandas) used for log analysis and plots. Add any project-specific extras inside the virtual environment as needed.
The remote functions expect Modal to provide:
- GPU type
H100. - Volume
khmer-bridge-volmounted at/vol(checkpoints, adapters, and projector weights). - Secret
hf-tokenexposing the Hugging Face token asHUGGINGFACE_HUB_TOKEN.
Create or update these resources before launching jobs. Data curation and uploads are handled separately (TBD).
Launch the Stage A contrastive projector job on Modal:
modal run projection/train.py::train_projection_modelThe job streams logs to the terminal. Checkpoints and metadata are written under /vol/ckpts in the attached Modal volume.
Invoke the remote inference function with a string in the foreign language:
modal run inference/main.py::infer --kwargs '{"foreign_text": "សួស្តី", "task_type": "translate_to_english"}'Optional flags:
--strict(boolean) toggles conservative prompting.--gate-boost <float>scales the injected slot magnitude.
The function compares injected vs. ablated outputs and prints warnings when the injection has no measurable effect.
To inspect slot neighborhoods or token-level deltas during debugging, call the probing utility:
modal run inference/main.py::lexeme_probe --kwargs '{"foreign_text": "សួស្តី"}'Supply a list of foreign-language examples to the batch tester to measure pairwise cosine similarity and slot norms:
modal run inference/main.py::test_batch --kwargs '{"foreign_texts": ["ខ្ញុំស្រឡាញ់ភាសាខ្មែរ", "សួស្តី"]}'Outputs include diversity statistics that help tune batching and normalization.
The repository also includes helper scripts that can be run locally once the virtual environment is active, for example:
python parse_logs.py # summarize training runs into CSV/plotsPlotting and analysis notebooks reference artifacts written to the Modal volume. Mount or sync those assets locally before post-processing (details TBD).
