LLINK — Langauges are New Modalities

Training & inference code for paper Languages are New Modalities: Cross-Lingual Alignment via Encoder Injection

Abstract

Instruction-tuned Large Language Models (LLMs) underperform on low‑resource, non‑Latin scripts due to tokenizer fragmentation and weak cross‑lingual coupling. We present LLINK (Large Language Injection for Non-English Knowledge), a compute-efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder's latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into $K$ soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct finetuning in LLM-judged Q&A evaluations. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross-lingual alignment, despite the model having residual weaknesses in numeric fidelity. Treating low-resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs.

Prerequisites

Python 3.13 or newer for local tooling. (Modal functions build their own CUDA-enabled Python 3.10 image.)
uv or pip for dependency management.
Modal CLI (pip install modal) with an authenticated account (modal token set).
A Hugging Face access token stored in Modal as a secret named hf-token:
```
echo "hf_xxx" | modal secret create hf-token --stdin
```
A Modal volume named khmer-bridge-vol to persist checkpoints and trained artifacts:
```
modal volume create khmer-bridge-vol
```
Populate the volume with the required model checkpoints before running remote jobs; asset preparation is tracked separately.

Local Environment

git clone <repo-url>
cd llink
uv venv  # or: python3.13 -m venv .venv
source .venv/bin/activate
uv sync  # or: pip install -e .
pip install modal  # ensures CLI + SDK available inside the venv

uv sync installs the base visualization utilities (matplotlib, pandas) used for log analysis and plots. Add any project-specific extras inside the virtual environment as needed.

Modal Resources

The remote functions expect Modal to provide:

GPU type H100.
Volume khmer-bridge-vol mounted at /vol (checkpoints, adapters, and projector weights).
Secret hf-token exposing the Hugging Face token as HUGGINGFACE_HUB_TOKEN.

Create or update these resources before launching jobs. Data curation and uploads are handled separately (TBD).

Running

Stage A: Projection Training

Launch the Stage A contrastive projector job on Modal:

modal run projection/train.py::train_projection_model

The job streams logs to the terminal. Checkpoints and metadata are written under /vol/ckpts in the attached Modal volume.

Stage B: Inference Functions

Invoke the remote inference function with a string in the foreign language:

modal run inference/main.py::infer --kwargs '{"foreign_text": "សួស្តី", "task_type": "translate_to_english"}'

Optional flags:

--strict (boolean) toggles conservative prompting.
--gate-boost <float> scales the injected slot magnitude.

The function compares injected vs. ablated outputs and prints warnings when the injection has no measurable effect.

To inspect slot neighborhoods or token-level deltas during debugging, call the probing utility:

modal run inference/main.py::lexeme_probe --kwargs '{"foreign_text": "សួស្តី"}'

Batch Diagnostics

Supply a list of foreign-language examples to the batch tester to measure pairwise cosine similarity and slot norms:

modal run inference/main.py::test_batch --kwargs '{"foreign_texts": ["ខ្ញុំស្រឡាញ់ភាសាខ្មែរ", "សួស្តី"]}'

Outputs include diversity statistics that help tune batching and normalization.

Local Utilities

The repository also includes helper scripts that can be run locally once the virtual environment is active, for example:

python parse_logs.py  # summarize training runs into CSV/plots

Plotting and analysis notebooks reference artifacts written to the Modal volume. Mount or sync those assets locally before post-processing (details TBD).

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
images		images
inference		inference
projection		projection
sft		sft
website		website
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLINK — Langauges are New Modalities

Abstract

Prerequisites

Local Environment

Modal Resources

Running

Stage A: Projection Training

Stage B: Inference Functions

Batch Diagnostics

Local Utilities

About

Uh oh!

Releases

Packages

Languages

rajansagarwal/llink

Folders and files

Latest commit

History

Repository files navigation

LLINK — Langauges are New Modalities

Abstract

Prerequisites

Local Environment

Modal Resources

Running

Stage A: Projection Training

Stage B: Inference Functions

Batch Diagnostics

Local Utilities

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages