Skip to content

🚀 Added a Translation Pipeline#43

Open
leagrieder wants to merge 14 commits intoEPFLiGHT:masterfrom
leagrieder:addTranslationModel
Open

🚀 Added a Translation Pipeline#43
leagrieder wants to merge 14 commits intoEPFLiGHT:masterfrom
leagrieder:addTranslationModel

Conversation

@leagrieder
Copy link
Copy Markdown

@leagrieder leagrieder commented Jan 28, 2026

This PR introduces a translation interface for NLLB-200 with fasttext detection.

✨ Key Contributions

  • Translator (translator.py) for multimeditron inference

    • Automatic language detection with fastText (80% confidence threshold)
    • Smart routing to prevent mistranslation of ambiguous inputs
    • Bidirectional medical translation (user language ↔ English)
    • Compatible with base and fine-tuned NLLB-200 models
  • Consensus-based data generation

    • Synthetic parallel medical corpora built from multi-model translation agreement
    • Scalable approach for low-resource languages
  • Fine-tuning & evaluation framework

    • Scripts for NLLB-200 medical fine-tuning
    • Comprehensive experiments on translation quality and downstream medical QA

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 64 changed files in this pull request and generated 27 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +218 to +220
# CRITICAL FIX: Clip predictions to valid token ID range
vocab_size = len(tokenizer)
preds = np.clip(preds, 0, vocab_size - 1)
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The training script clips token IDs to the vocabulary size to prevent out-of-range errors during evaluation. However, clipping predictions could produce invalid tokens. A better approach would be to investigate why predictions are out of range in the first place, as this indicates a potential issue with model generation or tokenization configuration.

Copilot uses AI. Check for mistakes.
"""NLLB-200 translator with fastText language detection."""

def __init__(self,
model_name="src/multimeditron/translation/models/nllb-consensus-finetuned-1epoch", #Fine tuned model - to use the base NLLB-200 3.3B model, add HF path here (nllb-200-3.3B)
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default model path uses a relative path that may not work correctly depending on where the code is executed from. Consider using an absolute path constructed from file or making this a required parameter without a default value. The comment also suggests this should point to a HuggingFace model ID for the base model, but the current default is a local path.

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +56
print(f"[INFO] Loading NLLB model: {model_name}")
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model.to(self.device)

print(f"[INFO] Loading fastText language detection model")
try:
import fasttext
model_path = hf_hub_download(
repo_id="facebook/fasttext-language-identification",
filename="model.bin"
)
fasttext.FastText.eprint = lambda x: None
self.lang_detector = fasttext.load_model(model_path)
print(f"[INFO] fastText model loaded successfully")
except Exception as e:
print(f"[ERROR] Failed to load fastText: {e}")
print("[INFO] Ensure: pip install 'numpy<2.0' fasttext")
raise

self.detected_user_lang = None
print(f"[INFO] NLLB translator ready on {self.device}")
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The print statements are suitable for debugging but should be replaced with proper logging (using the logging module) for production code. This allows users to control log levels and outputs more flexibly.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +86
try:
token_id = self.tokenizer.convert_tokens_to_ids(detected_code)
if token_id == self.tokenizer.unk_token_id:
print(f"[WARNING] '{detected_code}' not supported. Defaulting to eng_Latn.")
return 'eng_Latn'
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When detected language code is not supported by the tokenizer, the method returns 'eng_Latn' as a fallback. However, this could lead to incorrect behavior as the actual text might not be in English. Consider raising an exception or logging a warning to make this fallback behavior more explicit to callers.

Copilot uses AI. Check for mistakes.
Comment on lines +155 to +169
def translate_from_english(self, text: str, tgt_lang: str = None) -> str:
"""
Translate from English back to original language.
If original was low confidence (eng_Latn), passes through unchanged.
"""
if tgt_lang is None:
if self.detected_user_lang is None:
print("[WARNING] No detected language stored. Returning as-is.")
return text
tgt_lang = self.detected_user_lang

if tgt_lang == 'eng_Latn':
return text

return self.translate(text, 'eng_Latn', tgt_lang) No newline at end of file
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The translate_from_english method relies on the detected_user_lang instance variable set by translate_to_english. This creates a stateful dependency between method calls that could lead to bugs if the methods are called out of order or in a multi-threaded context. Consider making this stateless by requiring the target language as a parameter or documenting this requirement clearly.

Copilot uses AI. Check for mistakes.
b = bleu.sentence_score(candidate, refs).score
c = chrf.sentence_score(candidate, refs).score
scores[model] = 0.5 * b + 0.5 * c
except:
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except block directly handles BaseException.

Copilot uses AI. Check for mistakes.
Comment on lines +523 to +524
except:
pass
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
pass
except Exception as exc:
# Gradient checkpointing is an optional optimization; continue without it if enabling fails.
print(f" ⚠️ Could not enable gradient checkpointing: {exc}")

Copilot uses AI. Check for mistakes.
Comment on lines +654 to +655
except:
pass
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
pass
except Exception as save_err:
log(f"⚠️ Failed to save emergency checkpoint: {save_err}")

Copilot uses AI. Check for mistakes.
lang_data = load_jsonl(filepath)

if SAMPLES_PER_LANGUAGE:
lang_data = lang_data[:SAMPLES_PER_LANGUAGE]
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is unreachable.

Copilot uses AI. Check for mistakes.

if lang not in writers:
out_path = out_dir / f"wikipedia_{lang}_pretraining.jsonl"
writers[lang] = open(out_path, "w", encoding=ENCODING)
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File is opened but is not closed.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 64 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 64 changed files in this pull request and generated 16 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +29 to +35
import fasttext
from huggingface_hub import hf_hub_download
from openai import OpenAI

project_root = Path(__file__).parent.parent.parent.parent
sys.path.insert(0, str(project_root))

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This experiment script imports OpenAI from the openai package, but openai is not listed in pyproject.toml dependencies/optionals. As-is, the script will fail in a clean environment. Consider adding openai as an optional dependency for experiments, or guarding the import with a clear error message that instructs how to install it.

Copilot uses AI. Check for mistakes.
Comment on lines +140 to +148
self.tokenizer.src_lang = src_lang

inputs = self.tokenizer(
text,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512
).to(self.device)
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.tokenizer.src_lang = src_lang mutates shared tokenizer state. If a single NLLBTranslator instance is used concurrently (e.g., in a web server), concurrent calls can race and produce incorrect translations. Consider guarding translation calls with a lock, using separate tokenizer instances per thread/request, or using tokenizer methods that don’t rely on mutable global src_lang state.

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +26
PROJECT_ROOT = Path(__file__).parent.parent.parent.parent
sys.path.insert(0, str(PROJECT_ROOT))

from multimeditron.translation.translator import NLLBTranslator

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PROJECT_ROOT is set to Path(__file__).parent.parent.parent.parent, which resolves to .../src/multimeditron for this script. Adding that to sys.path does not make import multimeditron... work when running the script directly because the import root should be .../src. Either remove this block and require pip install -e ., or change it to insert the repository's src/ directory (e.g., walk parents until you find the src folder).

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +122
def translate_sample(self, sample: dict) -> list:
source_lang = sample.get('language')
if source_lang not in LANG_TO_NLLB:
return []

src_nllb = LANG_TO_NLLB[source_lang]
self.stats['by_source_lang'][source_lang] = (
self.stats['by_source_lang'].get(source_lang, 0) + 1
)

translations = []
for lang_code, (nllb_code, _) in AFRICAN_LANGUAGES.items():
question = self.translate_text(sample['question'], src_nllb, nllb_code)
options = [
self.translate_text(opt, src_nllb, nllb_code)
for opt in sample['options']
]

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

translate_sample() indexes sample['question'], sample['options'], and sample['answer'] directly but load_medibench() does not enforce these keys exist (it can also set options to an empty list). This can raise KeyError or emit translations with empty options. Consider using .get(...) with validation (similar to the base NLLB script) and skipping malformed/empty MCQs before translating.

Copilot uses AI. Check for mistakes.
Comment on lines +157 to +160
if (i + 1) % 50 == 0:
torch.cuda.empty_cache()
gc.collect()

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.cuda.empty_cache() is called unconditionally. On environments where PyTorch is built without CUDA (or CUDA is unavailable), this can raise an exception. Consider guarding with if torch.cuda.is_available(): ... (and similarly for any other CUDA-specific calls).

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +45
PROJECT_ROOT = Path(__file__).parent.parent.parent.parent
sys.path.insert(0, str(PROJECT_ROOT))

from multimeditron.model.model import MultiModalModelForCausalLM, ChatTemplate
from multimeditron.model.data_loader import DataCollatorForMultimodal
from multimeditron.translation.translator import NLLBTranslator

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PROJECT_ROOT = Path(__file__).parent.parent.parent.parent here resolves to .../src/multimeditron, which doesn’t help from multimeditron... imports when running the script directly. If you want these experiments to run from a fresh checkout, insert the repo’s src/ directory (or remove this and rely on editable install).

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +38
project_root = Path(__file__).parent.parent.parent.parent
sys.path.insert(0, str(project_root))

from multimeditron.model.model import MultiModalModelForCausalLM, ChatTemplate
from multimeditron.model.data_loader import DataCollatorForMultimodal

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

project_root = Path(__file__).parent.parent.parent.parent points to .../src/multimeditron here, which won’t make from multimeditron... imports work when running the file directly. If this is meant to be runnable from a repo checkout, insert the repository’s src/ directory instead (or remove this and rely on editable install).

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +25
project_root = Path(__file__).parent.parent.parent.parent
sys.path.insert(0, str(project_root))

from multimeditron.translation.translator import NLLBTranslator
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

project_root = Path(__file__).parent.parent.parent.parent points to .../src/multimeditron, which won’t help import multimeditron... when running this script directly; the import root should be .../src. Either remove the sys.path tweak and rely on installation, or adjust it to insert the repo’s src/ directory.

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +140
if (i + 1) % 50 == 0:
torch.cuda.empty_cache()
gc.collect()

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.cuda.empty_cache() is invoked without checking torch.cuda.is_available(). If this script is run in a CPU-only PyTorch build, it can crash. Guard CUDA-specific cache clearing behind an availability check.

Copilot uses AI. Check for mistakes.
Comment on lines +129 to +143
# BOTH directions
sources.append(eng_text)
targets.append(target_text)
tgt_langs.append(target_lang)

sources.append(target_text)
targets.append(eng_text)
tgt_langs.append('eng_Latn')

model_inputs = tokenizer(
sources,
max_length=MAX_LENGTH,
truncation=True,
padding="max_length"
)
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In preprocess_function, sources contains both English and non-English text (for the X→EN direction), but tokenizer(sources, ...) runs with tokenizer.src_lang still set to eng_Latn. That means the non-English source examples are tokenized with the wrong language code, which will corrupt training. Consider tracking src_langs alongside sources and tokenizing per-language (or per-example) with the correct src_lang, and use the tokenizer’s target-encoding API (text_target/tgt_lang) for labels instead of mutating src_lang inside the labels loop.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 64 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +445 to +451
print(f"\n📊 Final Statistics:")
print(f" Total steps: {metadata['total_steps']}")
print(f" Epochs: {metadata['epochs_completed']:.2f}")
print(f" Final loss: {metadata['final_train_loss']:.4f}")
print(f" Best BLEU: {metadata['best_eval_bleu']:.2f}")
print(f" Best chrF: {metadata['best_eval_chrf']:.2f}")
print(f" Time: {metadata['training_hours']:.2f} hours")
Comment on lines +51 to +56
if __name__ == "__main__":
src_train = "../../../nemo/datasets/polyglot/fineweb2_am/train.jsonl"
src_test = "../../../nemo/datasets/polyglot/fineweb2_am/test.jsonl"

dest_dir = "src/multimeditron/translation/datasets/formatted_datasets/general_datasets/fineweb/fineweb_am"
os.makedirs(dest_dir, exist_ok=True)
Comment on lines +23 to +27
IN_PATH = Path("../../../nemo/datasets/polyglot/clean_wikipedia/train.jsonl")
OUT_DIR = Path("src/multimeditron/translation/datasets/formatted_datasets/general_datasets/wikipedia")

OUT_DIR.mkdir(parents=True, exist_ok=True)
ENCODING = "utf-8"
Comment on lines +92 to +116
def detect_language(self, text: str, confidence_threshold=0.80) -> str:
"""
Detect language using fastText. Returns 'eng_Latn' if confidence < threshold
to trigger pass-through behavior (no translation).
"""
try:
clean_text = text.replace('\n', ' ').strip()
predictions = self.lang_detector.predict(clean_text, k=3)

detected_code = predictions[0][0].replace('__label__', '')
confidence = float(predictions[1][0])

LOGGER.debug("Detected language %s (confidence %.3f)", detected_code, confidence)

if confidence < confidence_threshold:
LOGGER.warning(
"Low confidence language detection (%.3f < %.3f). Falling back to eng_Latn.",
confidence,
confidence_threshold,
)
for i in range(min(3, len(predictions[0]))):
alt_code = predictions[0][i].replace('__label__', '')
alt_conf = float(predictions[1][i])
LOGGER.warning("Alternative prediction %d: %s (%.3f)", i + 1, alt_code, alt_conf)
return 'eng_Latn'
Comment on lines +173 to +181
# Load dataset
print(f"\n[1/4] Loading dataset from {input_file}...")
try:
with open(input_file, 'r', encoding='utf-8') as f:
data = json.load(f)
print(f" ✅ Loaded {len(data)} samples")
except Exception as e:
print(f" ❌ Error loading file: {e}")
return
Comment on lines +173 to +181
# Load dataset
print(f"\n[1/4] Loading dataset from {input_file}...")
try:
with open(input_file, 'r', encoding='utf-8') as f:
data = json.load(f)
print(f" ✅ Loaded {len(data)} samples")
except Exception as e:
print(f" ❌ Error loading file: {e}")
return
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants