🚀 Added a Translation Pipeline by leagrieder · Pull Request #43 · EPFLiGHT/MultiMeditron

leagrieder · 2026-01-28T13:45:23Z

This PR introduces a translation interface for NLLB-200 with fasttext detection.

✨ Key Contributions

Translator (translator.py) for multimeditron inference
- Automatic language detection with fastText (80% confidence threshold)
- Smart routing to prevent mistranslation of ambiguous inputs
- Bidirectional medical translation (user language ↔ English)
- Compatible with base and fine-tuned NLLB-200 models
Consensus-based data generation
- Synthetic parallel medical corpora built from multi-model translation agreement
- Scalable approach for low-resource languages
Fine-tuning & evaluation framework
- Scripts for NLLB-200 medical fine-tuning
- Comprehensive experiments on translation quality and downstream medical QA

…e large results)

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 39 out of 64 changed files in this pull request and generated 27 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-04T21:55:45Z

src/multimeditron/translation/datasets/training/finetune_nllb_openwho.py

+    # CRITICAL FIX: Clip predictions to valid token ID range
+    vocab_size = len(tokenizer)
+    preds = np.clip(preds, 0, vocab_size - 1)


The training script clips token IDs to the vocabulary size to prevent out-of-range errors during evaluation. However, clipping predictions could produce invalid tokens. A better approach would be to investigate why predictions are out of range in the first place, as this indicates a potential issue with model generation or tokenization configuration.

Copilot · 2026-02-04T21:55:45Z

src/multimeditron/translation/translator.py

+    """NLLB-200 translator with fastText language detection."""
+
+    def __init__(self, 
+                model_name="src/multimeditron/translation/models/nllb-consensus-finetuned-1epoch",  #Fine tuned model - to use the base NLLB-200 3.3B model, add HF path here (nllb-200-3.3B)


The default model path uses a relative path that may not work correctly depending on where the code is executed from. Consider using an absolute path constructed from file or making this a required parameter without a default value. The comment also suggests this should point to a HuggingFace model ID for the base model, but the current default is a local path.

Copilot · 2026-02-04T21:55:46Z

src/multimeditron/translation/translator.py

+        print(f"[INFO] Loading NLLB model: {model_name}")
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
+
+
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        self.model.to(self.device)
+
+        print(f"[INFO] Loading fastText language detection model")
+        try:
+            import fasttext
+            model_path = hf_hub_download(
+                repo_id="facebook/fasttext-language-identification",
+                filename="model.bin"
+            )
+            fasttext.FastText.eprint = lambda x: None
+            self.lang_detector = fasttext.load_model(model_path)
+            print(f"[INFO] fastText model loaded successfully")
+        except Exception as e:
+            print(f"[ERROR] Failed to load fastText: {e}")
+            print("[INFO] Ensure: pip install 'numpy<2.0' fasttext")
+            raise
+
+        self.detected_user_lang = None
+        print(f"[INFO] NLLB translator ready on {self.device}")


The print statements are suitable for debugging but should be replaced with proper logging (using the logging module) for production code. This allows users to control log levels and outputs more flexibly.

Copilot · 2026-02-04T21:55:46Z

src/multimeditron/translation/translator.py

+            try:
+                token_id = self.tokenizer.convert_tokens_to_ids(detected_code)
+                if token_id == self.tokenizer.unk_token_id:
+                    print(f"[WARNING] '{detected_code}' not supported. Defaulting to eng_Latn.")
+                    return 'eng_Latn'


When detected language code is not supported by the tokenizer, the method returns 'eng_Latn' as a fallback. However, this could lead to incorrect behavior as the actual text might not be in English. Consider raising an exception or logging a warning to make this fallback behavior more explicit to callers.

Copilot · 2026-02-04T21:55:46Z

src/multimeditron/translation/translator.py

+    def translate_from_english(self, text: str, tgt_lang: str = None) -> str:
+        """
+        Translate from English back to original language.
+        If original was low confidence (eng_Latn), passes through unchanged.
+        """
+        if tgt_lang is None:
+            if self.detected_user_lang is None:
+                print("[WARNING] No detected language stored. Returning as-is.")
+                return text
+            tgt_lang = self.detected_user_lang
+
+        if tgt_lang == 'eng_Latn':
+            return text
+
+        return self.translate(text, 'eng_Latn', tgt_lang)


The translate_from_english method relies on the detected_user_lang instance variable set by translate_to_english. This creates a stateful dependency between method calls that could lead to bugs if the methods are called out of order or in a multi-threaded context. Consider making this stateless by requiring the target language as a parameter or documenting this requirement clearly.

Copilot · 2026-02-04T21:55:52Z

src/multimeditron/translation/experiments/scripts/translation_consensus.py

+                    b = bleu.sentence_score(candidate, refs).score
+                    c = chrf.sentence_score(candidate, refs).score
+                    scores[model] = 0.5 * b + 0.5 * c
+                except:


Except block directly handles BaseException.

Copilot · 2026-02-04T21:55:53Z

...itron/translation/experiments/experiments_on_base_nllb/experiment_1_african_languages_gpt.py

+        except:
+            pass


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except:

pass

except Exception as exc:

# Gradient checkpointing is an optional optimization; continue without it if enabling fails.

print(f" ⚠️ Could not enable gradient checkpointing: {exc}")

Copilot · 2026-02-04T21:55:53Z

src/multimeditron/translation/datasets/training/finetune_nllb_consensus.py

+    except:
+        pass


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except:

pass

except Exception as save_err:

log(f"⚠️ Failed to save emergency checkpoint: {save_err}")

Copilot · 2026-02-04T21:55:53Z

src/multimeditron/translation/datasets/training/finetune_nllb_consensus.py

+        lang_data = load_jsonl(filepath)
+
+        if SAMPLES_PER_LANGUAGE:
+            lang_data = lang_data[:SAMPLES_PER_LANGUAGE]


This statement is unreachable.

Copilot · 2026-02-04T21:55:53Z

...ditron/translation/datasets/scripts/reformatting_scripts/convert_cleanWiki_to_pretraining.py

+
+                if lang not in writers:
+                    out_path = out_dir / f"wikipedia_{lang}_pretraining.jsonl"
+                    writers[lang] = open(out_path, "w", encoding=ENCODING)


File is opened but is not closed.

Copilot

Pull request overview

Copilot reviewed 39 out of 64 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/multimeditron/translation/translator.py

src/multimeditron/translation/experiments/scripts/translateMediBench_baseNLLB.py

src/multimeditron/translation/experiments/scripts/filter_translated_medibench_finetunedNLLB.py

src/multimeditron/translation/experiments/scripts/filter_translated_medibench_baseNLLB.py

src/multimeditron/translation/experiments/scripts/filter_translated_medibench_finetunedNLLB.py

src/multimeditron/translation/experiments/scripts/filter_translated_medibench_baseNLLB.py

src/multimeditron/translation/translator.py

Copilot

Pull request overview

Copilot reviewed 39 out of 64 changed files in this pull request and generated 16 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-12T17:27:31Z

...itron/translation/experiments/experiments_on_base_nllb/experiment_1_african_languages_gpt.py

+import fasttext
+from huggingface_hub import hf_hub_download
+from openai import OpenAI
+
+project_root = Path(__file__).parent.parent.parent.parent
+sys.path.insert(0, str(project_root))
+


This experiment script imports OpenAI from the openai package, but openai is not listed in pyproject.toml dependencies/optionals. As-is, the script will fail in a clean environment. Consider adding openai as an optional dependency for experiments, or guarding the import with a clear error message that instructs how to install it.

Copilot · 2026-03-12T17:27:31Z

src/multimeditron/translation/translator.py

+        self.tokenizer.src_lang = src_lang
+
+        inputs = self.tokenizer(
+            text,
+            return_tensors="pt",
+            padding=True,
+            truncation=True,
+            max_length=512
+        ).to(self.device)


self.tokenizer.src_lang = src_lang mutates shared tokenizer state. If a single NLLBTranslator instance is used concurrently (e.g., in a web server), concurrent calls can race and produce incorrect translations. Consider guarding translation calls with a lock, using separate tokenizer instances per thread/request, or using tokenizer methods that don’t rely on mutable global src_lang state.

Copilot · 2026-03-12T17:27:32Z

src/multimeditron/translation/experiments/scripts/translateMediBench_finetunedNLLB.py

+PROJECT_ROOT = Path(__file__).parent.parent.parent.parent
+sys.path.insert(0, str(PROJECT_ROOT))
+
+from multimeditron.translation.translator import NLLBTranslator
+


PROJECT_ROOT is set to Path(__file__).parent.parent.parent.parent, which resolves to .../src/multimeditron for this script. Adding that to sys.path does not make import multimeditron... work when running the script directly because the import root should be .../src. Either remove this block and require pip install -e ., or change it to insert the repository's src/ directory (e.g., walk parents until you find the src folder).

Copilot · 2026-03-12T17:27:32Z

src/multimeditron/translation/experiments/scripts/translateMediBench_finetunedNLLB.py

+    def translate_sample(self, sample: dict) -> list:
+        source_lang = sample.get('language')
+        if source_lang not in LANG_TO_NLLB:
+            return []
+
+        src_nllb = LANG_TO_NLLB[source_lang]
+        self.stats['by_source_lang'][source_lang] = (
+            self.stats['by_source_lang'].get(source_lang, 0) + 1
+        )
+
+        translations = []
+        for lang_code, (nllb_code, _) in AFRICAN_LANGUAGES.items():
+            question = self.translate_text(sample['question'], src_nllb, nllb_code)
+            options = [
+                self.translate_text(opt, src_nllb, nllb_code)
+                for opt in sample['options']
+            ]
+


translate_sample() indexes sample['question'], sample['options'], and sample['answer'] directly but load_medibench() does not enforce these keys exist (it can also set options to an empty list). This can raise KeyError or emit translations with empty options. Consider using .get(...) with validation (similar to the base NLLB script) and skipping malformed/empty MCQs before translating.

Copilot · 2026-03-12T17:27:32Z

src/multimeditron/translation/experiments/scripts/translateMediBench_finetunedNLLB.py

+            if (i + 1) % 50 == 0:
+                torch.cuda.empty_cache()
+                gc.collect()
+


torch.cuda.empty_cache() is called unconditionally. On environments where PyTorch is built without CUDA (or CUDA is unavailable), this can raise an exception. Consider guarding with if torch.cuda.is_available(): ... (and similarly for any other CUDA-specific calls).

Copilot · 2026-03-12T17:27:34Z

...imeditron/translation/experiments/experiments_on_base_nllb/experiment_1_african_languages.py

+PROJECT_ROOT = Path(__file__).parent.parent.parent.parent
+sys.path.insert(0, str(PROJECT_ROOT))
+
+from multimeditron.model.model import MultiModalModelForCausalLM, ChatTemplate
+from multimeditron.model.data_loader import DataCollatorForMultimodal
+from multimeditron.translation.translator import NLLBTranslator
+


PROJECT_ROOT = Path(__file__).parent.parent.parent.parent here resolves to .../src/multimeditron, which doesn’t help from multimeditron... imports when running the script directly. If you want these experiments to run from a fresh checkout, insert the repo’s src/ directory (or remove this and rely on editable install).

Copilot · 2026-03-12T17:27:35Z

...itron/translation/experiments/experiments_on_base_nllb/experiment_1_african_languages_gpt.py

+project_root = Path(__file__).parent.parent.parent.parent
+sys.path.insert(0, str(project_root))
+
+from multimeditron.model.model import MultiModalModelForCausalLM, ChatTemplate
+from multimeditron.model.data_loader import DataCollatorForMultimodal
+


project_root = Path(__file__).parent.parent.parent.parent points to .../src/multimeditron here, which won’t make from multimeditron... imports work when running the file directly. If this is meant to be runnable from a repo checkout, insert the repository’s src/ directory instead (or remove this and rely on editable install).

Copilot · 2026-03-12T17:27:35Z

src/multimeditron/translation/experiments/scripts/translateMediBench_baseNLLB.py

+project_root = Path(__file__).parent.parent.parent.parent
+sys.path.insert(0, str(project_root))
+
+from multimeditron.translation.translator import NLLBTranslator


project_root = Path(__file__).parent.parent.parent.parent points to .../src/multimeditron, which won’t help import multimeditron... when running this script directly; the import root should be .../src. Either remove the sys.path tweak and rely on installation, or adjust it to insert the repo’s src/ directory.

Copilot · 2026-03-12T17:27:35Z

src/multimeditron/translation/experiments/scripts/translateMediBench_baseNLLB.py

+            if (i + 1) % 50 == 0:
+                torch.cuda.empty_cache()
+                gc.collect()
+


torch.cuda.empty_cache() is invoked without checking torch.cuda.is_available(). If this script is run in a CPU-only PyTorch build, it can crash. Guard CUDA-specific cache clearing behind an availability check.

Copilot · 2026-03-12T17:27:36Z

src/multimeditron/translation/datasets/training/finetune_nllb_openwho.py

+        # BOTH directions
+        sources.append(eng_text)
+        targets.append(target_text)
+        tgt_langs.append(target_lang)
+
+        sources.append(target_text)
+        targets.append(eng_text)
+        tgt_langs.append('eng_Latn')
+
+    model_inputs = tokenizer(
+        sources,
+        max_length=MAX_LENGTH,
+        truncation=True,
+        padding="max_length"
+    )


In preprocess_function, sources contains both English and non-English text (for the X→EN direction), but tokenizer(sources, ...) runs with tokenizer.src_lang still set to eng_Latn. That means the non-English source examples are tokenized with the wrong language code, which will corrupt training. Consider tracking src_langs alongside sources and tokenizing per-language (or per-example) with the correct src_lang, and use the tokenizer’s target-encoding API (text_target/tgt_lang) for labels instead of mutating src_lang inside the labels loop.

Copilot

Pull request overview

Copilot reviewed 39 out of 64 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/multimeditron/translation/datasets/training/finetune_nllb_openwho.py

+    print(f"\n📊 Final Statistics:")
+    print(f"  Total steps: {metadata['total_steps']}")
+    print(f"  Epochs: {metadata['epochs_completed']:.2f}")
+    print(f"  Final loss: {metadata['final_train_loss']:.4f}")
+    print(f"  Best BLEU: {metadata['best_eval_bleu']:.2f}")
+    print(f"  Best chrF: {metadata['best_eval_chrf']:.2f}")
+    print(f"  Time: {metadata['training_hours']:.2f} hours")


...meditron/translation/datasets/scripts/reformatting_scripts/convert_fineweb_to_pretraining.py

+if __name__ == "__main__":
+    src_train = "../../../nemo/datasets/polyglot/fineweb2_am/train.jsonl"
+    src_test = "../../../nemo/datasets/polyglot/fineweb2_am/test.jsonl"
+
+    dest_dir = "src/multimeditron/translation/datasets/formatted_datasets/general_datasets/fineweb/fineweb_am"
+    os.makedirs(dest_dir, exist_ok=True)


...ditron/translation/datasets/scripts/reformatting_scripts/convert_cleanWiki_to_pretraining.py

+IN_PATH = Path("../../../nemo/datasets/polyglot/clean_wikipedia/train.jsonl")
+OUT_DIR = Path("src/multimeditron/translation/datasets/formatted_datasets/general_datasets/wikipedia")
+
+OUT_DIR.mkdir(parents=True, exist_ok=True)
+ENCODING = "utf-8"


src/multimeditron/translation/translator.py

+    def detect_language(self, text: str, confidence_threshold=0.80) -> str:
+        """
+        Detect language using fastText. Returns 'eng_Latn' if confidence < threshold
+        to trigger pass-through behavior (no translation).
+        """
+        try:
+            clean_text = text.replace('\n', ' ').strip()
+            predictions = self.lang_detector.predict(clean_text, k=3)
+
+            detected_code = predictions[0][0].replace('__label__', '')
+            confidence = float(predictions[1][0])
+
+            LOGGER.debug("Detected language %s (confidence %.3f)", detected_code, confidence)
+
+            if confidence < confidence_threshold:
+                LOGGER.warning(
+                    "Low confidence language detection (%.3f < %.3f). Falling back to eng_Latn.",
+                    confidence,
+                    confidence_threshold,
+                )
+                for i in range(min(3, len(predictions[0]))):
+                    alt_code = predictions[0][i].replace('__label__', '')
+                    alt_conf = float(predictions[1][i])
+                    LOGGER.warning("Alternative prediction %d: %s (%.3f)", i + 1, alt_code, alt_conf)
+                return 'eng_Latn'


src/multimeditron/translation/experiments/scripts/filter_translated_medibench_finetunedNLLB.py

+    # Load dataset
+    print(f"\n[1/4] Loading dataset from {input_file}...")
+    try:
+        with open(input_file, 'r', encoding='utf-8') as f:
+            data = json.load(f)
+        print(f"   ✅ Loaded {len(data)} samples")
+    except Exception as e:
+        print(f"   ❌ Error loading file: {e}")
+        return


src/multimeditron/translation/experiments/scripts/filter_translated_medibench_baseNLLB.py

+    # Load dataset
+    print(f"\n[1/4] Loading dataset from {input_file}...")
+    try:
+        with open(input_file, 'r', encoding='utf-8') as f:
+            data = json.load(f)
+        print(f"   ✅ Loaded {len(data)} samples")
+    except Exception as e:
+        print(f"   ❌ Error loading file: {e}")
+        return


Grieder Lea Noemie and others added 10 commits January 28, 2026 14:29

Ignore large translation result JSON file

0020243

Consensus algorithm, cleaned scripts, and experiment refactor (exclud…

d73eabb

…e large results)

translation consensus script updated

d1701fd

finetuned model

67bf030

Translation pipeline experiments and datasets

ae85caf

git bugs

a400cbe

new preprocessor changes from master needed to be updated in experiments

fc1ef17

results finetuned nllb on baseline experiments

3d2239b

final results experiment 3

52047f6

analysis of meditron output

2d4b5d7

fabnemEPFL requested a review from Copilot February 4, 2026 14:52

Copilot started reviewing on behalf of fabnemEPFL February 4, 2026 14:52 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

fabnemEPFL requested a review from Copilot February 4, 2026 21:50

Copilot started reviewing on behalf of fabnemEPFL February 4, 2026 21:51 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

fabnemEPFL added 2 commits March 12, 2026 15:27

update of the MultiMeditron API

6bd871a

fixes according to copilot's comments

7ecaaf1

fabnemEPFL requested a review from Copilot March 12, 2026 14:29

Copilot started reviewing on behalf of fabnemEPFL March 12, 2026 14:30 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

more fixes

c5c8047

fabnemEPFL requested a review from Copilot March 12, 2026 17:16

Copilot started reviewing on behalf of fabnemEPFL March 12, 2026 17:17 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

addressed copilot's remarks

5d8f29c

fabnemEPFL requested a review from Copilot March 16, 2026 15:41

Copilot started reviewing on behalf of fabnemEPFL March 16, 2026 15:42 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

-        except:
-            pass
+        except Exception as exc:
+            # Gradient checkpointing is an optional optimization; continue without it if enabling fails.
+            print(f"   ⚠️ Could not enable gradient checkpointing: {exc}")

-    except:
-        pass
+    except Exception as save_err:
+        log(f"⚠️  Failed to save emergency checkpoint: {save_err}")

Conversation

leagrieder commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Key Contributions

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

leagrieder commented Jan 28, 2026 •

edited

Loading