CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

by

Eric Onyame^*
University of Virginia

Akash Ghosh^*
IIT-Patna

Subhadip Baidya
IIT-Patna

Sriparna Saha
IIT-Patna

Xiuying Chen
MBZUAI

Chirag Agarwal
University of Virginia

^*Equal contribution. Corresponding authors: Eric Onyame, Akash Ghosh

This repository hosts the codebase and dataset for CURE-Med, a framework for improving multilingual medical reasoning in large language models (LLMs). Below, we provide an overview of the project along with key training and implementation details.

Overview

Large language models (LLMs) perform strongly on monolingual math and commonsense reasoning, but they remain unreliable for multilingual medical reasoning—limiting safe use in real-world, multilingual healthcare settings. To address this, we introduce CUREMED-BENCH, a high-quality multilingual medical reasoning benchmark of open-ended questions with a single verifiable answer, spanning 13 languages, including under-represented languages such as Amharic, Yoruba, and Swahili. Building on this benchmark, we propose CURE-MED, a curriculum-informed reinforcement learning framework that combines code-switching-aware supervised fine-tuning with Group Relative Policy Optimization to improve both logical correctness and language stability. Across 13 languages, CURE-MED consistently outperforms strong baselines and scales effectively, reaching 85.21% language consistency and 54.35% logical correctness at 7B parameters, and 94.96% language consistency and 70.04% logical correctness at 32B parameters. Overall, our results move toward more reliable and equitable multilingual medical reasoning with LLMs.

Key Figure

Figure 1. CURE-MED pipeline: (A) clinically validated multilingual data curation (e.g., MedlinePlus), (B) code-switching-aware supervised fine-tuning of a Qwen2.5-Instruct backbone, and (C) GRPO-guided curriculum RL from high- to mid- to low-resource languages to improve logical correctness and language consistency.

_{High-resolution PDF: Figure 1}

For full technical details and experiments, see the paper on arXiv and the project website.

Dataset

CUREMED-BENCH: Provided in data.zip. The dataset contains open-ended medical reasoning questions with a single verifiable answer across 13 languages.
Unzip data.zip before running training or evaluation.
Hugging Face: CUREMED-BENCH is also available on Hugging Face:
https://huggingface.co/datasets/Aikyam-Lab/CUREMED-BENCH

Repository Structure

baseline_inference/ — Baseline inference scripts for evaluation.
SFT/ — Code-switching-aware supervised fine-tuning (SFT) training pipeline.
SFT_Inference/ — Inference and evaluation for SFT checkpoints.
Curriculum_RFT/ — Curriculum-informed reinforcement learning / RFT training (GRPO-guided).
RFT_Inference/ — Inference and evaluation for RFT checkpoints.
figures/ — Figures used in the README and paper.
README.md — Project documentation.
data.zip — Packaged dataset release for local use.
datasets/ — Codeswitched dataset release for local use.

Supervised Fine-Tuning (SFT)

SFT code is in SFT/:

code_switch_sft.py — code-switching-aware SFT training script
deepspeed_zero3.yaml — DeepSpeed ZeRO-3 config
code_switch_batch_script.sh — Slurm batch script for launching SFT

Environment

We ran SFT with Python 3.11.13. Ensure your environment includes: torch, transformers, datasets, trl, accelerate, deepspeed.

Data (SFT_data)

Provide SFT training files as JSONL under a directory like: /path/to/SFT_data/*.jsonl

Each example must contain: question, reasoning, answer, language.

Set the dataset path in the batch script via: --data_dir="/path/to/SFT_data".

Models & GPU Recommendations

We used Qwen2.5-Instruct variants: 1.5B, 3B, 7B, 14B, 32B.

1.5B / ~4B: recommended 4× A100
7B / 14B / 32B: recommended ≥ 8× A100

SFT Hyperparameters

Optimizer: AdamW (β1=0.9, β2=0.999)
LR: 1e-5 (cosine, warmup ratio 0.1), epochs: 3
Effective batch size: 32, max seq length: 4096
Precision: bf16, DeepSpeed ZeRO-3 + gradient checkpointing

Run (Slurm)

Edit SFT/code_switch_batch_script.sh:

set base_model="Qwen/Qwen2.5-*-Instruct"
set --data_dir="/path/to/SFT_data"
request GPUs via #SBATCH --gres=gpu:a100:<N>
match Accelerate processes to GPU count (e.g., --num_processes <N>)

Submit: sbatch SFT/code_switch_batch_script.sh

Reinforcement Fine-Tuning (RFT / GRPO Curriculum)

RFT is implemented in Curriculum_RFT/ as a 3-stage GRPO curriculum over staged datasets:

Datasets: Curriculum_RFT/staged/{high,medium,low}/ (each contains *.jsonl)
Training code: Curriculum_RFT/Training_Stages/{Stage_one_training,Stage_two_training,Stage_three_training}/

Environment

We ran RFT with Python 3.11.13 . All stages use full fine-tuning.

Dataset format

Each JSONL example must include: question, reasoning, answer, language.

Curriculum order (sequential checkpoints)

Run stages in this order (each stage initializes from the previous checkpoint):

Stage 1 (High-resource): start from the SFT checkpoint + staged/high/
Stage 2 (Medium-resource): start from Stage 1 checkpoint + staged/medium/
Stage 3 (Low-resource): start from Stage 2 checkpoint + staged/low/

Run RFT (Slurm)

Each stage provides a Slurm launcher script inside its stage folder. Update the script (or args) to point to the correct starting checkpoint and dataset path.

Note: Ensure accelerate --num_processes matches the number of GPUs requested in Slurm (e.g., 4 GPUs -> --num_processes 4). Output checkpoints are saved to the --output_dir specified in each stage script.

Inference: The SFT_Inference/ and RFT_Inference/ folders contain scripts for running inference with the trained SFT and RFT checkpoints. Please update the relevant model/checkpoint and data paths in the scripts before running.

Citation for CURE-Med Paper

Below is the BibTeX entry for the paper:

@misc{onyame2026curemedcurriculuminformedreinforcementlearning,
      title={CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning},
      author={Eric Onyame and Akash Ghosh and Subhadip Baidya and Sriparna Saha and Xiuying Chen and Chirag Agarwal},
      year={2026},
      eprint={2601.13262},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.13262},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

Overview

Key Figure

Dataset

Repository Structure

Supervised Fine-Tuning (SFT)

Environment

Data (SFT_data)

Models & GPU Recommendations

SFT Hyperparameters

Run (Slurm)

Reinforcement Fine-Tuning (RFT / GRPO Curriculum)

Environment

Dataset format

Curriculum order (sequential checkpoints)

Run RFT (Slurm)

Citation for CURE-Med Paper

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Curriculum_RFT		Curriculum_RFT
RFT_Inference		RFT_Inference
SFT		SFT
SFT_Inference		SFT_Inference
baseline_inference		baseline_inference
datasets		datasets
figures		figures
.gitignore		.gitignore
README.md		README.md
data.zip		data.zip

AikyamLab/cure-med

Folders and files

Latest commit

History

Repository files navigation

CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

Overview

Key Figure

Dataset

Repository Structure

Supervised Fine-Tuning (SFT)

Environment

Data (SFT_data)

Models & GPU Recommendations

SFT Hyperparameters

Run (Slurm)

Reinforcement Fine-Tuning (RFT / GRPO Curriculum)

Environment

Dataset format

Curriculum order (sequential checkpoints)

Run RFT (Slurm)

Citation for CURE-Med Paper

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages