CosmOrford

How to build optimal summary statistics for weak gravitational lensing cosmology under a limited simulation budget?

This repository investigates how to build optimal summary statistics for weak gravitational lensing cosmology under a limited simulation budget. This work distills lessons learned from participating in the FAIR Universe - Weak Lensing ML Uncertainty Challenge.

We compare different strategies for building summary statistics — analytical, neural without pre-training, and neural with pre-training on cheaper simulations — within a unified evaluation framework.

📐 Evaluation framework

All summary strategies are evaluated through the same three-step pipeline, which ensures a fair comparison across approaches.

Step 1 — Compression to 8D. Every summary (analytical or neural) is compressed into an 8-dimensional vector. This shared dimensionality puts all approaches on equal footing for the downstream posterior estimation.

Step 2 — Neural Posterior Estimation (NPE). A Masked Autoregressive Flow (MAF) is trained on (summary, θ) pairs drawn from the holdout dataset — with noise augmentation applied to the maps before compression — to approximate the posterior p(Ω_m, S_8 | summary).

Step 3 — Figure of Merit (FoM). Posterior samples are drawn for maps from the fiducial split of the holdout dataset (Ω_m = 0.29, S_8 = 0.81). The FoM = 1 / sqrt(det Cov(Ω_m, S_8)) measures how tightly the posterior constrains the parameters.

Scripts:

Script	Description
`cosmoford/models_nopatch.py`	Compressor model, trained via `cosmoford/trainer.py`
`scripts/run_npe_budget_scan.py`	Trains the NPE flow and computes FoM, sweeping over simulation budgets
`scripts/plot_fom_budget.py`	Plots FoM vs. simulation budget from saved results

Datasets:

Dataset	Split	Used for
`CosmoStat/neurips-wl-challenge-flat`	`train` / `validation`	Compressor training and validation
`CosmoStat/neurips-wl-challenge-holdout`	`train`	NPE training (summaries precomputed with noise augmentation)
`CosmoStat/neurips-wl-challenge-holdout`	`fiducial`	FoM evaluation

⚠️ See below for how to access the datasets.

🗂️ Summary statistics strategies

🔢 Option A — Analytical summaries

Physically motivated statistics computed directly from the masked convergence maps, such as peak counts, wavelet ℓ₁-norm, or power spectrum. A small MLP is then trained to compress these hand-crafted features into an 8D vector by maximizing a Gaussian log-likelihood.

Training script: trainer fit -c <config TBD>

Dataset: CosmoStat/neurips-wl-challenge-flat

🧠 Option B — Neural compressor (no pre-training)

An EfficientNetV2-S network trained directly on the N-body simulations, compressing each convergence map to 8 summary statistics by maximizing the Gaussian log-likelihood.

Training script: trainer fit -c configs/experiments/efficientnet_v2_s_logp_.yaml

Dataset: CosmoStat/neurips-wl-challenge-flat

🚀 Option C — Neural compressor with pre-training

Same EfficientNetV2-S architecture, but first pre-trained on a larger set of cheaper simulations to reduce overfitting when the N-body budget is small, then fine-tuned on the N-body dataset. The compressor is trained with a Gaussian log-likelihood loss.

Fine-tuning script: trainer fit -c configs/finetune_from_pretrain_nopatch_logp.yaml

Update pretrained_checkpoint_path in the config to point to your pre-trained checkpoint.

Fine-tuning dataset: CosmoStat/neurips-wl-challenge-flat

Available pre-training datasets and their configs:

Simulation type	Local dataset	Pre-training config
Gaussian Random Field (GRF)	`CosmoStat/GRF_HF`	`None`
LogNormal	`CosmoStat/lognormal`	`configs/experiments/pretrain_lognormal_nopatch_logp.yaml`
Gower Street	`CosmoStat/gowerstreet-train`	`configs/experiments/pretrain_gowerstreet_nopatch_logp.yaml`
OT-emulated (from LogNormal)	`CosmoStat/ot_emulated`	`configs/pretrain_otemulated_nopatch_logp.yaml`
OT-emulated from TBD	output of the emulator (see below)	TBD

# Example: pre-train on LogNormal, then fine-tune on challenge data
trainer fit -c configs/experiments/pretrain_lognormal_nopatch_logp.yaml
trainer fit -c configs/finetune_from_pretrain_nopatch_logp.yaml

⚙️ Building the OT-emulated dataset

To bridge the gap between cheap simulations and the N-body distribution, a UNet emulator is trained using conditional optimal-transport flow matching (COT-FM). It maps LogNormal (or Gower Street) convergence maps to the distribution of N-body maps, conditioned on cosmological parameters. The emulated maps are then used as pre-training data for Option C.

Training script: cosmoford/emulator/cot_fm.py UNet configs: configs/unet_condition_small.yaml / configs/unet_condition_large.yaml Build HF dataset from emulated maps: scripts/hf_emulated_dataset.py

Dataset	Role
`CosmoStat/GRF_HF`	Cheap simulations to be corrected (GRF)
`CosmoStat/lognormal`	Cheap simulations to be corrected (LogNormal)
PM source	To be generated
`CosmoStat/neurips-wl-challenge-flat`	N-body target distribution for the emulator

python cosmoford/emulator/cot_fm.py \
    --config_yaml configs/unet_condition_large.yaml \
    --dataset_dir_nbody <path/to/neurips-wl-challenge-flat> \
    --dataset_dir_logn_train <path/to/GRF_HF> \
    --num_epochs 100

# Build the emulated HF dataset
python scripts/hf_emulated_dataset.py

🔧 Installation

pip install -e .

Requires Python ≥ 3.8. Key dependencies: torch, lightning, diffusers, torchdyn, nflows, datasets, wandb.

📦 Dataset loading

By default, datasets are loaded locally from /project/rrg-lplevass/shared/wl_chall_data/ (on the Rorqual cluster). The expected directory structure is:

/project/rrg-lplevass/shared/wl_chall_data/
├── neurips-wl-challenge-flat/   # Main challenge dataset (train/validation splits)
├── lognormal/                   # LogNormal pretraining data
├── gowerstreet-train/           # Gower Street pretraining data
├── ot_emulated/                 # OT-emulated pretraining data
└── GRF_HF/                     # Gaussian Random Field pretraining data

To load from HuggingFace Hub / GCS instead (e.g. when running outside the cluster), set use_hub: true in your config:

data:
  init_args:
    use_hub: true

To use a different local directory, set data_dir:

data:
  init_args:
    data_dir: /path/to/your/datasets

All options can also be passed as CLI overrides:

# Default: just pick a dataset mode (loads locally from the default path)
trainer fit -c configs/experiments/efficientnet_v2_s.yaml --data.dataset_mode=lognormal

# Load from HuggingFace Hub
trainer fit -c configs/experiments/efficientnet_v2_s.yaml --data.use_hub=true

# Load from a custom local path
trainer fit -c configs/experiments/efficientnet_v2_s.yaml --data.data_dir=/scratch/datasets

Available dataset_mode values: train, full, lognormal, gowerstreet, gowerstreet-train, ot_emulated, grf.

👥 Team


@AndreasTersenov	@ASKabalan	@b-remy
@EiffL	@noe-dia	@JuliaLinhart
@Justinezgh	@LaurencePeanuts	@SammyS15
@sachaguer	@rouzib

📝 License

See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
configs		configs
cosmoford		cosmoford
docs/plans		docs/plans
scripts		scripts
shell		shell
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
fom_budget_scan.pdf		fom_budget_scan.pdf
pyproject.toml		pyproject.toml
train_modal.py		train_modal.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CosmOrford

📐 Evaluation framework

🗂️ Summary statistics strategies

🔢 Option A — Analytical summaries

🧠 Option B — Neural compressor (no pre-training)

🚀 Option C — Neural compressor with pre-training

⚙️ Building the OT-emulated dataset

🔧 Installation

📦 Dataset loading

👥 Team

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CosmOrford

📐 Evaluation framework

🗂️ Summary statistics strategies

🔢 Option A — Analytical summaries

🧠 Option B — Neural compressor (no pre-training)

🚀 Option C — Neural compressor with pre-training

⚙️ Building the OT-emulated dataset

🔧 Installation

📦 Dataset loading

👥 Team

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages