Official implementation of SEED: Towards More Accurate Semantic Evaluation for Visual Brain Decoding (ICLR 2026).
SEED is a semantic evaluation metric for visual brain decoding that compares reconstructed images against ground-truth images using complementary semantic signals. It combines object-level agreement, image-level feature similarity, and caption-level semantic similarity into one score.
The final SEED score is:
SEED = (Object F1 + Cap-Sim + EffNet) / 3
Where:
Object F1measures object category overlap between detection results.Cap-Simis the cosine similarity between generated caption embeddings.EffNetis image feature similarity.
Implementation note for this repo: seed/metrics.py computes EffNet as correlation distance and converts it back to similarity during final aggregation (1 - effnet_distance), which matches the paper-level formulation above.
# Create and activate environment
conda create --name seed python=3.8 -y
conda activate seed
# Install PyTorch (choose the command matching your CUDA setup)
conda install pytorch torchvision -c pytorch
# Install MMDetection dependencies
pip install openmim
mim install mmengine
mim install mmcv
# Install SEED package in editable mode
pip install -e .
If you prefer, you can run the provided setup script instead (you may need to change the pytorch installation line):
bash installation.shDownload the Grounding DINO checkpoint (required for detection). This links to the MM-Grounding-DINO-L model. For more information, see https://github.com/open-mmlab/mmdetection/blob/main/configs/mm_grounding_dino/README.md
wget https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-l_pretrain_all/grounding_dino_swin-l_pretrain_all-56d69e78.pth
bash seed_evaluation.shBefore running, update image directory variables in seed_evaluation.sh to match your data paths:
RECON_IMAGE_PATHGT_IMAGE_PATH
This runs object detection for reconstruction and GT images, then computes Object F1, Cap-Sim, EffNet, and the final SEED score.
Prepare paired reconstruction and GT image folders with matching filenames:
human_eval_data/
images_test/
gt/
a.png
b.png
...
recon/
a.png
b.png
...
Requirements:
gt/andrecon/must contain corresponding image files with the same names for corresponding GT and reconstruction pairs.- Supported image files should be readable by
PIL(e.g.,.png,.jpg).
Running detection + evaluation creates outputs under:
evaluations/<model_name>/
recon_detection_results/
preds/
*.json
vis/
*.png
gt_detection_results/
preds/
*.json
vis/
*.png
intermediate_results/
obj_f1.npy
effnet.npy
cap_sim.npy
recon_captions.npy
gt_captions.npy
Notes:
preds/*.jsonstores per-image detection outputs used by Object F1.vis/*.pngstores visualized detections.intermediate_results/*.npystores per-image metric values and generated captions.
Object F1 measures object-category agreement between reconstruction and GT images.
- Objects are detected for both image sets using
image_detection.py(Grounding DINO config + weights). - Per-image categories are collected from
preds/*.jsonwith score threshold sweepsseed/metrics.py. - Precision/recall are computed from category overlap and converted to per-image F1.
Higher is better.
Cap-Sim measures semantic similarity between generated captions for reconstruction and GT images.
- Captions are generated with GIT.
- Captions are embedded with Sentence Transformer.
- Cosine similarity is computed between paired caption embeddings.
Higher is better.
EffNet measures image-level feature similarity using EfficientNet-B1 features.
In this repo:
- Features are extracted from EfficientNet-B1 (
avgpool). - The metric usually used in relevant literature is the correlation distance per image (
scipy.spatial.distance.correlation). - For our purposes, the correlation distance is converted to cosine similarity as
1 - effnet_distance.
Higher is better after conversion to similarity.
The final score is the average of the three components:
SEED = (Object F1 + Cap-Sim + EffNet) / 3
We provide our collected human survey results for researchers interested in developing new evaluation metrics and plan to use the survey results to meta-evaluate different evaluation metrics.
Download and unzip the data:
wget https://github.com/Concarne2/SEED/releases/download/v1.0.0/human_eval_data.tar.gz
tar -xzf human_eval_data.tar.gz
This release contains the human survey results and related data, including the image files used for the survey, our evaluation results for those images, and the suggested usage of the survey results.
250131_final.csvraw survey responsesimages/paired image setsimages/gt/(1000 PNGs)images/recon/(1000 PNGs, filename-matched togt)
eval_metrics.npzprecomputed metric arrays for 1000 items- Keys:
pixcorr,ssim,alexnet2,alexnet5,inception,clip,effnet,swav,obj_f1,git_st
- Keys:
survey_analysis.ipynb: suggested survey analysis notebookdataset.py,tau_optimization.py: metric/correlation utilities used by the notebook. These utilities are adopted from the t2v_metrics repro: https://github.com/linzhiqiu/t2v_metrics
- Minor numeric differences can appear across hardware/software stacks (especially GPU/CUDA combinations).
- Therefore when comparing models, run all evaluations in the same environment whenever possible.
This project is released under the Apache 2.0 License.
See LICENSE for details.
If you find this work is useful, please cite.
@inproceedings{
park2026seed,
title={{SEED}: Towards More Accurate Semantic Evaluation for Visual Brain Decoding},
author={Juhyeon Park and Peter Yongho Kim and Jiook Cha and Shinjae Yoo and Taesup Moon},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=JV1eUVA6W7}
}