Visual Metaphor Generation (The Mind's Eye)

Graphical abstract

This repository contains the reference code and artifacts for the paper:

The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation — accepted to the Generative and Protective AI for Content Creation (GenProCC) workshop at NeurIPS 2025.

The project builds an iterative pipeline that:

Decomposes a metaphor into Source (S), Target (T), and Meaning (M) with a text LLM
Generates images with Stable Diffusion 3.5 or Janus based on a refined visual prompt
Evaluates images using a multi-faceted reward signal (CLIP, VLM-based analysis, BERTScore, and a decomposition reward)
Refines the prompt over multiple iterations via in-context feedback, or optionally trains a small LLM with GRPO using the same reward signal

Highlights

End-to-end metaphor-to-image generation pipeline with iterative refinement
Pluggable text LLMs (API-based Gemini, or local Unsloth/Gemma LoRA adapters)
Pluggable image backends: Stable Diffusion 3.5 (diffusers) or Janus
VLM-based evaluation with either Gemini or a local vLLM-served Qwen2.5-VL model
Unified combined reward with adjustable weights in config.py
Optional GRPO finetuning loop that uses the same rewards to improve the S/T/M + visual prompt generation

Pipeline architecture

Quick start

Environment

Python 3.10+ recommended
CUDA-capable GPU (48GB+ VRAM recommended; SD3.5/Janus benefit from more)
Install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Credentials and optional services

Gemini (text/VLM): set GEMINI_API_KEY in your shell environment.
Hugging Face token (optional): set HF_TOKEN if you need gated model files (e.g., LoRA weights) from the Hub.
Local Qwen2.5-VL via vLLM (optional alternative to Gemini VLM):
- Ensure config.py has VLM_PROVIDER = "qwen" and QWEN_VLM_API_BASE points to your vLLM server.
- Example vLLM serve (adjust to your hardware):

vllm serve Qwen/Qwen2.5-VL-32B-Instruct-AWQ --port 8000 --tensor-parallel-size 2

Configure Adjust config.py to choose models and behavior:

ACTIVE_IMAGE_MODEL: "stable_diffusion" or "janus"
GEMINI_MODEL_ID: API text model (default: Gemma-3 27B via API wrapper)
USE_TEST_METAPHORS: toggle to load metaphors from data/metaphors_test.txt
N_ITERATIONS: prompt refinement iterations
Reward weights: W_* keys for CLIP/VLM/BERTScore/decomposition alignment
Generation params: DEFAULT_SD_*, DEFAULT_JANUS_*
VLM provider/model: VLM_PROVIDER, VLM_MODEL_ID_FOR_EVAL, QWEN_VLM_*

Run the iterative pipeline

Put your metaphors in config.py (METAPHOR_PROMPT_LIST) or set USE_TEST_METAPHORS=True and fill data/metaphors_test.txt.
Then run:

python main.py

Outputs go to generated_images_visual_metaphor/<sanitized-metaphor>/, including per-iteration PNGs and an iteration_summary.json. A run-level average_scores_summary.json is written in the root OUTPUT_DIR.

How it works (high level)

utils.llm_utils.break_down_metaphor(...) extracts S/T/M and a compact, visually actionable prompt (≤77 tokens)
utils.pipeline_steps.run_iterative_pipeline(...) runs N iterations:
- generate with SD3.5 or Janus
- score with CLIP + VLM (S′/T′/M′ extraction, presence/alignment) + BERTScore + decomposition reward
- compute a combined reward (weighted sum)
- refine the prompt via the text LLM for the next iteration
main.py orchestrates initialization, I/O, caching of processed metaphors, and average-score aggregation

Optionally, grpo_training.py uses the same reward signal to train a small LLM (Unsloth/Gemma 4B) with GRPO on S/T/M + visual prompt generation.

Repository structure

Below is a quick tour of the key files and folders.

Drivers and scripts at repo root

main.py — Entry point for the iterative generation pipeline (loads models, loops over metaphors, aggregates scores)
grpo_training.py — GRPO trainer using Unsloth/Gemma 4B; uses the same multi-signal rewards and generates/saves images for auditing
eval_zero_shot.py — Standalone evaluator for a single image/metaphor pair; emits a JSON with metrics
generate_images.py — Earlier standalone script to generate images from sample prompts with SD3.5 + Gemini-based elaboration
config.py — All runtime configuration (model IDs, weights, iteration counts, paths, API providers, generation params)

Utilities (moved under utils/)

utils/pipeline_steps.py — Core iterative loop: image generation, VLM analysis, CLIP/BERTScore, combined reward, and prompt refinement
utils/llm_utils.py — Text/VLM client initialization (Gemini API, local Unsloth/Gemma LoRA, Qwen vLLM), plus metaphor breakdown helpers
utils/sd_utils.py — Stable Diffusion 3.5 (diffusers) and Janus initialization; LoRA loading; memory tuning
utils/evaluation_utils.py — CLIP score, VLM analysis (Gemini or Qwen2.5-VL), BERTScore, combined reward, decomposition reward
utils/data_preprocessing.py — Load, dedupe, split, and persist train/test metaphor lists
utils/logging_utils.py — Centralized logging config (file + console) driven by config.py

Configuration and environment

requirements.txt — Fully pinned environment for reproducibility (diffusers, transformers, torch 2.6, unsloth, vLLM, etc.)

Data, outputs, and artifacts

data/ — Project datasets and splits
- metaphors_train.txt, metaphors_test.txt — Generated by data_preprocessing.py or created manually
unsloth_compiled_cache/ — Unsloth runtime cache
user_study/, user_study_images/ — Materials and images related to the user study

See more details in the user study README: user_study/README.md.

Running evaluation only (zero-shot)

Evaluate an already generated image for a given metaphor:

python eval_zero_shot.py \
  --metaphor "The world is a garden" \
  --image_path path/to/image.png \
  --source "garden" \
  --target "world" \
  --meaning "The world is cultivated, diverse, and nurtured like a garden."

Emits a JSON with: CLIP score, VLM S′/T′/M′ and presence/alignment scores, BERTScores, and (if S/T/M provided) the decomposition reward.

GRPO Training

The same reward signal can be used to finetune a small text LLM to produce better S/T/M + visual prompts.

Set ENABLE_GRPO_TRAINING = True in config.py.
Ensure a GPU with sufficient VRAM; SD/Janus image generation during rewards is the heavy part.
Prepare training metaphors in data/metaphors_train.txt.
Run python -m grpo_training or call MetaphorGRPOTrainer().run_standalone_training().

Outputs (model + tokenizer) are saved under GRPO_OUTPUT_DIR; interim images are backed up to GRPO_IMAGE_BACKUP_DIR for auditing.

Tips and troubleshooting

GPU memory: SD3.5 is VRAM hungry. Reduce DEFAULT_SD_HEIGHT/WIDTH, DEFAULT_SD_STEPS, or switch to Janus if more stable for your hardware. The code proactively sets PYTORCH_CUDA_ALLOC_CONF and clears CUDA cache between steps.
Gemini vs Qwen VLM: switch providers in config.py. For local Qwen2.5-VL, keep vLLM running and confirm QWEN_VLM_API_BASE.
Long prompts: the visual prompt should be ≤77 tokens; the pipeline penalizes overlong prompts during GRPO and asks the LLM to stay concise during refinement.
LoRA loading: if LoRA fails to load, the pipeline will continue with the base SD model and log a warning.

Citation

If you use this repository, please cite the paper:

@article{koushik2025mindseyemultifacetedreward,
      title={The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation}, 
      author={Girish A. Koushik and Fatemeh Nazarieh and Katherine Birch and Shenbin Qian and Diptesh Kanojia},
      year={2025},
      eprint={2508.18569},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.18569}, 
}

License

This code is released under the MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Visual Metaphor Generation (The Mind's Eye)

Highlights

Quick start

How it works (high level)

Repository structure

Running evaluation only (zero-shot)

GRPO Training

Tips and troubleshooting

Citation

License

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
data		data
images		images
user_study		user_study
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.py		config.py
eval_zero_shot.py		eval_zero_shot.py
generate_images.py		generate_images.py
grpo_training.py		grpo_training.py
main.py		main.py
requirements.txt		requirements.txt

License

surrey-nlp/visual-metaphor

Folders and files

Latest commit

History

Repository files navigation

Visual Metaphor Generation (The Mind's Eye)

Highlights

Quick start

How it works (high level)

Repository structure

Running evaluation only (zero-shot)

GRPO Training

Tips and troubleshooting

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages