This repository contains the reference code and artifacts for the paper:
- The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation — accepted to the Generative and Protective AI for Content Creation (GenProCC) workshop at NeurIPS 2025.
The project builds an iterative pipeline that:
- Decomposes a metaphor into Source (S), Target (T), and Meaning (M) with a text LLM
- Generates images with Stable Diffusion 3.5 or Janus based on a refined visual prompt
- Evaluates images using a multi-faceted reward signal (CLIP, VLM-based analysis, BERTScore, and a decomposition reward)
- Refines the prompt over multiple iterations via in-context feedback, or optionally trains a small LLM with GRPO using the same reward signal
- End-to-end metaphor-to-image generation pipeline with iterative refinement
- Pluggable text LLMs (API-based Gemini, or local Unsloth/Gemma LoRA adapters)
- Pluggable image backends: Stable Diffusion 3.5 (diffusers) or Janus
- VLM-based evaluation with either Gemini or a local vLLM-served Qwen2.5-VL model
- Unified combined reward with adjustable weights in
config.py - Optional GRPO finetuning loop that uses the same rewards to improve the S/T/M + visual prompt generation
- Environment
- Python 3.10+ recommended
- CUDA-capable GPU (48GB+ VRAM recommended; SD3.5/Janus benefit from more)
- Install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
- Credentials and optional services
- Gemini (text/VLM): set
GEMINI_API_KEYin your shell environment. - Hugging Face token (optional): set
HF_TOKENif you need gated model files (e.g., LoRA weights) from the Hub. - Local Qwen2.5-VL via vLLM (optional alternative to Gemini VLM):
- Ensure
config.pyhasVLM_PROVIDER = "qwen"andQWEN_VLM_API_BASEpoints to your vLLM server. - Example vLLM serve (adjust to your hardware):
- Ensure
vllm serve Qwen/Qwen2.5-VL-32B-Instruct-AWQ --port 8000 --tensor-parallel-size 2
- Configure
Adjust
config.pyto choose models and behavior:
ACTIVE_IMAGE_MODEL: "stable_diffusion" or "janus"GEMINI_MODEL_ID: API text model (default: Gemma-3 27B via API wrapper)USE_TEST_METAPHORS: toggle to load metaphors fromdata/metaphors_test.txtN_ITERATIONS: prompt refinement iterations- Reward weights:
W_*keys for CLIP/VLM/BERTScore/decomposition alignment - Generation params:
DEFAULT_SD_*,DEFAULT_JANUS_* - VLM provider/model:
VLM_PROVIDER,VLM_MODEL_ID_FOR_EVAL,QWEN_VLM_*
- Run the iterative pipeline
- Put your metaphors in
config.py(METAPHOR_PROMPT_LIST) or setUSE_TEST_METAPHORS=Trueand filldata/metaphors_test.txt. - Then run:
python main.py
Outputs go to generated_images_visual_metaphor/<sanitized-metaphor>/, including per-iteration PNGs and an iteration_summary.json. A run-level average_scores_summary.json is written in the root OUTPUT_DIR.
utils.llm_utils.break_down_metaphor(...)extracts S/T/M and a compact, visually actionable prompt (≤77 tokens)utils.pipeline_steps.run_iterative_pipeline(...)runs N iterations:- generate with SD3.5 or Janus
- score with CLIP + VLM (S′/T′/M′ extraction, presence/alignment) + BERTScore + decomposition reward
- compute a combined reward (weighted sum)
- refine the prompt via the text LLM for the next iteration
main.pyorchestrates initialization, I/O, caching of processed metaphors, and average-score aggregation
Optionally, grpo_training.py uses the same reward signal to train a small LLM (Unsloth/Gemma 4B) with GRPO on S/T/M + visual prompt generation.
Below is a quick tour of the key files and folders.
Drivers and scripts at repo root
main.py— Entry point for the iterative generation pipeline (loads models, loops over metaphors, aggregates scores)grpo_training.py— GRPO trainer using Unsloth/Gemma 4B; uses the same multi-signal rewards and generates/saves images for auditingeval_zero_shot.py— Standalone evaluator for a single image/metaphor pair; emits a JSON with metricsgenerate_images.py— Earlier standalone script to generate images from sample prompts with SD3.5 + Gemini-based elaborationconfig.py— All runtime configuration (model IDs, weights, iteration counts, paths, API providers, generation params)
Utilities (moved under utils/)
utils/pipeline_steps.py— Core iterative loop: image generation, VLM analysis, CLIP/BERTScore, combined reward, and prompt refinementutils/llm_utils.py— Text/VLM client initialization (Gemini API, local Unsloth/Gemma LoRA, Qwen vLLM), plus metaphor breakdown helpersutils/sd_utils.py— Stable Diffusion 3.5 (diffusers) and Janus initialization; LoRA loading; memory tuningutils/evaluation_utils.py— CLIP score, VLM analysis (Gemini or Qwen2.5-VL), BERTScore, combined reward, decomposition rewardutils/data_preprocessing.py— Load, dedupe, split, and persist train/test metaphor listsutils/logging_utils.py— Centralized logging config (file + console) driven byconfig.py
Configuration and environment
requirements.txt— Fully pinned environment for reproducibility (diffusers, transformers, torch 2.6, unsloth, vLLM, etc.)
Data, outputs, and artifacts
data/— Project datasets and splitsmetaphors_train.txt,metaphors_test.txt— Generated bydata_preprocessing.pyor created manually
unsloth_compiled_cache/— Unsloth runtime cacheuser_study/,user_study_images/— Materials and images related to the user study
See more details in the user study README: user_study/README.md.
Evaluate an already generated image for a given metaphor:
python eval_zero_shot.py \
--metaphor "The world is a garden" \
--image_path path/to/image.png \
--source "garden" \
--target "world" \
--meaning "The world is cultivated, diverse, and nurtured like a garden."
Emits a JSON with: CLIP score, VLM S′/T′/M′ and presence/alignment scores, BERTScores, and (if S/T/M provided) the decomposition reward.
The same reward signal can be used to finetune a small text LLM to produce better S/T/M + visual prompts.
- Set
ENABLE_GRPO_TRAINING = Trueinconfig.py. - Ensure a GPU with sufficient VRAM; SD/Janus image generation during rewards is the heavy part.
- Prepare training metaphors in
data/metaphors_train.txt. - Run
python -m grpo_trainingor callMetaphorGRPOTrainer().run_standalone_training().
Outputs (model + tokenizer) are saved under GRPO_OUTPUT_DIR; interim images are backed up to GRPO_IMAGE_BACKUP_DIR for auditing.
- GPU memory: SD3.5 is VRAM hungry. Reduce
DEFAULT_SD_HEIGHT/WIDTH,DEFAULT_SD_STEPS, or switch to Janus if more stable for your hardware. The code proactively setsPYTORCH_CUDA_ALLOC_CONFand clears CUDA cache between steps. - Gemini vs Qwen VLM: switch providers in
config.py. For local Qwen2.5-VL, keep vLLM running and confirmQWEN_VLM_API_BASE. - Long prompts: the visual prompt should be ≤77 tokens; the pipeline penalizes overlong prompts during GRPO and asks the LLM to stay concise during refinement.
- LoRA loading: if LoRA fails to load, the pipeline will continue with the base SD model and log a warning.
If you use this repository, please cite the paper:
@article{koushik2025mindseyemultifacetedreward,
title={The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation},
author={Girish A. Koushik and Fatemeh Nazarieh and Katherine Birch and Shenbin Qian and Diptesh Kanojia},
year={2025},
eprint={2508.18569},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.18569},
}
This code is released under the MIT License. See LICENSE for details.

