Skip to content

The official implementation for "The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation" accepted to the GenProCC Workshop at NeurIPS 2025.

License

Notifications You must be signed in to change notification settings

surrey-nlp/visual-metaphor

Repository files navigation

Visual Metaphor Generation (The Mind's Eye)

arXiv

Graphical abstract
Graphical abstract

This repository contains the reference code and artifacts for the paper:

  • The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation — accepted to the Generative and Protective AI for Content Creation (GenProCC) workshop at NeurIPS 2025.

The project builds an iterative pipeline that:

  • Decomposes a metaphor into Source (S), Target (T), and Meaning (M) with a text LLM
  • Generates images with Stable Diffusion 3.5 or Janus based on a refined visual prompt
  • Evaluates images using a multi-faceted reward signal (CLIP, VLM-based analysis, BERTScore, and a decomposition reward)
  • Refines the prompt over multiple iterations via in-context feedback, or optionally trains a small LLM with GRPO using the same reward signal

Highlights

  • End-to-end metaphor-to-image generation pipeline with iterative refinement
  • Pluggable text LLMs (API-based Gemini, or local Unsloth/Gemma LoRA adapters)
  • Pluggable image backends: Stable Diffusion 3.5 (diffusers) or Janus
  • VLM-based evaluation with either Gemini or a local vLLM-served Qwen2.5-VL model
  • Unified combined reward with adjustable weights in config.py
  • Optional GRPO finetuning loop that uses the same rewards to improve the S/T/M + visual prompt generation

Pipeline architecture
Pipeline architecture

Quick start

  1. Environment
  • Python 3.10+ recommended
  • CUDA-capable GPU (48GB+ VRAM recommended; SD3.5/Janus benefit from more)
  • Install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
  1. Credentials and optional services
  • Gemini (text/VLM): set GEMINI_API_KEY in your shell environment.
  • Hugging Face token (optional): set HF_TOKEN if you need gated model files (e.g., LoRA weights) from the Hub.
  • Local Qwen2.5-VL via vLLM (optional alternative to Gemini VLM):
    • Ensure config.py has VLM_PROVIDER = "qwen" and QWEN_VLM_API_BASE points to your vLLM server.
    • Example vLLM serve (adjust to your hardware):
vllm serve Qwen/Qwen2.5-VL-32B-Instruct-AWQ --port 8000 --tensor-parallel-size 2
  1. Configure Adjust config.py to choose models and behavior:
  • ACTIVE_IMAGE_MODEL: "stable_diffusion" or "janus"
  • GEMINI_MODEL_ID: API text model (default: Gemma-3 27B via API wrapper)
  • USE_TEST_METAPHORS: toggle to load metaphors from data/metaphors_test.txt
  • N_ITERATIONS: prompt refinement iterations
  • Reward weights: W_* keys for CLIP/VLM/BERTScore/decomposition alignment
  • Generation params: DEFAULT_SD_*, DEFAULT_JANUS_*
  • VLM provider/model: VLM_PROVIDER, VLM_MODEL_ID_FOR_EVAL, QWEN_VLM_*
  1. Run the iterative pipeline
  • Put your metaphors in config.py (METAPHOR_PROMPT_LIST) or set USE_TEST_METAPHORS=True and fill data/metaphors_test.txt.
  • Then run:
python main.py

Outputs go to generated_images_visual_metaphor/<sanitized-metaphor>/, including per-iteration PNGs and an iteration_summary.json. A run-level average_scores_summary.json is written in the root OUTPUT_DIR.

How it works (high level)

  • utils.llm_utils.break_down_metaphor(...) extracts S/T/M and a compact, visually actionable prompt (≤77 tokens)
  • utils.pipeline_steps.run_iterative_pipeline(...) runs N iterations:
    • generate with SD3.5 or Janus
    • score with CLIP + VLM (S′/T′/M′ extraction, presence/alignment) + BERTScore + decomposition reward
    • compute a combined reward (weighted sum)
    • refine the prompt via the text LLM for the next iteration
  • main.py orchestrates initialization, I/O, caching of processed metaphors, and average-score aggregation

Optionally, grpo_training.py uses the same reward signal to train a small LLM (Unsloth/Gemma 4B) with GRPO on S/T/M + visual prompt generation.

Repository structure

Below is a quick tour of the key files and folders.

Drivers and scripts at repo root

  • main.py — Entry point for the iterative generation pipeline (loads models, loops over metaphors, aggregates scores)
  • grpo_training.py — GRPO trainer using Unsloth/Gemma 4B; uses the same multi-signal rewards and generates/saves images for auditing
  • eval_zero_shot.py — Standalone evaluator for a single image/metaphor pair; emits a JSON with metrics
  • generate_images.py — Earlier standalone script to generate images from sample prompts with SD3.5 + Gemini-based elaboration
  • config.py — All runtime configuration (model IDs, weights, iteration counts, paths, API providers, generation params)

Utilities (moved under utils/)

  • utils/pipeline_steps.py — Core iterative loop: image generation, VLM analysis, CLIP/BERTScore, combined reward, and prompt refinement
  • utils/llm_utils.py — Text/VLM client initialization (Gemini API, local Unsloth/Gemma LoRA, Qwen vLLM), plus metaphor breakdown helpers
  • utils/sd_utils.py — Stable Diffusion 3.5 (diffusers) and Janus initialization; LoRA loading; memory tuning
  • utils/evaluation_utils.py — CLIP score, VLM analysis (Gemini or Qwen2.5-VL), BERTScore, combined reward, decomposition reward
  • utils/data_preprocessing.py — Load, dedupe, split, and persist train/test metaphor lists
  • utils/logging_utils.py — Centralized logging config (file + console) driven by config.py

Configuration and environment

  • requirements.txt — Fully pinned environment for reproducibility (diffusers, transformers, torch 2.6, unsloth, vLLM, etc.)

Data, outputs, and artifacts

  • data/ — Project datasets and splits
    • metaphors_train.txt, metaphors_test.txt — Generated by data_preprocessing.py or created manually
  • unsloth_compiled_cache/ — Unsloth runtime cache
  • user_study/, user_study_images/ — Materials and images related to the user study

See more details in the user study README: user_study/README.md.

Running evaluation only (zero-shot)

Evaluate an already generated image for a given metaphor:

python eval_zero_shot.py \
  --metaphor "The world is a garden" \
  --image_path path/to/image.png \
  --source "garden" \
  --target "world" \
  --meaning "The world is cultivated, diverse, and nurtured like a garden."

Emits a JSON with: CLIP score, VLM S′/T′/M′ and presence/alignment scores, BERTScores, and (if S/T/M provided) the decomposition reward.

GRPO Training

The same reward signal can be used to finetune a small text LLM to produce better S/T/M + visual prompts.

  • Set ENABLE_GRPO_TRAINING = True in config.py.
  • Ensure a GPU with sufficient VRAM; SD/Janus image generation during rewards is the heavy part.
  • Prepare training metaphors in data/metaphors_train.txt.
  • Run python -m grpo_training or call MetaphorGRPOTrainer().run_standalone_training().

Outputs (model + tokenizer) are saved under GRPO_OUTPUT_DIR; interim images are backed up to GRPO_IMAGE_BACKUP_DIR for auditing.

Tips and troubleshooting

  • GPU memory: SD3.5 is VRAM hungry. Reduce DEFAULT_SD_HEIGHT/WIDTH, DEFAULT_SD_STEPS, or switch to Janus if more stable for your hardware. The code proactively sets PYTORCH_CUDA_ALLOC_CONF and clears CUDA cache between steps.
  • Gemini vs Qwen VLM: switch providers in config.py. For local Qwen2.5-VL, keep vLLM running and confirm QWEN_VLM_API_BASE.
  • Long prompts: the visual prompt should be ≤77 tokens; the pipeline penalizes overlong prompts during GRPO and asks the LLM to stay concise during refinement.
  • LoRA loading: if LoRA fails to load, the pipeline will continue with the base SD model and log a warning.

Citation

If you use this repository, please cite the paper:

@article{koushik2025mindseyemultifacetedreward,
      title={The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation}, 
      author={Girish A. Koushik and Fatemeh Nazarieh and Katherine Birch and Shenbin Qian and Diptesh Kanojia},
      year={2025},
      eprint={2508.18569},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.18569}, 
}

License

This code is released under the MIT License. See LICENSE for details.

About

The official implementation for "The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation" accepted to the GenProCC Workshop at NeurIPS 2025.

Resources

License

Stars

Watchers

Forks