Image to Text Generation Project

This project implements an image captioning model using a Transformer-based architecture. It takes an image as input and generates a descriptive textual caption. The model uses pre-trained CLIP features for image understanding and a Transformer decoder for text generation.

1. Project Structure

Image-to-Text-Generation/
├── .gitignore
├── README.md
├── requirements.txt
├── data/
│   ├── flickr8k_images/
│   │   ├── Images/       # <-- Raw JPG images (approx. 8000)
│   │   └── captions.txt  # <-- Original captions file from Flickr8k
│   ├── clip_features/    # Generated by extract_features.py
│   │   ├── train/
│   │   ├── val/
│   │   └── test/
│   └── processed/        # Generated by preprocess_data.py
│       ├── vocab.json
│       ├── train_set.json
│       ├── val_set.json
│       └── test_set.json
├── logging # Generated by train.py
├── src/
│   ├── helpers # for logging and result visualization 
│   ├── preprocess_data.py
│   ├── extract_features.py
│   ├── model.py
│   └── model_config.yaml # train configuration 
│   ├── train.py
│   ├── test.py # run trained model on entire test set 
│   ├── generate.py # run trained model for a individual input image
│   ├── utils.py
│   ├── run_pipeline.sh     # Script to run the full data prep and training
│   └── image-to-text.sh  # Script to generate a caption for an image
│   └── evaluate_gen_result.py
└── saved_models/   # Generated by train.py (contains subdirectories for each run)
    └── run_YYYYMMDD-HHMMSS/
        ├── model_best.pt
        └── model_epoch_N.pt
├── visualization # curves and any visualization results generated by train.py
└── tests_and_results/  # test set results 
    └── YYYY-MM0DD/

2. Setup Instructions

Prerequisites

Python 3.8 or higher.
pip for installing packages.

Download Dataset

This project uses the Flickr-8k dataset.

Download:
- Flickr8k_Dataset (Images): You can find this on Kaggle or other academic dataset repositories. Search for "Flickr8k". It usually comes as a Flickr8k_Dataset.zip containing the Images folder.
- Flickr8k_text (Captions): This usually contains Flickr8k.token.txt which is often used, but this project expects captions.txt in a specific CSV format (image,caption). The preprocess_data.py script included in this repository uses data/flickr8k_images/captions.txt. Ensure your captions.txt has two columns: image and caption, with each image having multiple caption rows.
A common source for the text file used by this project might be from a pre-processed version of Flickr8k text or you might need to adapt Flickr8k.token.txt. The expected captions.txt format:
```
image_name_1.jpg,First caption for image 1
image_name_1.jpg,Second caption for image 1
...
image_name_2.jpg,First caption for image 2
```
Folder Structure:
- Create the directory data/flickr8k_images/ in the project root.
- Extract/place the downloaded Images folder into data/flickr8k_images/. So you should have data/flickr8k_images/Images/.
- Place your captions.txt file directly into data/flickr8k_images/.
The final structure should be:
```
data/
└── flickr8k_images/
    ├── Images/
    │   ├── 1000268201_693b08cb0e.jpg
    │   ├── ... (all ~8000 images)
    └── captions.txt
```

Install Dependencies

Navigate to the project root directory in your terminal and run:

pip install -r requirements.txt

This will install PyTorch, Transformers, NLTK, Pillow, and other necessary libraries. The script will also attempt to download NLTK's punkt tokenizer data if not found.

3. Running the Pipeline

All scripts should be run from the src/ directory.

Full Pipeline (Preprocessing, Feature Extraction, Training)

This script automates the entire process of preparing data, extracting image features, and training the captioning model.

Navigate to the src directory:
```
cd src
```
Make the script executable (if not already):
```
chmod +x run_pipeline.sh
```
Run the script:
```
./run_pipeline.sh
```
This will:
- Run preprocess_data.py to clean captions, build a vocabulary, and create train/validation/test splits. Processed data will be saved in data/processed/.
- Run extract_features.py to generate CLIP image features for all images in the dataset splits. Features will be saved in data/clip_features/.
- Run train.py to train the CaptioningTransformer model. Checkpoints and the best model will be saved in a timestamped subdirectory within saved_models/.

Run Trained Model on Test Dataset

First, navigate to the src directory. Then, run the evaluation script with the required arguments. For example:

python test.py \
  --best_model_path ../saved_models/run_{timestamp}/model_best.pt \  # Specify the full path to the trained model
  --clip_model_name openai/XXXX \ # default to clip-vit-base-patch32
  --device cuda \
  --beam_search \  # (Optional flag) Include this flag only if you want to use beam search decoding
  --eval_mode full \
  --max_len 40

This command loads the processed test set from data/processed/test_set.json and the vocabulary from data/processed/vocab.json. It also loads the trained model and CLIP model for feature extraction, then generates captions for each test image using either greedy decoding (default) or beam search (if the --beam_search flag is used).

The generated captions for each image will be saved to a timestamped file under tests_and_results/YYYY-MM-DD/. After generation, the script computes evaluation metrics, and prints the average scores to the console.

Generating Captions for a New Image

This script loads a trained model and generates a caption for a specified image.

Navigate to the src directory:
```
cd src
```
Make the script executable (if not already):
```
chmod +x image-to-text.sh
```

Modify image-to-text.sh (Optional): The script is pre-configured to use a sample image path and a model path. You might need to update these paths:

python generate.py \
    --image_path ../data/flickr8k_images/Images/YOUR_IMAGE.jpg \
    --model_path ../saved_models/YOUR_RUN_DIR/model_best.pt \
    --vocab_path ../data/processed/vocab.json \
    --clip_model_name openai/XXXX \ # default to clip-vit-base-patch32
    --max_len 40 \ # Maximum length of the generated caption 
    --device cuda \ # or cuda
    --beam_search #(Optional flag) Include this flag only if you want to use beam search decoding

Replace YOUR_IMAGE.jpg with the image you want to test and YOUR_RUN_DIR with the specific run directory in saved_models/ containing your trained model_best.pt.

Run the script:
```
./image-to-text.sh
```
This will output the generated caption to the console.

4. Script Descriptions

Located in the src/ directory:

preprocess_data.py:
- Loads captions from data/flickr8k_images/captions.txt.
- Cleans captions: converts to lowercase, removes punctuation, and tokenizes using NLTK.
- Builds a vocabulary (vocab.json) from the training captions, mapping words to unique integer IDs. Special tokens (<pad>, <start>, <end>, <unk>) are included. Words occurring less than MIN_WORD_FREQ are excluded.
- Splits the dataset into training, validation, and test sets (train_set.json, val_set.json, test_set.json). These JSON files map image IDs to their respective image paths and tokenized captions.
- Saves all processed data into data/processed/.
extract_features.py:
- Loads a pre-trained CLIP model (openai/clip-vit-base-patch32 by default) and its processor.
- Iterates through the images specified in train_set.json, val_set.json, and test_set.json.
- For each image, it extracts the global image features using CLIP.
- Saves these features as NumPy arrays (.npy files) in data/clip_features/{train|val|test}/.
model.py:
- ImageCaptionDataset: A PyTorch Dataset class. It loads precomputed image features and corresponding tokenized captions. It handles numericalizing captions by converting tokens to IDs using the vocabulary and adding <start> and <end> tokens.
- PositionalEncoding: Standard Transformer positional encoding module to inject sequence order information.
- CaptioningTransformer: The main image captioning model. It consists of:
  - An embedding layer for caption tokens.
  - A positional encoder.
  - A linear projection layer for the input image features (to match d_model).
  - A stack of Transformer Decoder layers.
  - A final linear layer to output logits over the vocabulary.
model_config.yaml:
- Defines the core settings required to run the model and training process. It includes:
  - Paths: Directories and filenames for training/validation
  - Model Hyperparameters
  - Training Hyperparameters: Configuration for training such as learning rate, batch size, and number of epochs
- Update the values as needed to point to your data directories or adjust model behavior.
train.py:
- Defines the training and validation loops.
- Loads the vocabulary, datasets (ImageCaptionDataset), and creates DataLoaders.
- Initializes the CaptioningTransformer model, optimizer (AdamW), and loss function (CrossEntropyLoss, ignoring padding).
- train_epoch: Handles one epoch of training, including forward pass, loss calculation, backpropagation, gradient clipping, and optimizer step.
- validate_epoch: Evaluates the model on the validation set, calculating average loss and BLEU-4 score. For BLEU score calculation, it generates captions greedily.
- The main training loop iterates for a configured number of epochs, calling train_epoch and validate_epoch.
- Saves model checkpoints (including optimizer state and training configuration) for each epoch and separately saves the model with the best CIDEr score (model_best.pt) .
- Includes early stopping based on validation loss.
- Save plots and any visualization results to visualization folder
evaluate_gen_result.py
- Contains caption evaluation pipeline for comparing generated image captions against reference texts using multiple NLP metrics
  - BLEU-1 and BLEU-4: Evaluates n-gram overlap between generated and reference captions.
  - CIDEr: Consensus-based evaluation metric that emphasizes content similarity weighted by TF-IDF, designed for image captioning.
  - ROUGE-L: Measures longest common subsequence between candidate and reference captions
  - METEOR: Accounts for synonymy and word order alignment (skipped in test mode).
  - BERTScore: Embedding-based metric using DistilBERT to compute precision, recall, and F1 (used only in full mode).
generate.py:
- Loads a trained CaptioningTransformer model (expects a raw state_dict like model_best.pt) and its vocabulary.
- Loads the CLIP model and processor.
- Takes an image path as input.
- Processes the image and extracts its features using CLIP.
- Generates a caption for the image features with the trained captioning model.
- Converts the generated token IDs back to words and formats the caption for display.
test.py:
- Loads a trained captioning model and its vocabulary.
- Loads the CLIP image encoder.
- Processes the test set test_set.json and generates captions.
  - Optionally uses beam search decoding.
- Saves generated captions and computes evaluation metrics for the entire test dataset to tests_and_results/YYYY-MM-DD/
utils.py:
- Contains helper functions, such as setup_nltk_resources to check and download NLTK's punkt tokenizer data, and load_captions_csv to load the captions file with error handling.
run_pipeline.sh:
- A shell script that automates running preprocess_data.py, then extract_features.py, and finally train.py in sequence.
image-to-text.sh:
- A shell script that runs generate.py with pre-set arguments to generate a caption for a sample image.
helpers\logging_util.py
- Automatically creates a logging/ directory inside your project’s data/ folder.
- Saves output to a file named log_<timestamp>.log for easy traceability.
- Dual Output: Logs to console. Logs to a file in data/logging/
helpers\visualization_util.py
- Plots training and validation loss curves over epochs
- Saves the figure to the specified directory
- Saves per-epoch training/validation losses and evaluation scores to a JSON file

5. Model Architectures

The CaptioningTransformer model is an image captioning system based on the Transformer architecture, specifically utilizing only the Decoder part of a standard Transformer, conditioned on image features.

Key Components:

Image Feature Input:
- Source: Pre-extracted global image features from a CLIP model (e.g., openai/clip-vit-base-patch32, resulting in a 512-dimensional vector per image).
- Projection (self.image_feature_proj): These features are passed through a linear layer (nn.Linear(feature_dim, d_model)) to project them into the d_model dimensionality used by the Transformer decoder. This projected feature vector acts as the memory input to the decoder's cross-attention mechanism, providing the visual context.
Caption Input (Target Sequence for Training):
- Token Embedding (self.caption_embed): Input caption tokens (word IDs from the vocabulary) are converted into dense vector representations of size d_model using an nn.Embedding layer. The output is scaled by sqrt(d_model).
- Positional Encoding (self.pos_encoder): Sinusoidal positional encodings are added to the token embeddings to provide the model with information about the order of tokens in the sequence.
Transformer Decoder (self.transformer_decoder):
- This is the core component responsible for generating the caption sequence autoregressively.
- It consists of a stack of num_decoder_layers (default: 6) identical nn.TransformerDecoderLayer modules.
- Each nn.TransformerDecoderLayer contains:
  - Masked Multi-Head Self-Attention: Attends to the previously generated tokens in the caption. A square subsequent mask (causal mask) is applied to ensure that a token at a given position can only attend to tokens at preceding positions (and itself), maintaining the autoregressive property.
  - Multi-Head Cross-Attention (Encoder-Decoder Attention): Attends to the projected image features (memory). This allows the model to incorporate visual information from the image when predicting each token in the caption.
  - Feed-Forward Network (FFN): A position-wise fully connected feed-forward network with ReLU activation (typically two linear layers).
- The batch_first=True argument is used, meaning input/output tensors are expected in the shape (batch_size, sequence_length, feature_dim).
Output Layer (self.fc_out):
- The output from the final Transformer Decoder layer (for each token position) is passed through a linear layer (nn.Linear(d_model, vocab_size)).
- This layer produces logits (unnormalized scores) for each word in the vocabulary for each position in the sequence.
- During inference, a softmax function can be applied to these logits to obtain probabilities, and a decoding strategy (e.g., greedy search, beam search) is used to select the next token.

Hyperparameters:

vocab_size: Size of the vocabulary.
feature_dim: Dimensionality of the input image features
d_model: The main internal dimensionality of the Transformer layers (embeddings, attention outputs, FFN inputs/outputs;)
nhead: Number of attention heads in the multi-head attention mechanisms (default: 8).
num_decoder_layers: Number of stacked decoder layers (default: 6).
dim_feedforward: Dimensionality of the hidden layer in the FFNs (default: 2048).
dropout: Dropout rate applied within the Transformer layers (default: 0.1).
max_seq_length: Maximum sequence length for positional encodings (default 40, 38 Flick8k dataset max length + 1 start + 1 end).

Weight Initialization (_init_weights):

The embedding layer and linear layers (image projection, final output) are initialized with a uniform distribution (uniform_(-initrange, initrange) where initrange = 0.1). Biases are initialized to zero.

6. Final Model Result

The final best model:

Train log: logging/log_20250807_154532.log
Per-epoch results: visualization\train_val_results_20250807_191208.json.
Visualization charts for per-epoch loss and validation metrics:
- visualization\val_score_20250807_191208.png
- visualization\train_val_loss_20250807_191208.png
Test set resultstests_and_results/2025-08-07.
- Average evaluation results on the test set: BLEU-1: 0.6505, BLEU-4: 0.2341, CIDEr: 0.6102, METEOR: 0.2201, ROUGE-L: 0.4934, BERTScore-F1: 0.5685

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image to Text Generation Project

Table of Contents

1. Project Structure

2. Setup Instructions

Prerequisites

Download Dataset

Install Dependencies

3. Running the Pipeline

Full Pipeline (Preprocessing, Feature Extraction, Training)

Run Trained Model on Test Dataset

Generating Captions for a New Image

4. Script Descriptions

5. Model Architectures

6. Final Model Result

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
logging		logging
src		src
tests_and_results		tests_and_results
visualization		visualization
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

ZihangH/Image-to-Text-Generation

Folders and files

Latest commit

History

Repository files navigation

Image to Text Generation Project

Table of Contents

1. Project Structure

2. Setup Instructions

Prerequisites

Download Dataset

Install Dependencies

3. Running the Pipeline

Full Pipeline (Preprocessing, Feature Extraction, Training)

Run Trained Model on Test Dataset

Generating Captions for a New Image

4. Script Descriptions

5. Model Architectures

6. Final Model Result

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages