This project implements an image captioning model using a Transformer-based architecture. It takes an image as input and generates a descriptive textual caption. The model uses pre-trained CLIP features for image understanding and a Transformer decoder for text generation.
- Project Structure
- Setup Instructions
- Running the Pipeline
- Script Descriptions
- Model Architectures
- Final Model Result
Image-to-Text-Generation/
├── .gitignore
├── README.md
├── requirements.txt
├── data/
│ ├── flickr8k_images/
│ │ ├── Images/ # <-- Raw JPG images (approx. 8000)
│ │ └── captions.txt # <-- Original captions file from Flickr8k
│ ├── clip_features/ # Generated by extract_features.py
│ │ ├── train/
│ │ ├── val/
│ │ └── test/
│ └── processed/ # Generated by preprocess_data.py
│ ├── vocab.json
│ ├── train_set.json
│ ├── val_set.json
│ └── test_set.json
├── logging # Generated by train.py
├── src/
│ ├── helpers # for logging and result visualization
│ ├── preprocess_data.py
│ ├── extract_features.py
│ ├── model.py
│ └── model_config.yaml # train configuration
│ ├── train.py
│ ├── test.py # run trained model on entire test set
│ ├── generate.py # run trained model for a individual input image
│ ├── utils.py
│ ├── run_pipeline.sh # Script to run the full data prep and training
│ └── image-to-text.sh # Script to generate a caption for an image
│ └── evaluate_gen_result.py
└── saved_models/ # Generated by train.py (contains subdirectories for each run)
└── run_YYYYMMDD-HHMMSS/
├── model_best.pt
└── model_epoch_N.pt
├── visualization # curves and any visualization results generated by train.py
└── tests_and_results/ # test set results
└── YYYY-MM0DD/
- Python 3.8 or higher.
pipfor installing packages.
This project uses the Flickr-8k dataset.
-
Download:
- Flickr8k_Dataset (Images): You can find this on Kaggle or other academic dataset repositories. Search for "Flickr8k". It usually comes as a
Flickr8k_Dataset.zipcontaining theImagesfolder. - Flickr8k_text (Captions): This usually contains
Flickr8k.token.txtwhich is often used, but this project expectscaptions.txtin a specific CSV format (image,caption). Thepreprocess_data.pyscript included in this repository usesdata/flickr8k_images/captions.txt. Ensure yourcaptions.txthas two columns:imageandcaption, with each image having multiple caption rows.
A common source for the text file used by this project might be from a pre-processed version of Flickr8k text or you might need to adapt
Flickr8k.token.txt. The expectedcaptions.txtformat:image_name_1.jpg,First caption for image 1 image_name_1.jpg,Second caption for image 1 ... image_name_2.jpg,First caption for image 2 - Flickr8k_Dataset (Images): You can find this on Kaggle or other academic dataset repositories. Search for "Flickr8k". It usually comes as a
-
Folder Structure:
- Create the directory
data/flickr8k_images/in the project root. - Extract/place the downloaded
Imagesfolder intodata/flickr8k_images/. So you should havedata/flickr8k_images/Images/. - Place your
captions.txtfile directly intodata/flickr8k_images/.
The final structure should be:
data/ └── flickr8k_images/ ├── Images/ │ ├── 1000268201_693b08cb0e.jpg │ ├── ... (all ~8000 images) └── captions.txt - Create the directory
Navigate to the project root directory in your terminal and run:
pip install -r requirements.txtThis will install PyTorch, Transformers, NLTK, Pillow, and other necessary libraries. The script will also attempt to download NLTK's punkt tokenizer data if not found.
All scripts should be run from the src/ directory.
This script automates the entire process of preparing data, extracting image features, and training the captioning model.
- Navigate to the
srcdirectory:cd src - Make the script executable (if not already):
chmod +x run_pipeline.sh
- Run the script:
This will:
./run_pipeline.sh
- Run
preprocess_data.pyto clean captions, build a vocabulary, and create train/validation/test splits. Processed data will be saved indata/processed/. - Run
extract_features.pyto generate CLIP image features for all images in the dataset splits. Features will be saved indata/clip_features/. - Run
train.pyto train theCaptioningTransformermodel. Checkpoints and the best model will be saved in a timestamped subdirectory withinsaved_models/.
- Run
First, navigate to the src directory. Then, run the evaluation script with the required arguments. For example:
python test.py \
--best_model_path ../saved_models/run_{timestamp}/model_best.pt \ # Specify the full path to the trained model
--clip_model_name openai/XXXX \ # default to clip-vit-base-patch32
--device cuda \
--beam_search \ # (Optional flag) Include this flag only if you want to use beam search decoding
--eval_mode full \
--max_len 40This command loads the processed test set from data/processed/test_set.json and the vocabulary from data/processed/vocab.json. It also loads the trained model and CLIP model for feature extraction, then generates captions for each test image using either greedy decoding (default) or beam search (if the --beam_search flag is used).
The generated captions for each image will be saved to a timestamped file under tests_and_results/YYYY-MM-DD/. After generation, the script computes evaluation metrics, and prints the average scores to the console.
This script loads a trained model and generates a caption for a specified image.
-
Navigate to the
srcdirectory:cd src -
Make the script executable (if not already):
chmod +x image-to-text.sh
-
Modify
image-to-text.sh(Optional): The script is pre-configured to use a sample image path and a model path. You might need to update these paths:python generate.py \ --image_path ../data/flickr8k_images/Images/YOUR_IMAGE.jpg \ --model_path ../saved_models/YOUR_RUN_DIR/model_best.pt \ --vocab_path ../data/processed/vocab.json \ --clip_model_name openai/XXXX \ # default to clip-vit-base-patch32 --max_len 40 \ # Maximum length of the generated caption --device cuda \ # or cuda --beam_search #(Optional flag) Include this flag only if you want to use beam search decodingReplace
YOUR_IMAGE.jpgwith the image you want to test andYOUR_RUN_DIRwith the specific run directory insaved_models/containing your trainedmodel_best.pt. -
Run the script:
./image-to-text.sh
This will output the generated caption to the console.
Located in the src/ directory:
-
preprocess_data.py:- Loads captions from
data/flickr8k_images/captions.txt. - Cleans captions: converts to lowercase, removes punctuation, and tokenizes using NLTK.
- Builds a vocabulary (
vocab.json) from the training captions, mapping words to unique integer IDs. Special tokens (<pad>,<start>,<end>,<unk>) are included. Words occurring less thanMIN_WORD_FREQare excluded. - Splits the dataset into training, validation, and test sets (
train_set.json,val_set.json,test_set.json). These JSON files map image IDs to their respective image paths and tokenized captions. - Saves all processed data into
data/processed/.
- Loads captions from
-
extract_features.py:- Loads a pre-trained CLIP model (
openai/clip-vit-base-patch32by default) and its processor. - Iterates through the images specified in
train_set.json,val_set.json, andtest_set.json. - For each image, it extracts the global image features using CLIP.
- Saves these features as NumPy arrays (
.npyfiles) indata/clip_features/{train|val|test}/.
- Loads a pre-trained CLIP model (
-
model.py:ImageCaptionDataset: A PyTorchDatasetclass. It loads precomputed image features and corresponding tokenized captions. It handles numericalizing captions by converting tokens to IDs using the vocabulary and adding<start>and<end>tokens.PositionalEncoding: Standard Transformer positional encoding module to inject sequence order information.CaptioningTransformer: The main image captioning model. It consists of:- An embedding layer for caption tokens.
- A positional encoder.
- A linear projection layer for the input image features (to match
d_model). - A stack of Transformer Decoder layers.
- A final linear layer to output logits over the vocabulary.
-
model_config.yaml:- Defines the core settings required to run the model and training process. It includes:
- Paths: Directories and filenames for training/validation
- Model Hyperparameters
- Training Hyperparameters: Configuration for training such as learning rate, batch size, and number of epochs
- Update the values as needed to point to your data directories or adjust model behavior.
- Defines the core settings required to run the model and training process. It includes:
-
train.py:- Defines the training and validation loops.
- Loads the vocabulary, datasets (
ImageCaptionDataset), and creates DataLoaders. - Initializes the
CaptioningTransformermodel, optimizer (AdamW), and loss function (CrossEntropyLoss, ignoring padding). train_epoch: Handles one epoch of training, including forward pass, loss calculation, backpropagation, gradient clipping, and optimizer step.validate_epoch: Evaluates the model on the validation set, calculating average loss and BLEU-4 score. For BLEU score calculation, it generates captions greedily.- The main training loop iterates for a configured number of epochs, calling
train_epochandvalidate_epoch. - Saves model checkpoints (including optimizer state and training configuration) for each epoch and separately saves the model with the best CIDEr score (
model_best.pt) . - Includes early stopping based on validation loss.
- Save plots and any visualization results to
visualizationfolder
-
evaluate_gen_result.py- Contains caption evaluation pipeline for comparing generated image captions against reference texts using multiple NLP metrics
- BLEU-1 and BLEU-4: Evaluates n-gram overlap between generated and reference captions.
- CIDEr: Consensus-based evaluation metric that emphasizes content similarity weighted by TF-IDF, designed for image captioning.
- ROUGE-L: Measures longest common subsequence between candidate and reference captions
- METEOR: Accounts for synonymy and word order alignment (skipped in test mode).
- BERTScore: Embedding-based metric using DistilBERT to compute precision, recall, and F1 (used only in
fullmode).
- Contains caption evaluation pipeline for comparing generated image captions against reference texts using multiple NLP metrics
-
generate.py:- Loads a trained
CaptioningTransformermodel (expects a rawstate_dictlikemodel_best.pt) and its vocabulary. - Loads the CLIP model and processor.
- Takes an image path as input.
- Processes the image and extracts its features using CLIP.
- Generates a caption for the image features with the trained captioning model.
- Converts the generated token IDs back to words and formats the caption for display.
- Loads a trained
-
test.py:- Loads a trained captioning model and its vocabulary.
- Loads the CLIP image encoder.
- Processes the test set
test_set.jsonand generates captions.- Optionally uses beam search decoding.
- Saves generated captions and computes evaluation metrics for the entire test dataset to
tests_and_results/YYYY-MM-DD/
-
utils.py:- Contains helper functions, such as
setup_nltk_resourcesto check and download NLTK'spunkttokenizer data, andload_captions_csvto load the captions file with error handling.
- Contains helper functions, such as
-
run_pipeline.sh:- A shell script that automates running
preprocess_data.py, thenextract_features.py, and finallytrain.pyin sequence.
- A shell script that automates running
-
image-to-text.sh:- A shell script that runs
generate.pywith pre-set arguments to generate a caption for a sample image.
- A shell script that runs
-
helpers\logging_util.py- Automatically creates a
logging/directory inside your project’sdata/folder. - Saves output to a file named
log_<timestamp>.logfor easy traceability. - Dual Output: Logs to console. Logs to a file in
data/logging/
- Automatically creates a
-
helpers\visualization_util.py- Plots training and validation loss curves over epochs
- Saves the figure to the specified directory
- Saves per-epoch training/validation losses and evaluation scores to a JSON file
The CaptioningTransformer model is an image captioning system based on the Transformer architecture, specifically utilizing only the Decoder part of a standard Transformer, conditioned on image features.
Key Components:
-
Image Feature Input:
- Source: Pre-extracted global image features from a CLIP model (e.g.,
openai/clip-vit-base-patch32, resulting in a 512-dimensional vector per image). - Projection (
self.image_feature_proj): These features are passed through a linear layer (nn.Linear(feature_dim, d_model)) to project them into thed_modeldimensionality used by the Transformer decoder. This projected feature vector acts as thememoryinput to the decoder's cross-attention mechanism, providing the visual context.
- Source: Pre-extracted global image features from a CLIP model (e.g.,
-
Caption Input (Target Sequence for Training):
- Token Embedding (
self.caption_embed): Input caption tokens (word IDs from the vocabulary) are converted into dense vector representations of sized_modelusing annn.Embeddinglayer. The output is scaled bysqrt(d_model). - Positional Encoding (
self.pos_encoder): Sinusoidal positional encodings are added to the token embeddings to provide the model with information about the order of tokens in the sequence.
- Token Embedding (
-
Transformer Decoder (
self.transformer_decoder):- This is the core component responsible for generating the caption sequence autoregressively.
- It consists of a stack of
num_decoder_layers(default: 6) identicalnn.TransformerDecoderLayermodules. - Each
nn.TransformerDecoderLayercontains:- Masked Multi-Head Self-Attention: Attends to the previously generated tokens in the caption. A square subsequent mask (causal mask) is applied to ensure that a token at a given position can only attend to tokens at preceding positions (and itself), maintaining the autoregressive property.
- Multi-Head Cross-Attention (Encoder-Decoder Attention): Attends to the projected image features (
memory). This allows the model to incorporate visual information from the image when predicting each token in the caption. - Feed-Forward Network (FFN): A position-wise fully connected feed-forward network with ReLU activation (typically two linear layers).
- The
batch_first=Trueargument is used, meaning input/output tensors are expected in the shape(batch_size, sequence_length, feature_dim).
-
Output Layer (
self.fc_out):- The output from the final Transformer Decoder layer (for each token position) is passed through a linear layer (
nn.Linear(d_model, vocab_size)). - This layer produces logits (unnormalized scores) for each word in the vocabulary for each position in the sequence.
- During inference, a softmax function can be applied to these logits to obtain probabilities, and a decoding strategy (e.g., greedy search, beam search) is used to select the next token.
- The output from the final Transformer Decoder layer (for each token position) is passed through a linear layer (
Hyperparameters:
vocab_size: Size of the vocabulary.feature_dim: Dimensionality of the input image featuresd_model: The main internal dimensionality of the Transformer layers (embeddings, attention outputs, FFN inputs/outputs;)nhead: Number of attention heads in the multi-head attention mechanisms (default: 8).num_decoder_layers: Number of stacked decoder layers (default: 6).dim_feedforward: Dimensionality of the hidden layer in the FFNs (default: 2048).dropout: Dropout rate applied within the Transformer layers (default: 0.1).max_seq_length: Maximum sequence length for positional encodings (default 40, 38 Flick8k dataset max length + 1 start + 1 end).
Weight Initialization (_init_weights):
- The embedding layer and linear layers (image projection, final output) are initialized with a uniform distribution (
uniform_(-initrange, initrange)whereinitrange = 0.1). Biases are initialized to zero.
The final best model:
- Train log:
logging/log_20250807_154532.log - Per-epoch results:
visualization\train_val_results_20250807_191208.json. - Visualization charts for per-epoch loss and validation metrics:
visualization\val_score_20250807_191208.pngvisualization\train_val_loss_20250807_191208.png
- Test set results
tests_and_results/2025-08-07.- Average evaluation results on the test set: BLEU-1: 0.6505, BLEU-4: 0.2341, CIDEr: 0.6102, METEOR: 0.2201, ROUGE-L: 0.4934, BERTScore-F1: 0.5685