Text-to-Image Generation with Diffusion Models

An end-to-end text-to-image diffusion model — from data preparation and caption generation through distributed training on AWS SageMaker, experiment tracking with MLflow, model optimization (ONNX / TensorRT), to automated deployment on Hugging Face Spaces via CI/CD.

🚀 Try it live on Hugging Face Spaces: Text-to-Image Flowers Generator

Model on HF Hub · Experiment Tracking · System Design

Sample Outputs

Generated with DDIM · 50 steps · CFG 5.0 · avg 2.37 s/image (GPU):



Yellow sunflower	Red rose	Purple lavender	White daisy	Cherry blossom

Pipeline Overview

The entire workflow — data prep, training, evaluation, export, and deployment — is orchestrated as a single reproducible DVC pipeline.

graph TD
    A[Oxford 102 Flowers\n~8K images] --> B[Florence-2-large\nCaption Generator]
    B --> B2[CLIP Embedding\nPrecomputation]
    B2 --> C[Upload to S3]
    C --> D[AWS SageMaker\nDistributed Training]
    D --> E[DeepSpeed ZeRO Stage 2\nFP16 Mixed Precision]
    E --> F[MLflow on DagShub\nExperiment Tracking]
    F --> G{val_diffuser_loss\nimproved?}
    G -- Yes --> H[Download Best Model]
    G -- No --> I[Skip downstream stages]
    H --> H2[Evaluate Model\nGenerate Samples]
    H2 --> J[ONNX Export\nCLIP + VAE + UNet]
    J --> J2[TensorRT Compile\nGPU-optimized engines]
    J2 --> K[Push to HF Hub]
    K --> L[HF Spaces\nDDIM · CPU Inference]

DVC Stages

#	Stage	What it does
1	`caption-generator`	Auto-generates detailed captions for ~8K flower images using Florence-2-large
2	`precompute-embeddings`	Pre-caches CLIP text embeddings to disk, eliminating per-batch CLIP inference
3	`data-push`	Uploads dataset (images + captions + embeddings) to S3 for SageMaker
4	`training`	Launches distributed DeepSpeed training on SageMaker (2 nodes)
5	`log_training_model`	Downloads best model from S3 — only if val loss improved
6	`evaluate`	Generates sample images and computes evaluation metrics
7	`onnx_convert`	Exports CLIP + VAE + UNet to ONNX for CPU inference
8	`tensorrt_convert`	Compiles ONNX models to TensorRT engines for GPU inference
9	`push_to_hub`	Pushes all model artifacts to Hugging Face Hub

Stages 5–9 are automatically skipped by DVC when the model doesn't improve, preventing unnecessary exports and deployments.

CI/CD: Auto-deploy on `git push`

graph LR
    A[git push main] --> B[GitHub Actions]
    B --> C[Upload app.py +\nDockerfile]
    C --> D[HF Spaces\nStreamlit App]
    D --> E[Downloads ONNX\nfrom HF Hub]
    E --> F[DDIM inference\non CPU]

Model Architecture

graph LR
    P[Text Prompt] --> CLIP[CLIP ViT-L/14\nfrozen]
    CLIP --> E[Embeddings\n77 × 768]

    I[Training Image] --> VAE_E[VAE Encoder\nSD ft-mse]
    VAE_E --> L[Latent\n16×16×4]

    L --> NOISE[Add Noise\nDDPM · T=1000]
    NOISE --> UNET[Cross-Attn UNet2D\nNoise Predictor]
    E --> UNET
    UNET --> DENOISE[Denoise\nDDIM · 30-50 steps]
    DENOISE --> VAE_D[VAE Decoder]
    VAE_D --> OUT[Generated Image\n128×128]

Component	Details
Text Encoder	CLIP ViT-L/14 (`openai/clip-vit-large-patch14`) — frozen, 768-dim embeddings
VAE	Stable Diffusion VAE (`stabilityai/sd-vae-ft-mse`) — 8× latent compression
UNet	`UNet2DConditionModel` with cross-attention — channels [192, 384, 576]
Scheduler	DDPM (1000 steps, training) → DDIM (30–50 steps, inference)
Output	128 × 128 px

Training


Dataset	Oxford 102 Flowers (~8,189 images)
Captions	Auto-generated with Florence-2-large
Platform	AWS SageMaker — 2× `ml.g4dn.xlarge` (NVIDIA T4, 16 GB each)
Distribution	DeepSpeed ZeRO Stage 2 — optimizer states + gradients partitioned
Precision	FP16 mixed precision
Optimizer	AdamW (lr=2e-4, weight_decay=1e-2) + cosine warmup
Batch size	64
Epochs	75
Gradient clipping	0.5
CFG training	10% unconditional dropout
Best val loss	0.3124
Experiment logs	DagShub MLflow

Tech Stack

Component	Technology
Caption generation	Florence-2-large (Microsoft)
Text encoding	CLIP ViT-L/14 (OpenAI)
Image compression	Stable Diffusion VAE ft-mse
Noise prediction	UNet2DConditionModel (HF Diffusers)
Training infrastructure	AWS SageMaker (multi-node spot instances)
Distributed training	DeepSpeed ZeRO Stage 2
Experiment tracking	MLflow on DagShub
Pipeline orchestration	DVC
Model optimization	ONNX Runtime (CPU) · TensorRT (GPU)
Deployment	Hugging Face Spaces + GitHub Actions CI/CD
App framework	Streamlit

Project Structure

├── .github/workflows/
│   └── deploy_to_hf_spaces.yml     # CI/CD — auto-deploy to HF Spaces on push
├── data/raw/flowers/
│   ├── images/                      # Oxford 102 Flowers dataset (~8K images)
│   ├── captions/                    # Florence-2-generated captions
│   └── embeddings/                  # Pre-cached CLIP embeddings
├── samples/                         # Generated sample images
├── saved_models/
│   ├── app.py                       # Streamlit app (local + HF Spaces)
│   ├── Dockerfile                   # HF Spaces container
│   ├── requirements.txt             # HF Spaces dependencies
│   ├── diffuser.pth                 # Trained UNet weights (DVC-tracked)
│   ├── onnx_models/                 # ONNX exports (DVC-tracked)
│   └── trt_models/                  # TensorRT engines (DVC-tracked)
├── src/
│   ├── code/                        # SageMaker training container
│   │   ├── dataloader.py            # Dataset with CLIP embedding cache
│   │   └── training_sagemaker_deepspeed.py
│   ├── caption_generator.py         # Florence-2 caption pipeline
│   ├── precompute_embeddings.py     # CLIP embedding precomputation
│   ├── trainingjob.py               # SageMaker job launcher
│   ├── log_training_model.py        # Best model download (conditional)
│   ├── evaluate.py                  # Model evaluation + sample generation
│   ├── onnx_converter.py            # PyTorch → ONNX export
│   ├── tensorrt_converter.py        # ONNX → TensorRT compilation
│   ├── push_to_hub.py               # Upload to HF Hub
│   ├── upload.py                    # Dataset upload to S3
│   └── common.py                    # Shared utilities
├── notebooks/
│   └── Diffusion.ipynb              # EDA & exploration
├── dvc.yaml                         # Pipeline definition (9 stages)
├── params.yaml.template             # Config template
├── requirements.txt                 # Project dependencies
├── MODEL_CARD.md                    # Model documentation
└── SYSTEM_DESIGN.md                 # System architecture & design decisions

Setup & Usage

1. Clone and install

git clone https://github.com/aniketpoojari/Text-To-Image-Diffusion.git
cd Text-To-Image-Diffusion
pip install -r requirements.txt

2. Configure

cp params.yaml.template params.yaml
# Fill in AWS credentials, MLflow URI, and HF token

3. Run the full pipeline

dvc repro

4. Run the app locally

cd saved_models
streamlit run app.py

Supports three inference backends — select in the sidebar:

ONNX Runtime — CPU-optimized (default on HF Spaces)
TensorRT — fastest on NVIDIA GPUs
PyTorch — full precision, any device

5. Deploy to HF Spaces

Automated via GitHub Actions on every push to main. One-time setup:

Create a Docker Space on huggingface.co/new-space
Add GitHub repo secrets: HF_TOKEN

Key Design Decisions

Why precompute CLIP embeddings? CLIP inference was the per-batch bottleneck during training. Pre-caching embeddings to disk as .pt files eliminates this overhead entirely and enables multi-worker data loading.

Why DeepSpeed ZeRO Stage 2? Partitions optimizer states and gradients across GPUs. This fits the full UNet + pretrained VAE on 2× T4 instances (16 GB each) that would otherwise OOM with standard data parallelism.

Why conditional model download? The log_training_model stage compares the new run's val_diffuser_loss against the current model's metadata. If no improvement, file hashes stay the same and DVC automatically skips all downstream stages — no unnecessary ONNX exports or HF Hub uploads.

Why DDIM over DDPM at inference? DDPM requires 1000 denoising steps. DDIM achieves comparable quality in 30–50 steps — a 20–33× speedup critical for CPU-based deployment on HF Spaces.

Why ONNX for deployment? ONNX Runtime CPU inference is significantly faster than PyTorch CPU due to graph optimization and kernel fusion. This makes the HF Spaces demo (CPU-only free tier) practical.

For a deeper dive into architecture and trade-offs, see SYSTEM_DESIGN.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-to-Image Generation with Diffusion Models

Sample Outputs

Pipeline Overview

DVC Stages

CI/CD: Auto-deploy on `git push`

Model Architecture

Training

Tech Stack

Project Structure

Setup & Usage

1. Clone and install

2. Configure

3. Run the full pipeline

4. Run the app locally

5. Deploy to HF Spaces

Key Design Decisions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.dvc		.dvc
.github/workflows		.github/workflows
data		data
notebooks		notebooks
samples		samples
saved_models		saved_models
src		src
.dvcignore		.dvcignore
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
SYSTEM_DESIGN.md		SYSTEM_DESIGN.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.yaml.template		params.yaml.template
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Text-to-Image Generation with Diffusion Models

Sample Outputs

Pipeline Overview

DVC Stages

CI/CD: Auto-deploy on git push

Model Architecture

Training

Tech Stack

Project Structure

Setup & Usage

1. Clone and install

2. Configure

3. Run the full pipeline

4. Run the app locally

5. Deploy to HF Spaces

Key Design Decisions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

CI/CD: Auto-deploy on `git push`

Packages