Denoising Autoencoder & Transfer Learning

A production-ready implementation demonstrating how unsupervised pre-training on 57,000 images with a Denoising Autoencoder can be used for Transfer Learning, improving the performance of a CNN classifier trained on only 1,800 labeled samples.

📑 Table of Contents

🎯 Project Overview
🛠️ Technical Stack
🚀 Installation & Quick Start
📁 Project Structure
⚠️ Known Limitations
🏗️ Architecture
🔧 Customization
📊 Results
💡 Key Learnings
📖 Supporting Theory
🚀 Future Enhancements
📚 References

🎯 Project Overview

Objective: Build and compare two CNN models: one trained from scratch (baseline) and another using Transfer Learning from an Encoder pre-trained on a Denoising Autoencoder. The goal is to quantify the benefit of unsupervised pre-training in a simulated low-data scenario.

Key Achievements:

✅ Semi-Supervised Learning: Successfully demonstrated a pipeline where an encoder, pre-trained on 57,000 unlabeled images, improved the performance of a classifier trained on only 1,800 labeled images.
✅ Improved Generalization: The pre-trained model achieved higher validation accuracy (72.17% vs 70.33%) and more stable training curves, indicating better generalization.
✅ Resolved Critical Ambiguity: Achieved a ~4x accuracy improvement (from 8.8% to 32.4%) on the most difficult class (Category 6: "Shirt"), proving it learned more robust features.
✅ Denoising Implementation: Forced the autoencoder to learn robust, noise-invariant features by training it to reconstruct clean images from corrupted versions (noise factor 0.2).

Techniques Implemented:

Denoising Autoencoder: Remove random noise (factor 0.2) from corrupted images to force robust feature learning.
Transfer Learning: Reuse pretrained encoder layers (learned features) in a supervised classifier, freezing their weights.
Unsupervised Pretraining: Learned complex features from 57,000 unlabeled images prior to supervised training.
Supervised Fine-tuning: Trained the new classifier head with only 1,800 labeled samples.
Feature Transfer: Freezing the pre-trained encoder layers for feature initialization, ensuring stable feature extraction.

💼 Real-World Applications

This technique—leveraging self-supervised or unsupervised pretraining to initialize a model, which is then fine-tuned on a smaller set of labeled data—is foundational to modern deep learning. Its utility spans various domains:

Semi-Supervised Learning

Medical Imaging: Limited labeled data (pathologies), abundant unlabeled scans (normal/general).
Satellite Imagery: Vast unlabeled images, few labeled examples for specific classifications (e.g., land use).
Industrial Inspection: Few defect examples, many normal samples used to learn baseline feature representations.

Feature Extraction and Transfer Learning

Pretrained Encoders: Utilizing the encoder for complex downstream classification tasks.
Dimensionality Reduction: Compressing high-dimensional input (e.g., $28 \times 28$ image) into a compact, meaningful latent representation.
Transfer Learning in Low-Resource Domains: Applying features learned from a massive, general dataset (or unsupervised task) to a specialized domain with scarce labeled data.

Anomaly Detection

Learn "Normal" Patterns: Training the autoencoder on only non-anomalous data to establish a baseline of normal features.
Outlier Detection: Flagging inputs based on high reconstruction error, as the model struggles to reproduce features it hasn't learned (e.g., a defect on a product).
Quality Control in Manufacturing: Automated system for identifying defects in real-time.

Image Denoising and Restoration

Medical Image Enhancement: Removing acquisition noise from sensitive scans (MRI, X-ray) to improve diagnostic clarity.
Low-Light Photography Improvement: Restoration of detail and reduction of digital noise introduced in dark environments.
Restoration of Degraded Historical Photos: Applying denoising and inpainting techniques to clean up noise and damage on old or degraded film scans.

⚠️ Training Note: This project purposefully simulates a low-data scenario. The key metric is not the absolute test accuracy, but the comparative improvement and feature quality gained from pre-training.

🛠️ Technical Stack

Framework: TensorFlow 2.10.0 / Keras
Dataset: Fashion MNIST (70,000 images)
Data Split:

Unsupervised: 57,000 (train), 3,000 (validation)
Supervised: 1,800 (train), 600 (validation), 600 (test) Training: 10 epochs, batch size 256, Adam Optimizer
Environment: Python 3.10
Hardware: CPU (local) or GPU (Colab)

Core Dependencies

tensorflow==2.10.0      # Deep Learning Framework (Keras included)
numpy==1.23.5           # Numerical computing
matplotlib==3.6.2       # Plotting graphs and images
seaborn==0.12.1         # Visualization (confusion matrix heatmaps)
scikit-learn==1.1.3     # Data splitting (train_test_split) & metrics (confusion_matrix)

🚀 Installation & Quick Start

Recommended: Conda Environment

# Create the environment from the YAML file
conda env create -f environment.yml
conda activate dl-autoencoder

# Run the main training and analysis script
python Matheus_DAE.py

Alternative: pip Installation

# Create a virtual environment (optional, but recommended)
python -m venv autoencoder-env
source autoencoder-env/bin/activate  # Linux/Mac
# or
autoencoder-env\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Run the main script
python Matheus_DAE.py

Expected Output

The script will run the full pipeline:

Print data shapes.
Baseline CNN summary and training.
Baseline validation accuracy plot.
Baseline confusion matrix.
Visualization of noisy images.
Autoencoder summary and training.
Visualization of reconstructed (denoised) images.
Pre-trained CNN summary and training.
Pre-trained validation accuracy plot.
Pre-trained confusion matrix.
Final comparison plot (Validation Accuracy: Baseline vs Pre-trained).

📁 Project Structure

01_AutoencoderTransferLearning/
|   Diagrams.vsdx          # Architecture diagrams (Visio)
|   environment.yml        # Conda environment specification
|   Instructions.pdf
|   Matheus_DAE.py         # Main script with complete implementation
|   README.md              # This file (updated version)
|   requirements.txt       # Dependencies for pip
|   WrittenAnalysis.docx   # Original academic analysis with plots

⚠️ Known Limitations

Users should be aware of the following constraints:

Simulated Data Scarcity: The supervised dataset of 1,800 samples is an artificial constraint of the assignment. It does not represent the model's maximum performance if all 60,000 labels were used, but rather simulates a real-world scenario.
Low Resolution: Fashion MNIST consists of 28x28 grayscale images, limiting the complexity of the learned features.
Test Set Variance: The supervised test set is small (600 samples). This can lead to statistical variance in the results; therefore, the validation accuracy (also 600 samples) and per-class analysis are more reliable indicators of improvement.
Simple Architectures: The models are intentionally shallow for academic demonstration and quick training, not for achieving state-of-the-art accuracy.

🏗️ Architecture

This project implements three distinct architectures:

1. Baseline CNN (cnn_v1_model_Matheus)

A standard CNN classifier trained from scratch on the 1,800 labeled data samples.

Input: 28×28×1 (grayscale image)
├─ Conv2D(16 filters, 3×3, stride=2, relu)       # 14×14×16
├─ Conv2D(8 filters, 3×3, stride=2, relu)        # 7×7×8
├─ Flatten                                         # 392 features
├─ Dense(100, relu)                                # Hidden layer
└─ Dense(10, softmax)                              # Classification output

Total params: 41,630

2. Denoising Autoencoder (autoencoder_Matheus)

Trained on the 57,000 unlabeled images to reconstruct clean images from noisy inputs.

--- ENCODER ---
Input: 28×28×1 (noisy image)
├─ Conv2D(16 filters, 3×3, stride=2, relu)       # 14×14×16
└─ Conv2D(8 filters, 3×3, stride=2, relu)        # 7×7×8 (Latent Space)

--- DECODER ---
Input: 7×7×8 (Latent Space)
├─ Conv2DTranspose(8 filters, 3×3, stride=2, relu)  # 14×14×8
├─ Conv2DTranspose(16 filters, 3×3, stride=2, relu) # 28×28×16
└─ Conv2D(1 filter, 3×3, sigmoid, 'same')         # 28×28×1 (Reconstructed Image)

Total params: 3,217

3. Pretrained CNN (cnn_v2_Matheus)

A new classifier that uses the Autoencoder's Encoder as a frozen feature extractor.

--- ENCODER (Transferred and Frozen) ---
Input: 28×28×1 (grayscale image)
├─ Conv2D(16 filters, 3×3, stride=2, relu)       # 14×14×16
└─ Conv2D(8 filters, 3×3, stride=2, relu)        # 7×7×8
(Non-trainable params: 1,320)

--- CLASSIFIER HEAD (Trainable) ---
├─ Flatten                                         # 392 features
├─ Dense(100, relu)                                # Hidden layer
└─ Dense(10, softmax)                              # Classification output
(Trainable params: 40,310)

Total params: 41,630

🔧 Customization

Experimental Design: Unsupervised Pre-training for Data Scarcity

The central challenge of this project was not creating a custom layer, but rather designing a methodology to validate the value of semi-supervised learning.

Problem: How do you train an effective image classifier when you have very few labeled data (1,800 images) but a vast set of unlabeled data (57,000 images)?

Solution: A two-stage pre-training and fine-tuning approach:

Step 1: Unsupervised Feature Learning (57k images)

Instead of just training a standard Autoencoder, we trained a Denoising Autoencoder (DAE).
We added Gaussian noise (factor 0.2) to the 57,000 input images.
The DAE is trained to minimize the Mean Squared Error (MSE) between its output and the original, clean images.
Why is this crucial? This forces the Encoder to learn robust, meaningful features (textures, edges, shapes) rather than simply memorizing the images (learning an identity function). It must learn the essence of a fashion item to reconstruct it from corrupted data.

Step 2: Supervised Transfer Learning (1.8k images)

We create a new classification model (CNN v2).
The Encoder layers trained in Step 1 are transferred to this new model.
The Encoder layers are frozen (trainable = False), treating them as a fixed feature extractor.
We add a "classifier head" (Flatten and Dense layers) on top.
We train only this new "head" on the 1,800 labeled samples.

A/B Test Comparison:

Model A (Baseline): Has to learn everything (what an edge is, what a texture is, and how to classify) from only 1,800 samples.
Model B (Pre-trained): Already knows what edges and textures look like (learned from 57k samples). It only needs to learn how to map these rich features to 10 classes, using the 1,800 samples.

This methodology directly simulates a real-world scenario where data collection is easy, but data labeling is expensive and time-consuming.

Training Configuration

Configuration	Baseline CNN	Autoencoder	Pretrained CNN
Optimizer	Adam	Adam	Adam
Loss Function	Categorical Crossentropy	Mean Squared Error (pixel-wise)	Categorical Crossentropy
Epochs	10	10	10
Batch Size	256	256	256
Labeled Samples	1,800	N/A	1,800 (Fine-tuning)
Unlabeled Samples	N/A	57,000 (Pretraining)	N/A
Encoder Weights	Randomly Initialized	Trained	Frozen (Transferred)

📊 Results

Comparative Performance Analysis

The analysis of the results reveals a subtle story. While the Baseline's test accuracy was marginally higher (likely due to variance in a small 600-sample set), the validation metrics and per-class analysis prove the pre-trained model was superior in generalization and feature learning.

Metric	Baseline CNN (Trained on 1.8k)	Pretrained CNN (Pre-trained on 57k, Tuned on 1.8k)	Impact
Training Accuracy	72.09%	75.77%	+3.68%
Validation Accuracy	70.33%	72.17%	+1.84%
Test Accuracy	75.33%	74.50%	-0.83%
Pre-training Data	None	57,000 unlabeled	N/A

Observation: The pre-trained model consistently outperformed the baseline on train and validation metrics, indicating it started training with a much better feature foundation.

The "Shirt" Problem (Category 6): Proof of Value

The true advantage of pre-training is seen when analyzing Category 6 ("Shirt"), the dataset's most difficult class, which is visually ambiguous and easily confused with "T-shirt/top" (0), "Pullover" (2), and "Coat" (4).

Baseline CNN (No pre-training):

Out of 68 "Shirts" in the test set, it correctly identified only 6.
Per-class accuracy: 8.82%
Massively confused "Shirts" with "Coats" (27 errors), "Dresses" (13 errors), "T-shirts" (10 errors), and "Pullovers" (10 errors).
Essentially, the model failed to learn the distinctive feature of a "Shirt" from only 1,800 examples.

Pretrained CNN (With pre-training):

Out of 68 "Shirts" in the test set, it correctly identified 22.
Per-class accuracy: 32.35%
A ~4x (266%) improvement!
The model still finds the class difficult (confusing it with "Coats" - 25 errors), but its ability to distinguish it is drastically better.

Conclusion: The Encoder, pre-trained on 57,000 images, learned the subtle features (like collars, buttons, textures) that differentiate a "Shirt" from a "Coat" or "Pullover" — features the baseline model could not extract from its 1,800-sample supervised dataset.

💡 Key Learnings

1. Test Accuracy Isn't the Whole Story

A superficial look at test accuracy (75.33% vs 74.50%) would suggest pre-training failed. However, the consistently higher validation accuracy and the 4x improvement on the hardest class prove the pre-trained model was objectively superior and more generalizable. On small test sets, variance can mask true performance.

2. Pre-training Shines in Low-Data Scenarios

This experiment quantitatively validates the use of unlabeled data. Unsupervised pre-training acts as a "teacher" that teaches the model the fundamentals of the world (what shapes, textures, and edges look like) before the supervised "tutor" gives it a specific task (classification).

3. Denoising Forces Robust Feature Learning

By adding corruption (noise) and forcing the model to reconstruct the clean image, we prevent it from learning a trivial "identity function" (i.e., output = input). The model is forced to learn the underlying structure of the data, making the encoder a much more powerful feature extractor that is robust to small variations.

4. Feature Learning Solves Ambiguity

The "Shirt" vs. "Pullover" vs. "Coat" problem is one of visual ambiguity. The baseline model, with only 1,800 samples, did not have enough data to disambiguate these classes. The pre-trained model used 57,000 images to build a much richer feature space where these classes became more separable.

📖 Supporting Theory

What is an Autoencoder (AE)?

An Autoencoder is a type of unsupervised neural network used to learn compressed representations (encodings) of data, typically for dimensionality reduction or feature learning. It consists of two parts:

Encoder: A network that compresses the input data x into a lower-dimensional latent space representation, z. It learns to capture the most important information.
- z = encoder(x)
- In our case: 28x28x1 (784 pixels) → 7x7x8 (392 features)
Decoder: A network that attempts to reconstruct the original input x' from the latent representation z.
- x' = decoder(z)
- In our case: 7x7x8 → 28x28x1

The model is trained to minimize the reconstruction loss (like MSE) between x and x'. The latent space z becomes a powerful, compressed representation of the input.

Why a Denoising Autoencoder (DAE)?

A standard Autoencoder trained on simple data might just learn the "identity function" (copying the input to the output), resulting in an encoder that learned nothing useful.

A Denoising Autoencoder solves this by introducing a corruption step:

A clean sample x is selected.
Random noise is added to create a corrupted version x_noisy.
The Encoder receives x_noisy as input: z = encoder(x_noisy).
The Decoder reconstructs from z: x' = decoder(z).
Crucially: The loss is calculated between the output x' and the original clean image x.
- loss = MSE(x', x)

This forces the model to not just copy, but to learn the underlying structure of the data to remove the noise and recreate the original. The encoder is forced to extract only the most robust and meaningful features, ignoring statistical noise.

What is Transfer Learning?

Transfer Learning is a technique where a model trained on a Task A is reused (in whole or in part) for a Task B. This is extremely effective when Task B has little data.

We use the analogy of a medical student:

Baseline Model: Like a student trying to learn anatomy (features) and surgery (classification) at the same time, with only a few textbooks (1,800 samples).
Pre-trained Model: Like a resident physician. They spent years studying anatomy (pre-training on 57k images) and now just need to focus on learning the surgery (classification with 1.8k samples). They learn faster and more effectively because their knowledge base is vast.

In this project, the "anatomy knowledge" is the Encoder layers. We transfer them and freeze them (trainable = False), so the new model doesn't "forget" what it learned from 57,000 images. It only trains the new Dense layers to apply this knowledge to the classification task.

🚀 Future Enhancements

Fine-Tuning: After the initial training of the classifier "head," we could unfreeze the encoder layers and continue training the entire network with a very low learning rate (e.g., 1e-5). This would allow the pre-trained features to "fine-tune" themselves to the classification task.
Deeper Architectures: Using a deeper encoder (e.g., VGG or ResNet-style) as the autoencoder backbone could capture even more complex features, potentially leading to better classification performance.
Variational Autoencoders (VAEs): Replacing the DAE with a VAE (as in the next project) for pre-training. This would teach the encoder a smooth, probabilistic latent space distribution, which can also lead to generalizable features.

📚 References

Essential Papers

Vincent et al. (2008) - "Extracting and Composing Robust Features with Denoising Autoencoders"
- Seminal paper that introduced the Denoising Autoencoder.
Bengio et al. (2007) - "Greedy Layer-Wise Training of Deep Networks"
- Fundamental concept of layer-wise pre-training that inspired modern approaches.

Dataset

Fashion MNIST: GitHub Repository
Xiao et al. (2017) - "Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms"

🎓 Academic Context

Course: COMP 263 - Deep Learning
Institution: Centennial College
Term: Fall 2024
Grade: High Honors (GPA: 4.45/4.5)

Skills Demonstrated

Semi-Supervised Learning: Designed and executed a pipeline combining unsupervised pre-training (DAE) with supervised fine-tuning (CNN).
Transfer Learning: Successfully implemented feature extraction by transferring and freezing layers from a pre-trained encoder.
Comparative Analysis (A/B Testing): Conducted a rigorous analysis (Baseline vs. Pre-trained), dissecting validation and per-class accuracy metrics to prove the value of pre-training.
Robust Feature Engineering: Implemented a Denoising Autoencoder to learn noise-invariant representations instead of a standard AE.
Data Scarcity Simulation: Managed a complex data split (57k/1.8k/600/600) to accurately simulate a real-world low-label scenario.
Technical Communication: Documented complex architectures (nested models) and theoretical concepts (DAE, Transfer Learning) clearly and accessibly.

Author: Matheus Ferreira Teixeira
GitHub: github.com/domvito55
LinkedIn: linkedin.com/in/mathteixeira

📝 License

This project is academic coursework at Centennial College. Free to use for learning purposes with proper attribution.

FilesExpand file tree

README.md

Latest commit

History