Fashion MNIST Classification - CNN vs RNN Comparison

Comparative study of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN/LSTM) for image classification. Built and evaluated two architectures on Fashion MNIST dataset, achieving 88% test accuracy with CNN and 86.2% with RNN, demonstrating CNN superiority for spatial pattern recognition.

🎯 Project Overview

Objective: Compare CNN and RNN architectures for classifying 10 categories of clothing items from Fashion MNIST dataset, demonstrating when to use each architecture type.

Key Achievements:

✅ CNN achieved 88% test accuracy vs RNN's 86.2% - validated CNN superiority for spatial tasks
✅ Identified category 6 (Shirts) as most challenging (65.9% accuracy) due to visual similarity with related classes
✅ Demonstrated proper train-validation-test methodology with balanced generalization (no overfitting)
✅ Built comprehensive evaluation pipeline: training curves, confusion matrices, probability distributions

Real-World Applications:

Fashion & E-commerce

Automated clothing categorization for online retailers
Visual search systems ("find similar items")
Inventory management with image recognition
Size recommendation based on garment type detection

Computer Vision Systems

Multi-class image classification pipelines
Transfer learning foundations for custom datasets
Confidence-based decision making (prediction probabilities)
Production deployment of CNN models

Retail Analytics

Customer preference analysis from product images
Trend detection in fashion categories
Automated tagging for product databases
Quality control with visual inspection

⚠️ Training Recommendation: CPU training takes ~5-10 minutes. GPU recommended for larger experiments but not required for this dataset size.

🛠️ Technical Stack

Framework: TensorFlow 2.10.0 / Keras
Dataset: Fashion MNIST (70,000 images: 60,000 train, 10,000 test)
Training: 8 epochs, batch size 256, Adam optimizer
Hardware: CPU (tested on Python 3.10)
Environment: Python 3.10 (TensorFlow 2.10 compatibility requirement)

Core Dependencies

tensorflow==2.10.0  # Deep learning framework (last TF 2.x with Python 3.10 support)
numpy==1.23.5       # Numerical operations and array manipulation
matplotlib==3.6.2   # Training curves and probability visualizations
seaborn==0.12.1     # Confusion matrix heatmaps
scikit-learn==1.1.3 # Train-test split and confusion matrix

Version Compatibility Note: All dependencies are 2022 stable releases tested together. TensorFlow 2.10 is the last version with native Python 3.10 support.

Dataset Information

Fashion MNIST

70,000 grayscale images (28×28 pixels)
10 classes of clothing and accessories
Training set: 60,000 images
Test set: 10,000 images
Balanced classes (7,000 samples per category)

Classes:

T-shirt/top
Trouser
Pullover
Dress
Coat
Sandal
Shirt
Sneaker
Bag
Ankle boot

Source: Built into tf.keras.datasets.fashion_mnist

🚀 Installation & Quick Start

Recommended: Conda Environment

# Create environment from YAML
conda env create -f environment.yml
conda activate dl-cnn-rnn

# Run complete analysis
python matheus_linear.py

Expected output:

Dataset size and image resolution
CNN model summary and training progress
Training vs validation accuracy plots
Test accuracy evaluation
Prediction probability distributions (4 samples)
Confusion matrix visualization
RNN training and evaluation (same sequence)

Runtime: ~5-10 minutes on CPU, ~2-3 minutes on GPU

Alternative: pip Installation

# Create virtual environment
python -m venv dl-env
source dl-env/bin/activate  # Linux/Mac
# or
dl-env\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Run analysis
python matheus_linear.py

Quick Troubleshooting

Issue: TensorFlow warnings about oneDNN optimizations

Solution: Warnings suppressed in code via environment variables (already handled)

Issue: Out of memory errors

Solution: Reduce batch size from 256 to 128 (line 76 in script)

📁 Project Structure

00_CNN_RNN_ImageClassification/
├── Diagrams.vsdx           # Network architecture diagrams
├── environment.yml         # Conda environment specification
├── Instructions.pdf                  
├── matheus_CNN_RNN.py      # Complete CNN + RNN implementation
├── README.md               # This file
├── requirements.txt        # pip dependencies
└── WrittenAnalysis.docx    # Detailed analysis and diagrams

⚠️ Known Limitations

Users should be aware of the following constraints:

Dataset Scope: Fashion MNIST only (28×28 grayscale)
- Impact: Simplified dataset not representative of real-world complexity
- Note: Excellent for learning fundamentals, not production-ready
Low Resolution: 28×28 pixels
- Impact: Limited detail for fine-grained classification
- Example: Shirts vs T-shirts difficult to distinguish at this resolution
RNN Architecture Choice: LSTM treats image rows as sequence
- Impact: Not optimal for spatial data (rows lack temporal relationship)
- Note: Demonstrates RNN capabilities but highlights architectural mismatch
Single Model Per Architecture: No ensemble or hyperparameter tuning
- Impact: Results show single configuration, not optimal performance
Fixed Random Seed:
- Impact: Results reproducible but single train-validation split
- Workaround: Multiple runs with different seeds for robust evaluation

🏗️ Architecture

CNN Architecture (Image Classification)

Input Layer:         28×28×1 (grayscale images)
                     ↓
Conv2D Layer 1:      32 filters, 3×3 kernel, ReLU activation
                     ↓
MaxPooling2D:        2×2 window (spatial downsampling)
                     ↓
Conv2D Layer 2:      32 filters, 3×3 kernel, ReLU activation
                     ↓
MaxPooling2D:        2×2 window
                     ↓
Flatten:             Convert 2D feature maps to 1D
                     ↓
Dense Layer:         100 neurons, ReLU activation
                     ↓
Output Layer:        10 neurons, Softmax activation (10 classes)

Total Parameters: ~140,000

Design Principles:

Progressive spatial downsampling (28×28 → 14×14 → 7×7)
Hierarchical feature extraction (edges → textures → shapes)
Small kernels (3×3) following modern CNN best practices
ReLU activation prevents vanishing gradients

RNN Architecture (Sequential Processing)

Input Layer:         28 timesteps × 28 features
                     (each image row treated as timestep)
                     ↓
LSTM Layer:          128 hidden units
                     ↓
Output Layer:        10 neurons, Softmax activation

Total Parameters: ~84,000 (40% fewer than CNN)

Design Principles:

Treat image rows as temporal sequence
LSTM captures long-range dependencies
Simpler architecture than CNN (fewer parameters)
Demonstrates RNN capability despite architectural mismatch

📊 Results

Model Performance Comparison

Metric	CNN	RNN	Winner
Training Accuracy	89.99%	87.23%	CNN
Validation Accuracy	~89%	~87%	CNN
Test Accuracy	88.0%	86.2%	CNN (+1.8%)
Training Speed	Faster	Slower	CNN
Parameters	~140K	~84K	RNN (simpler)
Generalization	Excellent	Excellent	Tie

Per-Class Performance (CNN)

Class	Name	Accuracy	Confusion Pattern
0	T-shirt/top	~84%	Confused with shirts (81 errors)
1	Trouser	>97%	Highly distinctive shape
2	Pullover	~85%	Some overlap with coats
3	Dress	~87%	Clear silhouette
4	Coat	~83%	Confused with pullovers/shirts
5	Sandal	~94%	Distinctive footwear pattern
6	Shirt	65.9%	Most challenging (see analysis)
7	Sneaker	~92%	Clear footwear pattern
8	Bag	>97%	Most distinctive shape
9	Ankle boot	~91%	Clear boot silhouette

Category 6 (Shirt) Deep Dive

Problem: Only 65.9% accuracy (659/1000 correct) - significantly below overall 88%

Misclassification Pattern:

136 predicted as T-shirt/top (0)
90 predicted as Pullover (2)
66 predicted as Coat (4)
30 predicted as Dress (3)

Reverse Confusion (other classes → Shirt):

T-shirts → Shirts: 81 times
Coats → Shirts: 82 times
Pullovers → Shirts: 66 times

Root Cause: High visual similarity at 28×28 resolution. All categories share vertical orientation and similar upper-body shapes.

Training Curves Analysis

CNN Training Dynamics:

Smooth convergence to ~89-90% accuracy
Training and validation curves closely aligned (gap <1%)
No overfitting indicators
Stable plateau by epoch 6-7

RNN Training Dynamics:

Similar convergence pattern to CNN
Slightly higher variance in validation curve
Final accuracy ~2% lower than CNN
Also no overfitting (healthy generalization)

Confusion Matrix Insights

Strong Diagonal: Categories 1, 5, 7, 8, 9 have >900/1000 correct predictions
Bidirectional Confusion: Shirts ↔ T-shirts, Shirts ↔ Coats show symmetric errors
Implication: Feature space overlap suggests hierarchical classification could help

💡 Key Learnings

1. CNNs Excel at Spatial Pattern Recognition

Key Insight: CNN achieved 88% accuracy vs RNN's 86.2% because convolutional layers are designed for spatial hierarchies (edges → textures → shapes), while RNNs are optimized for sequential data.

What we discovered: The 1.8% accuracy gap validates architectural choice. CNNs extract local patterns (edges, corners) via convolution, then combine them hierarchically. RNNs treat rows as sequence, missing 2D spatial relationships.

Practical implication: For image tasks, always prefer CNNs (or Vision Transformers for high-resolution). Reserve RNNs for true sequential data (text, audio, time series).

2. Training Curve Alignment Indicates Healthy Generalization

Key Insight: Both models showed <1% gap between training and validation accuracy, confirming proper regularization without overfitting or underfitting.

What we observed:

CNN: 89.99% train, 88% test (1.99% gap)
RNN: 87.23% train, 86.2% test (1.03% gap)

Practical implication: Train-validation-test split (80-20 + separate test) is essential. Monitoring curves during training prevents wasting compute on diverging models.

3. Category-Specific Performance Reveals Dataset Biases

Key Insight: Shirt classification (65.9%) was 22% lower than overall accuracy due to high visual similarity with T-shirts, pullovers, and coats at 28×28 resolution.

What we learned: Confusion matrix analysis reveals which categories need improvement. Symmetric confusion (A→B and B→A) suggests feature space overlap, not model bias.

Practical implication: In production, use per-class metrics (precision, recall, F1) alongside overall accuracy. Identify weak categories and collect more diverse samples or apply data augmentation.

4. RNN Simplicity Doesn't Guarantee Better Performance

Key Insight: RNN had 40% fewer parameters (~84K vs ~140K) but achieved lower accuracy, demonstrating that architectural match to problem structure matters more than parameter count.

What we discovered: Fewer parameters can mean underfitting if architecture doesn't capture problem structure. RNNs process sequences causally (row N depends on row N-1), but image rows lack this dependency.

Practical implication: Don't optimize for parameter count alone. Choose architecture based on data structure: CNNs for spatial, RNNs for temporal, Transformers for long-range dependencies.

📖 Supporting Theory

What is a CNN?

Convolutional Neural Networks, or CNNs, are specialized neural networks designed for processing grid-like data—most notably, images. The revolutionary insight behind CNNs is borrowed from biological vision: just as neurons in the visual cortex respond to stimuli in specific regions of the visual field (called receptive fields), CNN neurons process small patches of the input image rather than the entire image at once.

The key operation in a CNN is convolution: imagine sliding a small window (the "kernel" or "filter") across an image, performing the same mathematical operation at each position. For example, a 3×3 filter might detect horizontal edges by looking at brightness differences between the top and bottom of its window:

# Edge detection filter example
horizontal_edge_filter = [
    [-1, -1, -1],  # Top row (darker above)
    [ 0,  0,  0],  # Middle row
    [ 1,  1,  1]   # Bottom row (brighter below)
]
# When this slides over an image, it produces high values 
# wherever there's a horizontal edge

This simple 3×3 filter has only 9 parameters (the values in the matrix). But here's the magic: we use this same filter across the entire image. Whether an edge appears in the top-left corner or bottom-right corner, the same filter detects it. This property is called translation invariance—patterns are recognized regardless of where they appear.

Parameter Sharing is the secret sauce: instead of having different parameters for every position in the image (which would be millions of parameters), we reuse the same filter everywhere. A single 3×3 filter applied to a 28×28 image involves only 9 parameters but produces 784 outputs (one per position). This is the fundamental reason CNNs are so efficient.

Hierarchical Feature Learning: Early layers detect simple patterns (edges, corners). Middle layers combine these into textures and parts (collars, pockets). Deep layers recognize complete objects (shirts, shoes). This mirrors how biological vision systems work, building complex understanding from simple components.

Why Do CNNs Exist? The Problem They Solve

Before CNNs, the standard approach to image classification was using fully connected networks (what we now call "feedforward" or "dense" networks). Let's understand why this was catastrophically inefficient.

Consider our Fashion MNIST images: 28×28 pixels = 784 input features. If we connect this directly to a hidden layer with just 500 neurons (a modest size), we need:

784 inputs × 500 neurons = 392,000 parameters (just for the first layer!)

Now imagine real images. A small 200×200 color image has:

200 × 200 × 3 (RGB) = 120,000 input features

First hidden layer (2000 neurons):
120,000 × 2000 = 240 million parameters

For a single layer. This is computationally absurd.

The fundamental problem: Fully connected networks treat every pixel as independent. They can't exploit the fact that nearby pixels are related (spatial locality) or that a pattern learned in one part of the image should transfer to another part (translation invariance). Every connection must be learned independently, leading to parameter explosion.

Historical Context:

The breakthrough came in 1998 with Yann LeCun's LeNet-5, which successfully read handwritten digits for check processing. LeNet introduced convolution and pooling, reducing parameters from millions to tens of thousands while improving accuracy.

But CNNs didn't dominate until 2012, when AlexNet (Krizhevsky et al.) won ImageNet competition with 15.3% error rate—crushing the 26.2% error of traditional methods. AlexNet had 60 million parameters but processed 224×224 images—something impossible with fully connected architectures.

The Paradigm Shift:

CNNs replaced:

❌ Every pixel connects to every neuron (O(image_size × neurons))
✅ Small filters slide across image (O(filter_size × neurons))

For our 28×28 Fashion MNIST with 32 filters of size 3×3:

Fully connected: 28 × 28 × 32 = 25,088 parameters
Convolutional:   3 × 3 × 32 = 288 parameters (87× reduction!)

This efficiency enables deep networks (many layers) without computational explosion.

The CNN Architecture in This Project

Our CNN follows the classic pattern:

Conv2D → MaxPooling → Conv2D → MaxPooling → Flatten → Dense

Let's break down each component:

Conv2D Layer (Feature Detection):

Conv2D(32 filters, kernel_size=3×3, activation='ReLU')

32 filters: 32 different patterns to detect (edges at various angles, textures, etc.)
3×3 kernel: Each filter looks at 3×3 pixel neighborhoods
ReLU activation: Introduces non-linearity (f(x) = max(0, x))

What happens: The input 28×28 image becomes 32 "feature maps" of size 26×26 (slight shrinkage from convolution borders). Each feature map highlights where a specific pattern was detected.

MaxPooling (Dimensionality Reduction):

MaxPooling2D(pool_size=2×2)

Divides each feature map into 2×2 blocks
Keeps only the maximum value from each block
26×26 → 13×13 (halves spatial dimensions)

Purpose: Provides translation invariance (exact position doesn't matter) and reduces computation. The strongest activation in each region is what matters, not precisely where it occurred.

Flatten (Reshape for Classification): Converts 2D feature maps into 1D vector. After two Conv-Pool blocks, we have 32 feature maps of size 6×6 = 1,152 features. Flatten makes this a single vector of 1,152 numbers.

Dense Layer (Classification): Standard fully connected layer. Now that we've extracted meaningful features via convolution, we can use traditional neural network layers to combine them for classification (10 classes = 10 output neurons).

What is an RNN?

Recurrent Neural Networks (RNNs) are designed for sequential data where order matters and each element depends on previous ones. Think of reading a sentence: understanding the word "bank" requires knowing the previous context—"river bank" vs "savings bank" mean completely different things.

The key innovation of RNNs is the hidden state—a memory that persists across time steps. Imagine reading a book where you maintain a mental summary of the story so far. Each new sentence updates this summary, and your understanding of the new sentence depends on this accumulated knowledge.

The RNN Loop:

# Pseudocode for RNN processing
hidden_state = initialize_zeros()

for timestep in sequence:
    # Combine current input with previous memory
    hidden_state = tanh(W_input @ input[timestep] + W_hidden @ hidden_state)
    output[timestep] = W_output @ hidden_state

At each timestep, the RNN:

Takes current input (e.g., a word embedding)
Combines it with previous hidden state (memory from previous words)
Produces new hidden state (updated understanding)
Generates output (prediction or classification)

The Vanishing Gradient Problem:

Early RNNs suffered a critical flaw: gradients (signals for learning) diminished exponentially as they traveled backward through many timesteps. This meant RNNs couldn't learn long-term dependencies—they had short-term memory loss.

LSTM: The Solution:

Long Short-Term Memory (LSTM) networks, introduced in 1997, solved this with gating mechanisms:

┌─────────────────────────────────┐
│           LSTM Cell             │
│                                 │
│  ┌─────────┐  ┌─────────┐       │
│  │ Forget  │  │  Input  │       │
│  │  Gate   │  │  Gate   │       │
│  └────┬────┘  └────┬────┘       │
│       │            │            │
│       ▼            ▼            │
│  Cell State (long-term memory)  │
│       │                         │
│       ▼                         │
│  ┌─────────┐                    │
│  │ Output  │                    │
│  │  Gate   │                    │
│  └────┬────┘                    │
│       ▼                         │
│   Hidden State (short-term)     │
└─────────────────────────────────┘

Three gates control information flow:

Forget gate: Decides what to discard from long-term memory (sigmoid → 0-1 scale)
Input gate: Decides what new information to store
Output gate: Decides what to output from memory

This gating allows LSTMs to maintain information across hundreds of timesteps, enabling them to learn long-range dependencies in text, speech, and time series data.

Why Do RNNs Exist? The Problem They Solve

Sequential data is everywhere: text, speech, time series, video, DNA sequences. The defining characteristic is that order matters and context depends on history.

The Problem with Feedforward Networks:

Imagine trying to predict the next word in: "The clouds gathered and it started to _____"

A feedforward network has no memory. If you feed it one word at a time, it makes independent predictions:

Input: "The"      → Predict next word (no context!)
Input: "clouds"   → Predict next word (forgot "The")
Input: "gathered" → Predict next word (forgot everything before)

To work around this, you'd need to feed the entire history at once, but:

Variable-length sequences require fixed input size (padding/truncation)
Position information is lost (word 1 and word 100 processed identically)
No parameter sharing (word "the" at position 1 vs position 50 requires different parameters)

The RNN Solution:

RNNs process sequences step-by-step while maintaining state:

# Processing "The clouds gathered and it started to rain"
hidden = process("The", hidden)         # hidden stores: "article seen"
hidden = process("clouds", hidden)      # hidden stores: "weather-related"
hidden = process("gathered", hidden)    # hidden stores: "weather worsening"
# ...
hidden = process("to", hidden)          # hidden stores: complete context
output = predict_next(hidden)           # Predicts "rain" with full context

Real-World Applications:

Natural Language Processing:

Machine translation: "I love neural networks" → "J'adore les réseaux neuronaux" (word order differs between languages)
Sentiment analysis: "The movie was not very good" (negative despite "good")
Text generation: Each generated word depends on all previous words

Time Series Forecasting:

Stock prices: Today's price depends on historical trends
Weather prediction: Temperature sequences exhibit temporal patterns
Energy demand: Consumption patterns have daily/weekly cycles

Speech Recognition:

Audio is naturally sequential (phonemes → words → sentences)
"recognize speech" vs "wreck a nice beach" (sound identical, context differs)

Why Feedforward Networks Fail Here:

Feedforward networks assume independent, identically distributed (i.i.d.) data. Sequential data violates this:

Not independent: Today's weather affects tomorrow's weather
Order matters: "Dog bites man" ≠ "Man bites dog"
Variable length: Sentences have different lengths

RNNs embrace this non-i.i.d. structure, making them the natural architecture for sequential data.

The Architectural Mismatch: Why RNN Works (Partially) for Images

In this project, we achieved 86.2% accuracy with RNN on Fashion MNIST—surprisingly good for the "wrong" architecture. Understanding why it works (and why CNN is still better) reveals deep insights about architectural alignment.

How We Forced Images Into RNN:

Images are 2D grids, but RNNs expect 1D sequences. Our approach:

# Original image: 28×28 grid
image = [
    [row_0_pixels...],  # 28 pixels
    [row_1_pixels...],  # 28 pixels
    ...
    [row_27_pixels...] # 28 pixels
]

# RNN treatment: 28 timesteps, 28 features each
# Timestep 0: Process row 0 (top of image)
# Timestep 1: Process row 1
# ...
# Timestep 27: Process row 27 (bottom of image)

The RNN builds a hidden state sequentially: "I've seen the top rows, now processing middle, now bottom."

Why It Works At All:

Some vertical structure exists in clothing images:

Shirts have collars at top, torsos in middle, hems at bottom
Shoes have laces/openings at top, soles at bottom
Trousers have waistbands at top, legs extending downward

The LSTM's memory can capture: "Top rows looked like a collar → likely a shirt/dress." This top-down processing explains 86.2% accuracy—features do flow somewhat vertically.

Why CNN is Superior (88% vs 86.2%):

The critical difference: 2D spatial relationships vs 1D sequential processing.

CNNs process all spatial directions simultaneously:

A 3×3 convolution filter looks at:
┌───────┐
│ □ □ □ │  Horizontal, vertical, AND diagonal
│ □ X □ │  relationships captured simultaneously
│ □ □ □ │
└───────┘

For a shoe image, CNN sees:

Laces at top + sole at bottom (vertical)
Rounded toe shape (horizontal curves)
Diagonal stitching patterns
Textural consistency across the entire region

RNN processes one row at a time:

Row 0:  [pixels...] → hidden_state_0
Row 1:  [pixels...] → hidden_state_1

For the same shoe:

Sees laces in early timesteps
Sees sole in late timesteps
Cannot directly compare row 5 with row 15 (spatially distant rows)
Diagonal patterns split across many timesteps (hard to detect)
Must rely on LSTM memory to maintain spatial coherence

The Gap Explained:

The 1.8% accuracy difference (88% - 86.2%) represents patterns that are:

Easy for CNNs: Diagonal stitching, circular buttons, horizontal stripes
Hard for RNNs: Features spanning non-consecutive rows, requiring memory across 10+ timesteps

Quantitative Analysis:

Looking at confusion matrices, RNN struggles most with Category 6 (Shirts):

CNN:  Shirt → T-shirt misclassifications: 136
RNN:  Shirt → T-shirt misclassifications: ~155 (estimated)

Why? Shirt collars (top) vs hemlines (bottom) are far apart spatially. RNNs must maintain this information across 15-20 timesteps, while CNNs process it in parallel.

The Fundamental Insight:

Architecture should match data structure:

Sequential data (text, audio, time series) → RNN's temporal processing is ideal
Spatial data (images, graphs) → CNN's parallel spatial processing is ideal

RNNs on images are like reading a painting line-by-line with a narrow slit—you can understand it, but you miss the holistic spatial composition that makes it meaningful.

This project's results (88% vs 86.2%) empirically validate this principle: the 1.8% gap is the cost of architectural mismatch.

Understanding Training Curves

Training curves reveal not just performance but the learning dynamics of each architecture. In this project, both CNN and RNN exhibited healthy convergence patterns, but with subtle differences.

Key Indicators:

Training Accuracy Rising: Model is learning patterns from training data Validation Accuracy Rising: Model generalizes to unseen data Small Gap Between Curves: Little overfitting (healthy generalization) Plateau After ~5 Epochs: Model has learned most available patterns

CNN vs RNN Dynamics:

In our results, CNN showed:

Steeper initial learning curve (faster convergence)
Higher final accuracy (88% vs 86.2%)
Slightly more stable validation (less oscillation)

RNN showed:

Gradual, steady learning (LSTM gating introduces stability)
Lower final accuracy (architectural mismatch)
Occasional validation dips (LSTM's sequential nature causes sensitivity to initialization)

What Would Be Concerning:

Overfitting Pattern:
  Training: 95% → 98% → 99% (keeps rising)
  Validation: 85% → 85% → 84% (plateaus or drops)
  → Model memorizing training data, not generalizing

Neither architecture showed this pattern, confirming healthy training.

Prediction Probability Distributions

What They Reveal:

Each model output is a 10-element probability vector summing to 1.0:

# Example: Model is confident this is class 1 (Trousers)
[0.01, 0.92, 0.03, 0.01, 0.01, 0.00, 0.01, 0.00, 0.00, 0.01]
#  ↑     ↑                                ↑
# T-shirt Trouser                        Bag
#         (predicted)

Calibration Analysis:

The bar charts in Results section show:

Green bar: True label (ground truth)
Blue bar: Predicted label (if different)
Height: Model confidence

Interpretation:

Tall blue bar matching green: High confidence, correct prediction ✓
Tall blue bar not matching green: High confidence, wrong prediction (overconfident) ✗
Flat distribution: Model uncertain (healthy when classes are ambiguous)

CNN showed taller bars (higher confidence) than RNN, indicating more decisive feature extraction from spatial patterns.

Practical Implications:

In production systems, probability distributions enable:

Confidence thresholds: Only act on predictions > 80% confidence
Human-in-the-loop: Flag uncertain predictions (< 60%) for manual review
Multi-label scenarios: If two classes have similar probabilities (both > 40%), suggest both
Error analysis: Systematic confusion patterns guide model improvements

Common Misconceptions

❌ "Higher accuracy always means better model"
✅ Correct understanding: Must check generalization gap (train-test difference) and per-class performance. 90% accuracy with 99% on easy classes and 50% on hard classes is worse than 88% balanced across all classes.

❌ "RNNs failed because they're bad at images"
✅ Correct understanding: RNNs achieved 86.2% accuracy (good!), just 1.8% below CNN. The gap validates architectural principles but doesn't mean RNNs are useless for images (they work, just suboptimal).

❌ "Fewer parameters = better model (less overfitting)"
✅ Correct understanding: Parameter count should match problem complexity. RNN's 84K parameters weren't enough to capture 2D spatial patterns as effectively as CNN's 140K parameters.

❌ "Confusion matrix diagonal should be 100%"
✅ Correct understanding: Perfect classification (100% on all classes) is unrealistic and often indicates overfitting. 88% with balanced errors across classes demonstrates healthy generalization.

❌ "LSTMs are just better RNNs, always use them"
✅ Correct understanding: LSTMs solve vanishing gradients for long sequences, but add complexity. For short sequences (< 30 steps), vanilla RNNs often suffice. Our 28 timesteps (image rows) is borderline—LSTM's gating helped but wasn't critical.

❌ "Pooling layers lose information"
✅ Correct understanding: MaxPooling discards precise spatial positions but preserves feature presence. This is desirable—"dog detected" matters more than "dog at exact pixel 247." The resulting translation invariance improves generalization.

🚀 Future Enhancements

Immediate Improvements

Data Augmentation
- Random rotations (±15°), flips, zooms
- Expected impact: +2-3% accuracy by increasing training diversity
- Focus: Category 6 (Shirts) would benefit most
Deeper CNN Architecture
- Add 3rd convolutional block (64 filters)
- Expected impact: +1-2% accuracy, better feature extraction
- Trade-off: ~2x more parameters, slightly slower training
Learning Rate Scheduling
- Reduce learning rate after plateau (e.g., epoch 5)
- Expected impact: Squeeze out additional 0.5-1% accuracy
- Implementation: ReduceLROnPlateau callback

Advanced Extensions

Transfer Learning
- Use pre-trained VGG or ResNet (ImageNet weights)
- Fine-tune on Fashion MNIST
- Expected impact: 92-95% accuracy (state-of-the-art range)
Ensemble Methods
- Combine CNN + deeper CNN + residual connections
- Expected impact: +2-3% accuracy via model diversity
- Cost: 3x inference time
Hierarchical Classification
- Level 1: "Upper body" vs "Lower body" vs "Footwear" vs "Accessories"
- Level 2: Specific categories within each group
- Expected impact: Reduce Shirt confusion (exploit category relationships)
Attention Mechanisms
- Add attention layers to CNN (focus on discriminative regions)
- Expected impact: Better interpretability, potential +1% accuracy

📚 References

Essential Papers

LeCun et al. (1998) - "Gradient-Based Learning Applied to Document Recognition"
Foundational CNN architecture (LeNet) establishing convolutional neural networks
Hochreiter & Schmidhuber (1997) - "Long Short-Term Memory"
Original LSTM paper introducing gating mechanisms for RNNs
Xiao et al. (2017) - "Fashion-MNIST: a Novel Image Dataset for Benchmarking"
Introduction of Fashion MNIST dataset as MNIST replacement
Goodfellow et al. (2016) - "Deep Learning Book"
Comprehensive reference for CNN and RNN fundamentals (Chapters 9-10)

Key Concepts

Convolutional Neural Networks (CNNs): Extract spatial hierarchies via convolution
Recurrent Neural Networks (RNNs): Process sequential data with temporal dependencies
LSTM (Long Short-Term Memory): RNN variant with gating to prevent vanishing gradients
Confusion Matrix: Visualization of per-class classification performance
Train-Validation-Test Split: Essential methodology for robust model evaluation

Dataset

Fashion MNIST: GitHub Repository
- 70,000 grayscale images (28×28 pixels)
- 10 clothing categories
- Balanced classes (7,000 samples each)
- Training: 60,000 | Test: 10,000

🎓 Academic Context

Course: COMP 263 - Deep Learning
Institution: Centennial College
Term: Fall 2024
Grade: High Honors (GPA: 4.45/4.5)

Skills Demonstrated

Architecture Comparison: Systematic evaluation of CNN vs RNN for image classification
Model Evaluation: Comprehensive metrics (accuracy, confusion matrix, probability distributions)
Statistical Analysis: Train-validation-test methodology with proper random seed control
Data Preprocessing: Normalization, one-hot encoding, dataset splitting
Visualization: Professional matplotlib/seaborn plots (training curves, confusion matrices)
Critical Thinking: Identified root cause of category 6 (Shirt) poor performance through confusion matrix analysis

Author: Matheus Ferreira Teixeira
GitHub: github.com/domvito55
LinkedIn: linkedin.com/in/mathteixeira

📝 License

This project is academic coursework at Centennial College. Free to use for learning purposes with proper attribution.

FilesExpand file tree

README.md

Latest commit

History