Comparative study of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN/LSTM) for image classification. Built and evaluated two architectures on Fashion MNIST dataset, achieving 88% test accuracy with CNN and 86.2% with RNN, demonstrating CNN superiority for spatial pattern recognition.
- 🎯 Project Overview
- 🛠️ Technical Stack
- 🚀 Installation & Quick Start
- 📁 Project Structure
⚠️ Known Limitations- 🏗️ Architecture
- 📊 Results
- 💡 Key Learnings
- 📖 Supporting Theory
- 🚀 Future Enhancements
- 📚 References
Objective: Compare CNN and RNN architectures for classifying 10 categories of clothing items from Fashion MNIST dataset, demonstrating when to use each architecture type.
Key Achievements:
- ✅ CNN achieved 88% test accuracy vs RNN's 86.2% - validated CNN superiority for spatial tasks
- ✅ Identified category 6 (Shirts) as most challenging (65.9% accuracy) due to visual similarity with related classes
- ✅ Demonstrated proper train-validation-test methodology with balanced generalization (no overfitting)
- ✅ Built comprehensive evaluation pipeline: training curves, confusion matrices, probability distributions
- Automated clothing categorization for online retailers
- Visual search systems ("find similar items")
- Inventory management with image recognition
- Size recommendation based on garment type detection
- Multi-class image classification pipelines
- Transfer learning foundations for custom datasets
- Confidence-based decision making (prediction probabilities)
- Production deployment of CNN models
- Customer preference analysis from product images
- Trend detection in fashion categories
- Automated tagging for product databases
- Quality control with visual inspection
Framework: TensorFlow 2.10.0 / Keras
Dataset: Fashion MNIST (70,000 images: 60,000 train, 10,000 test)
Training: 8 epochs, batch size 256, Adam optimizer
Hardware: CPU (tested on Python 3.10)
Environment: Python 3.10 (TensorFlow 2.10 compatibility requirement)
tensorflow==2.10.0 # Deep learning framework (last TF 2.x with Python 3.10 support)
numpy==1.23.5 # Numerical operations and array manipulation
matplotlib==3.6.2 # Training curves and probability visualizations
seaborn==0.12.1 # Confusion matrix heatmaps
scikit-learn==1.1.3 # Train-test split and confusion matrixVersion Compatibility Note: All dependencies are 2022 stable releases tested together. TensorFlow 2.10 is the last version with native Python 3.10 support.
Fashion MNIST
- 70,000 grayscale images (28×28 pixels)
- 10 classes of clothing and accessories
- Training set: 60,000 images
- Test set: 10,000 images
- Balanced classes (7,000 samples per category)
Classes:
- T-shirt/top
- Trouser
- Pullover
- Dress
- Coat
- Sandal
- Shirt
- Sneaker
- Bag
- Ankle boot
Source: Built into tf.keras.datasets.fashion_mnist
# Create environment from YAML
conda env create -f environment.yml
conda activate dl-cnn-rnn
# Run complete analysis
python matheus_linear.pyExpected output:
- Dataset size and image resolution
- CNN model summary and training progress
- Training vs validation accuracy plots
- Test accuracy evaluation
- Prediction probability distributions (4 samples)
- Confusion matrix visualization
- RNN training and evaluation (same sequence)
Runtime: ~5-10 minutes on CPU, ~2-3 minutes on GPU
# Create virtual environment
python -m venv dl-env
source dl-env/bin/activate # Linux/Mac
# or
dl-env\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Run analysis
python matheus_linear.pyIssue: TensorFlow warnings about oneDNN optimizations
- Solution: Warnings suppressed in code via environment variables (already handled)
Issue: Out of memory errors
- Solution: Reduce batch size from 256 to 128 (line 76 in script)
00_CNN_RNN_ImageClassification/
├── Diagrams.vsdx # Network architecture diagrams
├── environment.yml # Conda environment specification
├── Instructions.pdf
├── matheus_CNN_RNN.py # Complete CNN + RNN implementation
├── README.md # This file
├── requirements.txt # pip dependencies
└── WrittenAnalysis.docx # Detailed analysis and diagrams
Users should be aware of the following constraints:
-
Dataset Scope: Fashion MNIST only (28×28 grayscale)
- Impact: Simplified dataset not representative of real-world complexity
- Note: Excellent for learning fundamentals, not production-ready
-
Low Resolution: 28×28 pixels
- Impact: Limited detail for fine-grained classification
- Example: Shirts vs T-shirts difficult to distinguish at this resolution
-
RNN Architecture Choice: LSTM treats image rows as sequence
- Impact: Not optimal for spatial data (rows lack temporal relationship)
- Note: Demonstrates RNN capabilities but highlights architectural mismatch
-
Single Model Per Architecture: No ensemble or hyperparameter tuning
- Impact: Results show single configuration, not optimal performance
-
Fixed Random Seed:
- Impact: Results reproducible but single train-validation split
- Workaround: Multiple runs with different seeds for robust evaluation
Input Layer: 28×28×1 (grayscale images)
↓
Conv2D Layer 1: 32 filters, 3×3 kernel, ReLU activation
↓
MaxPooling2D: 2×2 window (spatial downsampling)
↓
Conv2D Layer 2: 32 filters, 3×3 kernel, ReLU activation
↓
MaxPooling2D: 2×2 window
↓
Flatten: Convert 2D feature maps to 1D
↓
Dense Layer: 100 neurons, ReLU activation
↓
Output Layer: 10 neurons, Softmax activation (10 classes)
Total Parameters: ~140,000
Design Principles:
- Progressive spatial downsampling (28×28 → 14×14 → 7×7)
- Hierarchical feature extraction (edges → textures → shapes)
- Small kernels (3×3) following modern CNN best practices
- ReLU activation prevents vanishing gradients
Input Layer: 28 timesteps × 28 features
(each image row treated as timestep)
↓
LSTM Layer: 128 hidden units
↓
Output Layer: 10 neurons, Softmax activation
Total Parameters: ~84,000 (40% fewer than CNN)
Design Principles:
- Treat image rows as temporal sequence
- LSTM captures long-range dependencies
- Simpler architecture than CNN (fewer parameters)
- Demonstrates RNN capability despite architectural mismatch
| Metric | CNN | RNN | Winner |
|---|---|---|---|
| Training Accuracy | 89.99% | 87.23% | CNN |
| Validation Accuracy | ~89% | ~87% | CNN |
| Test Accuracy | 88.0% | 86.2% | CNN (+1.8%) |
| Training Speed | Faster | Slower | CNN |
| Parameters | ~140K | ~84K | RNN (simpler) |
| Generalization | Excellent | Excellent | Tie |
| Class | Name | Accuracy | Confusion Pattern |
|---|---|---|---|
| 0 | T-shirt/top | ~84% | Confused with shirts (81 errors) |
| 1 | Trouser | >97% | Highly distinctive shape |
| 2 | Pullover | ~85% | Some overlap with coats |
| 3 | Dress | ~87% | Clear silhouette |
| 4 | Coat | ~83% | Confused with pullovers/shirts |
| 5 | Sandal | ~94% | Distinctive footwear pattern |
| 6 | Shirt | 65.9% | Most challenging (see analysis) |
| 7 | Sneaker | ~92% | Clear footwear pattern |
| 8 | Bag | >97% | Most distinctive shape |
| 9 | Ankle boot | ~91% | Clear boot silhouette |
Problem: Only 65.9% accuracy (659/1000 correct) - significantly below overall 88%
Misclassification Pattern:
- 136 predicted as T-shirt/top (0)
- 90 predicted as Pullover (2)
- 66 predicted as Coat (4)
- 30 predicted as Dress (3)
Reverse Confusion (other classes → Shirt):
- T-shirts → Shirts: 81 times
- Coats → Shirts: 82 times
- Pullovers → Shirts: 66 times
Root Cause: High visual similarity at 28×28 resolution. All categories share vertical orientation and similar upper-body shapes.
CNN Training Dynamics:
- Smooth convergence to ~89-90% accuracy
- Training and validation curves closely aligned (gap <1%)
- No overfitting indicators
- Stable plateau by epoch 6-7
RNN Training Dynamics:
- Similar convergence pattern to CNN
- Slightly higher variance in validation curve
- Final accuracy ~2% lower than CNN
- Also no overfitting (healthy generalization)
Strong Diagonal: Categories 1, 5, 7, 8, 9 have >900/1000 correct predictions
Bidirectional Confusion: Shirts ↔ T-shirts, Shirts ↔ Coats show symmetric errors
Implication: Feature space overlap suggests hierarchical classification could help
Key Insight: CNN achieved 88% accuracy vs RNN's 86.2% because convolutional layers are designed for spatial hierarchies (edges → textures → shapes), while RNNs are optimized for sequential data.
What we discovered: The 1.8% accuracy gap validates architectural choice. CNNs extract local patterns (edges, corners) via convolution, then combine them hierarchically. RNNs treat rows as sequence, missing 2D spatial relationships.
Practical implication: For image tasks, always prefer CNNs (or Vision Transformers for high-resolution). Reserve RNNs for true sequential data (text, audio, time series).
Key Insight: Both models showed <1% gap between training and validation accuracy, confirming proper regularization without overfitting or underfitting.
What we observed:
- CNN: 89.99% train, 88% test (1.99% gap)
- RNN: 87.23% train, 86.2% test (1.03% gap)
Practical implication: Train-validation-test split (80-20 + separate test) is essential. Monitoring curves during training prevents wasting compute on diverging models.
Key Insight: Shirt classification (65.9%) was 22% lower than overall accuracy due to high visual similarity with T-shirts, pullovers, and coats at 28×28 resolution.
What we learned: Confusion matrix analysis reveals which categories need improvement. Symmetric confusion (A→B and B→A) suggests feature space overlap, not model bias.
Practical implication: In production, use per-class metrics (precision, recall, F1) alongside overall accuracy. Identify weak categories and collect more diverse samples or apply data augmentation.
Key Insight: RNN had 40% fewer parameters (~84K vs ~140K) but achieved lower accuracy, demonstrating that architectural match to problem structure matters more than parameter count.
What we discovered: Fewer parameters can mean underfitting if architecture doesn't capture problem structure. RNNs process sequences causally (row N depends on row N-1), but image rows lack this dependency.
Practical implication: Don't optimize for parameter count alone. Choose architecture based on data structure: CNNs for spatial, RNNs for temporal, Transformers for long-range dependencies.
Convolutional Neural Networks, or CNNs, are specialized neural networks designed for processing grid-like data—most notably, images. The revolutionary insight behind CNNs is borrowed from biological vision: just as neurons in the visual cortex respond to stimuli in specific regions of the visual field (called receptive fields), CNN neurons process small patches of the input image rather than the entire image at once.
The key operation in a CNN is convolution: imagine sliding a small window (the "kernel" or "filter") across an image, performing the same mathematical operation at each position. For example, a 3×3 filter might detect horizontal edges by looking at brightness differences between the top and bottom of its window:
# Edge detection filter example
horizontal_edge_filter = [
[-1, -1, -1], # Top row (darker above)
[ 0, 0, 0], # Middle row
[ 1, 1, 1] # Bottom row (brighter below)
]
# When this slides over an image, it produces high values
# wherever there's a horizontal edgeThis simple 3×3 filter has only 9 parameters (the values in the matrix). But here's the magic: we use this same filter across the entire image. Whether an edge appears in the top-left corner or bottom-right corner, the same filter detects it. This property is called translation invariance—patterns are recognized regardless of where they appear.
Parameter Sharing is the secret sauce: instead of having different parameters for every position in the image (which would be millions of parameters), we reuse the same filter everywhere. A single 3×3 filter applied to a 28×28 image involves only 9 parameters but produces 784 outputs (one per position). This is the fundamental reason CNNs are so efficient.
Hierarchical Feature Learning: Early layers detect simple patterns (edges, corners). Middle layers combine these into textures and parts (collars, pockets). Deep layers recognize complete objects (shirts, shoes). This mirrors how biological vision systems work, building complex understanding from simple components.
Before CNNs, the standard approach to image classification was using fully connected networks (what we now call "feedforward" or "dense" networks). Let's understand why this was catastrophically inefficient.
Consider our Fashion MNIST images: 28×28 pixels = 784 input features. If we connect this directly to a hidden layer with just 500 neurons (a modest size), we need:
784 inputs × 500 neurons = 392,000 parameters (just for the first layer!)
Now imagine real images. A small 200×200 color image has:
200 × 200 × 3 (RGB) = 120,000 input features
First hidden layer (2000 neurons):
120,000 × 2000 = 240 million parameters
For a single layer. This is computationally absurd.
The fundamental problem: Fully connected networks treat every pixel as independent. They can't exploit the fact that nearby pixels are related (spatial locality) or that a pattern learned in one part of the image should transfer to another part (translation invariance). Every connection must be learned independently, leading to parameter explosion.
Historical Context:
The breakthrough came in 1998 with Yann LeCun's LeNet-5, which successfully read handwritten digits for check processing. LeNet introduced convolution and pooling, reducing parameters from millions to tens of thousands while improving accuracy.
But CNNs didn't dominate until 2012, when AlexNet (Krizhevsky et al.) won ImageNet competition with 15.3% error rate—crushing the 26.2% error of traditional methods. AlexNet had 60 million parameters but processed 224×224 images—something impossible with fully connected architectures.
The Paradigm Shift:
CNNs replaced:
❌ Every pixel connects to every neuron (O(image_size × neurons))
✅ Small filters slide across image (O(filter_size × neurons))
For our 28×28 Fashion MNIST with 32 filters of size 3×3:
Fully connected: 28 × 28 × 32 = 25,088 parameters
Convolutional: 3 × 3 × 32 = 288 parameters (87× reduction!)
This efficiency enables deep networks (many layers) without computational explosion.
Our CNN follows the classic pattern:
Conv2D → MaxPooling → Conv2D → MaxPooling → Flatten → Dense
Let's break down each component:
Conv2D Layer (Feature Detection):
Conv2D(32 filters, kernel_size=3×3, activation='ReLU')- 32 filters: 32 different patterns to detect (edges at various angles, textures, etc.)
- 3×3 kernel: Each filter looks at 3×3 pixel neighborhoods
- ReLU activation: Introduces non-linearity (f(x) = max(0, x))
What happens: The input 28×28 image becomes 32 "feature maps" of size 26×26 (slight shrinkage from convolution borders). Each feature map highlights where a specific pattern was detected.
MaxPooling (Dimensionality Reduction):
MaxPooling2D(pool_size=2×2)- Divides each feature map into 2×2 blocks
- Keeps only the maximum value from each block
- 26×26 → 13×13 (halves spatial dimensions)
Purpose: Provides translation invariance (exact position doesn't matter) and reduces computation. The strongest activation in each region is what matters, not precisely where it occurred.
Flatten (Reshape for Classification): Converts 2D feature maps into 1D vector. After two Conv-Pool blocks, we have 32 feature maps of size 6×6 = 1,152 features. Flatten makes this a single vector of 1,152 numbers.
Dense Layer (Classification): Standard fully connected layer. Now that we've extracted meaningful features via convolution, we can use traditional neural network layers to combine them for classification (10 classes = 10 output neurons).
Recurrent Neural Networks (RNNs) are designed for sequential data where order matters and each element depends on previous ones. Think of reading a sentence: understanding the word "bank" requires knowing the previous context—"river bank" vs "savings bank" mean completely different things.
The key innovation of RNNs is the hidden state—a memory that persists across time steps. Imagine reading a book where you maintain a mental summary of the story so far. Each new sentence updates this summary, and your understanding of the new sentence depends on this accumulated knowledge.
The RNN Loop:
# Pseudocode for RNN processing
hidden_state = initialize_zeros()
for timestep in sequence:
# Combine current input with previous memory
hidden_state = tanh(W_input @ input[timestep] + W_hidden @ hidden_state)
output[timestep] = W_output @ hidden_stateAt each timestep, the RNN:
- Takes current input (e.g., a word embedding)
- Combines it with previous hidden state (memory from previous words)
- Produces new hidden state (updated understanding)
- Generates output (prediction or classification)
The Vanishing Gradient Problem:
Early RNNs suffered a critical flaw: gradients (signals for learning) diminished exponentially as they traveled backward through many timesteps. This meant RNNs couldn't learn long-term dependencies—they had short-term memory loss.
LSTM: The Solution:
Long Short-Term Memory (LSTM) networks, introduced in 1997, solved this with gating mechanisms:
┌─────────────────────────────────┐
│ LSTM Cell │
│ │
│ ┌─────────┐ ┌─────────┐ │
│ │ Forget │ │ Input │ │
│ │ Gate │ │ Gate │ │
│ └────┬────┘ └────┬────┘ │
│ │ │ │
│ ▼ ▼ │
│ Cell State (long-term memory) │
│ │ │
│ ▼ │
│ ┌─────────┐ │
│ │ Output │ │
│ │ Gate │ │
│ └────┬────┘ │
│ ▼ │
│ Hidden State (short-term) │
└─────────────────────────────────┘
Three gates control information flow:
- Forget gate: Decides what to discard from long-term memory (sigmoid → 0-1 scale)
- Input gate: Decides what new information to store
- Output gate: Decides what to output from memory
This gating allows LSTMs to maintain information across hundreds of timesteps, enabling them to learn long-range dependencies in text, speech, and time series data.
Sequential data is everywhere: text, speech, time series, video, DNA sequences. The defining characteristic is that order matters and context depends on history.
The Problem with Feedforward Networks:
Imagine trying to predict the next word in: "The clouds gathered and it started to _____"
A feedforward network has no memory. If you feed it one word at a time, it makes independent predictions:
Input: "The" → Predict next word (no context!)
Input: "clouds" → Predict next word (forgot "The")
Input: "gathered" → Predict next word (forgot everything before)
To work around this, you'd need to feed the entire history at once, but:
- Variable-length sequences require fixed input size (padding/truncation)
- Position information is lost (word 1 and word 100 processed identically)
- No parameter sharing (word "the" at position 1 vs position 50 requires different parameters)
The RNN Solution:
RNNs process sequences step-by-step while maintaining state:
# Processing "The clouds gathered and it started to rain"
hidden = process("The", hidden) # hidden stores: "article seen"
hidden = process("clouds", hidden) # hidden stores: "weather-related"
hidden = process("gathered", hidden) # hidden stores: "weather worsening"
# ...
hidden = process("to", hidden) # hidden stores: complete context
output = predict_next(hidden) # Predicts "rain" with full contextReal-World Applications:
Natural Language Processing:
- Machine translation: "I love neural networks" → "J'adore les réseaux neuronaux" (word order differs between languages)
- Sentiment analysis: "The movie was not very good" (negative despite "good")
- Text generation: Each generated word depends on all previous words
Time Series Forecasting:
- Stock prices: Today's price depends on historical trends
- Weather prediction: Temperature sequences exhibit temporal patterns
- Energy demand: Consumption patterns have daily/weekly cycles
Speech Recognition:
- Audio is naturally sequential (phonemes → words → sentences)
- "recognize speech" vs "wreck a nice beach" (sound identical, context differs)
Why Feedforward Networks Fail Here:
Feedforward networks assume independent, identically distributed (i.i.d.) data. Sequential data violates this:
- Not independent: Today's weather affects tomorrow's weather
- Order matters: "Dog bites man" ≠ "Man bites dog"
- Variable length: Sentences have different lengths
RNNs embrace this non-i.i.d. structure, making them the natural architecture for sequential data.
In this project, we achieved 86.2% accuracy with RNN on Fashion MNIST—surprisingly good for the "wrong" architecture. Understanding why it works (and why CNN is still better) reveals deep insights about architectural alignment.
How We Forced Images Into RNN:
Images are 2D grids, but RNNs expect 1D sequences. Our approach:
# Original image: 28×28 grid
image = [
[row_0_pixels...], # 28 pixels
[row_1_pixels...], # 28 pixels
...
[row_27_pixels...] # 28 pixels
]
# RNN treatment: 28 timesteps, 28 features each
# Timestep 0: Process row 0 (top of image)
# Timestep 1: Process row 1
# ...
# Timestep 27: Process row 27 (bottom of image)The RNN builds a hidden state sequentially: "I've seen the top rows, now processing middle, now bottom."
Why It Works At All:
Some vertical structure exists in clothing images:
- Shirts have collars at top, torsos in middle, hems at bottom
- Shoes have laces/openings at top, soles at bottom
- Trousers have waistbands at top, legs extending downward
The LSTM's memory can capture: "Top rows looked like a collar → likely a shirt/dress." This top-down processing explains 86.2% accuracy—features do flow somewhat vertically.
Why CNN is Superior (88% vs 86.2%):
The critical difference: 2D spatial relationships vs 1D sequential processing.
CNNs process all spatial directions simultaneously:
A 3×3 convolution filter looks at:
┌───────┐
│ □ □ □ │ Horizontal, vertical, AND diagonal
│ □ X □ │ relationships captured simultaneously
│ □ □ □ │
└───────┘
For a shoe image, CNN sees:
- Laces at top + sole at bottom (vertical)
- Rounded toe shape (horizontal curves)
- Diagonal stitching patterns
- Textural consistency across the entire region
RNN processes one row at a time:
Row 0: [pixels...] → hidden_state_0
Row 1: [pixels...] → hidden_state_1
For the same shoe:
- Sees laces in early timesteps
- Sees sole in late timesteps
- Cannot directly compare row 5 with row 15 (spatially distant rows)
- Diagonal patterns split across many timesteps (hard to detect)
- Must rely on LSTM memory to maintain spatial coherence
The Gap Explained:
The 1.8% accuracy difference (88% - 86.2%) represents patterns that are:
- Easy for CNNs: Diagonal stitching, circular buttons, horizontal stripes
- Hard for RNNs: Features spanning non-consecutive rows, requiring memory across 10+ timesteps
Quantitative Analysis:
Looking at confusion matrices, RNN struggles most with Category 6 (Shirts):
CNN: Shirt → T-shirt misclassifications: 136
RNN: Shirt → T-shirt misclassifications: ~155 (estimated)
Why? Shirt collars (top) vs hemlines (bottom) are far apart spatially. RNNs must maintain this information across 15-20 timesteps, while CNNs process it in parallel.
The Fundamental Insight:
Architecture should match data structure:
- Sequential data (text, audio, time series) → RNN's temporal processing is ideal
- Spatial data (images, graphs) → CNN's parallel spatial processing is ideal
RNNs on images are like reading a painting line-by-line with a narrow slit—you can understand it, but you miss the holistic spatial composition that makes it meaningful.
This project's results (88% vs 86.2%) empirically validate this principle: the 1.8% gap is the cost of architectural mismatch.
Training curves reveal not just performance but the learning dynamics of each architecture. In this project, both CNN and RNN exhibited healthy convergence patterns, but with subtle differences.
Key Indicators:
Training Accuracy Rising: Model is learning patterns from training data Validation Accuracy Rising: Model generalizes to unseen data Small Gap Between Curves: Little overfitting (healthy generalization) Plateau After ~5 Epochs: Model has learned most available patterns
CNN vs RNN Dynamics:
In our results, CNN showed:
- Steeper initial learning curve (faster convergence)
- Higher final accuracy (88% vs 86.2%)
- Slightly more stable validation (less oscillation)
RNN showed:
- Gradual, steady learning (LSTM gating introduces stability)
- Lower final accuracy (architectural mismatch)
- Occasional validation dips (LSTM's sequential nature causes sensitivity to initialization)
What Would Be Concerning:
Overfitting Pattern:
Training: 95% → 98% → 99% (keeps rising)
Validation: 85% → 85% → 84% (plateaus or drops)
→ Model memorizing training data, not generalizing
Neither architecture showed this pattern, confirming healthy training.
What They Reveal:
Each model output is a 10-element probability vector summing to 1.0:
# Example: Model is confident this is class 1 (Trousers)
[0.01, 0.92, 0.03, 0.01, 0.01, 0.00, 0.01, 0.00, 0.00, 0.01]
# ↑ ↑ ↑
# T-shirt Trouser Bag
# (predicted)Calibration Analysis:
The bar charts in Results section show:
- Green bar: True label (ground truth)
- Blue bar: Predicted label (if different)
- Height: Model confidence
Interpretation:
- Tall blue bar matching green: High confidence, correct prediction ✓
- Tall blue bar not matching green: High confidence, wrong prediction (overconfident) ✗
- Flat distribution: Model uncertain (healthy when classes are ambiguous)
CNN showed taller bars (higher confidence) than RNN, indicating more decisive feature extraction from spatial patterns.
Practical Implications:
In production systems, probability distributions enable:
- Confidence thresholds: Only act on predictions > 80% confidence
- Human-in-the-loop: Flag uncertain predictions (< 60%) for manual review
- Multi-label scenarios: If two classes have similar probabilities (both > 40%), suggest both
- Error analysis: Systematic confusion patterns guide model improvements
❌ "Higher accuracy always means better model"
✅ Correct understanding: Must check generalization gap (train-test difference) and per-class performance. 90% accuracy with 99% on easy classes and 50% on hard classes is worse than 88% balanced across all classes.
❌ "RNNs failed because they're bad at images"
✅ Correct understanding: RNNs achieved 86.2% accuracy (good!), just 1.8% below CNN. The gap validates architectural principles but doesn't mean RNNs are useless for images (they work, just suboptimal).
❌ "Fewer parameters = better model (less overfitting)"
✅ Correct understanding: Parameter count should match problem complexity. RNN's 84K parameters weren't enough to capture 2D spatial patterns as effectively as CNN's 140K parameters.
❌ "Confusion matrix diagonal should be 100%"
✅ Correct understanding: Perfect classification (100% on all classes) is unrealistic and often indicates overfitting. 88% with balanced errors across classes demonstrates healthy generalization.
❌ "LSTMs are just better RNNs, always use them"
✅ Correct understanding: LSTMs solve vanishing gradients for long sequences, but add complexity. For short sequences (< 30 steps), vanilla RNNs often suffice. Our 28 timesteps (image rows) is borderline—LSTM's gating helped but wasn't critical.
❌ "Pooling layers lose information"
✅ Correct understanding: MaxPooling discards precise spatial positions but preserves feature presence. This is desirable—"dog detected" matters more than "dog at exact pixel 247." The resulting translation invariance improves generalization.
-
Data Augmentation
- Random rotations (±15°), flips, zooms
- Expected impact: +2-3% accuracy by increasing training diversity
- Focus: Category 6 (Shirts) would benefit most
-
Deeper CNN Architecture
- Add 3rd convolutional block (64 filters)
- Expected impact: +1-2% accuracy, better feature extraction
- Trade-off: ~2x more parameters, slightly slower training
-
Learning Rate Scheduling
- Reduce learning rate after plateau (e.g., epoch 5)
- Expected impact: Squeeze out additional 0.5-1% accuracy
- Implementation:
ReduceLROnPlateaucallback
-
Transfer Learning
- Use pre-trained VGG or ResNet (ImageNet weights)
- Fine-tune on Fashion MNIST
- Expected impact: 92-95% accuracy (state-of-the-art range)
-
Ensemble Methods
- Combine CNN + deeper CNN + residual connections
- Expected impact: +2-3% accuracy via model diversity
- Cost: 3x inference time
-
Hierarchical Classification
- Level 1: "Upper body" vs "Lower body" vs "Footwear" vs "Accessories"
- Level 2: Specific categories within each group
- Expected impact: Reduce Shirt confusion (exploit category relationships)
-
Attention Mechanisms
- Add attention layers to CNN (focus on discriminative regions)
- Expected impact: Better interpretability, potential +1% accuracy
-
LeCun et al. (1998) - "Gradient-Based Learning Applied to Document Recognition"
Foundational CNN architecture (LeNet) establishing convolutional neural networks -
Hochreiter & Schmidhuber (1997) - "Long Short-Term Memory"
Original LSTM paper introducing gating mechanisms for RNNs -
Xiao et al. (2017) - "Fashion-MNIST: a Novel Image Dataset for Benchmarking"
Introduction of Fashion MNIST dataset as MNIST replacement -
Goodfellow et al. (2016) - "Deep Learning Book"
Comprehensive reference for CNN and RNN fundamentals (Chapters 9-10)
- Convolutional Neural Networks (CNNs): Extract spatial hierarchies via convolution
- Recurrent Neural Networks (RNNs): Process sequential data with temporal dependencies
- LSTM (Long Short-Term Memory): RNN variant with gating to prevent vanishing gradients
- Confusion Matrix: Visualization of per-class classification performance
- Train-Validation-Test Split: Essential methodology for robust model evaluation
- Fashion MNIST: GitHub Repository
- 70,000 grayscale images (28×28 pixels)
- 10 clothing categories
- Balanced classes (7,000 samples each)
- Training: 60,000 | Test: 10,000
Course: COMP 263 - Deep Learning
Institution: Centennial College
Term: Fall 2024
Grade: High Honors (GPA: 4.45/4.5)
- Architecture Comparison: Systematic evaluation of CNN vs RNN for image classification
- Model Evaluation: Comprehensive metrics (accuracy, confusion matrix, probability distributions)
- Statistical Analysis: Train-validation-test methodology with proper random seed control
- Data Preprocessing: Normalization, one-hot encoding, dataset splitting
- Visualization: Professional matplotlib/seaborn plots (training curves, confusion matrices)
- Critical Thinking: Identified root cause of category 6 (Shirt) poor performance through confusion matrix analysis
Author: Matheus Ferreira Teixeira
GitHub: github.com/domvito55
LinkedIn: linkedin.com/in/mathteixeira
This project is academic coursework at Centennial College. Free to use for learning purposes with proper attribution.