Technical Report: Binary Semantic Segmentation

Overview

This technical report details the implementation and experimental evaluation of binary semantic segmentation using UNet and ResNet34_UNet architectures on the Oxford-IIIT Pet Dataset. The project demonstrates end-to-end deep learning pipeline for computer vision tasks, including model architecture design, training optimization, and comprehensive evaluation.

1. Implementation Details

Model Architectures

Two main model architectures were implemented:

UNet (`src/models/unet.py`)

The implementation follows the classic U-shaped design from the original paper. The encoder gradually reduces feature map dimensions while the decoder reconstructs them to the original resolution. Skip connections between corresponding encoder and decoder levels preserve fine details. The implementation uses double convolution blocks (double_conv) containing two convolutional layers with batch normalization and ReLU activation. The final layer is a 1×1 convolution that outputs the segmentation map.

ResNet34_UNet (`src/models/resnet34_unet.py`)

This architecture combines a ResNet34 encoder with a UNet-style decoder. The encoder uses ResNet34's initial convolutional layers and residual blocks (BasicBlock), while the decoder mirrors UNet's structure with transposed convolutions for upsampling and skip connections from ResNet34's encoder layers. ResNet34's original output layer is disabled to ensure proper connection between encoder and decoder. This design benefits from ResNet34's strong feature extraction while maintaining UNet's segmentation capabilities.

Training Procedure (`src/train.py`)

The training process includes these key components:

Device Selection

The script automatically selects the appropriate device (CUDA, MPS, or CPU) using the get_device function from utils.py. Users should adjust the batch size according to their device's memory capacity.

Data Loading

Uses the load_dataset function from oxford_pet.py to load the Oxford-IIIT Pet dataset, creating both training and validation sets. Data augmentation is implemented through the augment_dataset function, which applies random rotations, horizontal flips, crops, or combinations of these. By default, each image is augmented 3 times, though this can be adjusted. The script includes logic to prevent duplicate augmentations across training runs.

Model Configuration

Only unet or resnet34_unet are valid model types. The script defaults to resnet34_unet if an invalid type is entered. Each training run receives a unique run_id for tracking purposes.

Optimization Strategy

Loss Function: Binary Cross-Entropy with Logits Loss (nn.BCEWithLogitsLoss)
Optimizer: Adam optimizer (optim.Adam)
Learning Rate Scheduling: ReduceLROnPlateau scheduler that adjusts the learning rate based on validation Dice scores. The rate reduces by 0.5× after 3 epochs without improvement (patience=3), with a minimum rate of 1e-6.
Gradient Clipping: torch.nn.utils.clip_grad_norm_ (max norm=1.0) to prevent exploding gradients

Training Loop

The script runs for multiple epochs, processing batched data via DataLoader. Each batch completes forward passes, loss calculations, backpropagation, and optimization. Training/validation metrics (losses and Dice scores) are logged to CSV files using utils.py functions. The script saves periodic checkpoints and retains the best model based on validation Dice score.

Evaluation Procedure (`src/evaluate.py`)

The evaluation script assesses trained models with the following components:

Data Loading: Uses the same load_dataset function to load validation data
Model Loading: Loads a specified pre-trained model and switches to evaluation mode (net.eval())
Metrics: Calculates Dice scores using the dice_score function from utils.py
Evaluation Process: Processes the validation set in batches, converts model outputs to binary masks (threshold=0.5 after Sigmoid), and computes batch Dice scores

Inference Procedure (`src/inference.py`)

This script runs predictions on unseen data with:

Data Loading: Loads test data using the same load_dataset function
Prediction and Output: Processes test images, generates binary masks, and optionally saves inputs, predictions, and ground truth masks as PNGs in inference_results
Results Logging: All results are logged to CSV files

2. Data Preprocessing

Preprocessing occurs primarily in oxford_pet.py and utils.py through the OxfordPetDataset and SimpleOxfordPetDataset classes.

Dataset Processing

Mask Processing

The _preprocess_mask method in OxfordPetDataset includes an augment_uncertain option. By default (False), both foreground (1) and uncertain regions (3) become foreground (1.0) in the binary mask. When enabled (True), uncertain regions randomly become foreground (1.0) or background (0.0) with 50% probability.

Data Augmentation

Augmentation Methods: utils.py contains augment_dataset for bulk augmentation and apply_augmentations for individual image-mask pairs
Available Transformations: Rotation, horizontal flipping, scaled cropping, and combined transformations
Implementation: When the --augment flag is active, train.py uses augment_dataset to generate enhanced data, then combines it with the original dataset via the AugmentedDataset class

3. Experimental Results

Model Performance Comparison: UNet vs. ResNet34_UNet

The experimental results demonstrate that ResNet34_UNet consistently achieved higher Dice scores than the standard UNet within the same argument groups. This suggests that leveraging ResNet34's feature extraction significantly enhances segmentation performance. ResNet34_UNet showed faster convergence due to its deep residual connections, which helped mitigate the vanishing gradient problem.

Data Augmentation Impact

The experiments examined the effect of data augmentation, including random rotations, horizontal flips, and cropping. Results showed that augmentation led to noticeable improvement in Dice scores, likely by increasing model robustness to variations in pet shapes and orientations.

Key Observations

Learning Rate Scheduling Impact: The ReduceLROnPlateau scheduler was effective in improving model performance. ResNet34_UNet benefited more from this scheduler, converging faster than UNet.
Data Augmentation Effects: Augmentation improved Dice scores, with the impact being more pronounced in earlier training stages. However, excessive augmentation showed diminishing returns beyond a certain point.
Overfitting Analysis: Overfitting was not a major issue due to effective augmentation and batch normalization in both architectures. ResNet34_UNet maintained better generalization compared to UNet.
Model Efficiency: ResNet34_UNet demonstrated superior memory efficiency and faster convergence compared to UNet, making it more suitable for resource-constrained environments.

Performance Metrics

Training Accuracy: ~95%+ on training set
Validation Accuracy: ~90%+ on validation set
IoU Score: ~0.8+ for well-segmented images
Convergence: Typically converges within 30-50 epochs

4. Execution Examples

Training

python src/train.py --data_path ./data/oxford-iiit-pet --model_type resnet34_unet --epochs 30 --batch_size 32 --learning_rate 1e-4 --augment

Evaluation

python src/evaluate.py --data_path ./data/oxford-iiit-pet --model_path ./saved_models/resnet34_unet_best_model.pth --model_type resnet34_unet --batch_size 4

Inference

python src/inference.py --data_path ./data/oxford-iiit-pet --model ./saved_models/resnet34_unet_best_model.pth --model_type resnet34_unet --batch_size 1 --save_results

5. Discussion and Future Directions

Architectural Insights

The experiments demonstrate that conventional CNN architectures, especially ResNet34_UNet, remain highly effective for binary semantic segmentation tasks. This aligns with the concept of diminishing returns, where increased model complexity doesn't always translate to proportional performance gains in well-defined tasks.

Practical Considerations

The implementation revealed crucial insights about applying segmentation models in practice. Data augmentation experiments showed that dataset quality and diversity often outweigh architectural sophistication, supporting the importance of a data-centric approach to machine learning.

Future Directions

Potential areas for exploration include:

Hybrid Architectures: Exploring models that combine CNNs with self-attention mechanisms
Multi-class Extension: Extending from binary to multi-class segmentation
Efficiency Optimization: Investigating time-space trade-offs for edge device deployment
Advanced Augmentation: Exploring more sophisticated data augmentation techniques

6. Technical Specifications

Environment Requirements

Python 3.7+
PyTorch >= 1.9.0
CUDA-compatible GPU (recommended) or CPU with sufficient memory

Memory Requirements

UNet: ~14.5GB RAM (batch size 32)
ResNet34_UNet: ~7GB RAM (batch size 32)

Dataset Information

Oxford-IIIT Pet Dataset: 37 pet categories, ~7,400 images
Trimap annotations with pixel-level labels
Automatic preprocessing to binary masks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Technical Report: Binary Semantic Segmentation

Overview

1. Implementation Details

Model Architectures

UNet (`src/models/unet.py`)

ResNet34_UNet (`src/models/resnet34_unet.py`)

Training Procedure (`src/train.py`)

Device Selection

Data Loading

Model Configuration

Optimization Strategy

Training Loop

Evaluation Procedure (`src/evaluate.py`)

Inference Procedure (`src/inference.py`)

2. Data Preprocessing

Dataset Processing

Mask Processing

Data Augmentation

3. Experimental Results

Model Performance Comparison: UNet vs. ResNet34_UNet

Data Augmentation Impact

Key Observations

Performance Metrics

4. Execution Examples

Training

Evaluation

Inference

5. Discussion and Future Directions

Architectural Insights

Practical Considerations

Future Directions

6. Technical Specifications

Environment Requirements

Memory Requirements

Dataset Information

FilesExpand file tree

TECHNICAL_REPORT.md

Latest commit

History

TECHNICAL_REPORT.md

File metadata and controls

Technical Report: Binary Semantic Segmentation

Overview

1. Implementation Details

Model Architectures

UNet (src/models/unet.py)

ResNet34_UNet (src/models/resnet34_unet.py)

Training Procedure (src/train.py)

Device Selection

Data Loading

Model Configuration

Optimization Strategy

Training Loop

Evaluation Procedure (src/evaluate.py)

Inference Procedure (src/inference.py)

2. Data Preprocessing

Dataset Processing

Mask Processing

Data Augmentation

3. Experimental Results

Model Performance Comparison: UNet vs. ResNet34_UNet

Data Augmentation Impact

Key Observations

Performance Metrics

4. Execution Examples

Training

Evaluation

Inference

5. Discussion and Future Directions

Architectural Insights

Practical Considerations

Future Directions

6. Technical Specifications

Environment Requirements

Memory Requirements

Dataset Information

UNet (`src/models/unet.py`)

ResNet34_UNet (`src/models/resnet34_unet.py`)

Training Procedure (`src/train.py`)

Evaluation Procedure (`src/evaluate.py`)

Inference Procedure (`src/inference.py`)