Skip to content
This repository was archived by the owner on Jul 31, 2025. It is now read-only.

Latest commit

 

History

History
164 lines (96 loc) · 9.3 KB

File metadata and controls

164 lines (96 loc) · 9.3 KB

Technical Report: Binary Semantic Segmentation

Overview

This technical report details the implementation and experimental evaluation of binary semantic segmentation using UNet and ResNet34_UNet architectures on the Oxford-IIIT Pet Dataset. The project demonstrates end-to-end deep learning pipeline for computer vision tasks, including model architecture design, training optimization, and comprehensive evaluation.

1. Implementation Details

Model Architectures

Two main model architectures were implemented:

UNet (src/models/unet.py)

The implementation follows the classic U-shaped design from the original paper. The encoder gradually reduces feature map dimensions while the decoder reconstructs them to the original resolution. Skip connections between corresponding encoder and decoder levels preserve fine details. The implementation uses double convolution blocks (double_conv) containing two convolutional layers with batch normalization and ReLU activation. The final layer is a 1×1 convolution that outputs the segmentation map.

ResNet34_UNet (src/models/resnet34_unet.py)

This architecture combines a ResNet34 encoder with a UNet-style decoder. The encoder uses ResNet34's initial convolutional layers and residual blocks (BasicBlock), while the decoder mirrors UNet's structure with transposed convolutions for upsampling and skip connections from ResNet34's encoder layers. ResNet34's original output layer is disabled to ensure proper connection between encoder and decoder. This design benefits from ResNet34's strong feature extraction while maintaining UNet's segmentation capabilities.

Training Procedure (src/train.py)

The training process includes these key components:

Device Selection

The script automatically selects the appropriate device (CUDA, MPS, or CPU) using the get_device function from utils.py. Users should adjust the batch size according to their device's memory capacity.

Data Loading

Uses the load_dataset function from oxford_pet.py to load the Oxford-IIIT Pet dataset, creating both training and validation sets. Data augmentation is implemented through the augment_dataset function, which applies random rotations, horizontal flips, crops, or combinations of these. By default, each image is augmented 3 times, though this can be adjusted. The script includes logic to prevent duplicate augmentations across training runs.

Model Configuration

Only unet or resnet34_unet are valid model types. The script defaults to resnet34_unet if an invalid type is entered. Each training run receives a unique run_id for tracking purposes.

Optimization Strategy

  • Loss Function: Binary Cross-Entropy with Logits Loss (nn.BCEWithLogitsLoss)
  • Optimizer: Adam optimizer (optim.Adam)
  • Learning Rate Scheduling: ReduceLROnPlateau scheduler that adjusts the learning rate based on validation Dice scores. The rate reduces by 0.5× after 3 epochs without improvement (patience=3), with a minimum rate of 1e-6.
  • Gradient Clipping: torch.nn.utils.clip_grad_norm_ (max norm=1.0) to prevent exploding gradients

Training Loop

The script runs for multiple epochs, processing batched data via DataLoader. Each batch completes forward passes, loss calculations, backpropagation, and optimization. Training/validation metrics (losses and Dice scores) are logged to CSV files using utils.py functions. The script saves periodic checkpoints and retains the best model based on validation Dice score.

Evaluation Procedure (src/evaluate.py)

The evaluation script assesses trained models with the following components:

  • Data Loading: Uses the same load_dataset function to load validation data
  • Model Loading: Loads a specified pre-trained model and switches to evaluation mode (net.eval())
  • Metrics: Calculates Dice scores using the dice_score function from utils.py
  • Evaluation Process: Processes the validation set in batches, converts model outputs to binary masks (threshold=0.5 after Sigmoid), and computes batch Dice scores

Inference Procedure (src/inference.py)

This script runs predictions on unseen data with:

  • Data Loading: Loads test data using the same load_dataset function
  • Prediction and Output: Processes test images, generates binary masks, and optionally saves inputs, predictions, and ground truth masks as PNGs in inference_results
  • Results Logging: All results are logged to CSV files

2. Data Preprocessing

Preprocessing occurs primarily in oxford_pet.py and utils.py through the OxfordPetDataset and SimpleOxfordPetDataset classes.

Dataset Processing

Mask Processing

The _preprocess_mask method in OxfordPetDataset includes an augment_uncertain option. By default (False), both foreground (1) and uncertain regions (3) become foreground (1.0) in the binary mask. When enabled (True), uncertain regions randomly become foreground (1.0) or background (0.0) with 50% probability.

Data Augmentation

  • Augmentation Methods: utils.py contains augment_dataset for bulk augmentation and apply_augmentations for individual image-mask pairs
  • Available Transformations: Rotation, horizontal flipping, scaled cropping, and combined transformations
  • Implementation: When the --augment flag is active, train.py uses augment_dataset to generate enhanced data, then combines it with the original dataset via the AugmentedDataset class

3. Experimental Results

Model Performance Comparison: UNet vs. ResNet34_UNet

The experimental results demonstrate that ResNet34_UNet consistently achieved higher Dice scores than the standard UNet within the same argument groups. This suggests that leveraging ResNet34's feature extraction significantly enhances segmentation performance. ResNet34_UNet showed faster convergence due to its deep residual connections, which helped mitigate the vanishing gradient problem.

Data Augmentation Impact

The experiments examined the effect of data augmentation, including random rotations, horizontal flips, and cropping. Results showed that augmentation led to noticeable improvement in Dice scores, likely by increasing model robustness to variations in pet shapes and orientations.

Key Observations

  1. Learning Rate Scheduling Impact: The ReduceLROnPlateau scheduler was effective in improving model performance. ResNet34_UNet benefited more from this scheduler, converging faster than UNet.

  2. Data Augmentation Effects: Augmentation improved Dice scores, with the impact being more pronounced in earlier training stages. However, excessive augmentation showed diminishing returns beyond a certain point.

  3. Overfitting Analysis: Overfitting was not a major issue due to effective augmentation and batch normalization in both architectures. ResNet34_UNet maintained better generalization compared to UNet.

  4. Model Efficiency: ResNet34_UNet demonstrated superior memory efficiency and faster convergence compared to UNet, making it more suitable for resource-constrained environments.

Performance Metrics

  • Training Accuracy: ~95%+ on training set
  • Validation Accuracy: ~90%+ on validation set
  • IoU Score: ~0.8+ for well-segmented images
  • Convergence: Typically converges within 30-50 epochs

4. Execution Examples

Training

python src/train.py --data_path ./data/oxford-iiit-pet --model_type resnet34_unet --epochs 30 --batch_size 32 --learning_rate 1e-4 --augment

Evaluation

python src/evaluate.py --data_path ./data/oxford-iiit-pet --model_path ./saved_models/resnet34_unet_best_model.pth --model_type resnet34_unet --batch_size 4

Inference

python src/inference.py --data_path ./data/oxford-iiit-pet --model ./saved_models/resnet34_unet_best_model.pth --model_type resnet34_unet --batch_size 1 --save_results

5. Discussion and Future Directions

Architectural Insights

The experiments demonstrate that conventional CNN architectures, especially ResNet34_UNet, remain highly effective for binary semantic segmentation tasks. This aligns with the concept of diminishing returns, where increased model complexity doesn't always translate to proportional performance gains in well-defined tasks.

Practical Considerations

The implementation revealed crucial insights about applying segmentation models in practice. Data augmentation experiments showed that dataset quality and diversity often outweigh architectural sophistication, supporting the importance of a data-centric approach to machine learning.

Future Directions

Potential areas for exploration include:

  1. Hybrid Architectures: Exploring models that combine CNNs with self-attention mechanisms
  2. Multi-class Extension: Extending from binary to multi-class segmentation
  3. Efficiency Optimization: Investigating time-space trade-offs for edge device deployment
  4. Advanced Augmentation: Exploring more sophisticated data augmentation techniques

6. Technical Specifications

Environment Requirements

  • Python 3.7+
  • PyTorch >= 1.9.0
  • CUDA-compatible GPU (recommended) or CPU with sufficient memory

Memory Requirements

  • UNet: ~14.5GB RAM (batch size 32)
  • ResNet34_UNet: ~7GB RAM (batch size 32)

Dataset Information

  • Oxford-IIIT Pet Dataset: 37 pet categories, ~7,400 images
  • Trimap annotations with pixel-level labels
  • Automatic preprocessing to binary masks