This project implements a full pose estimation pipeline using keypoints from the COCO dataset. It includes data preprocessing, heatmap generation, model training, and evaluation using the PCK@0.2 metric.
- Predict 17 keypoints of a human body (COCO format)
- Use heatmaps as supervision (Gaussian blobs)
- Explore soft-argmax and coordinate losses
- Evaluate using PCK@0.2
pose-estimation-project/
├── README.md
├── dataset
├── annotations
├── coco_subset
├── coco_subset_large
├── generate_subset.py
├── train2017
└── val2017
├── output
├── output_200_image
├── output_epoch30
├── src
├── checkpoints
├── pck.py
├── pose_model.pth
├── simple_pose_net.py
├── train.py
├── unet_pose_net.py
└── unit_test_simple_pose_net.py
└── utils
├── debug_visualize_heatmaps.py
├── decode_keypoints.py
├── heatmap_generator.py
├── heatmaps_decoder.py
├── soft_argmax.py
├── test_heatmap_decoder.py
├── visualiser.py
└── visualize_predictions.py
- Subset of COCO Keypoints 2017
- Filtered for images with at least one visible person
- Used subset sizes:
200,2000 - Input resolution:
256×192 - Output resolution:
96×72or192×256
- Encoder: Pretrained
ResNet18 - Decoder:
SimplePoseNet: 3-layer upsamplingUNetPoseNet: U-Net with skip connections
- Loss:
- BCEWithLogits + Soft-argmax L1 loss
- Joint-wise weighted coordinate loss
- Optimizer: Adam (LR=1e-3)
- Metric: PCK@0.2
- Best PCK@0.2 on 2000 images: ~0.51
- Model learns rough vicinity of joints, but not fine arrangement
- Qualitative examples show blobs forming, but not well structured
- Soft-argmax decoder improved keypoint sharpness
- U-Net decoder improved learning for difficult joints (ankles, wrists)
- Ground-truth heatmap resolution must match model output resolution
- Prediction keypoints sometimes correct in location but misordered
- GT annotations in COCO have noisy or missing keypoints
- Model sometimes predicts joints in the correct area but wrong order
- Predicted skeleton structure not yet coherent
- Multiple person ambiguity: Supervision is limited to the first visible person, causing the model to sometimes place keypoints on other individuals in multi-person images.
- Visual debugging is crucial
- Output resolution affects heatmap precision
- Masking invisible keypoints stabilizes training
- Balancing coordinate vs heatmap loss helps
- Structured refinement with graph-based loss
- Hourglass or Transformer-style decoders
- Bone length constraints
- Pose refinement from initial prediction


