Team Name: Gradient Geeks
Team Members: Suchana Hazra, Siddharth Sen, Uttam Mahata, Anurag Ghosh
Submission Date: 13/10/25
This repository contains the solution for the Smart Product Pricing challenge.
Our approach leverages multimodal data — textual product descriptions, images, and structured metadata — to predict product prices accurately. The solution uses text embeddings, image embeddings, dimensionality reduction, and gradient boosting models.
- Predict product prices using text, image, and structured features.
- Text descriptions contain valuable pricing cues but need cleaning.
- Images provide visual cues related to product quality and category.
- Redundant or sparse features were removed to improve model performance.
- Text Processing: Clean text → embed using MiniLM → PCA for dimensionality reduction.
- Image Processing: Preprocess images → embed using pretrained CNN/CLIP → PCA.
- Feature Fusion: Concatenate text embeddings, image embeddings, and structured features.
- Regression Models: Fit ensemble models (LightGBM, XGBoost, CatBoost) to predict prices.
Product Text → Text Cleaning → MiniLM Embedding → PCA → → Concatenate → GBM Regressor → Price Prediction Product Image → Preprocessing → CNN/CLIP Embedding → PCA → / Structured Features → Clean/Encode → Concatenate → GBM Regressor → Price Prediction
- Cleaning: regex, lowercasing, punctuation removal, stopword removal
- Embedding: MiniLM (384-dimensional)
- PCA: reduced to 128 dimensions
- Preprocessing: resize, normalize, convert to tensor
- Embedding: Pretrained CNN/CLIP (2048-dimensional)
- PCA: reduced to 128 dimensions
- Drop redundant features
- Encode categorical variables (target encoding / label encoding)
- Gradient Boosting (LightGBM, XGBoost, CatBoost)
- Hyperparameter tuning via cross-validation
| Metric | Score |
|---|---|
| SMAPE | 0.047 |
- Clone the repository: https://github.com/gradientgeeks/amazon-ml-challenge-2025/