This repository contains various approaches for the Amazon ML Challenge 2025 price prediction task.
Team ML Mavericks achieved an exceptional All-India Rank of 80 out of approximately 23,000 teams, placing us in the top 0.3% with a final SMAPE score of 43.28 (winning score: 39.7).
Team Members:
- Neel Shah
- Sneh Shah
- Harsh Maheshwari
- Harsh Shah
In e-commerce, determining the optimal price point for products is crucial for marketplace success and customer satisfaction. The challenge was to develop an ML solution that analyzes product details and predicts the price of the product holistically using only the provided text and product images - with all external price lookups strictly prohibited.
- Training Dataset: 75,000 products with complete details and prices
- Test Dataset: 75,000 products for final evaluation
Features:
sample_id: Unique identifier for each samplecatalog_content: Text field containing title, product description, and Item Pack Quantity (IPQ) concatenatedimage_link: Public URL for product image downloadprice: Target variable (training data only)
SMAPE (Symmetric Mean Absolute Percentage Error)
SMAPE = (1/n) * Σ |predicted_price - actual_price| / ((|actual_price| + |predicted_price|)/2)
- Range: 0% to 200% (lower is better)
- Our Final Score: 43.28%
⚠️ STRICTLY PROHIBITED: External price lookup from internet/databases- Models limited to MIT/Apache 2.0 License with ≤8B parameters
- Must predict positive float values for all test samples
The main complexity was deriving prices holistically using only:
- Product titles and descriptions
- Item pack quantities
- Product images
- NO external price references allowed
This made it a true test of feature engineering and multimodal learning capabilities.
Complete data exploration and diagnostic analysis to understand pricing patterns:
classification-idea.ipynb- Price classification and binning strategiesdiagnostic-analysis.ipynb- Deep diagnostic analysis of price drivers and validation gapsexploratory-data-analysis-visualization.ipynb- Comprehensive EDA with visualizationsfeature-engineering.ipynb- Advanced feature extraction and preprocessing techniques
Focus: Understanding what truly drives product pricing through statistical analysis
Advanced models leveraging both text and visual information:
multimodal-brand-clip-image-price-prediction.ipynb- CLIP-based image + brand features with ensembleqwen2-5-finetune-multimodal.ipynb- Qwen2.5 vision-language model fine-tuning
Focus: Combining visual product information with textual descriptions for holistic price prediction
Comprehensive text-only solutions spanning traditional ML to cutting-edge LLMs:
hybrid-ensemble-validation-test-gap-fixes.ipynb- 🥇 FINAL SOLUTION with validation-test gap fixesadvanced-hybrid-solution.ipynb- Ultra-advanced ensemble approach (SMAPE: 38-44%)
granite-4.0-llm-price-prediction-with-unsloth.ipynb- Granite 4.0 with Unsloth optimizationfinal-granite-amazon-25-alternative.ipynb- Alternative Granite implementationqwen-optimized-fast-training.ipynb- Optimized Qwen training pipelineqwen2-5-finetune-text-only.ipynb- Qwen2.5 text-only fine-tuning
flan-t5-model-main-third-method-dynamic-length.ipynb- FLAN-T5 with dynamic length handlingflan-t5-model-main-inference.ipynb- FLAN-T5 inference pipelineflan-t5-mlp-regression-log-transformed.ipynb- T5 + MLP with log transformationbert-regression-model-price-prediction.ipynb- BERT-based regression approachmodern-bert-mmd-loss-price-prediction.ipynb- Modern BERT with MMD losscomprehensive-bert-text-preprocessing-model.ipynb- BERT with advanced preprocessingtext-only-bert-optimized-approach.ipynb- Optimized BERT implementation
llm-batch-feature-extraction-15-fields.ipynb- LLM-based comprehensive feature extractionvllm-ultra-fast-feature-extraction-a100.ipynb- Ultra-fast GPU-optimized feature extractionllm-feature-extraction.ipynb- General LLM feature extraction pipelineml-feature-engineering-approach.ipynb- Traditional ML with engineered features
gradient-boosting-solution-amazon-ml.ipynb- Gradient boosting implementationfaiss-similarity-search.ipynb- FAISS-based similarity searchamazon-ml-price-prediction.ipynb- General ML approach
multi-task-t5-beam-search-learning.ipynb- Multi-task T5 with beam searcht5-conditional-generation-pytorch-lightning.ipynb- T5 with PyTorch Lightningt5-conditional-generation-pytorch-lightning-alt.ipynb- Alternative T5 PyTorch approacht5-encoder-neural-network-price-classification.ipynb- T5 encoder + neural networktensorflow-lstm-price-prediction-model.ipynb- LSTM-based neural networkupdated-t5-model-aug-data.ipynb- T5 with data augmentation
Total: 30+ implementations covering the full spectrum from traditional ML to state-of-the-art LLMs
- Amazon 25 Problem Statement.pdf: Official challenge documentation
- Multimodal Model Architectures.pdf: Reference material for multimodal approaches
Task: Predict product prices from catalog content and images
Metric: SMAPE (Symmetric Mean Absolute Percentage Error)
Data: Product catalog descriptions and images from Amazon
- Explore the data: Start with
1. Exploratory Data Analysis/ - Choose approach:
- For image+text:
2. MultiModal Approach/ - For text-only:
3. Text Only Approach/
- For image+text:
- Check README files in each folder for detailed descriptions
- Individual notebooks are prefixed with team member names (Neel Shah, Sneh Shah, Harsh Maheshwari, Harsh Shah)
- Collaborative approaches are organized by methodology
- Check individual notebook documentation for performance metrics
- Compare validation scores and training times across approaches
- Consider computational requirements for your setup
- Biggest gains came from deeply understanding the data before training
- Focus on rigorous preprocessing and thoughtful feature engineering
- Comprehensive exploratory data analysis was crucial
- With 75k test samples, local validation was critical
- Helped navigate leaderboard and avoid overfitting
- Trust your validation over leaderboard fluctuations
- Collaborative debugging, pivoting, and motivation
- Diverse approaches and perspectives
- 72-hour intensive sprint requiring sustained teamwork
- Combined text and image features effectively
- Explored vision-language models (Qwen2.5, CLIP-based)
- Brand + image feature combinations proved valuable
- 30+ text-only implementations from traditional ML to modern LLMs
- Range from Gradient Boosting to Large Language Models
- Ensemble methods combining different model types
- Duration: 3-day intensive sprint (72 hours)
- Public Leaderboard: Based on 25k test samples
- Final Ranking: Complete 75k test set evaluation
- Final Achievement: Rank 80/23,000+ teams with SMAPE 43.28%
All notebooks follow a clear, descriptive naming pattern:
[model/approach]-[specific-technique]-[use-case].ipynb- Example:
hybrid-ensemble-validation-test-gap-fixes.ipynb - No more cryptic names - every notebook clearly describes its purpose
- Start with
3. Text Only Approach/hybrid-ensemble-validation-test-gap-fixes.ipynb(Final Solution) - Compare with
3. Text Only Approach/advanced-hybrid-solution.ipynb(Ultra-advanced)
- Begin with
1. Exploratory Data Analysis/to understand the data - Explore different approaches in
3. Text Only Approach/ - Try multimodal approaches in
2. MultiModal Approach/
- LLMs: Granite, Qwen notebooks in Text Only folder
- Traditional ML: Gradient boosting and feature engineering notebooks
- Deep Learning: BERT, T5, LSTM implementations
- Multimodal: CLIP and Qwen2.5 in MultiModal folder
Every notebook now includes:
- ✅ Professional explanatory markdown at the beginning
- ✅ Architecture and approach description
- ✅ Key features and innovations
- ✅ Expected performance metrics
- ✅ Clear, descriptive filenames
Total: 35+ fully documented notebooks across all approaches!