The repository contains my implementation to develop deep learning models to predict install probability for real-time bidding (RTB).
NOTE
-
60% of the data in train_data.csv is used as the train set.
-
20% of the data in train_data.csv is used as the internal valid set.
-
20% of the data in train_data.csv is used as the internal test set.
The results in this table are the prediction results on the internal valid set. The best-performed model from the internal valid set is evaluated with the test set (as shown in the last row of the table)
Note: This python repository is developed under python 3.8 and tensorflow 2.8 with cpu only. However, the code is implemented in such a way that it can automatically detect machine devices (CPU or GPU-enabled). If tensorflow with GPU is properly set, the models can run using GPU.
-
step 1: Navigate to the root directory of the repo where
requirements.txtandexecute_pipeline.pycan be found. -
step 2: Create a new folder named
dataand put thetrain_data.csvandassessment_data.csvthere. -
step 3: pip install -r requirements.txt
-
step 1: Navigate to the root directory of the repo where
requirements.txtandexecute_pipeline.pycan be found. -
step 2: Run
python execute_pipeline.py
The following instructions show how to launch the full ML pipeline consisting of the following steps:
- Data preprocessing
- Model building and training
- Model evaluation
- Final submission file generation
1. Leave `IS_PREPROCESS` to `True` and `IS_TRAIN` to `True` inside `execute_pipeline.py`
2. python execute_pipeline.py In this option, the whole pipeline will start from scratch, which would generate features, train models on the features, evaluate the models on the internal valid set and internal test set, and finally make predictions on the assessment set.
During this process, the following two sets of features would be generated, trained, and evaluated.
- basic feature encoding
- a more advanced feature encoding.
1. Change `IS_PREPROCESS` to `False` inside `execute_pipeline.py`
2. python execute_pipeline.py In this option, the program will skip the feature generation step. Instead, the program assumes the training feature files are available in the expected location. It will train models on the features, evaluate the models on the internal valid set and internal test set, and finally make predictions on the assessment set.
1. change `IS_PREPROCESS` to `False` inside `execute_pipeline.py`
2. change `IS_TRAIN` to `False` inside `execute_pipeline.py`
3. python execute_pipeline.py In this option, the program assumes that the features are generated and the trained weights are saved in the expected folders. The model weights will be loaded and be evaluated with the internal valid set and internal test set, and finally make predictions on the assessment set
The execute_pipeline.py, which locates at the root directory level, is the main entrypoint. It can be launched via python execute_pipeline.py to trigger the
whole ML pipeline.
There are 5 subfolders:
-
a_preprocessingcontains all files used for performing feature engineering.naive_preprocessor.pyimplements basic data preprocessing whileexplore_preprocessor.pyperforms more advanced data exploration and feature engineering. -
b_modelscontains all model implementations including a XGBoost baseline classifier and a DLRM deep learning model. -
c_ensemblecontains utility functions to combine predictions from multiple models into a sinlge prediction. -
d_autoencoder_experimentcontains an experimental script that shows the idea of using an autoencoder to convert the install prediction task to an anomaly detection task. This is a working-in-progress due to the limited time and can be revived in the future. To run the code, please make sure that the advanced data preprocessor has generated feature files to the expected locations, then navigate tod_autoencoder_experimentand then launchpython autoencoder_exp.py. -
metrics: contains all utility functions for metrics.
