This project is part of the EPFL Machine Learning course for fall 2021 and was implemented by Younes Moussaif, Clémence Barsi and Pauline Conti.
The aim of this project is to predict implement machine learning models able to predict if a tweet used to contain a ":)" or ":(" smiley face. The training data available contains 2.5 Million tweets that the models implemented need to correctly classify as positive or negative. To obtain our test accuracy, we uploaded our submissions to AIcrowd.
The python libraries and versions used for this project are listed below:
- python = 3.7.11
- scipy = 1.7.1
- pandas = 1.3.4
- numpy = 1.21.2
- matplotlib = 3.5.0
- pytorch = 1.10.1
- torchvision = 0.11.2
- torchaudio = 0.10.1
- cudatoolkit = 10.2
- scikit-learn = 1.0.1
- jupyterlab = 3.2.1
- nb_conda_kernels
- nltk = 4.62.3
- transformers = 4.14.1
- gensim = 4.1.2
In order to run the notebooks, run the following command when the environnement with the above dependencies is active:
conda install -c conda-forge ipywidgets
To run our best performing model, the script run.py must be ran having the same structure as in the github file meaning :
- in the folder data/models/BERT both files :
- best_submission_bert_custom.pkl drive link
- best_submission_bert.pkl drive link
- in the folder data/twitter-datasets/
- test_data.csv
- train_pos.csv
- train_neg.csv
- train_pos_full.csv
- train_neg_full.csv
Since the above two .pkl file are too big for github we uploaded them on an external google drive here : drive link
This will create a .csv output file containing our predictions data\submissions\output_run_py.csv for our best run (#169220).
If you want to train one of our BERT models we recommend you use google COLAB, and a use a GPU (we used COLAB a P100 GPU), more information on how to do that is is the notebook BERT_models.ipynb (additionally, instructions on how to create a conda virtual environment with the packages required to run run.py are indicated inside the notebook, and you can run run.py from inside it, this was done using conda version 4.10.3)
On a GPU run.py takes approximately 1 minute to run (depends on your GPU), on a CPU it takes ~1h.
To reproduce our other results, open preprocessing_embedding_baseline_example.ipynb and follow the next steps (most of the next steps are implemented in the notebook except for placing the large file glove-twitter-25.gz in the right place)
- additionally to the packages you must install at the beginning of
BERT_models.ipynbinstall gensim usingpip install gensim - first preprocess the tweets using the function
preprocessingfrompreprocessing.py( on train_pos and train_neg, some additional files required to run the functions are provided in our repo) - Transform the tweets into number vectors (embed them) using functions in
embeddings.py:- Glove :
- make sure you have downloaded all the files in data directory
glove-twitter-25.gzwas too large to host on github so we host it on google drive : link
- load the model with
load_glove_model - use clean
clean_colsto remove unused columns - use
df_to_numpy_featuresto obtain a numpy matrix containing the features and one that contains the labels - normalize the features using
standardize_cols
- make sure you have downloaded all the files in data directory
- TF-IDF :
- use
tf_idf_embeddingto produce the sparse embeddings thenadd_label_tfidfto get the corresponding labels
- use
- Glove :
- Run one of the models or crossvalidation functions from
baseline.py - If you want to make a submission use the corresponding submission function in
submission.py
BERT_models.ipynbcontains the pipeline we used for the BERT modelspreprocessing_embedding_baseline_example.ipynbcontains code to reproduce our preprocessing, embedding, and results with the baseline modelsEDA.ipynbcontains the code we used to generate Figure 1 from the reportgraph_BERT.ipynbcontains the code to produce the Figure 2 from the reportbaseline.pycontains the functions used to train and fit our baseline modelsembeddings.pycontains the functions used to obtain the embeddings for the modelshelpers.pycontains helper functionshelper_bert.pycontains the helpers needed for the BERT modelsmodels_bert.pycontains the funtions and implementation of the BERT modelspreprocessing.pycontains the functions used to preprocess our data before embeddingpreprocessing_bert.pycontains the functions used to preprocess the data before BERTsubmission.pyallows to generate the files needed to do a sumbission on Ai-Crowdtrain_bert.pycontains the functions performing the training of the BERT models.