This repository contains source code for detecting sarcasm in reddit comments. The dataset used in this analysis is available here on kaggle.
SARC.yaml contains a conda environment containing all required python packages needed to run the source code. paths.json lists all the filepaths used in running the source code. The home filepath must be filled in prior to running source code. The train-balanced-sarcasm.csv file from the dataset must be placed in the data directory prior to running source code.
bert.py: Contains a wrapper class for the huggingface transformers DistilBert implementation.clean.py: Preprocessing script for the sarcasm dataset.dataset.py: Contains atorch.utils.data.Datasetclass for the sarcasm dataset.stats.py: Script for plotting token count distribution of sarcasm dataset.test.py: Script for calculating test set accuracy.train.py: Script for fine-tuning DistilBert for sarcasm detection
Code must be run in the following order in order to produce sarcasm detection train and test results.
clean.pytrain.pytest.py
Results of train.py and test.py will appear in the command prompt once the training and testing processes have completed.