This project is done as part of the DSAIT4090 Natural Language Processing course at TU Delft. It investigates how negatives and hard negatives affect RAG performance. In both their inclusion in IR training and the input contexts for the LLM.
-
Download the data files and put them into the data/ directory, can be downloaded from here: https://drive.google.com/drive/folders/1qIZcNcU2wtiJNr3BUyX2GIUtnHEfbQDi?usp=sharing
-
Structure the files such that 'wiki_musique_corpus.json' is in the 'data' directory, whereas the 'dev.json', 'train.json' and 'test.json' are in a subdirectory of data called 'qa', so you should make a new folder in 'data/qa' to put them in. The filepaths are thus 'data/wiki_musique_corpus.json', 'data/qa/dev.json', 'data/qa/train.json' and 'data/qa/test.json'
-
Make a conda-env with conda, using
conda env create -f environment.ymlto install the required dependencies (make sure to add/remove dependencies while we work, as we most likely will need to use more dependencies). Note: you might have to delete and create the environment again. If dexter-cqa is a dependency you might need to install: https://visualstudio.microsoft.com/visual-cpp-build-tools/ . During installation, make sure to select "Desktop development with C++". -
Get a huggingface token (from https://huggingface.co/) and an openai token (from their developer platform https://platform.openai.com/docs/overview) and put them in a .env file in the root directory of the project, as such:
- huggingface_token=[insert huggingace token]
- OPENAI_KEY=[key here]
- Run
python corpus_management/encode_corpus.pyto encode the corpus into dense embeddings. This process will take a while (potentially several hours depending on your hardware) and will save the embedded corpus as a memmap file in data/embeddings. This step is necessary for efficient retrieval later.
This can take a while and you might need to configure the training parameters in train_adore.py based on your system's capabilities. You can first check with python setup_analysis/check_gpu_availability.py . Then once the parameters are configured apprpriately you can run python train_adore.py which will generate model weights for each of 6 epochs in model_checkpoint/.
Then in relevant_contexts_experiment.py choose the experiment you want and run the file with python relevant_contexts_experiment.py. This will generate results in results/ when complete.