NLP-Group-Project

This project is done as part of the DSAIT4090 Natural Language Processing course at TU Delft. It investigates how negatives and hard negatives affect RAG performance. In both their inclusion in IR training and the input contexts for the LLM.

Setting up the project:

Download the data files and put them into the data/ directory, can be downloaded from here: https://drive.google.com/drive/folders/1qIZcNcU2wtiJNr3BUyX2GIUtnHEfbQDi?usp=sharing
Structure the files such that 'wiki_musique_corpus.json' is in the 'data' directory, whereas the 'dev.json', 'train.json' and 'test.json' are in a subdirectory of data called 'qa', so you should make a new folder in 'data/qa' to put them in. The filepaths are thus 'data/wiki_musique_corpus.json', 'data/qa/dev.json', 'data/qa/train.json' and 'data/qa/test.json'
Make a conda-env with conda, using conda env create -f environment.yml to install the required dependencies (make sure to add/remove dependencies while we work, as we most likely will need to use more dependencies). Note: you might have to delete and create the environment again. If dexter-cqa is a dependency you might need to install: https://visualstudio.microsoft.com/visual-cpp-build-tools/ . During installation, make sure to select "Desktop development with C++".
Get a huggingface token (from https://huggingface.co/) and an openai token (from their developer platform https://platform.openai.com/docs/overview) and put them in a .env file in the root directory of the project, as such:

huggingface_token=[insert huggingace token]
OPENAI_KEY=[key here]

Run python corpus_management/encode_corpus.py to encode the corpus into dense embeddings. This process will take a while (potentially several hours depending on your hardware) and will save the embedded corpus as a memmap file in data/embeddings. This step is necessary for efficient retrieval later.

Training ADORE-Contreiver

This can take a while and you might need to configure the training parameters in train_adore.py based on your system's capabilities. You can first check with python setup_analysis/check_gpu_availability.py . Then once the parameters are configured apprpriately you can run python train_adore.py which will generate model weights for each of 6 epochs in model_checkpoint/.

Run Experiment to collect results

Then in relevant_contexts_experiment.py choose the experiment you want and run the file with python relevant_contexts_experiment.py. This will generate results in results/ when complete.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP-Group-Project

Setting up the project:

Training ADORE-Contreiver

Run Experiment to collect results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
adore		adore
corpus_management		corpus_management
pipeline_components		pipeline_components
results		results
setup_analysis		setup_analysis
.gitignore		.gitignore
README.md		README.md
config.ini		config.ini
environment.yml		environment.yml
experiment.py		experiment.py
relevant_contexts_experiment.py		relevant_contexts_experiment.py
requirements.txt		requirements.txt
train_adore.py		train_adore.py

BendikC/NLP-Group-Project

Folders and files

Latest commit

History

Repository files navigation

NLP-Group-Project

Setting up the project:

Training ADORE-Contreiver

Run Experiment to collect results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages