This repository contains code and data used by the SmileyLlama project to train SmileyLlama and its variants, and to produce results used in the paper. The SmileyLlama model is not hosted here; rather, it's hosted on huggingface, along with variants trained for adhering to properties specified in a prompt and for generating binders to SARS-CoV-2 Main Protease (MPro).
For those who want a gentle, yet hands-on introduction to SmileyLlama, download the Demo.ipynb jupyter notebook, which provides a demonstration of SmileyLlama's abilities and a brief tutorial on writing prompts for it and related models.
Supervised fine-tuning and DPO of SmileyLlama is very memory-intensive due to the number of parameters. To replicate the work in this study, a 4xGPU node with 48GB VRAM per GPU is recommended. Lower VRAM will result in an out of memory error. For smaller setups, the gradient_accumulation_steps setting in the relevant axolotl configuration files (sft/8b-lora32/cf_lora.yml, prompt_following/dpo-instr/cf_dpo_lora.yml, prompt_following/dpo-instr/cf_dpo_lora.yml) should be adjusted such that the overall batch size remains unchanged. In axolotl, the total batch size is the product of the micro batch size, gradient accumulation steps, and number of GPUs. This was tested using Nvidia A40 GPUs.
Inference on SmileyLlama should not be done on a GPU with less than 16 GB VRAM. Inference can be done using the CPU, but this will be slow.
Tested on python 3.10.12 (Python 3.10 can be found at python's website and version management can be done with pyenv), gcc 11.4.0 and cuda 11.8.0. Runs on Linux, was tested on Rocky Linux 8.10 (Green Obsidian).
A few environments are required to be able to replicate the work in SmileyLlama, including finetuning the models.
Make sure to have python 3.10, and CUDA version 11 or 12 (tested on 11.8 and 12.2) installed.
cd envs
python -m venv axo
source axo/bin/activate
pip install packaging wheel psutil
pip install torch==2.3.1
pip install flash-attn==2.6.2 --no-build-isolation
pip install -r axo-requirements.txt
Make sure to have python 3.10.12, gcc 11.4.0 and cuda 11.8.0 or compatible versions
cd envs
python -m venv ana-env
source ana-env/bin/activate
pip install packaging wheel
pip install torch==2.3.1+cu118 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r ana-env-requirements.txt
cd ../scripts
pip install -e .
python -m ipykernel install --user --name=ana-env
Also, remember to create kernels for use in jupyter notebooks.
Follow steps on https://github.com/BenevolentAI/guacamol
Installing these will take somewhere on the order of 10 minutes on a "normal" desktop computer if all goes well. It can take much longer if flash attention is compiled (on the order of hours) instead of loaded from a prebuilt binary.
You can use the ana-env to run this demo, or anything with torch, transformers, and rdkit. The demo folder contains a jupyter notebook will take you through how to download and use SmileyLlama to generate molecules with some features. SmileyLlama's weights are about 16GB, so the time it takes to download them will vary based on your internet speed. Outside of this, a "normal" desktop will probably take on the order of 5 minutes to run the demo. The outputs of the notebook are already shown, although some part requires randomness.
To download and use SmileyLlama or its derivative models, you can visit this link . All scripts and jupyter notebooks in this codebase reference either these models or the Llama models by their huggingface identifiers (e.g. "THGLab/Llama-3.1-8B-SmileyLlama-1.1").
We've included code and data required to regenerate the figures in this paper. However, some of the scripts in mpro require the iMiner library to run, which is not yet released. Still, the derivative model of SmileyLlama produced by optimizing with iMiner is available at https://huggingface.co/THGLab/Llama-3.1-8B-SmileyLlama-1.1-Mpro.
The code and data required to generate SMILES strings and run the guacamol benchmark on Llama-3.1-Instruct is in llama_k_shot, code and data used to compare SmileyLlama and Llama is found in mmlu (need to install lm-evaluation-harness to run the benchmark). The lm-evaluation-harness can also be used to calculate perplexity on wikitest using
lm_eval --model=hf --model_args="pretrained=/path/to/model" --tasks=wikitext
You can download the necessary data for this section from figshare in the sft directory with the following commands:
cd sft
wget -O random_smiles.jsonl https://ndownloader.figshare.com/files/60278828
wget -O chembl_random_smiles.txt https://ndownloader.figshare.com/files/60278825
wget -O chembl_33.csv https://ndownloader.figshare.com/files/60278831
To create new datasets with random smiles, use the chembl_random_smiles.txt and random_smiles.jsonl files, use the make_sft_data.ipynb notebook.
sft/8b-lora32 contains the config file used for axolotl to fine-tune. To restart fine-tuning, you'll need to fist gain access to Llama. You can do this by first pasting your huggingface access token after requesting access to Llama through your account, or acquiring Llama from another source, then specifying the path to the Llama weights instead of meta-llama/Llama-3.1-8B-Instruct in the first line. Then, preprocess, begin fine-tuning, and merge the LoRA into the weights.
# Export HuggingFace Token to download Llama (or modify the path to point to the weights)
export HF_TOKEN=<Your HuggingFace Token>
# Preprocess the data
CUDA_VISIBLE_DEVICES="" python3 -m axolotl.cli.preprocess cf_lora.yml
# Begin fine-tuning
srun accelerate launch -m axolotl.cli.train cf_lora.yml
# Merge the LoRA into the weights. The new, fine-tuned models' weights will be in `sft/outputs/merged`
python3 -m axolotl.cli.merge_lora $(pwd)/cf_lora.yml --lora_model_dir="$(pwd)/outputs"
Code and data for analyzing the ability of SmileyLlama to follow instructions in the prompt before and after DPO for instruction following can be found in the prompt_following folder.
Similarly to the previous section, to restart DPO, simply modify prompt_following/dpo-instr/cf_dpo_lora.yml to have the relevant paths in your system and run
srun accelerate launch --use-deepspeed -m axolotl.cli.train cf_dpo_lora.yml --dataset_processes=1
python3 -m axolotl.cli.merge_lora $(pwd)/cf_dpo_lora.yml --lora_model_dir="$(pwd)/outputs"
The data required for this can be found in prompt_following/dpo-instr/dpodataset/dpodataset.jsonl.
This directory contains all relevant parts of the project used to optimize SmileyLlama for inhibition of SARS-CoV-2 Main Protease (MPro). Analysis for a few sample ligands generated by SmileyLlama after this optimization (the model which generated them can be found on huggingface as THGLab/Llama-3.1-8B-SmileyLlama-1.1-Mpro) can be found in mpro/ligand_analysis . files for DPO and outputs of the model throughout the training process can be found in mpro/run/. Outputs from iMiner used for comparison in our paper can be found in mpro/iminer_ref_details. The Jupyter notebooks used to generate figures relating to optimization for MPro inhibition can be found at mpro/MproFigures.ipynb and mpro/cleaner_inference.ipynb.
To reproduce the Llama 0-shot and 20-shot values in Table 1, use the llama_k_shot/guacamol_analysis.py notebook. These analyze pre-generated molecules which were produced with the gen_0_shot and gen_20_shot scripts. To reproduce the SmileyLlama values in Table 1 and Figure S1, use the sft/guacamol_analysis.ipynb jupyter notebook. To reproduce the visualizations of properties in Figure 2, use the sft/distribution_vis.ipynb notebook To reproduce the SFT and DPO results in Table 2 and Figure 3b, use the prompt_following/prompt_following_analysis.ipynb notebook. To reproduce Figure 3a, use the prompt_following/figure3a.ipynb notebook. To reproduce Figure 4, use the mpro/MproFigures.ipynb and mpro/cleaner_inference.ipynb notebooks. To visualize the interactions between selected generations and Mpro as in Figure 5, use the results from the Protein-Ligand Interaction Profiler (PLIP) in mpro/ligand_analysis/plip