Under construction! Installation instructions are still being worked on

This repository contains code and data used by the SmileyLlama project to train SmileyLlama and its variants, and to produce results used in the paper. The SmileyLlama model is not hosted here; rather, it's hosted on huggingface, along with variants trained for adhering to properties specified in a prompt and for generating binders to SARS-CoV-2 Main Protease (MPro).

For those who want a gentle, yet hands-on introduction to SmileyLlama, download the Demo.ipynb jupyter notebook, which provides a demonstration of SmileyLlama's abilities and a brief tutorial on writing prompts for it and related models.

System Requirements

Supervised fine-tuning and DPO of SmileyLlama is very memory-intensive due to the number of parameters. To replicate the work in this study, a 4xGPU node with 48GB VRAM per GPU is recommended. Lower VRAM will result in an out of memory error. For smaller setups, the gradient_accumulation_steps setting in the relevant axolotl configuration files (sft/8b-lora32/cf_lora.yml, prompt_following/dpo-instr/cf_dpo_lora.yml, prompt_following/dpo-instr/cf_dpo_lora.yml) should be adjusted such that the overall batch size remains unchanged. In axolotl, the total batch size is the product of the micro batch size, gradient accumulation steps, and number of GPUs. This was tested using Nvidia A40 GPUs.

Inference on SmileyLlama should not be done on a GPU with less than 16 GB VRAM. Inference can be done using the CPU, but this will be slow.

Tested on python 3.10.12 (Python 3.10 can be found at python's website and version management can be done with pyenv), gcc 11.4.0 and cuda 11.8.0. Runs on Linux, was tested on Rocky Linux 8.10 (Green Obsidian).

Installation Guide

A few environments are required to be able to replicate the work in SmileyLlama, including finetuning the models.

axo (use for fine-tuning)

Make sure to have python 3.10, and CUDA version 11 or 12 (tested on 11.8 and 12.2) installed.

cd envs
python -m venv axo
source axo/bin/activate
pip install packaging wheel psutil
pip install torch==2.3.1
pip install flash-attn==2.6.2 --no-build-isolation
pip install -r axo-requirements.txt

ana-env (main environment for analysis)

Make sure to have python 3.10.12, gcc 11.4.0 and cuda 11.8.0 or compatible versions

cd envs
python -m venv ana-env
source ana-env/bin/activate
pip install packaging wheel
pip install torch==2.3.1+cu118 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r ana-env-requirements.txt
cd ../scripts
pip install -e .
python -m ipykernel install --user --name=ana-env

Also, remember to create kernels for use in jupyter notebooks.

mol-benchmark (for guacamol analysis)

Follow steps on https://github.com/BenevolentAI/guacamol

Installing these will take somewhere on the order of 10 minutes on a "normal" desktop computer if all goes well. It can take much longer if flash attention is compiled (on the order of hours) instead of loaded from a prebuilt binary.

Demo

You can use the ana-env to run this demo, or anything with torch, transformers, and rdkit. The demo folder contains a jupyter notebook will take you through how to download and use SmileyLlama to generate molecules with some features. SmileyLlama's weights are about 16GB, so the time it takes to download them will vary based on your internet speed. Outside of this, a "normal" desktop will probably take on the order of 5 minutes to run the demo. The outputs of the notebook are already shown, although some part requires randomness.

Instructions for Use

To download and use SmileyLlama or its derivative models, you can visit this link . All scripts and jupyter notebooks in this codebase reference either these models or the Llama models by their huggingface identifiers (e.g. "THGLab/Llama-3.1-8B-SmileyLlama-1.1").

We've included code and data required to regenerate the figures in this paper. However, some of the scripts in mpro require the iMiner library to run, which is not yet released. Still, the derivative model of SmileyLlama produced by optimizing with iMiner is available at https://huggingface.co/THGLab/Llama-3.1-8B-SmileyLlama-1.1-Mpro.

The code and data required to generate SMILES strings and run the guacamol benchmark on Llama-3.1-Instruct is in llama_k_shot, code and data used to compare SmileyLlama and Llama is found in mmlu (need to install lm-evaluation-harness to run the benchmark). The lm-evaluation-harness can also be used to calculate perplexity on wikitest using

lm_eval --model=hf --model_args="pretrained=/path/to/model" --tasks=wikitext

sft

You can download the necessary data for this section from figshare in the sft directory with the following commands:

cd sft
wget -O random_smiles.jsonl https://ndownloader.figshare.com/files/60278828
wget -O chembl_random_smiles.txt https://ndownloader.figshare.com/files/60278825
wget -O chembl_33.csv https://ndownloader.figshare.com/files/60278831

To create new datasets with random smiles, use the chembl_random_smiles.txt and random_smiles.jsonl files, use the make_sft_data.ipynb notebook.

sft/8b-lora32 contains the config file used for axolotl to fine-tune. To restart fine-tuning, you'll need to fist gain access to Llama. You can do this by first pasting your huggingface access token after requesting access to Llama through your account, or acquiring Llama from another source, then specifying the path to the Llama weights instead of meta-llama/Llama-3.1-8B-Instruct in the first line. Then, preprocess, begin fine-tuning, and merge the LoRA into the weights.

# Export HuggingFace Token to download Llama (or modify the path to point to the weights)
export HF_TOKEN=<Your HuggingFace Token>
# Preprocess the data
CUDA_VISIBLE_DEVICES="" python3 -m axolotl.cli.preprocess cf_lora.yml
# Begin fine-tuning
srun accelerate launch -m axolotl.cli.train cf_lora.yml
# Merge the LoRA into the weights. The new, fine-tuned models' weights will be in `sft/outputs/merged`
python3 -m axolotl.cli.merge_lora $(pwd)/cf_lora.yml --lora_model_dir="$(pwd)/outputs"

prompt_following

Code and data for analyzing the ability of SmileyLlama to follow instructions in the prompt before and after DPO for instruction following can be found in the prompt_following folder.

Similarly to the previous section, to restart DPO, simply modify prompt_following/dpo-instr/cf_dpo_lora.yml to have the relevant paths in your system and run

srun accelerate launch --use-deepspeed -m axolotl.cli.train cf_dpo_lora.yml --dataset_processes=1
python3 -m axolotl.cli.merge_lora $(pwd)/cf_dpo_lora.yml --lora_model_dir="$(pwd)/outputs"

The data required for this can be found in prompt_following/dpo-instr/dpodataset/dpodataset.jsonl.

mpro

This directory contains all relevant parts of the project used to optimize SmileyLlama for inhibition of SARS-CoV-2 Main Protease (MPro). Analysis for a few sample ligands generated by SmileyLlama after this optimization (the model which generated them can be found on huggingface as THGLab/Llama-3.1-8B-SmileyLlama-1.1-Mpro) can be found in mpro/ligand_analysis . files for DPO and outputs of the model throughout the training process can be found in mpro/run/. Outputs from iMiner used for comparison in our paper can be found in mpro/iminer_ref_details. The Jupyter notebooks used to generate figures relating to optimization for MPro inhibition can be found at mpro/MproFigures.ipynb and mpro/cleaner_inference.ipynb.

To reproduce key results

To reproduce the Llama 0-shot and 20-shot values in Table 1, use the llama_k_shot/guacamol_analysis.py notebook. These analyze pre-generated molecules which were produced with the gen_0_shot and gen_20_shot scripts. To reproduce the SmileyLlama values in Table 1 and Figure S1, use the sft/guacamol_analysis.ipynb jupyter notebook. To reproduce the visualizations of properties in Figure 2, use the sft/distribution_vis.ipynb notebook To reproduce the SFT and DPO results in Table 2 and Figure 3b, use the prompt_following/prompt_following_analysis.ipynb notebook. To reproduce Figure 3a, use the prompt_following/figure3a.ipynb notebook. To reproduce Figure 4, use the mpro/MproFigures.ipynb and mpro/cleaner_inference.ipynb notebooks. To visualize the interactions between selected generations and Mpro as in Figure 5, use the results from the Protein-Ligand Interaction Profiler (PLIP) in mpro/ligand_analysis/plip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Under construction! Installation instructions are still being worked on

System Requirements

Installation Guide

axo (use for fine-tuning)

ana-env (main environment for analysis)

mol-benchmark (for guacamol analysis)

Demo

Instructions for Use

sft

prompt_following

mpro

To reproduce key results

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
envs		envs
llama_k_shot		llama_k_shot
mmlu		mmlu
mpro		mpro
prompt_following		prompt_following
scripts		scripts
sft		sft
.gitignore		.gitignore
Demo.ipynb		Demo.ipynb
LICENSE		LICENSE
README.md		README.md

License

THGLab/SmileyLlama

Folders and files

Latest commit

History

Repository files navigation

Under construction! Installation instructions are still being worked on

System Requirements

Installation Guide

axo (use for fine-tuning)

ana-env (main environment for analysis)

mol-benchmark (for guacamol analysis)

Demo

Instructions for Use

sft

prompt_following

mpro

To reproduce key results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages