This repository contains a set of scripts and utilities designed to unveil linguistic regions in large language models. Below are step-by-step instructions for running the code successfully. Follow along to preprocess data, train models, select regions, and finally, assess the model's adaptability to damage.
- unveiling_code/: Main project directory containing code organized by function.
- damage/: Contains scripts related to model assessment.
- data_preprocess/: Contains scripts for dataset preparation and tokenization.
- region_selection/: Contains scripts for extracting linguistic regions.
- training/: Contains scripts and utilities for model training and evaluation.
Ensure that you have Python and necessary libraries installed. You may require packages such as transformers, datasets, torch, etc. These can be installed via pip if not already available.
Follow these steps in the specified order to achieve the complete result:
-
Navigate to the data preprocessing directory:
cd Unveiling-Coding-Regions-in-LLMs/data_preprocess -
Download Dataset for Processing:
Use the following command to download desired datasets:
python create_code_dataset.py
-
Preprocess Dataset:
Run the preprocess script to tokenize and prepare the dataset for training. For different languages, modify the script parameters accordingly:
Example for GO:
bash run_preprocess.sh tiny-codes go tokenizers/llama-3.2
Example for Java:
bash run_preprocess.sh tiny-codes java tokenizers/llama-3.2
-
Navigate to the training directory:
cd Unveiling-Coding-Regions-in-LLMs/training/further_training -
Run the Training Script:
Utilize the preprocessed dataset to calculate importance scores with the following command:
bash code_train_core-10000.sh "tiny-codes" "llama-3.2" "meta-llama/Llama-3.2-3B-Instruct" "go"
bash code_train_core-10000.sh "tiny-codes" "llama-3.2" "meta-llama/Llama-3.2-3B-Instruct" "java"
-
Navigate to the region selection directory:
cd Unveiling-Coding-Regions-in-LLMs/region_selection -
Extract Core Linguistic Regions:
Execute these scripts to identify regions:
python extract_accumulated_core_linguistic_region.py
python extract_spot.py
-
Navigate to the damage directory:
cd Unveiling-Coding-Regions-in-LLMs/damage -
Run the Damage Assessment Script:
Use the following command to evaluate model robustness:
python damage_model.py