Causal Representation Learning Approach for Unlearning Toxic Content in LLMs

Dataset

Dataset 1: RealToxicityPrompts [Dataset Link] | [Paper]

Run ROME with GPT2

Notes:

The entire codebase for ROME can be found in the directory rome-main. Cloned from Locating and Editing Factual Associations in GPT (NeurIPS'22).
The rome-main/trace_main.py is the main script to run a vanilla example of causal tracing on one of the datasets in the original paper.
The directory rome-main/dsets contains the datasets they use. This is where we need to add the RealToxicityPrompts dataset and load it from for inference. It is present in the file rome-main/dsets/realtoxicityprompts.py

For William:

Refer to the RealToxicityPrompts paper to determine which model they used and pick one of the GPT-2 variants to run ROME with the prompts from the dataset.
Choose the "challenging" subset of prompts, i.e., where dataset["challenging"] == true
For our use case, right now just the causal tracing part is enough, we don't need to worry about the editing part yet.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
causal_to_concept/llms		causal_to_concept/llms
causalrep		causalrep
dataset		dataset
.gitignore		.gitignore
README.md		README.md
fluency_clean.py		fluency_clean.py
fluency_prompt.yaml		fluency_prompt.yaml
json_process.py		json_process.py
pair_generation.py		pair_generation.py
prompts.jsonl		prompts.jsonl
requirements.txt		requirements.txt
toxic.ipynb		toxic.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Causal Representation Learning Approach for Unlearning Toxic Content in LLMs

Dataset

Run ROME with GPT2

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Uh oh!

Uh oh!

CrowdDynamicsLab/causal_unlearn_llm

Folders and files

Latest commit

History

Repository files navigation

Causal Representation Learning Approach for Unlearning Toxic Content in LLMs

Dataset

Run ROME with GPT2

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages