LLM Safety Paper List

Survey

Evaluating Large Language Models: A Comprehensive Survey [Paper]

Hallucination

Chain-of-Verification Reduces Hallucination in Large Language Models [Paper]
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models [Paper]
LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples [Paper]

Privacy Protection

Large Language Models Can Be Good Privacy Protection Learners [Paper]
ProPILE: Probing Privacy Leakage in Large Language Models [Paper]

Jailbreak

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study [Paper]
Jailbroken: How Does LLM Safety Training Fail? [Paper]
MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots [Paper]
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization [Paper]
Defending ChatGPT against Jailbreak Attack via Self-Reminder [Paper]
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM [Paper]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts [Paper] [Code]
Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models [Paper]
Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success [Paper]
Multi-step Jailbreaking Privacy Attacks on ChatGPT [Paper]
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [Paper]
DeepInception: Hypnotize Large Language Model to Be Jailbreaker [Paper]
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation [Paper]
Multilingual Jailbreak Challenges in Large Language Models [Paper]
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations [Paper]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [Paper]
Open Sesame! Universal Black Box Jailbreaking of Large Language Models [Paper]
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [Paper]
Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks [Paper]
Universal and Transferable Adversarial Attacks on Aligned Language Models [Paper] [Code]

Datasets

Jailbreak Chat

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Safety Paper List

Survey

Hallucination

Privacy Protection

Jailbreak

Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LLM Safety Paper List

Survey

Hallucination

Privacy Protection

Jailbreak

Datasets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages