Skip to content

Nachiket-ML/LLM_Decrypter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM_Decrypter: Solving Ciphertext Reasoning via Tokenization Engineering

Python 3.10+ License: MIT

📌 Overview

LLM_Decrypter is a research-oriented project that explores the limitations of Large Language Models (LLMs) in performing symbolic reasoning tasks—specifically cryptanalysis.

Standard LLMs often struggle with decryption because their default tokenizers (like BPE) group multiple characters into single tokens, obscuring the character-level patterns essential for solving ciphers. This project implements a Character-Level Tokenization Workaround via strategic whitespace injection, allowing the model to perform granular pattern recognition and frequency analysis on encrypted text.

🚀 Key Methodology: Character-Level Injection

The primary technical challenge addressed in this repo is the "Tokenization Gap." Most sub-word tokenizers compress text into the most efficient mathematical representation, which destroys the positional logic required for ciphers like Caesar or Vigenère.

Our Approach:

  • Whitespace Augmentation: Dynamically injects spaces between every character in the prompt (e.g., SECRET becomes S E C R E T).
  • Atomic Reasoning: By forcing the LLM to treat each character as an individual token, we bypass the sub-word compression that typically leads to reasoning errors in cryptographic tasks.
  • Pattern Preservation: This methodology ensures that the model's self-attention mechanism can map one-to-one relationships between ciphertext and plaintext characters without being biased by common word fragments.
  • Evaluation: We compare the LLM-generated plaintexts and encryption methods with the ground-truth to measure LLM accuracy. We use the Levenshtein distance to compare plaintexts since exact match is too difficult for LLMs.

🏗️ Architecture

The project is structured as a modular evaluation pipeline:

  1. Datasets: A Huggingface encryption dataset (https://huggingface.co/datasets/Sakonii/EncryptionDataset) that contains ciphertexts, plaintexts, and 9 common encryption methods
  2. Preprocessing Layer: A transformation module that implements whitespace injection and custom prompt framing.
  3. Inference Pipeline: An extensible interface for testing SOTA models (via HuggingFace) to compare decryption and classification accuracy with and without the tokenization workaround.

🛠️ Tech Stack

  • Core: Python 3.10+
  • LLM Integration: OpenAI API, HuggingFace Transformers
  • Evaluation: Custom string-matching scripts for character-level accuracy metrics.
  • Data Handling: Huggingface datasets for bulk evaluation and result logging across various cipher complexities.

📋 Usage

Setup

git clone [https://github.com/Nachiket-ML/LLM_Decrypter.git](https://github.com/Nachiket-ML/LLM_Decrypter.git)
cd LLM_Decrypter

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors