LLM_Decrypter: Solving Ciphertext Reasoning via Tokenization Engineering

📌 Overview

LLM_Decrypter is a research-oriented project that explores the limitations of Large Language Models (LLMs) in performing symbolic reasoning tasks—specifically cryptanalysis.

Standard LLMs often struggle with decryption because their default tokenizers (like BPE) group multiple characters into single tokens, obscuring the character-level patterns essential for solving ciphers. This project implements a Character-Level Tokenization Workaround via strategic whitespace injection, allowing the model to perform granular pattern recognition and frequency analysis on encrypted text.

🚀 Key Methodology: Character-Level Injection

The primary technical challenge addressed in this repo is the "Tokenization Gap." Most sub-word tokenizers compress text into the most efficient mathematical representation, which destroys the positional logic required for ciphers like Caesar or Vigenère.

Our Approach:

Whitespace Augmentation: Dynamically injects spaces between every character in the prompt (e.g., SECRET becomes S E C R E T).
Atomic Reasoning: By forcing the LLM to treat each character as an individual token, we bypass the sub-word compression that typically leads to reasoning errors in cryptographic tasks.
Pattern Preservation: This methodology ensures that the model's self-attention mechanism can map one-to-one relationships between ciphertext and plaintext characters without being biased by common word fragments.
Evaluation: We compare the LLM-generated plaintexts and encryption methods with the ground-truth to measure LLM accuracy. We use the Levenshtein distance to compare plaintexts since exact match is too difficult for LLMs.

🏗️ Architecture

The project is structured as a modular evaluation pipeline:

Datasets: A Huggingface encryption dataset (https://huggingface.co/datasets/Sakonii/EncryptionDataset) that contains ciphertexts, plaintexts, and 9 common encryption methods
Preprocessing Layer: A transformation module that implements whitespace injection and custom prompt framing.
Inference Pipeline: An extensible interface for testing SOTA models (via HuggingFace) to compare decryption and classification accuracy with and without the tokenization workaround.

🛠️ Tech Stack

Core: Python 3.10+
LLM Integration: OpenAI API, HuggingFace Transformers
Evaluation: Custom string-matching scripts for character-level accuracy metrics.
Data Handling: Huggingface datasets for bulk evaluation and result logging across various cipher complexities.

📋 Usage

Setup

git clone [https://github.com/Nachiket-ML/LLM_Decrypter.git](https://github.com/Nachiket-ML/LLM_Decrypter.git)
cd LLM_Decrypter

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
LLaMa_3B		LLaMa_3B
Mistral_7B		Mistral_7B
Qwen2_1.5B		Qwen2_1.5B
llama_outputs		llama_outputs
.gitignore		.gitignore
README.md		README.md
inference.py		inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM_Decrypter: Solving Ciphertext Reasoning via Tokenization Engineering

📌 Overview

🚀 Key Methodology: Character-Level Injection

Our Approach:

🏗️ Architecture

🛠️ Tech Stack

📋 Usage

Setup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM_Decrypter: Solving Ciphertext Reasoning via Tokenization Engineering

📌 Overview

🚀 Key Methodology: Character-Level Injection

Our Approach:

🏗️ Architecture

🛠️ Tech Stack

📋 Usage

Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages