LLM_Decrypter is a research-oriented project that explores the limitations of Large Language Models (LLMs) in performing symbolic reasoning tasks—specifically cryptanalysis.
Standard LLMs often struggle with decryption because their default tokenizers (like BPE) group multiple characters into single tokens, obscuring the character-level patterns essential for solving ciphers. This project implements a Character-Level Tokenization Workaround via strategic whitespace injection, allowing the model to perform granular pattern recognition and frequency analysis on encrypted text.
The primary technical challenge addressed in this repo is the "Tokenization Gap." Most sub-word tokenizers compress text into the most efficient mathematical representation, which destroys the positional logic required for ciphers like Caesar or Vigenère.
- Whitespace Augmentation: Dynamically injects spaces between every character in the prompt (e.g.,
SECRETbecomesS E C R E T). - Atomic Reasoning: By forcing the LLM to treat each character as an individual token, we bypass the sub-word compression that typically leads to reasoning errors in cryptographic tasks.
- Pattern Preservation: This methodology ensures that the model's self-attention mechanism can map one-to-one relationships between ciphertext and plaintext characters without being biased by common word fragments.
- Evaluation: We compare the LLM-generated plaintexts and encryption methods with the ground-truth to measure LLM accuracy. We use the Levenshtein distance to compare plaintexts since exact match is too difficult for LLMs.
The project is structured as a modular evaluation pipeline:
- Datasets: A Huggingface encryption dataset (https://huggingface.co/datasets/Sakonii/EncryptionDataset) that contains ciphertexts, plaintexts, and 9 common encryption methods
- Preprocessing Layer: A transformation module that implements whitespace injection and custom prompt framing.
- Inference Pipeline: An extensible interface for testing SOTA models (via HuggingFace) to compare decryption and classification accuracy with and without the tokenization workaround.
- Core: Python 3.10+
- LLM Integration: OpenAI API, HuggingFace Transformers
- Evaluation: Custom string-matching scripts for character-level accuracy metrics.
- Data Handling: Huggingface datasets for bulk evaluation and result logging across various cipher complexities.
git clone [https://github.com/Nachiket-ML/LLM_Decrypter.git](https://github.com/Nachiket-ML/LLM_Decrypter.git)
cd LLM_Decrypter