The Strawberry Problem 🍓
Emergence of Character-level Understanding in Tokenized Language Models

Accepted in the Main Track (Oral Presentation - top 15% accepted papers)
The 2025 Conference on Empirical Methods in Natural Language Processing
EMNLP 2025

Adrian Cosma, Stefan Ruseti, Emilian Radoi, Mihai Dascalu

📜 Paper PDF| 📘 Abstract| ⚒️ Usage| 📖 Citation| 📝 License

📘 Abstract

Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge slowly, suddenly, and only late in training. We further show that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.

⚒️ Usage

Go to cd experiments/ and run:

Step 1 - Generate vocabularies

    bash generate_datasets.sh

Step 2 - Train the models

    bash train.sh
    bash wiki_train.sh

Step 3 - Perform ablation studies

    bash ablation.sh

📖 Citation

If you found our work useful, please cite our paper:

@inproceedings{cosma-etal-2025-strawberry,
    title = "The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models",
    author = "Cosma, Adrian  and
      Ruseti, Stefan  and
      Radoi, Emilian  and
      Dascalu, Mihai",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1434/",
    doi = "10.18653/v1/2025.emnlp-main.1434",
    pages = "28240--28251",
    ISBN = "979-8-89176-332-6",
    abstract = "Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge suddenly and only late in training. We find that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available."
}

📝 License

This work is protected by Attribution-NonCommercial 4.0 International

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
datasetss		datasetss
evaluators		evaluators
experiments		experiments
lib		lib
models		models
scripts		scripts
trainers		trainers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
acumen_tokenizer.py		acumen_tokenizer.py
main.py		main.py
nomenclature.py		nomenclature.py
utils.py		utils.py
utils_generation.py		utils_generation.py
utils_tokenization.py		utils_tokenization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The Strawberry Problem 🍓
Emergence of Character-level Understanding in Tokenized Language Models

Accepted in the Main Track (Oral Presentation - top 15% accepted papers)
The 2025 Conference on Empirical Methods in Natural Language Processing
EMNLP 2025

📘 Abstract

⚒️ Usage

Step 1 - Generate vocabularies

Step 2 - Train the models

Step 3 - Perform ablation studies

📖 Citation

📝 License

About

Uh oh!

Languages

License

cosmaadrian/strawberry-problem

Folders and files

Latest commit

History

Repository files navigation

The Strawberry Problem 🍓 Emergence of Character-level Understanding in Tokenized Language Models

Accepted in the Main Track (Oral Presentation - top 15% accepted papers) The 2025 Conference on Empirical Methods in Natural Language Processing EMNLP 2025

📘 Abstract

⚒️ Usage

Step 1 - Generate vocabularies

Step 2 - Train the models

Step 3 - Perform ablation studies

📖 Citation

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

The Strawberry Problem 🍓
Emergence of Character-level Understanding in Tokenized Language Models

Accepted in the Main Track (Oral Presentation - top 15% accepted papers)
The 2025 Conference on Empirical Methods in Natural Language Processing
EMNLP 2025