Skip to content

sign/words-segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Words Segmentation

This repository contains a pretokenizer that segments text into "words" for further processing.

We define three classes of tokens:

  1. C0 Control tokens (always atomic)
  2. "Words" = runs of non-space, non-control + optional single trailing whitespace
  3. Whitespace runs

For any script where the default is not suitable, you can implement a custom pretokenizer. Modify LANGUAGE_SPECS in languages.py to add a custom function for specific scripts.

For example:

LANGUAGE_SPECS: Dict[str, LanguageSpec] = {
    "Chinese": {
        "scripts": ("Han",),
        "callback": segment_chinese,
    },
    "Japanese": {
        "scripts": ("Han", "Hiragana", "Katakana"),
        "callback": segment_japanese,
    },
}

Then, with a max_bytes parameter, we split long words into smaller chunks while preserving Unicode grapheme boundaries.

Usage

Install:

pip install words-segmentation

Pretokenize text using a Huggingface Tokenizer implementation:

from words_segmentation.tokenizer import WordsSegmentationTokenizer

pretokenizer = WordsSegmentationTokenizer(max_bytes=16)
tokens = pretokenizer.tokenize("hello world! 我爱北京天安门 👩‍👩‍👧‍👦")
# ['hello ', 'world! ', '我', '爱', '北京', '天安门', ' ', '👩‍👩‍👧‍👦‍']

Perhaps there will come a day when we could have a universal pretokenizer that works for all languages. Until then, we need to handle some writing systems with custom logic. We implement custom fallback pretoknizers for the following writing systems:

Tokenization Parity

Foroutan and Meister et al. (2025) note that:

In multilingual models, the same meaning can take far more tokens in some languages, penalizing users of underrepresented languages with worse performance and higher API costs.

Tokenization Parity

Let's consider the same example, for whitespace pre-tokenization parity:

Language Text (Google Translate) Bytes (UTF-8) Tokens (GPT-4) Words (Whitespace+)
English Tours are cheaper for larger groups, so if you're by yourself or with just one friend, try to meet other people and form a group of four to six for a better per-person rate. 173 40 34
Italian I tour sono più economici per i gruppi più numerosi, quindi se sei da solo o con un solo amico, prova a incontrare altre persone e a formare un gruppo da quattro a sei persone per ottenere una tariffa più conveniente a persona. 230 58 43
German Touren sind für größere Gruppen günstiger. Wenn Sie also alleine oder mit nur einem Freund unterwegs sind, versuchen Sie, andere Leute kennenzulernen und eine Gruppe von vier bis sechs Personen zu bilden, um einen besseren Preis pro Person zu erhalten. 256 64 40
Chinese 团体旅游价格更便宜,所以如果您独自一人或只有一个朋友,请尝试结识其他人并组成一个四到六人的团体,以获得更好的每人价格。 177 64 34
Japanese ツアーはグループが多ければ安くなるので、一人または友達とだけ参加する場合は、他の人と会って4人から6人のグループを作ると、一人当たりの料金が安くなります。 227 74 48
Finnish Retket ovat halvempia suuremmille ryhmille, joten jos olet yksin tai vain yhden ystävän kanssa, yritä tavata muita ihmisiä ja muodosta neljän tai kuuden hengen ryhmä saadaksesi paremman hinnan per henkilö. 212 79 30
Russian Туры обходятся дешевле для больших групп, поэтому, если вы одни или с одним другом, постарайтесь познакомиться с другими людьми и сформировать группу из четырех-шести человек, чтобы получить более выгодную цену на человека. 409 100 32
Arabic تكون الجولات أرخص بالنسبة للمجموعات الكبيرة، لذلك إذا كنت بمفردك أو مع صديق واحد فقط، فحاول مقابلة أشخاص آخرين وتشكيل مجموعة مكونة من أربعة إلى ستة أشخاص للحصول على سعر أفضل للشخص الواحد. 341 140 33
Hebrew סיורים זולים יותר לקבוצות גדולות יותר, כך שאם אתם לבד או עם חבר אחד בלבד, נסו לפגוש אנשים אחרים וליצור קבוצה של ארבעה עד שישה אנשים לקבלת מחיר טוב יותר לאדם. 281 151 31
Greek Οι εκδρομές είναι φθηνότερες για μεγαλύτερες ομάδες, οπότε αν είστε μόνοι σας ή με έναν μόνο φίλο, προσπαθήστε να γνωρίσετε άλλα άτομα και να σχηματίσετε μια ομάδα τεσσάρων έως έξι ατόμων για καλύτερη τιμή ανά άτομο. 394 193 36
Tamil பெரிய குழுக்களுக்கு சுற்றுலாக்கள் மலிவானவை, எனவே நீங்கள் தனியாகவோ அல்லது ஒரு நண்பருடனோ இருந்தால், மற்றவர்களைச் சந்தித்து நான்கு முதல் ஆறு பேர் கொண்ட குழுவை உருவாக்கி, ஒரு நபருக்கு சிறந்த விலையைப் பெற முயற்சிக்கவும். 587 293 26
Kannada ದೊಡ್ಡ ಗುಂಪುಗಳಿಗೆ ಪ್ರವಾಸಗಳು ಅಗ್ಗವಾಗಿರುತ್ತವೆ, ಆದ್ದರಿಂದ ನೀವು ಒಬ್ಬಂಟಿಯಾಗಿ ಅಥವಾ ಒಬ್ಬ ಸ್ನೇಹಿತನೊಂದಿಗೆ ಇದ್ದರೆ, ಇತರ ಜನರನ್ನು ಭೇಟಿ ಮಾಡಲು ಪ್ರಯತ್ನಿಸಿ ಮತ್ತು ಪ್ರತಿ ವ್ಯಕ್ತಿಗೆ ಉತ್ತಮ ದರಕ್ಕಾಗಿ ನಾಲ್ಕರಿಂದ ಆರು ಜನರ ಗುಂಪನ್ನು ರಚಿಸಿ. 565 361 26
Shan ၶၢဝ်းတၢင်း တႃႇၸုမ်းယႂ်ႇၼၼ်ႉ ၵႃႈၶၼ်မၼ်း ထုၵ်ႇလိူဝ်လႄႈ သင်ဝႃႈ ၸဝ်ႈၵဝ်ႇ ယူႇႁင်းၵူၺ်း ဢမ်ႇၼၼ် မီးဢူၺ်းၵေႃႉ ၵေႃႉလဵဝ်ၵွႆးၼႆၸိုင် ၶတ်းၸႂ် ႁူပ်ႉထူပ်း ၵူၼ်းတၢင်ႇၵေႃႉသေ ႁဵတ်းၸုမ်း 4 ၵေႃႉ တေႃႇထိုင် 6 ၵေႃႉ ႁႂ်ႈလႆႈ ၵႃႈၶၼ် ၼိုင်ႈၵေႃႉ ဢၼ်လီလိူဝ်ၼၼ်ႉယဝ်ႉ။ 669 531 23

Bytes Efficiency

English really is the most efficient language in terms of bytes count, which is not suprising given its Latin alphabet, without diacritics or ligatures (with 1 byte per character). Other languages that use the Latin alphabet are also relatively efficient (e.g. Italian, German, Finnish), but their use of diacritics and ligatures increases the byte count.

Languages that use non-Latin scripts (e.g. Arabic, Hebrew, Shan) have a much higher byte count, due to the need for multiple bytes per character in UTF-8 encoding. Hebrew and Arabic use two bytes per character, while Shan uses three bytes per character, not counting ligatures.

Tokenization Efficiency (GPT-4)

English is also the most efficient language in terms of token count, which is not suprising given that the tokenizer was trained primarily on English text. Other languages that use the Latin alphabet are also relatively efficient, but the moment we move to non-Latin scripts, the token count increases significantly (up to 13x for Shan).

Words Efficiency

Assuming whitespace tokenization as a proxy for words, we see that English is not the most efficient language. This makes sense, from a language efficiency perspective, that there is no computational bias towards English. Languages distribute between 23 and 43 words for the same sentence, with English right in the middle with 34.

Tokenization Parity - Words

Cite

If you use this code in your research, please consider citing the work:

@misc{moryossef2025words,
  title={Words Segmentation: A Word Level Pre-tokenizer for Languages of the World},
  author={Moryossef, Amit},
  howpublished={\url{https://github.com/sign/words-segmentation}},
  year={2025}
}

About

Pretokenizer implementation to segment text into words

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published

Contributors 2

  •  
  •  

Languages