Tokenizer created for my custom LLM model. It's design is based loosely on Byte Pair Encoding (BPE). It's optimized for dictionary creation and tokenization speed. Pre-computed dictionaries are based on Amazon Reviews Dataset.
Cicero Tokenizer is under Apache 2.0 license and Common Clause.
If you want to use Cicero Tokenizer commercially, know that, as stated in NOTICE, you are not allowed. If you really want to use it commercially, please contact me.