Tiktoken: OpenAI’s Tokenizer

Daily Digest email

Get the top HN stories in your inbox every day.

joelburget

A few interesting findings:

* the cl100k_base tokenizer has ~100k tokens -- previous tokenizers had ~50k. (enc.n_vocab gives 100277 but some numbers in that range don't work, starting at 100256)

* it has exactly 1110 tokens which are just digits. 10 1 digit tokens, 100 2 digit tokens and 1000 3 digit tokens! (none have preceding spaces). this is a huge improvement from GPT2's tokenizer, which was a huge mess.

The biggest news to me is the improved handling of numbers. This could explain some improved performance on arithmetic. One disappointment is that it tokenizes from the front, e.g. "1000000" -> 100|000|0. This is one of those "so close!" moments -- I would work for free to fix this.

forgingahead

I know OpenAI has been getting a lot of flack about their seemingly extreme measures of "safety" (and I agree to an extent, although it's more nuanced from my perspective), full kudos to them for open sourcing many useful projects that can serve as the building block for many other projects. From CLIP, to Whisper, and now this project, I do appreciate that effort from their team. So thanks, if you're reading this!

mellosouls

The main complaints against OpenAI have been it's lack of openness and betrayal of it's claimed founding principles.

In that context "open sourcing many useful projects" seems the bare minimum it should be doing, because that was the promise still enshrined in it's name - this is not a bog-standard commercial organisation where that would not be expected.

To be clear OpenAI does deserve massive kudos for it's achievements but openness is not among them.

krageon

It's not for safety, it is to make money. The "open" storyline is to get relatively cheap goodwill off things that market their paid products.

est

Requires az:// blob download.

I hope pypi libraries can provide complete standalone offline versions instead of requests+urllib3+some_object_storage shenanigans.

If these blobs are too large to host it on pypi, maybe give us an alternative way to download it altogether so we can deploy the full lib to a server without network access?

capableweb

Lots of ML/AI stuff wants to your sign some sort of license before you can use it, being able to deploy stuff offline kind of defeats that. Although of course there are ways around it (download once, store yourself, put in right directory), but they are unlikely to help you work around that.

yencabulator

> Lots of ML/AI stuff wants to your sign some sort of license

What's the point of this MIT license, then: https://github.com/openai/tiktoken/blob/main/LICENSE

capableweb

Maybe the license is for the code, not the model itself?

seemethere

This is most likely because pypi has size restrictions on what you can upload and most users won't go the extra mile to actually pip install from your bespoke download site.

phneutral26

Maybe the name is a bit misleading at first sight... But the project is great!

intelVISA

Fitting for a company named OpenAI with closed models.

Workaccount2

They investigated themselves and found that their products are too powerful for the public, so in the best interest of everyone they had to make the incredibly difficult, but right thing to do, choice of....

....going closed source and monetizing.

ianai

I genuinely hope they consider a rename. I wouldn’t want to use software with such a sleazy connection. It could have been “OAITokenizer”.

echelon

This is so great.

1. Name

2. OpenAI is releasing useful stuff

3. Rust in AI!

Vetch

NB: huggingface has long had a rust based tokenizer.

obert

- python: https://huggingface.co/docs/transformers/model_doc/gpt2#tran...

- javascript: https://www.npmjs.com/package/gpt-3-encoder

- c# https://github.com/dluc/openai-tools

- java: https://www.reddit.com/r/MachineLearning/comments/upej7e/p_j...

- php: https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP

Alifatisk

Ruby?

matsadler

https://github.com/ankane/tokenizers-ruby

Alifatisk

Thank you so much!

ipsum2

Looks like they have 4 different tokenizers. Besides gpt2, does anyone know which models correspond to which tokenizers? One of them is probably Codex.

    "gpt2": gpt2,
    "r50k_base": r50k_base,
    "p50k_base": p50k_base,
    "cl100k_base": cl100k_base,

minimaxir

cl100k_base is a new tokenizer that is apparently being used in their new Embeddings project.

Omie6541

can we please have some good example input/outputs in the readme itself? what is the expected output of print(enc.encode("hello world")) ?

mnks

Have a look at https://beta.openai.com/tokenizer which uses javascript reimplementation of the GPT-2 / GPT-3 BPE tokenizer. In this case it's [31373, 995].

stabbles

https://github.com/openai/tiktoken/blob/main/tests/test_simp...

Llamamoe

Are GPT models even bottlenecked by input tokenization? What real world speedup does this actually translate into?

Jensson

Could be used for their wrapper API where they check and alter your input and the models output. This way they could quickly detect "bad words" and so on and return their standard platitudes about how the model can't answer.

Or maybe they do a quick pass on websites and filter out websites with bad words quickly, just leaving a small fraction for the model to train on. Parsing to get the standardized tokens quickly would make that easier.

FL33TW00D

Can someone optimize this further? Seems like there is significant low hanging fruit, as evidenced by this line: https://github.com/openai/tiktoken/blob/main/src/lib.rs#L419

stabbles

Sounds like it could be optimized to run 10x faster on a single thread. ~7MB/s is not that fast

mark_l_watson

Any information on which human languages it works with? Some languages like Thai can be challenging so I am wondering how general this is.

joelburget

It works on all human languages, just inefficiently. I ran it over a sample I found on wikipedia:

    sample = "ฟองมันฟันหนู, ฟันหนูฟองมัน, ฝนทองฟองมัน"
    len(sample), len(enc.encode(sample))

This returns `39, 40` so it's just encoding one character at a time. It's probably like this for almost all non-English text.

theragra

Yeah, at least it does it with Russian

black_puppydog

Would be curious to know what hardware the benchmark was run on. That drop-off beyond 16 threads is steep...

Daily Digest email

Get the top HN stories in your inbox every day.