Byte Latent Transformer: Patches Scale Better Than Tokens

ai.meta.com

Daily Digest email

Get the top HN stories in your inbox every day.

dang

The paper: https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/470135129_...

PaulHoule

The summer that BERT came out I was working at a startup that was using character-based CNN models for classification. We were thinking a lot about alternate representations, other members of the team were keen on word vectors but I wasn't, particularly because it seemed the documents were were working on frequently had out-of-dictionary words, because those words were important, and because discarding them would lead to failure.

(We were working on "foundation models" too, so it's not just being out-of-dictionary in the final model that's a problem but being out-of-dictionary in the foundation model which is more expensive to train.)

We were doing OK with character based models for classification but people believed that storing the "dictionary" inside the neural net was not a good use of the neural net so there was a lot of enthusiasm for tokens.

Meanwhile I felt so sure that schemes like Word2Vec were doomed that I had left an earlier project using RNNs where the goal was text understanding with a foundation model made by training an RNN to write fake abstracts for case reports from PubMed.

When byte-pair encoding was introduced I remember telling people in a meeting that it was the first tokenization scheme we'd looked at that I could endorse.

I have to admit though that I wish we could work at the character label.

binarymax

I was really excited for CANINE [1] but it never really went anywhere. Tokens are a hack. They work for the most part, but it’s clear when they don’t.

[1] https://arxiv.org/abs/2103.06874

yndoendo

Do you mean that all produced output must be a chain or words found in a dictionary?

The real-world for humans has them creating and using non-dictionary words to communicate daily. A good example is "notify", defined in the dictionary. "notifier", which is not and is used to describe "a means to notify someone". The code to send an email notification is an "email notifier", then there is text message, voice call, call center call back notifiers ....

All industries and organizations have jargon, custom defined words not found in a dictionary and use non distinctive acronyms.

How would a ML output be useful if it cannot handle real world commutation and only lab based sanitization of in-dictionary only responses?

entilzha

(Author here)

If I understand your question right, this is one of the reasons BPE is nice and the parent liked it. For any character sequence, provided the characters are in the alphabet used to create the BPE vocab, there are no unknown words/sequences. One downside of some previous tokenization methods is you could have unknown/UNK tokens, EG dictionary based methods.

In our paper with bytes, we also avoid the UNK issue, since we can have an embedding for every possible byte, since it’s not that many (and for sequences of bytes we use hash embedding, although we did test n-gram lookups for the top K frequent byte n-grams in the training data).

cs702

Nice work. Thank you for commenting on HN!

Did you guys try using an RNN or some other kind of DNN to encode the patches?

phh

That's the OP's point. At the time, the community was split between word-level, which has the shortcomings you're describing, and byte-level which is uselessly compute intensive. BPE was the first reasonable in-between. BLT improves on BPE by having the the compression learnable rather than precomputed

modeless

I really hope this works out. Death to tokenizers!

Interesting that it's a hierarchical structure but only two levels of hierarchy. Stacking more levels seems like an obvious direction for further research.

Note: I posted this comment on another related story[1] and the author replied:

"Author here :), I do think it’s a good direction to look into! That said, aside from it being a bit too much to do at once, you’d also have to be careful about how you distributed your FLOP budget across the hierarchy. With two levels, you can make one level (bytes/local encoder) FLOP efficient and the other (patches/global encoder) FLOP intensive. You’d also need to find a way to group patches into larger units. But ya, there are many directions to go from here!"

[1] https://news.ycombinator.com/item?id=42413430

smaddox

Agree more levels seems like it could be beneficial. And another Meta paper published a day later shows how that might work: https://ai.meta.com/research/publications/large-concept-mode...

flimflamm

To create a patch, a small model is used to predict the likelihood for the next character in the input string. Input string: 'Lazy dog jumped over a fence.' Use the model to predict the likelihood of each character.

For example:

    100% sure the next character is 'a'.
    Or maybe it's 10% sure it's 'a', 10% sure it's 'b', and so on.

Then we chunk character estimates together. How many characters? Enough characters so that the total uncertainty (entropy) in each chunk is about the same. And there you have your 'patch' (or 'token').

yorwba

> How many characters? Enough characters so that the total uncertainty (entropy) in each chunk is about the same.

That's not how it's described in Section 2.3 of the paper. They only use the entropy of the next byte and whether it exceeds a threshold (Global Constraint) or is larger than the preceding byte's entropy by another threshold (Approx. Monotonic Constraint).

That does mean that long repetitive sequences can result in pathologically long patches, as demonstrated in Appendix E.

But what I'm really curious about is the "small CNN byte-level model with 2-byte context" in Figure 3 (f), because it's never mentioned in any other part of the paper.

entilzha

(Author Here)

Good description! Maybe what parent got mixed up on is an alternate way to view this is trying to chunk bytes to have roughly similar information. EG we initially tried a bunch of patching schemes, EG, keep a running total of entropy until the total exceeds a threshold, but ended up finding simple things worked better.

I’ll see if we can add more information about the small CNN in a next update to arXiv paper.

cschmidt

I'm curious if you're aware of some papers from around 2005 on using contextual entropy to do unsupervised word segmentation on Chinese, and other languages that don't use spaces for word boundaries.

https://aclanthology.org/Y03-1017/ https://aclanthology.org/I05-1009/ https://aclanthology.org/P06-2056/

Exactly the same approach of segmenting a word when the entropy goes up compared to the previous byte.

psb217

One way of thinking about the "Approximate Monotonic Constraint" is that you're running a quick and dirty edge detector on the entropy. Ie, you're clipping based on the gradient of per-byte entropy wrt timestep compared to detecting an edge based on gradient of per-pixel intensity wrt pixel coordinates. It would be interesting to look at the raw sequences of per-byte entropies to see how strongly these sorts of "edges" correlate with human interpretable boundaries (words, prefixes, suffixes, etc).

flimflamm

"That's not how it's described" - Thanks for the correction!

dv_dt

So a variant might be to try using a some standard compression algorithm to train with?

dang

Recent and related:

Sharing new research, models, and datasets from Meta FAIR - https://news.ycombinator.com/item?id=42412360 - Dec 2024 (61 comments)

vishpr

So only thing teaching model (loss) is probability prediction in single byte space. And that is enough? Looks very promising, if I am not misunderstanding.

nodja

From my understanding this not only removes tokenization but also sampling correct?

Sampling can be a pain point of LLMs, but they also can enable interesting usages, like forcing grammar so the model always outputs valid JSON or tuning temperature to get more varied distribution, XTC sampling, etc.

What would be the equivalent of these in a BLT?

I can only think of providing the decoder an extra input of allowed/prohibited bytes and run the decode over and over until it outputs something valid, maybe there's a simpler and more obvious approach.

yorwba

It doesn't remove sampling, and forcing grammar by specifying allowed/prohibited bytes doesn't require running the decoder over and over, you just compute the softmax at the output layer over allowed bytes only and sample from those accordingly, same as with BPE-based models.

dr_dshiv

Does this mean AI can pre-train on binaries?

bloomingkales

Some believe AI can now output compiled binaries (e.g update Notepad.exe with this feature).

We all think AI writing code for us will be the end, but it might be an even simpler take over.

8n4vidtmkvmk

That just sounds worse though? We can't validate the change is correct if we can't read the code. It is interesting though

refulgentis

Idk what they mean, I've never seen anyone claim, or come close to claiming, it could alter an executable binary by itself. Chose to interpret it as "some people think an llm can code well enough to add features on top of a ~50KLOC codebase automatically"

hackernewds

at some point you can't or won't be allowed to do any validations

undefined

[deleted]

iandanforth

I find it interesting how far linguistic, and experienced based approaches have fallen out of fashion. Humans don't read character by character, even if we can it's not a standard operating mode. We have word stems and understand modifications by endings. Tokenization doesn't replicate this experience (seriously, look at the tokens that appear in LLM vocabularies), nor does character or byte encoding. Humans have multiple ways to parse words. You can grok a full sentence, read a phrase, read word by word, or sound out a new word character by character. Very few papers explicitly claim that a method is good because it replicates the way a human would perform a task, or perceive the world.

I suspect as LLM reliance increases we'll want to align the models to our experience more closely. I further suspect this will make the errors that models make more comprehensible.

DerSaidin

> Unlike tokenization, BLT has no fixed vocabulary for patches.

iiuc this means: the vocabulary of patches is not known prior to training.

I guess once training has established a vocabulary of patches, that same fixed vocabulary is used for inference (if this is not true I don't see how it could work).

Right?

RandyOrion

An interesting read on alternative tokenization methods.

Questions:

1. What's the goal of entropy based byte token grouping as tokenization? Is this tokenization method best suited for the goal?

2. What about simply using byte level sequence to sequence autoencoder with down sampling for tokenization?

boulos

This is neat work, but I also love the (presumably intentional?) backronym of BLT.

Daily Digest email

Get the top HN stories in your inbox every day.