Skip to content(if available)orjump to list(if available)




·August 9, 2022


This is very interesting!

I've tried something somewhat similar in the past. I was looking at implementing an extremely fast decompressor, with ratio similar to LZ4. I was able to get 2x the decompression speed of LZ4, but struggled with compression ratio. The idea was to have 16 byte matches, and allow the matches to apply a 16-bit mask, telling whether each byte is part of the match or a literal. Then I restricted the compressor to only be able to use 16 distinct masks.

This was extremely fast to decompress, because each 16-byte match is: load the 16-byte match into an AVX2 register, load 16 bytes of literals, load the mask you're using, shuffle the literals, then blend the literals and the match. And because the matches are fixed size, you can start the fetch for multiple matches in parallel.

However, the problem I ran into, and would love to solve, is that I also wanted fast-ish compression speed. And it is very hard to search for good matches quickly. Since you have holes in the match.

I guess the author is looking at GPU compression, so they are taking a somewhat brute-force approach. But I'd be interested to see how they're doing the match finding, and what kind of speed they're getting.


The author is focused on GPU texture compression for games[1] so they're not too concerned with fast compression speed. Given a game will be played on (at a minimum) tens of thousands of machines, some of which may be quite limited (e.g. handheld consoles or mobile devices), the game developers are only more than happy to trade off once-off compression times for decompression size / speed. They'd likely only perform this on a later "release build" of their game too.

The author mentions in a tweet going from minutes to seconds for compression when switching CPU for GPU[2]. From memory he has made other references to a few seconds for compression being entirely reasonable for such tasks but I can't find a direct reference.




The initial link in this piece, Compression is Compilation, was really helpful to read before reading this piece: I think I got more out of it than this piece.


And the idea that compression is intelligence also should be mentioned:


Here's an interesting counterargument to this by one of the leading AI researchers (creator of Keras and much more), François Chollet:


The idea that a compressor is a compiler made me think back of Chaitin's work on incompleteness and the limits of mathematics, which I wrote a review of a few years ago:


I don't understand why the author encodes every literal byte as a separate instruction while in reality, they're just consecutive bytes?

The whole:

  LIT 13
  LIT 24
  LIT 65
  LIT 32
could have been written as a single instruction:

  LIT [13, 24, 65, 32, ...]
It's almost as if author tries too hard to support their point that their variant looks better.

"Notice how much faster it makes progress through the file vs. LZSS"

Yeah, because you encode every literal separately? It's all LIT instructions?


Yes and no. The author was making the point that each new LIT relied on an input-dependent branch, whereas when you're decoding XOR [a,b,c] the branch is on the pre-decoded length of the instruction. LIT could be encoded like this but isn't.


Got it, there's no "LIT w/length" instruction in LZSS, so it's logically a new compare-and-branch every time a literal character is hit. I hadn't regarded it as a performance-wise argument, but size-wise for some reason.


Can a XOR-only decompressor do the standard LZ77 trick of encoding many repeated bytes with a copy operation that refers to not-yet-decompressed data? It seems to me that you'd have to have a lot of 0 patch bytes for that. Huffman coding could encode all the 0 bytes in 1 bit each, but that still loses over regular LZ77. It seems to me that you still might want to keep "COPY" around for RLE-style data.


yes, since that is about how you handle the offsets most of the time. At most you just have an instruction for source side, and one for target side. I will say however, this is useful for files, but not for deltas, as it makes delta composability much more complicated, and also makes streaming harder.

(IE with only original-source copies, and a->delta1->delta2->delta3->b, composing delta1/2/3 prior to applying them is easy and simple to reason about. If you allow target side copies, it is a lot messier)

Way back in the day, when i upgraded subversion's delta format, i did a lot of work on testing out various mechanisms - in practice, target side copies were much worse, and much more expensive to process, than doing source-only copies and then zlib'ing the delta instructions + new data ;)


I wonder how an LZ_ADD-compressed bitmap would compare to a PNG. PNG is basically just DEFLATE on a bitmap, but you first apply some additional encoding to each row to help the compressor, by expressing the row as a kind of difference from the previous row. It would be cool if that became unnecessary.


Why am I just finding out about this now?