Navarro's book Compact Data Structures is an excellent survey of this literature! https://www.cambridge.org/core/books/compact-data-structures...
I love the perspective of computing the information theoretic lower bounds for a data structure and then trying to achieve that with as little overhead as possible while still supporting efficient operations on the structure.
Most applications don't require true random access.
Instead you're typically reading a range of data, and then you can decompress just the blocks required for the data you want to see.
Caching of partial queries can also help substantially. For example, if many queries involve querying the max() of some per-second data grouped by minute, it is well worth caching that rather than reading the source data every time to calculate the max().
Typically the query engine can keep counts of every subquery and how frequently it's used and how many data points it involves to decide how long to cache it for. As far as I'm aware no opensource tsdb does this, despite it being a massive simple win, especially for alerting systems and dashboards that run very similar queries frequently.
I was thinking the same thing as I was reading this. I doubt you could retain 100% random access. Instead, I think you would generally create "blocks" that are compressed in a time range that is compatible with your application (ie. 1 day blocks or 1 week blocks, etc.) and then when picking date/times you take X blocks and then further refine the query after compression (example: load May 5th block --> May 5th 8:10AM - 11:13AM). At least, that is what I have done in the past in my own home grown apps. Essentially each block then starts and terminates compression - # of blocks is a trade off in compression efficiency vs. granularity.
Correct, almost all timeseries databases divide the data in shards / partitions / whatever-it’s-called, which are then split by column, which are then compressed as a single unit.
Some databases use a fixed block size (eg as you mention, “1 day”), which are simple and stateless to manage, while others dynamically “split” blocks into smaller blocks (frequently called “ranges”), or merge them back later. The latter is significantly more complex, but is a much better approach for varying workloads where you don’t know the right shard size in advance, or need to deal with the possibility of highly varying workloads, eg you have a lot of traffic on specific time of day / day of week/month/year.
> (...) how do you handle random access in these cases?
Given we're discussing time series, random access is something that's not required at all. There is no use case where you're looking for a single, specific data point of a time series. Time series are used to store and retrieve batches of data points in order to process and/or analize/visualize in batches. Time series queries consists mainly of "get me all the data points of a time series reported between timestamps A and B".
> There is no use case where you're looking for a single, specific data point of a time series.
I'm no expert but it seems trivial to come up with some. The article talked of storing temperature and humidity, so in relation to that:
- "What was the hottest day in the past <variable time period>" will require you to start decompressing at $now-$period. Having to start at the dawn of time is kind of a pain
- "I was so warm last night that I even woke up from it at 5am, what were the temperature and humidity then?"
- Record highs are not going to be at night, so if you want to find that, we can skip a lot of data there. Now, temperatures don't change every millisecond so it's not a lot of data even if you would have to start decompressing 50 years back, but in general, this kind of optimization is only possible if you have random access. (A better example here might be if you have petabytes of LHC data compressed this way and want to find something in it.)
Random access is not the common case but not a bad feature to have, either. The comment kind of acknowledges this by mentioning batches, but starts by dismissing the problem
The commenter you're responding to isn't saying that it's literally impossible to retrieve a single point without doing a full scan of the data, but rather that it's not the top priority of these data structures to support queries for individual points.
The GP comment even says:
> Time series queries consists mainly of "get me all the data points of a time series reported between timestamps A and B"
which seems exactly like what you mentioned with "$now-$period".
These kinds of data structures, often found in OLAP databases, assume that point queries will be less common, and they accept a bit higher latencies for those queries. But those latencies are still fairly small.
> "What was the hottest day in the past <variable time period>" will require you to start decompressing at $now-$period.
Your example relies on a query to retrieve all data points between timestamp A (X days from now) and B (now).
> - "I was so warm last night that I even woke up from it at 5am, what were the temperature and humidity then?"
What's your definition of then?
In time series, it's typically a statistic calculated from all data points within a time window. It might be max, it might be a percentile, it might be a summary statistics of some kind.
> delta encoding of timestamps, and then letting a good standard compression algorithm such as zstd or deflate take care of the rest
This can cut both ways. If the integer data is irregularly distributed, you may increase entropy and give the compressor a harder time than it would have just looking for slightly longer common substrings
I've lost count of the number of compression experiments where some fancier encoding that greatly reduced redundancy in the uncompressed representation also happened to worsen (or at least have no effect on) the compressed output size. Would even be tempted to say it more often than not makes things worse
I tried this with real data (Satellite and Columbus module telemetry). At least for the timestamps, it is always a big win, because values are typically being sampled at a precisely or at least somewhat regular frequency.
For sample values, it is not as clear cut. The best way I found is to just try both approaches and use the one that results in better compression, if you can afford it. But for values I usually don't bother.
> Key to optimizing it was having a database that supported different compression strategies based on the data.
I keep wondering why automatic detection of a compression strategy is hardly ever used. Different algorithms are just better at different things.
I thought timsort was one of the few things that does this, but a quick look at Wikipedia actually indicates that I misremember. With that gone, off the top of my head I cannot think of any utility or library that does this sort of thing. It doesn't seem hard to just take a few kilobytes of data, try a handful of good algorithms, pick the best one, and repeat after a few megabytes. I imagine the gains should be double digit percentages for at least a fifth of the datasets out there, without necessarily losing out on (de)compression speed. Curious if anyone ever looked into this, though some uncertainty will always be about bias in the dataset selection.
E.g. 9 out of 10 times xz is clearly better than bzip2 (both at max compression setting), but then sometimes bzip2 actually beats xz with some margin. I don't know when or why, but it has me trying both options whenever I care about the final compressed size. I've also had that a lower level of gzip actually compressed better (I think it was -8 vs -9, but it also have been between -1 and -3 or so).
I've always liked this kind of thing. I've also done some experiments with automatically memorizing better starting dictionaries for a corpus eras where the data is small:
Another interesting case is using a compressed stream to indicate anomalies in the data when the compression ratio spikes down like in log analysis.
That actually sounds pretty bad to me, when many metrics will be constant or very rarely changing, or perfectly correlated with another metric.
Were you arithmetic coding stuff, or still doing bit-twiddling tricks?
If you think that sounds bad, I would love to see what you think is good and show your work then they can just add it to one of the options. It does basically all the tricks in the article and then some. This is a mean, not a median. The median is likely much lower but no one cares about the median when calculating how much space you need.
RE: Correlated with another metric. That would be extremely expensive one to calculate. You need an algorithm that can operate when ingesting millions of points per second 24/7.
This discussion about a Barclay's Bank json list of 74,000 numbers might have some ideas: https://news.ycombinator.com/item?id=28326036
I was able to get it from a 1.3mb json file (uncompressed) to a 151k uncompressed but 4k compressed file using mostly deltas. https://news.ycombinator.com/item?id=28348965
This is great! I'll have to think about how to combine "storing deltas within the list" and "storing deltas of the list," but there should be some significant gains available
I love zigzag encoding. The way it uses shift and xor is beautiful. It gives it a performance edge of 3 cycles on my cpu versus signed-leb encoding. One thing I did once, to encode UNICODE lookup tables in a smaller form (to reduce Python binary size) is I found out it can be very profitable to (1) delta encode; then (2) zig zag encode it; and finally (3) apply deflate. For example:
It's so cool.
_PyUnicode_PhrasebookOffset2: size is 178,176 bytes _PyUnicode_PhrasebookOffset2: packed size is 100,224 bytes _PyUnicode_PhrasebookOffset2: rle size is 282,216 bytes _PyUnicode_PhrasebookOffset2: deflate size is 52,200 bytes _PyUnicode_PhrasebookOffset2: bz2 size is 76,876 bytes _PyUnicode_PhrasebookOffset2: δleb size is 47,198 bytes _PyUnicode_PhrasebookOffset2: δzd size is 12,748 bytes
For anyone else not familiar: δ is a lowercase delta (at least, I had to look it up)