The One Billion Row Challenge

Daily Digest email

Get the top HN stories in your inbox every day.

planb

As far as I see the currently best performing solution [0] does not account for hash collisions and therefore probably generates wrong results if enough different cities are in the dataset. Or am I missing something?

[0] https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...

gunnarmorling

Yes, you are right. This came up yesterday and indeed two solutions were violating the "must work for all station names" rule by relying on specific hash functions optimized for the specific data set, which I unfortunately missed during evaluation. I've just removed these entries from the leaderboard for the time being. Both authors are reworking their submissions and then they'll be added back.

[0] https://twitter.com/mtopolnik/status/1742652716919251052

rkangel

Have you considered having two datasets? The "development" dataset which is public, and the "competition" dataset which is private?

That might help against algorithms that are (accidentally, or on purpose) tuned to the specific dataset.

paxys

Yeah if you are releasing the competition dataset then it can and will 100% be gamed. What is to stop me from just hardcoding and printing the result without any computation?

londons_explore

The next obvious solution is to have something that works fast for non-colliding station names, but then falls back to another (slow) implementation for colliding station names.

It's very cheap to detect station name collisions - just sample ~10,000 points in the file, and hope to find at least one of each station. If you find less stations than the full run, you haven't yet seen them all, keep hunting. If you find two that collide, fall back to the slow method. Checking 0.00001% of the dataset is cheap.

posix86

But how would you detect that a station name is colliding or not colliding? With a hash set?

anonymoushn

It seems like you have to run the slow collision-resistant thing over the whole file.

thomaswue

What are the constraints on station names - i.e. min length, max length, maximum number of different names?

gunnarmorling

Just clarified this in the README:

* Input value ranges are as follows:

- Station name: non null UTF-8 string of min length 1 character and max length 100 characters

- Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit

* Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported

londons_explore

I believe the whole thing can be done in 0.3 seconds with the following approach:

(Describing only the 'happy path' here - other paths can be made fast too, but will require different implementations)

* Since temperatures are only to 0.1 decimal points, we have a finite number of temperatures. ~400 temperatures will cover all common cases.

* We also have a finite number of place names. (~400)

* Just make a lookup table of all temps and all place names. (~160,000)

* autogenerate a state machine that will map each of the above 160,000 things, at any rotation within a 4 byte register, to a unique bin in a hash table. The state machine will have one 32 bit state register (16 bits to output, 16 to carry over to the next cycle) where every cycle a lookup in the state transition table is done, and the next 4 bytes of data XOR'ed on top.

* run through all the data, at RAM speed, incrementing counters for each state the machine ends up in. (there will only be 65k). These counters fit fully in cache.

* With just 32 bits of state, with AVX512 we can be running 512 copies of this in parallel if we like, per core! AKA, compute will not be the bottleneck.

* from the values of the counters, you can calculate the answers. 65k is a far smaller number than 1 billion, so you don't need to do this bit fast.

* for anything that doesn't map to a valid bin (higher/lower temperatures, unknown place names), just fallback to slow code (one state of the state machine can be reserved for 'escape to slow code'. Use these escapes for min/max handling too, since it will only happen a few thousand times).

* I think this approach can operate at RAM speed with just one core with AVX512, so no benefit in splitting across cores.

toth

You don't need the look up table. All you are asked for is min/mean/max which can all be computed in one pass without storing the data. All you need is a hash table with 400 entries and 3 floats (running min, mean and max) and and int (count, for updating running mean). That's just 16 bytes, if you use 16 bytes for name of station you can fit everything under 16K.

IO will dominate the running time for this, and JSON parsing will be second.

refulgentis

I don't have a CS background and when I eventually had to do interviews for Google, "Calculate mean/median/mode of temperatures" was the interview question I went with, intending to avoid BS leetcode*

I always worried it was too easy, but I'm heartened by how many comments miss the insight I always looked for and you named: you don't need to store a single thing.

I do wonder if it'll work here, at least as simply as the tracking vars you mention, with so many rows, and the implicit spirit of "you should be able to handle _all_ data", overflow might become a legitimate concern - ex. we might not be able to track mean as simply as maintaining sum + count vars.

* "oh sorry times up, but the right answer is a red black self balancing terenary tree recursive breadth-first search", ironically, was one of my Apple interviews.

toth

In general, median and mode are much harder than min/avg/max. You can't compute the former with constant memory in one pass (you can do approximate median, but not exact median).

(Here there is a restricted range for the temperature with only 199 possible values (-99.9 to 99.9 with 0.1 increment) so you could do it constant memory, need something like 4*199 bytes per unique place name))

For the sum overflow is not an issue if you use 64-bit integers. Parse everything to integers in tenths of degree and even if all 1 billion rows are 99.9 temperature for same place name (worst possible casE), you are very far from overflowing.

londons_explore

Maintaining sum+count is more expensive than you'd imagine... Because to maintain the sum, you need to convert the numbers from ascii to decimal. And to do that you at least need to know the memory alignment of the number and the position of the decimal point.

All far more expensive than the state machine approach which is pretty much 'alignment doesn't matter, ascii doesn't matter, we're gonna just handle all possible states that could end up in a 32 bit register').

CamperBob2

Streaming calculation of the exact median with no storage at all is non-trivial at best in the general case, and I'm not aware of any way to calculate the mode at all. Any pointers to the methods you used in your interview answers?

If you came up with them on the fly, then... well, sign here for your bonus. Can you start Monday?

londons_explore

Merely finding the start/end of each line will use more computation (in the inner loop) than the approach I outlined. Let alone converting the number from ascii to a float, or looking up the place name in a has table (oh, and the name is variable length, so you're gonna need to find how long the name is first).

toth

I deleted two previous comments because I realized I misunderstood your proposal. I understand it better now, but I am still confused about something...

Your state machine would need at least 2*160,000 states (you need an extra bit to flag whether you have reached a newline in the last word and need to increment a counter or not), correct? And you are assuming the input is 4 bytes, so won't your transition table need (2^32)*2*160,000 ~ 10^15 entries (at least 3 bytes each)?

undefined

[deleted]

undefined

[deleted]

usefulcat

> IO will dominate the running time for this, and JSON parsing will be second.

Memory bandwidth might dominate, but probably not I/O. The input file is ~12GB, the machine being used for the test has 32GB, and the fastest and slowest of five runs are discarded. The slowest run will usually be the first run (if the file is not already cached in memory), after which there should be little or no file I/O.

nojvek

Is there a way to validate from app whether all file pages are cached in memory?

What if the code was run against constraints such as Max memory limit in a docker container.

toth

s/JSON/string/

Twirrim

I took a few shots at this in rust (following on from https://www.reddit.com/r/rust/comments/18ws370/optimizing_a_...), and every single time the bottleneck was down to the string parsing, everything else was largely irrelevant.

You can do this whole thing in about 24 seconds, as long as you're smart about how you chunk up the text and leverage threading. https://github.com/coriolinus/1brc/blob/main/src/main.rs

dist1ll

> I think this approach can operate at RAM speed with just one core with AVX512, so no benefit in splitting across cores.

A single core can't saturate RAM bandwidth. Cores are limited by memory parallelism and latency. Most modern x86 server chips can retire 2 SIMD loads per clock cycle (i.e. 32GB/s @ 1GHz using AVX2), so AVX-512 is not necessary to max out per-core bandwidth - but if you're loading from DRAM you'll likely cap out much, much earlier (closer to 10-16GB/s on commodity servers).

As long as the majority of your data spills into RAM, you'll see a massive throughput cliff on a single core. With large streaming use-cases, you'll almost always benefit from multi-core parallelism.

It's very easy to test: allocate a large block of memory (ideally orders of magnitude larger than your L3 cache), pre-fault it, so you're not stalling on page faults, and do an unrolled vector load (AVX2/AVX-512) in a tight loop.

menaerus

> A single core can't saturate RAM bandwidth.

I was also intrigued by the same statement and was wondering about it as well. For AMD I don't know but on Intel I believe that each L1 cache-miss is going to be routed through the line-fill-buffers (LFB) first before going to L2/L3. Since this LFB is only 10-12 entries large, this number practically becomes the theoretical limit on the number of (load) memory operations each core can issue at a time.

> Most modern x86 server chips can retire 2 SIMD loads per clock cycle (i.e. 32GB/s @ 1GHz using AVX2),

I think in general yes but it appears that ALU on Alderlake and Saphire Rapids has 3 SIMD load ports so they can execute up to 3x SIMD AVX2 mm256_load_si256 per clock cycle.

londons_explore

The goal is that none of the intermediate data spills - the only user of RAM is reading the input data.

But for that, you're right, the workload must be split across as many cores as necessary to keep DRAM maxed out or you're throwing away performance. Luckily this problem is easily split.

LegionMammal978

> With just 32 bits of state, with AVX512 we can be running 512 copies of this in parallel if we like, per core! AKA, compute will not be the bottleneck.

How can the state machine be run in parallel, when the next state always has a dependency on the previous state?

Also, how exactly would the state register be decoded? After you XOR it with 4 bytes of input, it could be practically any of the 4.7 billion possible values, in the case of an unexpected place name.

And even for expected place names longer than 4 bytes, wouldn't they need several states each, to be properly distinguished from other names with a common prefix?

floobertoober

I would ask for clarification about the rules - it isn't clear to me whether code that is special cased for the known 400 place names (even if it supports additional names via a slower path) would be considered valid:

> Q: Can I make assumptions on the names of the weather stations showing up in the data set?

> A: No, while only a fixed set of station names is used by the data set generator, any solution should work with arbitrary UTF-8 station names (for the sake of simplicity, names are guaranteed to contain no ; character).

londons_explore

The code wouldn't be special cased... it would run through the first 10,000 or so rows to find out the dataset to generate the state transition table. If any invalid entry is found later, you'll end up in an invalid state, and can detect that case and fall back to slow code (or even theoretically update the state transition table and return to the fast path).

btilly

Even simpler. Just have a Dictionary mapping place name to ID. The slow path is "I haven't seen this name before and I need to store it, making sure that I'm not racing any other thread doing the same."

You'd have an upper limit on how many place names you can have. But no hard coding of what they are.

spullara

You have to read and parse the entire file to find the place names.

gpderetta

> autogenerate a state machine that will map each of the above 160,000 things, at any rotation within a 4 byte register, to a unique bin in a hash table. The state machine will have one 32 bit state register (16 bits to output, 16 to carry over to the next cycle) where every cycle a lookup in the state transition table is done, and the next 4 bytes of data XOR'ed on top.

You are basically describing a perfect hash, right?

londons_explore

nearly but not quite - you can deliberately 'collide' when there is no data you need to keep going forward (the obvious example being when you end on a '\r' character). This is necessary to keep an upper bound on the state space.

Remember that in the extreme, a state could mean "city A, temp zero, city b" - "0\rB" so it isn't a direct mapping from state machine state to city.

jodrellblank

Where does your ~160,000 come from? There's a billion rows, couldn't there be a different place on every row giving a billion place names?

londons_explore

There are only ~400 place names in the generated data. 160,000 = 400*400

The state machine generation would need to know all 400 - but it's easy enough to scan the first few hundred thousand rows to find them all, and have a fallback incase a name is seen that has never been seen before. (the fallback would be done by having the state machine jump to an 'invalid' state, and then at the end you check that that invalid state's counter is zero, and if it isn't, you redo everything with slow code).

tzs

> There are only ~400 place names in the generated data

The README says the maximum number of unique place names is 10 000, so you should probably design for the fast case to handle that many instead of just going by what is in the database that it happens to be measuring with at the moment.

londons_explore

This code is only fast in the 'happy' case (ie. ~400 unique places, with no odd distributions, temperatures with a std dev of 10 from a mean between 0 and 40). It is still correct in the 'unusual' cases, but would be slow because it would revert to fallback code.

Jaxan

Go for it!

londons_explore

You nerd sniper you!

But more seriously, the JVM's support for vector intrinsics is very basic right now, and I think I'd spend far more time battling the JVM to output the code I want it to output than is fun. Java just isn't the right language if you need to superoptimize stuff.

Theoretically all of the above is super simple SIMD stuff, but I have a suspicion that SIMD scatter/gather (needed for the state lookup table) isn't implemented.

switchbak

If you really want to nerd snipe, write an optimized version in a non-JVM language to compare to the fastest Java one. But that's kind of boring anyway, we've seen this play out a thousand times.

I still appreciate the intention of seeing how far we can squeeze performance of modern Java+JVM though. Too bad very few folks have the freedom to apply that at their day jobs though.

anonymoushn

It sounds like you're unfamiliar with vectorized string processing tasks, so you could surely learn a lot by working on an implementation of the proposed approach. I think it's fine to make it in C++ and then port to Java.

For more such tasks you could go to highload.fun.

JohnKemeny

> The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard.

I think it's better to discard the two slowest, or simply accept the fastest as the correct. There's (in my opinion) no good reason to discard the best runs.

mayank

This is a pretty standard measure called the Trimmed Mean: https://statisticsbyjim.com/basics/trimmed-mean/

ProblemFactory

Variability in software runtime arises mostly from other software running on the same system.

If you are looking for a real-world, whole-system benchmark (like a database or app server), then taking the average makes sense.

If you are benchmarking an individual algorithm or program and its optimisations, then taking the fastest run makes sense - that was the run with least external interference. The only exception might be if you want to benchmark with cold caches, but then you need to reset these carefully between runs as well.

stabbles

For performance benchmarking the minimal runtime is typically the best estimator if the computations are identical, cause it measures perf w/o interrupts.

If the language is garbage collected, or if the test is randomized you obviously don't want to look at the minimum.

mayank

> the minimal runtime is typically the best estimator

Depends what you’re estimating. The minimum is usually not representative of “real world” performance, which is why we use measures of central tendency over many runs for performance benchmarks.

hyperpape

There are good reasons to discard the best runs. If you think of your system as having predictable behavior that's slowed down by background processing happening on the machine, it might make sense to use the best run. But if there are any intrinsic sources of non-determinism in the program (and it's far more likely than you might think), taking the best time is likely to be unrepresentative.

https://tratt.net/laurie/blog/2019/minimum_times_tend_to_mis... is a good piece on the subject.

paxys

If you don't think discarding the fastest is acceptable then why are you in favor of discarding the slowest?

JohnKemeny

Because there can be outliers in terms of slowness; perhaps the CPU gets a hickup, the GC runs an abnormal amount of extra times, the kernel decides to reschedule stuff, whatever, I don't know much about these things.

There can't be outliers in terms of fastness; the CPU doesn't accidentally run the program much faster.

But then again, what the hell do I know...

paxys

It's the same logic both ways. The OS runs 10 background tasks on average at any given time. On one of your runs it was doing 15 things, while on another it was doing 5 things. They are both outliers.

undefined

[deleted]

e63f67dd-065b

The rule lawyer in me wants to spend the first run spinning up a background daemon that loads everything into memory, pins it there, and maybe even prefetches everything into cache as the subsequent runs perform basically a linear scan (you never have to pagemiss if you have an oracle!).

> write a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station

Depending on how far you want to stretch it, doesn’t precomputing the result on 1st run count too? Or pre-parsing the numbers into a compact format you can just slurp directly into rolling sums for subsequent runs.

Not the slightest in the spirit of the competition, but not against the rules as far as I can tell.

Edit: if we don't like pre-computation, we can still play with fancy out-of-the-box tricks like pre-sorting the input, pre-parsing, compacting, aligning everything, etc.

Edit 2: while we're here, why not just patch the calculate_time script to return 0 seconds :) And return 9999 for competitors, for good measure

camkego

There is a real issue with providing the contestant the real exact file that will be used in the contest. Because there are 1e9 shades of gray between hard coding the correct answer in a single line program which doesn't even read the input, and processing the file as if we don't know what it contains.

This might become a contest of judging what is fair pre-computing and what is not.

That's why machine learning contests don't let participants see the final data.

londons_explore

Fix:

A run consists of:

  * I generate random measurements with the provided script, which will be stored in a readonly file.

  * I run your program to calculate the results (and time it).

  * I run my baseline program and your entry is valid if my baseline program calculates the same results.

lelanthran

The filesystem has to be read-only as well, otherwise the first run can simply write the results to ~/.me_cheater and the program first reads that file before attempting to read the dataset.

yellow_lead

I think that would violate this rule

> The computation must happen at application runtime, i.e. you cannot process the measurements file at build time (for instance, when using GraalVM) and just bake the result into the binary

gunnarmorling

Yeah, and I'd count caching results between runs into the same category, i.e. against the spirit of this challenge. Folks should keep in mind that the goal is to learn something new, rather than "winning" by cheating the rules.

cm2187

Though the caching done by the OS (I/O for instance) is fair game

e63f67dd-065b

Ah I didn't actually follow the link to the github repo. I don't think that percludes pre-processing the input into something that needs no parsing, however, or to spawn a background daemon that prefetches everything in the right order.

The first run is indeed runtime, not build time, so technically I think I still count with pre-computation, sending that to the daemon (or just stashing it somewhere in /tmp, or even crazier patching the jarfile), and just printing it back out on the second run onwards.

robertlagrant

> I don't think that percludes pre-processing the input into something that needs no parsing

Why not preprocess it into a file that contains the answer? (-: My read of the description was just: you're meant to read the file as part of the run.

rocqua

I think the rules should stipulate each run is on its own tmpfs and all processes and pagecaches are cleared between runs.

mgaunard

Isn't this simply bound by the speed of the disk? Surely none of the suggested optimizations (SIMD, multi-threading) are relevant.

It would also depend of how many different stations there are and what they are for the hash lookup, but seriously I doubt this will be anything measurable compared to I/O.

rockwotj

A few things:

Disk access can certainly be parallelized, and NVMe is blazing fast, so the bottleneck is more the CPU than disk. There are systems that are built around around modern hardware that realize this (redpanda.com, where I work is one such example)

Parsing is a lot of the compute time and some SIMD tricks like SWAR for finding delimiters can be helpful. Stringzilla is a cool library if you want to see a clean implementation of these algorithms: https://github.com/ashvardanian/StringZilla

See my reply here about the file being cached completely in memory after the first run: https://news.ycombinator.com/item?id=38864034

londons_explore

I would hope that any reasonably performant implementation would be faster not only than NVMe, but also faster than CPU to RAM data transfers.

The AMD EPYC-Milan in the test server supports memory reads at 150 gigabytes/sec, but thats a 32 core machine, and our test only gets 8 of those cores, so we probably can't expect more than 37 gigabytes per second of read bandwidth.

The total file is ~12 gigabytes, so we should expect to be able to get the task done in 0.3 seconds or so.

There are only ~400 weather stations, so all the results fit in cache. It's unclear if it might be faster to avoid converting ascii to a float - it might be faster to add up all the ascii numbers, and figure out how to convert to a float only when you have the total.

codeflo

> any reasonably performant implementation would be faster not only than NVMe

Saturating NVMe bandwidth is wicked hard if you don't know what you're doing. Most people think they're I/O bound because profiling hinted at some read() function somewhere, when in fact they're CPU bound in a runtime layer they don't understand.

menaerus

> The total file is ~12 gigabytes, so we should expect to be able to get the task done in 0.3 seconds or so.

How? EPYC-Milan features PCIe 4.0 and theoretical maximum throughput of x4 sequential read is ~8GB/s.

In bare-metal machines, this figure is rather closer to 5-6GB/s, and in cloud environments such as CCX33, I imagine it could be even less.

So, unless I am missing something you'd need ~2 seconds only to read the contents of the ~12GB file.

vintermann

Are you sure that would give the same result? The first thing that struck me about this challenge is that there's going to be floating point precision issues.

anonymoushn

It looks like all existing solutions are too slow by a huge factor then :)

NavinF

Totally depends on your workload and hardware. I have a normal consumer SSD in my desktop and it can easily sustain 7GB/s (56gbps) as long as I only use 700GB of the 2TB available. A normal server has enough PCIe lanes for 15 SSDs like mine so a normal server's IO bandwidth is comparable to its memory bandwidth. More expensive servers have faster lanes (PCIe 5.0) and more of those lanes.

Anyway the OP's file only has 1B rows so it'll be ~1GB compressed which fits in memory after the first discarded run. IO bandwidth shouldn't matter in this scenario:

> All submissions will be evaluated by running the program on a Hetzner Cloud CCX33 instance (8 dedicated vCPU, 32 GB RAM). The time program is used for measuring execution times, i.e. end-to-end times are measured. Each contender will be run five times in a row. The slowest and the fastest runs are discarded

Edit: The github repo says it's 12 GB uncompressed. That confirms that IO bandwidth doesn't matter

D4ckard

Daniel Lemire has an interesting talk on this: https://www.youtube.com/watch?v=wlvKAT7SZIQ. They key takeaway is that the disk is rarely the bottleneck.

codeflo

It's an old bit of wisdom that's hard to stamp out. That's partly because anything that you don't understand below you in the runtime stack looks like "I/O" during profiling, so the misunderstanding perpetuates even among people who attempt to take a closer look, but don't know enough.

usefulcat

> Isn't this simply bound by the speed of the disk?

Depends on the OS and filesystem being used. The input file is ~12GB, and it's being run 5 times on a machine with 32GB, so after the first run the file could be cached entirely in memory. For example, Linux using ext2 would probably cache the entire file after the first run, but using ZFS it probably would not.

mikewarot

Clearly the fastest way to parse this is to load it all into RAM, and work backwards from the end. That has the advantage of putting the digits in ascending order, and then the delimiter, then the string... until EOF or NL is encountered.

undefined

[deleted]

throwaway2037

I agree. From here: https://news.ycombinator.com/item?id=38865942

<< Just clarified this in the README: * Input value ranges are as follows:

- Station name: non null UTF-8 string of min length 1 character and max length 100 characters

- Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit >>

If we assume worst case entries: -99.9;<100 char station name>

That is 106 chars plus newline (1 or 2 chars). That could be 108 GB of data! And, I want to be so picky here: The README says "UTF-8" and "max length 100 characters". Eh, we could start a holy war here arguing about what does a character mean in UTF-8. Let's be generous and assume it means one (8-bit) byte (like C). I have good faith in the original author!

Some Googling tells me that the current fastest PCIe Gen5 NVMe drive is Crucial T700 w/ Sequential read (max., MB/s) @ 12,400. That is 8.7 seconds to read 108 GB of data. A few other comments mentioned that the sample file is only 12 GB, so that is 1 second, and 100% can be cached by Linux kernel for 2nd and later reads. However, if the test file is greater than 32 GB, you are definitely screwed for file system caching. You will have to pay a multi-second "I/O tax" to read the file each time. This seems to far outweigh any computation time.

Thoughts?

Last, I still think this is a great challenge. I think I just found my next interview question. This question is easy to understand and program a basic implementation. Further discussion can be fractally complex about the myriad of optimizations that could be applied.

No trolling: Can you imagine if someone tries this exercise in Python? There are a goofy amount of optimizations to deal with huge data in that language thanks to SciPy, Pandas and other friends. I am sure someone can write an insanely performant solution. (I know, I know: The original challenge says no external dependencies.)

lessbergstein

Yes, this is an extremely trivial problem. Anybody who knows how to program in more than one language is going to find this silly. awk or perl would finish it before jit compilation gets started.

viraptor

It's a trivial problem to implement, but not trivial to implement faster than others. Awk won't beat solutions that are clever about multiple read queues and parallel processing.

declaredapple

Bit of an update, the record is now 8 seconds in java.

I have tried 2 naive awk implementations and both took 10x the basic java implementation btw (I'm sure it too can be optimized).

lessbergstein

Just stick to Java

Jaxan

Please try it out in awk. It would be interesting to see whether awk can do this in a few seconds.

lifthrasiir

> Q: Can I make assumptions on the names of the weather stations showing up in the data set?

I'm unsure if it's intentional or not, but this essentially means that the submission should be correct for all inputs, but can and probably should be tuned for the particular input regenerated by `create_measurements.sh`. I can imagine submissions with a perfect hash function tuned for given set of stations, for example.

Hugsun

Given this requirement, they would be wise to have the test data be different from the example data. That would prevent overfitting optimizations.

josephg

I'm not sure that it is overfitting to optimize your code for the test set. The requirement is just that you don't break compatibility for arbitrary names in the process.

This sort of thing shows up all the time in the real world - where, for example, 90% of traffic will hit one endpoint. Or 90% of a database is one specific table. Discovering and microoptimizing for the common case is an important skill.

lifthrasiir

That's why I was so unsure. I think though, that it is generally better to have the input generator seeded and do not disclose the chosen seed in advance, which prevents overfitting to a single input and yet encourages overfitting to a much bigger class of inputs.

gunnarmorling

Their should be no optimizations which rely on a-priori knowledge of the dataset. I.e. if you calculate a perfect hash function at runtime by inspecting the dataset, that's fine (not sure whether it's beneficial), but hard coding it, is not.

londons_explore

UTF-8 makes this far harder... But if you stick to the letter but not spirit of the rules, you can simply fall back to a slow implementation if you ever detect any byte > 127 (indicating a multibyte UTF-8 character).

jerf

Only if you validate the UTF-8 as being valid. If you just accept that it is you can treat it as just some bytes. Nothing in the spec I see requires processing UTF-8 as an actual Unicode string.

The easiest way to handle Unicode is to not handle it at all, and just shove it down the line. This is often even correct, as long as you don't need to do any string operations on it.

If the author wanted to play Unicode games, requiring normalization would be the way to go, but that would turn this into a very different challenge.

alkonaut

Since the tail of the line has a known format I guess we are rescued by the fact that the last 0x3B is the semicolon as the rest is just a decimal number. We can’t know the first 0x3B byte is the semicolon since the place names are only guaranteed to not contain 0x3B but can contain 0x013B. So a parser should start from the rear of the line and read the number up to the semicolon and then it can treat the place name as byte soup. Had two places shared line this challenge would have required real utf parsing and been much harder.

solarized

Just for fun. Speed testing awk vs Java.

  awk -F';' '{
      station = $1
      temperature = $2
      sum[station] += temperature
      count[station]++
      if (temperature < min[station] || count[station] == 1) {
          min[station] = temperature
      }
      if (temperature > max[station] || count[station] == 1) {
          max[station] = temperature
      }
  } 
  END {
      for (s in sum) {
          mean = sum[s] / count[s]
          printf "{%s=%.1f/%.1f/%.1f", s, min[s], mean, max[s]
          printf (s == PROCINFO["sorted_in"][length(PROCINFO["sorted_in"])] ? "}\n" : ", ")
      }
  }' measurement.txt

arkh

I'd like to see it speed tested against an instance of Postgres using the file Foreign Data Wrapper https://www.postgresql.org/docs/current/file-fdw.html

    CREATE EXTENSION file_fdw;
    CREATE SERVER stations FOREIGN DATA WRAPPER file_fdw;
    CREATE FOREIGN TABLE records (
      station_name text,
      temperature float
    ) SERVER stations OPTIONS (filename 'path/to/file.csv', format 'csv', delimiter ';');
    SELECT station_name, MIN(temperature) AS temp_min, AVG(temperature) AS temp_mean, MAX(temperature) AS temp_max
    FROM records
    GROUP BY station_name
    ORDER BY station_name;

ftisiot

Just modified the original post to add the file_fdw. Again, none of the instances (PG or ClickHouse) were optimised for the workload https://ftisiot.net/posts/1brows/

rustforlinux

>The result is 44.465s!

Very nice! Is this time for the first run or second? Is there a big difference between the first and second run, please?

yencabulator

Not including this in the benchmark time is cheating:

    \copy TEST(CITY, TEMPERATURE) FROM 'measurements.txt' DELIMITER ';' CSV;

DiggyJohnson

For some reason I'm impressed that the previous page in the documentation is https://www.postgresql.org/docs/current/earthdistance.html. Lotta fun stuff in Postgres these days.

nathancahill

Man, Postgres is so cool and powerful.

aargh_aargh

Using FDWs on a daily basis I fully realize their power and appeal but at this exclamation I paused and thought - is this really how we think today? That reading a CSV file directly is a cool feature, state of the art? Sure, FDWs are much more than that, but I would assume we could achieve much more with Machine Learning, and not even just the current wave of LLMs.

Why not have the machine consider the data it is currently seeing (type, even actual values), think about what end-to-end operation is required, how often it needs to be repeated, make a time estimate (then verify the estimate, change it for the next run if needed, keep a history for future needs), choose one of methods it has at its disposal (index autogeneration, conversion of raw data, denormalization, efficient allocation of memory hierarchy, ...). Yeah, I'm not focusing on this specific one billion rows challenge but rather what computers today should be able to do for us.

Algunenano

Using ClickHouse local:

    time clickhouse local -q "SELECT concat('{', arrayStringConcat(groupArray(v), ', '), '}')
    FROM
    (
        SELECT concat(station, '=', min(t), '/', max(t), '/', avg(t)) AS v
        FROM file('measurements.txt', 'CSV', 'station String, t Float32')
        GROUP BY station
        ORDER BY station ASC
    )
    SETTINGS format_csv_delimiter = ';', max_threads = 8" >/dev/null

    real 0m15.201s
    user 2m16.124s
    sys 0m2.351

Most of the time is spent parsing the file

jackhalford

Your sum variable could get pretty large, consider using the streaming mean function, something like: new_mean = ((n*old_mean)+temp)/(n+1)

apwheele

Had this same thought as well. Definitely comes up when calculating variance this way in streaming data.

https://stats.stackexchange.com/a/235151/1036

I am not sure if `n*old_mean` is a good idea. Wellford's is typically something like inside the loop

count += 1; delta = current - mean; mean += delta/count;

worthless-trash

Interesting challenge, shame its only java. Can't wait till people start hand rolling their own JVM bytecode.

kriskrunch

Check out the discussion[0], looks like there are submissions in several languages. Go, Rust, Python, and C++, to name a few

[0] https://github.com/gunnarmorling/1brc/discussions

Animats

It looks like the problem is dominated by reading in the data file. Some fast solutions just read the whole file into memory.

somat

It would be a more interesting challenge if the data file were larger than the memory. I would love to see what people would come up with on some baby vm with 512 mb of ram.

Even more interesting would be small ram, little local storage and a large file only available via network, I would like to see something other than http but realistically it would be http.

fny

Rather than read the file into memory, memory mapping can be used.

userbinator

Alternatively, "must be written in Java" can be interpreted to mean "must use the JVM to begin execution", and you can clearly spawn another process from Java...

yardstick

Another rule was no external dependencies

I guess you could from Java itself write a new binary and then run that binary, but it would be against the spirit of the challenge.

sweetjuly

There's sun.misc.Unsafe which is a builtin that gives you raw read/write. Turn that into ROP, mmap your native program, and now you're full native :)

nneonneo

Just use JNA to allocate an RWX buffer with mmap, unpack a bit of clever code into the buffer, and forge a native function pointer to call it :)

vessenes

Ooh fun, Advent of Code chaser!

A fair comparison between languages should include the make and build times. I haven't used Java / Maven for years, and I'm reminded why, heading into minute 2 of downloads for './mvnw clean verify'.

kaba0

Java build times are very fast. You are just measuring your internet speed here.

(Also, gradle is faster as a build tool for incremental compilation)

troupo

Gradle and fast should not be in the same sentence.

I have no idea what gradle and its daemon do except spending orders of magnitude more time starting up than running the actual build.

You're much better off running javac directly. Then it's fast.

kaba0

Then someone in your team failed learning the bare minimum to write a sane gradle file, and are putting imperative stuff into the configuration phase.

For big projects, compiling only what’s necessary is a huge time win.

pkolaczk

Gradle is faster than what? Than maven? Maybe. But not than Go or Cargo.

kaba0

What step are we talking about? Javac itself is absolutely on the same order of magnitude speed as Go per loc, while rust is significantly slower (which makes sense, the latter is a properly optimizing compiler with heavy static analysis, while the former two just spews out java byte code/machine code).

Gradle with a daemon is also pretty fast, you are just probably used to some complex project with hundreds of dependencies and compare it to a cargo file with a single dependency.

troupo

I don't even think it's faster than Maven.

wiseowise

That depends on project and scope of the project.

wokwokwok

Itself.

> gradle is faster as a build tool for incremental compilation

Implicit:

> …than it is building from scratch, where it needs to download lots of stuff

I mean, yes, saying Java builds are fast does seem a bit “rolls eyes, yes technically by loc when the compiler is actually running” …but, ^_^! let’s not start banging on about how great cargo/rust compile times are… they’re really terrible once procedural macros are used, or sys dependencies invoke some heinous c dependency build like autoconf and lots of crates do… and there’s still a subpar incremental compilation story.

So, you know. Eh. Live and let live. Gradle isn’t that bad.

occams_chainsaw

...very relative

sixlelyt321

But run time is slow, is that you want to convey?

kaba0

No, I would even go as far as to say that a naive, non-microbenchmark java program can often perform better than a naive, low-level language one.

Someone

> A fair comparison between languages should include the make and build times

if you do that, you should also include programming time, and divide both by the number of runs the code will have over its lifetime. Also add an appropriate fraction of the time needed to learn to program.

In such a challenge, that likely would make a very naive version win.

Apart from being impractical, I think that would be against the idea behind this challenge.

Dylan16807

> In such a challenge, that likely would make a very naive version win.

Not if you give a realistic number for an actual database of this scale. Tens of hours of dev time isn't much after you divide it by a few million.

wiseowise

Why would you clean?

“I discard cache and it is so slow, aarghhh”

mrkeen

Don't need maven

> No external dependencies may be used

chii

maven is just a build tool. It creates a single uberjar which you run with the JDK directly - you don't ship the maven project.

But the real meaning is that you're not allowed to use external libraries, rather than build tool related dependency.

kebsup

At the Czech technical university C course, we've had a very similar assignment. The submissions of all the course students were continuously evaluated on a leaderboard and many spent tens of hours optimizing to get the extra points for better grade (but mostly status points).

divbzero

I suspect Java is not the fastest language for this. I’d love to see unofficial contenders tackle this challenge using different languages.

dvko

There are a handful of implementations in other languages already. Here’s mine in C99: https://github.com/dannyvankooten/1brc

I’ve also seen versions in Rust, Go, Python, Clickhouse and DuckDB. The discussions tab on the GitHub repo lists some of these.

pbh101

> Each contender will be run five times in a row. The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard.

Shouldn’t he take the single fastest time, assuming file (and JDK) being in file cache is controlled for?

bayindirh

This is done to simulate real-world performance. Your binary is not the only binary in the system and other services may be running as well. So fastest time is the happiest path and slowest is the unluckiest.

The range of remaining three is what you expect to get 99% of the time on a real world system.

cwillu

Any production where you're routinely scanning a text file with a million records is probably a batch process, and I'd be shocked if the usual performance wasn't much closer to the worst case than the average.

undefined

[deleted]

pbh101

But this is a contrived test, and looking for the fastest solution, so your arguments point to taking the fastest: the one with the least interference.

bayindirh

Yes, the one works the fastest in average case will be the fastest in the real world, and will be the one least affected by general noise present in the system.

The test is very well designed, we may say.

fragmede

> Your binary is not the only binary in the system and other services may be running as well.

Technically yes, but these days most of my machines are single purpose VMs; database/load balancer/app server/etc, so it still seems weird not to take the fastest.

bayindirh

Then, your VM is not the only VM sharing the bare metal. Same thing applies, only on a slightly higher level.

As long as you share the metal with other stuff (be it containers, binaries, VMs), there's always competition for resources, and your average time becomes your real world time.

berkes

That VM is hardly "single purpose".

There's logrotate and other cleanup tasks, monitoring, dns, a firewall, and many more stuff running on that server. No matter how much you offload to the host (or forego), there's always a kernel and supporting deamons running alongside or under your app.

rockwotj

Why?

The author does explicitly want all the files to be cached in memory for the later runs: https://x.com/gunnarmorling/status/1742181941409882151?s=20

Daily Digest email

Get the top HN stories in your inbox every day.