Skip to content

1 Billion Row Challenge

Calculate the min, max, and average of 1 billion measurements

Don't see your favorite language listed above? Open an Issue to add it!

Choose one of the languages listed above to see the language-specific leaderboard and instructions for submitting your solution to that language's repository.

Global leaderboard

TODO: Make sure this is up-to-date

TimeSolutionLanguageAuthor
1.6.159slinkJavaroyvanrijn
2.6.532slinkJavaThomas Wuerthinger
3.7.620slinkJavaQuan Anh Mai
4.9.062slinkJavaobourgain
5.9.338slinkJavaElliot Barlas
6.10.589slinkJavaArtsiom Korzun
7.10.613slinkJavaSam Pullara
8.11.038slinkJavaAndrew Sun
9.11.222slinkJavaJamie Stansfield
10.13.277slinkJavaYavuz Tas
4m 13.449slinkJavaReference implementation

You can view language-specific leaderboards on each language's competition page.

💪 The challenge

Your mission, should you choose to accept it, is to write a program that retrieves temperature measurement values from a text file and calculates the min, mean, and max temperature per weather station. There's just one caveat: the file has 1,000,000,000 rows! That's more than 10 GB of data! 😱

The text file has a simple structure with one measurement value per row:

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
Hamburg;34.2
St. John's;15.2
Cracow;12.6
... etc. ...

The program should print out the min, mean, and max values per station, alphabetically ordered. The format that is expected varies slightly from language to language, but the following example shows the expected output for the first three stations:

Hamburg;12.0;23.1;34.2
Bulawayo;8.9;22.1;35.2
Palembang;38.8;39.9;41.0

Oh, and this input.txt is different for each submission since it's generated on-demand. So no hard-coding the results! 😉

Choose a language from the cards at the top of this page to get started! 🚀

Rules and limits

  • No external library dependencies may be used. That means no lodash, no numpy, no Boost, no nothing. You're limited to the standard library of your language.

  • Implementations must be provided as a single source file. Try to keep it relatively short; don't copy-paste a library into your solution as a cheat.

  • The computation must happen at application runtime; you cannot process the measurements file at build time

  • Input value ranges are as follows:

    • Station name: non null UTF-8 string of min length 1 character and max length 100 bytes (i.e. this could be 100 one-byte characters, or 50 two-byte characters, etc.)
    • Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit
  • There is a maximum of 10,000 unique station names.

  • Implementations must not rely on specifics of a given data set. Any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported.

Entering the challenge

Some languages have special instructions but in general here's what you can expect:

  1. Create a fork of the 1BRC repository for your language on your own GitHub profile. This will let you submit your solution via a pull request.

  2. Somehow create a new implementation file in the repository. This will vary by language. For example in JavaScript you might create a new src/<username>.js file while in C++ you might make a new src/<username>.cpp file. It's recommended to copy the default reference solution to get started and then modify it from there.

  3. Make that implementation fast. Really fast.

  4. Test & benchmark your solution! There's usually language-specific instructions on how to do this but in general you run <some-command> bench <username> to run your solution against the reference implementation. If you see any differences, fix them before submitting your implementation.

  5. Create a pull request against the upstream repository! 🎉 There's usually some additional instructions in the Pull Request template on information you should include like how long it took on your computer and your computer's specs.

  6. Someone or some robot will run your solution "officially" on the same hardware as everyone else's solution (so no hardware differences) and report the results. If you're the fastest, you win! 🏆 If not, you'll still probably go on the leaderboard. 🥉

If you'd like to discuss any potential ideas for implementing 1BRC with the community, you can use the GitHub Discussions of this @1brc GitHub organization or the language-specific repository discussions. Please keep it friendly and civil.

Prize 🎁

If you enter this challenge, you may learn something new, get to inspire others, and take pride in seeing your name listed in the scoreboard above. Rumor has it that the winner of the Java competition (the original challenge language) may receive a unique 1️⃣🐝🏎️ t-shirt, too!

FAQ

Make sure you check your language-specific FAQ as well. 😉

What is the encoding of the measurements.txt file?

The file is encoded as UTF-8.

Can I make assumptions on the names of the weather stations showing up in the data set?

No. While only a fixed set of station names is used by the data set generator, any solution should work with arbitrary UTF-8 station names. For the sake of simplicity, names are guaranteed to contain no ; character.

Can I copy code from other submissions?

Yes, you can. The primary focus of the challenge is about learning something new, rather than "winning". When you do so, please give credit to the relevant source submissions. Please don't re-submit other entries with no or only trivial improvements.

My solution runs in 2 sec on my machine. Am I the fastest 1BRC-er in the world?

Probably not. 😊 1BRC results are reported in wallclock time, thus results of different implementations are only comparable when obtained on the same machine. If for instance an implementation is faster on a 32 core workstation than on the 8 core evaluation instance, this doesn't allow for any conclusions. When sharing 1BRC results, you should also always share the result of running the baseline implementation on the same hardware.

Why 1️⃣🐝🏎️?

It's the abbreviation of the project name: the One Billion Row Challenge.

Released under the MIT License.