This presentation was recorded at GOTO Amsterdam 2024. #GOTOcon #GOTOams
https://gotoams.nl

Roy van Rijn – Experienced Developer & Architect, Robotics Enthusiast & Hobby Mathematician @royvanrijn

ORIGINAL TALK TITLE
How Fast Can You Parse a File with 1 Billion Rows of Weather Data Using Java?

RESOURCES
https://x.com/royvanrijn
https://www.linkedin.com/in/royvanrijn
https://github.com/royvanrijn
https://royvanrijn.com

Links
https://adventofcode.com
https://x.com/gunnarmorling
https://www.morling.dev

ABSTRACT
Last January a challenge was posted online by Gunnar Morling: How fast can you parse a file with 1 billion rows of weather data using Java?

Little did I know this deceivingly simple question would lead me down a path that taught me all about: parallelism, memory mapped files, SWAR techniques (SIMD as a register), bit twiddling, branchless code, mechanical sympathy, Graal native compilation and finally… I even turned to the dark side: using sun.misc.Unsafe.

Join me in this deep dive where I’ll explain all the code changes and tricks that took me from the reference implementation which processes the billion records in 4+ minutes, to processing everything in under 2 seconds.

Who knew Java could be this fast? […]

TIMECODES
00:00 Intro
01:49 The challenge
06:07 Watch, learn, adopt, experiment
08:00 Mechanical sympathy
09:32 Temperature as integer
10:37 Memory mapped files
11:54 Getting unsafe
13:31 SWAR
17:22 Stringless
18:18 Branchless programming
20:35 Parse the temperature
30:14 Keeping track
36:22 Which JVM?
37:21 Graal (native-image)
39:38 Summary
40:50 Results
42:00 Outro

Download slides and read the full abstract here:
https://gotoams.nl/2024/sessions/3164

RECOMMENDED BOOKS
Monica Beckwith • JVM Performance Engineering • https://amzn.to/3zuJ7Ig
Scott Oaks • Java Performance • https://amzn.to/4eNhlH4
Trisha Gee, Kathy Sierra & Bert Bates • Head First Java • https://amzn.to/3k59BJ6
Trisha Gee & Kevlin Henney • 97 Things Every Java Programmer Should Know • https://amzn.to/3kiTwJJ


https://www.linkedin.com/company/goto-
https://www.instagram.com/goto_con
https://www.facebook.com/GOTOConferences
#Java #JVM #GraalVM #Parsing #Parallelism #MemoryMappedFiles #SWAR #BitTwiddling #BranchlessCode #MechanicalSympathy #GraalNative #JavaProgramming #AdventOfCode #1BillionRowChallenge #GunnarMorling #RoyvanRijn

Looking for a unique learning experience?
Attend the next GOTO conference near you! Get your ticket at https://gotopia.tech
Sign up for updates and specials at https://gotopia.tech/newsletter

SUBSCRIBE TO OUR CHANNEL – new videos posted almost daily.
https://www.youtube.com/user/GotoConferences/?sub_confirmation=1

source

Comments

  • @djchrisi
    Reply

    The view count gives testamony what a fun challenge that was.

  • @Abhigyan103
    Reply

    How can I learn Java, which is this advanced, every course just teaches object oriented programming

  • @geoffxander7970
    Reply

    When a software engineer stumbles upon the dark arts of real computer science…
    The only thing that comes to mind not talked about was AVX2 or SSE4.x (I don't know if they're supported natively in Java).

  • @RumberoEuropeo
    Reply

    These tricks are great for graph analytics too!

  • @meryplays8952
    Reply

    ok, how about Golang?

  • @TechTalksWeekly
    Reply

    This is a brilliant talk and it's been featured in the last issue of Tech Talks Weekly newsletter 🎉
    Congrats Roy!

  • @eduardopalhares3526
    Reply

    Really Great talk, thanks for sharing this knowledge

  • @LtdJorge
    Reply

    If you optimize for specific arches, you can do the SIMD lookup with less instructions and much wider. For example, I’m using the memchr Rust crate by the genius BurntSushi, and specifically the AVX2 implementation. It does loops of 4 sequential comparisons with 256bit registers. The SIMD part is just 2 instructions.

  • @Syntax753
    Reply

    Kudos for mentioning Advent of Code! And yeah, most people can parse 1 Billion rows of weather data between every blink (and more if using strong Java)

  • @RouteNRide
    Reply

    Imagine now if we didn't have to deal with the absolute idiot who created that human readable string data format… Completely unrealistic problem, because the 1st billable hour of work would go to make the data persistence computer friendly, not trying to parse strings fast. (ex: [2 bytes cityid] [ 2 bytes temperature], or 4+4 if > 64k cities). Besides, no worthy sw engineer would ever create the problem of mixing data that is naturally partitioned (sensors or cities). It was an embarassingly parallel problem made worse by tossing everything in a single file.

  • @Tony-dp1rl
    Reply

    Any language that cannot do this entirely I/O limited by the reading of the billion rows from disk in parallel, should be shamed. This would run at I/O speed in JavaScript, Java, C#, Python, Turbo Pascal,, even LUA could do this. 🙂

  • @serrrsch
    Reply

    I've followed the competition on Twitter and GitHub but this talk is just a Gem in how it is being told. Big up for the presentation/slides skills!

  • @androth1502
    Reply

    native compilation… is it really java anymore?

  • @actorenEU
    Reply

    In 1975, at the University of Delft, my professor and I collaboratively developed an assembler and interpreter for the computer practicum. We had to run this on an IBM mainframe, using a higher-level language. To make it functional, we had to employ extensive masking and shifting operations. I vividly remember the complex logical intricacies we had to navigate to get it all working correctly. From this hands-on experience, I truly admire your work and the effort it takes to even get it working.

  • @Nashadelicable
    Reply

    This was so much fun watching

  • @CenturionDobrius
    Reply

    Amazing presentation, thanks a lot for sharing ❤

  • @TimBradleyFromOz
    Reply

    From 4 minutes 49 seconds 679 milliseconds >>> to >>> 1 second 535 miliseconds?
    Wow!
    Great talk, thanks!

  • @robchr
    Reply

    Wizard level optimizations

  • @juveraey
    Reply

    even my low brain can understand what you said. great talk thank you Roy van Rijn

  • @nightking4615
    Reply

    Using memory maps is cheating!

  • @chauchau0825
    Reply

    This is gold

  • @wsollers1
    Reply

    This was an amazing talk

  • @ericm97
    Reply

    What a journey. Lovely talk 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.