The June results are in for Graph500, and guess what? We took the 4th position in the capacity ranking with Chaos.
If you have not heard of it, Graph500 is a leading benchmark for large-scale graph processing backed by a wide range of HPC experts from academia, industry, and national laboratories. The goal of the benchmark is to run a breadth-first search (BFS) algorithm as fast as possible and on the largest possible graph. The benchmark is evaluated using two metrics: minimum execution time (speed ranking), and maximum problem size (capacity ranking).
In order to become #4, we had to process a graph with 16 trillion edges (scale 40) that required 256 terabytes of storage. We did this using 20 servers in a little over 10.5 hours. More details and complete ranking available at Graph500.org Capacity Ranking – June 2016.
As you can see, we are in pretty interesting company: all systems in the top 10 (and most beyond) are supercomputers. So how did we do it using only 20 commodity servers?
The answer is Chaos, the scale-out graph processing system using secondary storage that we built over the last couple of years. Where all these supercomputers use massive amounts of main memory, we use a pool of magnetic disks to store and process the graph. There is obviously a trade-off here: our system is much slower to process the same graph. But the cool thing is that we can get to same scale as supercomputers using cheap, commodity hardware.
What is even cooler is the modifications we made to Chaos to make it possible to do this. I dubbed them: Ragnarok mode 😉 I cannot include all the details in this blog post, but I might write a short paper about this, so stay tuned if you’re interested.
Ragnarok mode Overview
Processing input graphs almost as large as the available storage capacity without running out of space requires that any data be either only read or read and immediately replaced. Leveraging the fact that BFS needs not revisit an edge after its associated vertices have been discovered, we define a new processing model where the system produces updates by inserting them directly in the edge stream marked with a flag. In this mode, outgoing edges from a newly discovered edge are rewritten as updates within the edge stream by reversing source and destination. On the next pass, updates will cause new vertices to be discovered after which they are filtered out. As a result, the edge stream shrinks over time to only the set of edges that still have not been explored.
Rewriting the edge stream files in place causes difficulties with updates as the reversing of source and destination usually implies that an update must be written to a different partition. We addressed these issues by rewriting the I/O subsystem in Chaos to implement circular edge streams which collapse the parts of the file that have been read while appending newly generated data (edges or updates).
Finally, the use of a single stream with edges and updates makes it possible to fold the scatter inside the gather phase. We can also take advantage of the undirected nature of BFS to only store edges in one direction and provide the ability to access both endpoints by reversing edges at each iteration.
Edit: Fixed broken link (2019-09-17).