Iceberg

Hilbert Curves

When you want to cluster data together over multiple dimensions, you can use Z-Order. But a better algorithm is the Hilbert Curve, a fractal that makes a best attempt to keep adjacent points together in a 1-dimensional space. From DataBrick’s Liquid Cluster design doc we get this graphical representation of what it looks like: (Dotted line squares represent files). A Hilbert curve has the property that adjacent nodes (on the red line, above) have a distance of 1.

Z-Ordering

Z-ordering is an optimization technique in big data that allows faster access since similar data lives together. We discuss the algorithm that defines what is similar here. Imagine a logical grid where all the values of one column run across the top and all the values from another run down the side. If we were to sort this data, every datum can be placed somewhere in that grid. Now, if the squares of the grid were mapped to files and all the data in each cell were to live in those files, we have made searching much easier as we now know the subset of files in which it may live.