Posts
Cut Costs, Build Faster
It’s common to develop against external systems that we share with other devs. But wouldn’t it be so much better if we could just develop on our laptops with our favourite IDE in an isolated space? Well, we can! In this example, we’re going to test the functionality of Apache Iceberg. Note, we ’re not going to test its performance. For that we need too much data to fit on a laptop.
First look at the Unity Catalog
Databrick’s Unity Catalog has now been open sourced GitHub. But there’s not a huge amount of code there - I counted a mere 133 Java files which were neither test nor examples. This is not too surprising since it’s little more than a REST API over a (H2) database that stores metadata. It is just a catalog after all. What is a little more surprising is that “MANAGED table creation is not supported yet.
When more CPUs do not help your problem with CPUs
This is an interesting problem on Discord where the symptoms belie the cause. Here, a very beefy Spark cluster is taking a long time process (admittedly) a large amount of data. However, it’s the CPUs that are getting hammered. The temptation at this point is to add more CPU resources but this won’t help much. When your Spark jobs that are not computationally intensive are using large amounts of CPU, there’s an obvious suspect.
The Death of Data Locality?
Data locality is where the computation and the storage are on the same node. This means we don’t need to move huge data sets around the network. But it’s a pattern that has fallen out of fashion in recent years. With a lot of cloud offerings, we lose the data locality that made Hadoop such a great framework on which to run Spark some 10 years ago. The cloud providers counter this with a “just rent more nodes” argument.
Hilbert Curves
When you want to cluster data together over multiple dimensions, you can use Z-Order. But a better algorithm is the Hilbert Curve, a fractal that makes a best attempt to keep adjacent points together in a 1-dimensional space. From DataBrick’s Liquid Cluster design doc we get this graphical representation of what it looks like: (Dotted line squares represent files). A Hilbert curve has the property that adjacent nodes (on the red line, above) have a distance of 1.
Z-Ordering
Z-ordering is an optimization technique in big data that allows faster access since similar data lives together. We discuss the algorithm that defines what is similar here. Imagine a logical grid where all the values of one column run across the top and all the values from another run down the side. If we were to sort this data, every datum can be placed somewhere in that grid. Now, if the squares of the grid were mapped to files and all the data in each cell were to live in those files, we have made searching much easier as we now know the subset of files in which it may live.