With worked examples that illustrate functionality, we make systems architecture much more scientific.
Databrick’s Unity Catalog has now been open sourced GitHub. But there’s not a huge amount of code there - I counted a mere 133 Java files which were neither test nor examples. This is not too surprising since it’s little more than a REST API over a (H2) database that stores metadata. It is just a catalog after all. What is a little more surprising is that “MANAGED table creation is not supported yet.
This is an interesting problem on Discord where the symptoms belie the cause. Here, a very beefy Spark cluster is taking a long time process (admittedly) a large amount of data. However, it’s the CPUs that are getting hammered. The temptation at this point is to add more CPU resources but this won’t help much. When your Spark jobs that are not computationally intensive are using large amounts of CPU, there’s an obvious suspect.
Data locality is where the computation and the storage are on the same node. This means we don’t need to move huge data sets around the network. But it’s a pattern that has fallen out of fashion in recent years. With a lot of cloud offerings, we lose the data locality that made Hadoop such a great framework on which to run Spark some 10 years ago. The cloud providers counter this with a “just rent more nodes” argument.
When you want to cluster data together over multiple dimensions, you can use Z-Order. But a better algorithm is the Hilbert Curve, a fractal that makes a best attempt to keep adjacent points together in a 1-dimensional space. From DataBrick’s Liquid Cluster design doc we get this graphical representation of what it looks like: (Dotted line squares represent files). A Hilbert curve has the property that adjacent nodes (on the red line, above) have a distance of 1.