With worked examples that illustrate functionality, we make systems architecture much more scientific.
It’s common to develop against external systems that we share with other devs. But wouldn’t it be so much better if we could just develop on our laptops with our favourite IDE in an isolated space? Well, we can! In this example, we’re going to test the functionality of Apache Iceberg. Note, we ’re not going to test its performance. For that we need too much data to fit on a laptop.
Databrick’s Unity Catalog has now been open sourced GitHub. But there’s not a huge amount of code there - I counted a mere 133 Java files which were neither test nor examples. This is not too surprising since it’s little more than a REST API over a (H2) database that stores metadata. It is just a catalog after all. What is a little more surprising is that “MANAGED table creation is not supported yet.
This is an interesting problem on Discord where the symptoms belie the cause. Here, a very beefy Spark cluster is taking a long time process (admittedly) a large amount of data. However, it’s the CPUs that are getting hammered. The temptation at this point is to add more CPU resources but this won’t help much. When your Spark jobs that are not computationally intensive are using large amounts of CPU, there’s an obvious suspect.
Data locality is where the computation and the storage are on the same node. This means we don’t need to move huge data sets around the network. But it’s a pattern that has fallen out of fashion in recent years. With a lot of cloud offerings, we lose the data locality that made Hadoop such a great framework on which to run Spark some 10 years ago. The cloud providers counter this with a “just rent more nodes” argument.