Apache Iceberg

BDDs

Delta Lake

BDDs

Kafka

BDDs

Spark

Spark

AWS

BDDs

No longer medieval architecture

With worked examples that illustrate functionality, we make systems architecture much more scientific.

Cut Costs, Build Faster

It’s common to develop against external systems that we share with other devs. But wouldn’t it be so much better if we could just develop on our laptops with our favourite IDE in an isolated space? Well, we can! In this example, we’re going to test the functionality of Apache Iceberg. Note, we ’re not going to test its performance. For that we need too much data to fit on a laptop.

First look at the Unity Catalog

Databrick’s Unity Catalog has now been open sourced GitHub. But there’s not a huge amount of code there - I counted a mere 133 Java files which were neither test nor examples. This is not too surprising since it’s little more than a REST API over a (H2) database that stores metadata. It is just a catalog after all. What is a little more surprising is that “MANAGED table creation is not supported yet.

When more CPUs do not help your problem with CPUs

This is an interesting problem on Discord where the symptoms belie the cause. Here, a very beefy Spark cluster is taking a long time process (admittedly) a large amount of data. However, it’s the CPUs that are getting hammered. The temptation at this point is to add more CPU resources but this won’t help much. When your Spark jobs that are not computationally intensive are using large amounts of CPU, there’s an obvious suspect.

The Death of Data Locality?

Data locality is where the computation and the storage are on the same node. This means we don’t need to move huge data sets around the network. But it’s a pattern that has fallen out of fashion in recent years. With a lot of cloud offerings, we lose the data locality that made Hadoop such a great framework on which to run Spark some 10 years ago. The cloud providers counter this with a “just rent more nodes” argument.