Cut Costs, Build Faster
It’s common to develop against external systems that we share with other devs. But wouldn’t it be so much better if we could just develop on our laptops with our favourite IDE in an isolated space? Well, we can!
In this example, we’re going to test the functionality of Apache Iceberg. Note, we ’re not going to test its performance. For that we need too much data to fit on a laptop.
Now, if you’re developing an Apache Iceberg app, you probably need a catalog and you’re probably using Spark (other engines are available). How do we get all this working on our dev machine?
Why do you care?
At this point, some might say “well, I’m fine developing while connected to the cloud.” But this doesn’t scale. If you’re a lone dev, fine, but for the rest of us we need Developer Productivity Engineering, “the next big thing in software development”. The idea is to build faster and better. This obviously cuts costs.
What we want is a development environment and a full cluster all on one laptop. We don’t need any expensive cloud environments, expensive infrastructure guys and the whole thing is easier to manage because everything is replicated and isolated to everybody’s machine. Say goodbye to changing development data while other devs are using it!
Tools of the trade
Thanks to wonders of Docker, we can have a whole ecosystem that we can easily spin up.
Even better, with Fabric8’s docker-maven-plugin
you can start a cluster, run your tests and tear the cluster down all as part of your build requiring no manual intervention.
It’s free, open source and integrates with the most popular build tool in the Java ecosystem (Maven).
Choosing a Catalog for your Test Suite
You can, of course, run Iceberg and Spark in a single JVM using the local disk for tests most of the time. If you want to keep it really simple, you would use the Hadoop catalog backed by the local filesystem.
[Note that the term ‘Hadoop’ is a bit of a misnomer here. “While the name of the catalog is the Hadoop catalog it works on any file system (or things that look like file systems like cloud object stores)” - Apache Iceberg: The Definitive Guide ]
However, “concurrent writes with a Hadoop catalog are not safe with a local FS or S3” backing it, according to the official documentation. So, if you want to test behaviour during concurrent writes, Hadoop is not a good choice.
Another choice might be Hive Metastore since it comes bundled with Spark anyway. But once again, the moment you try to play with concurrent access, you hit this error. The default Hive Metastore is backed by an in-memory Apache Derby database which is normally great for testing but it “doesn’t support the interfaces (or concurrent access) required for Iceberg to work”.
You could then use a more production-ready DB like Postgres. But now we’re starting to need external processes. This is not necessarily a major problem but Hive Metastore was built for Hive not Iceberg.
For example, if you change the table scheme directly in hive, it does not change the schema in your iceberg table. Same with setting table properties."
So, why use Hive?
A More Realistic Catalog for your Tests
If you want an environment that is more production-like, you’ll need a more industrial grade catalog. Since we like free and open source solutions, we chose Apache Polaris as REST catalogs are becoming ever more popular. However, if you want to run Polaris in the same JVM to keep things simpler, you might be disappointed. Because of the transitive dependencies of Polaris and Spark being incompatible, you’ll find yourself in Classpath Hell. That’s when firing up Polaris in Docker becomes a great solution.
Bring it altogether
Right, so we have our Behaviour Driven Development tests that fire up a Spark instance configured to use the Iceberg extensions that uses Polaris as a catalog. That Polaris catalog is in its own Docker container that Maven started. After the tests pass, Maven kills it. The output is automatically generated documentation that covers various Iceberg corner cases.
And that’s it. No need for expensive infrastructure and a DevOps team. It’s all available in one GitHub repo.
Enjoy!