The Death of Data Locality?
Data locality is where the computation and the storage are on the same node. This means we don’t need to move huge data sets around the network. But it’s a pattern that has fallen out of fashion in recent years.
With a lot of cloud offerings, we lose the data locality that made Hadoop such a great framework on which to run Spark some 10 years ago. The cloud providers counter this with a “just rent more nodes” argument. But if you have full control over your infrastructure (say you have an on-prem cluster), throwing away data locality is a huge waste.
Just to recap, data locality gives you doubleplusgood efficiency. Not only does the network not take a hit sending huge amounts of data from storage to compute nodes, but we retain OS treats like caching.
What? The OS has built in caching? Have you ever grep
ped a large directory and then noticed that executing the same command a second time is orders of magnitude faster than the first time? That’s because modern operating systems leave pages in memory unless there is a reason to dispose of them. For this reason, there is normally no point in a caching layer for simple queries on the same machine as where the database lives - a strange anti-pattern I’ve seen in the wild.
Of course, none of this is not available over the network.
Another advantage of data locally is that apps can employ a pattern called “memory mapping”. The idea is that as far as the app is concerned, a file is just a location in memory. You read it just like you would a sequence of bytes in RAM (Hadoop takes advantage of this here.)
Why is this useful? Well, you don’t even need to make IO kernel calls so there is no context switching and certainly no copying data between buffers. Here is an example of how to do this in Java. You can prove to yourself that there are no kernel calls by running:
sudo strace -p $(jstack $(jps | grep MemoryMapMain | awk '{print $1}') | grep ^\"main | perl -pe s/\].*\//g | perl -pe s/.*\\[//g)
Note there are kernel calls in setting up the memory mapping but after that, there are no further calls as we read the entire file.
So, why have many architects abandoned data locality? It’s generally a matter of economics as the people at MinIO point out here. If your data is not homogeneous, you might be paying for, say, 32 CPUs on a node that’s just being used for storage. An example might be that you have a cluster with 10 years of data but you mainly use that last two years. If the data for the first eight years is living on expensive hardware and rarely accessed, that could be a waste of money.
So, should you use data locality today? The answer, as ever, is “it depends on your use case”.