When more CPUs do not help your problem with CPUs
This is an interesting problem on Discord where the symptoms belie the cause. Here, a very beefy Spark cluster is taking a long time process (admittedly) a large amount of data. However, it’s the CPUs that are getting hammered.
The temptation at this point is to add more CPU resources but this won’t help much.
When your Spark jobs that are not computationally intensive are using large amounts of CPU, there’s an obvious suspect. Let’s check time spent in Garbage Collection:
Shuffle per worker seems modest but look at those GC Times. In a five hours job, nearly two hours is spent just garbage collecting.
And this is something of a surprise to people new to Spark. Sure, it delivers on its promise to process more data than can fit in memory but if you want it to be performant, you need to give it as much memory as possible.