UPDATE February 6, 2022: The behaviour seems to be triggered by that the JVM process sometimes gets scheduled on a different NUMA node from where the heap is allocated. I tested this by using this command to run the sieve:
numactl --cpunodebind=0 --membind=0 java PrimeSieve
On my first try this gave me the error message
setting membind: Operation not permitted
. The workaround I
used was to run the docker container in privileged mode:
docker run --privileged --rm -it primes-java-bitset
I don’t know if the docker containers in the drag-racing competition are run like this, but I’ll figure that out eventually. If not, I will have to try see if there is a way to pin this that does not require root…
See:
- CraigJPerry’s HN answer suggesting this, along with the rationale.
- My experiment confirming that the theory seems spot on: test-runs/numa-pin
Some thoughts:
Now, why does the boolean-array
variant of the sieve not
suffer from this? Because the boolean array is not allocated on the
heap?
This NUMA scheduling seems pretty important to consider when deploying JVM applications using Docker, right?
See also my follow up article: Be aware of NUMA when your application needs to run very fast
Orinal post below:
I am currently chasing a Java BitSet
performance-eating
ghost. Can’t say I am particularly close to catching it, but I have just
managed to minimize the
reproduction substantially. Even if it is not yet particularly
minimal, it will have to do for now.
The Problem
When writing implementations of the Eratosthenes sieve in Clojure I started to notice strange performance behaviour:
Today I am investigating why my Clojure solution runs so fast on my Macs (an x64 and an M1), but so slow on the some of the benchmarking machines running Dave's drag-racing competition. It has brought me to @_bsless's Clojure Goes Fast profiler: https://t.co/rQUAMpWHPS
— Peter Strömberg aka PEZ (@pappapez) January 26, 2022
There are several strangeties, but one is easier to investigate than the others. A ”flicker” in the performance of the sieve when using Java’s BitSet. It manifests itself when it is run in Docker on som Linux Machines. Like when it is run in the Programming Languages drag-racing, as set up by Dave Plummer. Some of the times (more than half of the times) the sieve performs at only half of the performance it ”should” have.
For the purpose of the investigation, I could use my Agical freedoms to procure equipment, and I bought a machine that was reasonably similar to the drag-racing machines. An ASUS Mini-PC with Ryzen7. You can see the detailed specs at the repro repo.
I started to suspect it was not about Clojure, and decided to implement the sieve in Java in hopes of being able to reproduce it there. Which, a bit to my surprise, I actually could. I have created a repository for it.
The strange behaviour is not new, it is there also with Java 8 (actually a bit more pronounced, see below).
Reproduction
The repository has this main content:
PrimeSieve.java
, a program that sieveslimit
numbers of numbers for primes, defaulting to 1 million- The program starts with sieving the numbers as many times as it can for 5 seconds as a warmup.
- Then sieves the numbers for 5 seconds counting the number of passes.
- Then prints the number of passes, along with a sanity check that the sieve is correctly updated.
- The BitSet is recreated each pass.
run.sh
, a script that will compilePrimeSieve.java
and run it X times. X defaults to 10.Dockerfile
, a docker image definition based onopenjdk:17
that when run will runrun.sh
then exit.machine-pez-x64-ubuntu.md
, the specs for a machine where I can repro the behaviourtest-runs
, a directory with modified copies of the main program files + results from the runs (very WIP)
This is the code being run over and over again in the benchmark:
public PrimeSieve(int n) {
this.n = n;
int half_n = (n + 1) >> 1;
this.sieve = new BitSet(half_n);
this.sieve.set(0, half_n);
}
public void run() {
int q = (int) Math.ceil(Math.sqrt(n));
for (int p = 3; p < q; p += 2) {
if (sieve.get(p >> 1)) {
for (int i = (p * p) >> 1; i < n >> 1; i += p) {
sieve.clear(i);
}
}
}
}
To run the reproduction:
$ cd experiments/java-bitset-perf-flicker
$ docker build -t primes-java-bitset . && docker run --rm -it primes-java-bitset
Sample output from my Mac M1:
0: Passes: 3659, count: 78498, Valid: true
1: Passes: 3623, count: 78498, Valid: true
2: Passes: 3678, count: 78498, Valid: true
3: Passes: 3684, count: 78498, Valid: true
4: Passes: 3653, count: 78498, Valid: true
5: Passes: 3663, count: 78498, Valid: true
6: Passes: 3666, count: 78498, Valid: true
7: Passes: 3651, count: 78498, Valid: true
8: Passes: 3672, count: 78498, Valid: true
9: Passes: 3682, count: 78498, Valid: true
Very stable performance, albeit not stellar.
Sample output from a Linux x64 machine I have:
0: Passes: 6077, count: 78498, Valid: true
1: Passes: 6059, count: 78498, Valid: true
2: Passes: 2758, count: 78498, Valid: true
3: Passes: 2751, count: 78498, Valid: true
4: Passes: 2752, count: 78498, Valid: true
5: Passes: 2768, count: 78498, Valid: true
6: Passes: 2758, count: 78498, Valid: true
7: Passes: 2760, count: 78498, Valid: true
8: Passes: 6070, count: 78498, Valid: true
9: Passes: 2759, count: 78498, Valid: true
It either performs very well, or it performs at 0.5X of very well. When running this 100 times or more, the variance among the 1X runs is low, as is the variance among the 0.5X runs.
Also on JDK8, just even more crazy
On JDK8 the program has better ”normal” performance, while the runs hit with the drop is at the same level as with newer JDKs.
0: Passes: 6431, count: 78498, Valid: true
1: Passes: 6421, count: 78498, Valid: true
2: Passes: 2749, count: 78498, Valid: true
3: Passes: 6381, count: 78498, Valid: true
4: Passes: 2741, count: 78498, Valid: true
5: Passes: 2749, count: 78498, Valid: true
6: Passes: 6397, count: 78498, Valid: true
7: Passes: 6415, count: 78498, Valid: true
8: Passes: 2742, count: 78498, Valid: true
9: Passes: 2745, count: 78498, Valid: true
On JDK7 all runs suffer the performance drop
0 - Passes: 2743, count: 78498, Valid: true
1 - Passes: 2743, count: 78498, Valid: true
2 - Passes: 2735, count: 78498, Valid: true
3 - Passes: 2748, count: 78498, Valid: true
4 - Passes: 2747, count: 78498, Valid: true
5 - Passes: 2746, count: 78498, Valid: true
6 - Passes: 2734, count: 78498, Valid: true
7 - Passes: 2745, count: 78498, Valid: true
8 - Passes: 2746, count: 78498, Valid: true
9 - Passes: 2751, count: 78498, Valid: true
10 - Passes: 2746, count: 78498, Valid: true
11 - Passes: 2732, count: 78498, Valid: true
12 - Passes: 2750, count: 78498, Valid: true
13 - Passes: 2753, count: 78498, Valid: true
14 - Passes: 2742, count: 78498, Valid: true
15 - Passes: 2747, count: 78498, Valid: true
16 - Passes: 2746, count: 78498, Valid: true
17 - Passes: 2746, count: 78498, Valid: true
18 - Passes: 2744, count: 78498, Valid: true
19 - Passes: 2750, count: 78498, Valid: true
20 - Passes: 2748, count: 78498, Valid: true
21 - Passes: 2743, count: 78498, Valid: true
22 - Passes: 2742, count: 78498, Valid: true
23 - Passes: 2746, count: 78498, Valid: true
24 - Passes: 2746, count: 78498, Valid: true
25 - Passes: 2747, count: 78498, Valid: true
26 - Passes: 2736, count: 78498, Valid: true
27 - Passes: 2738, count: 78498, Valid: true
28 - Passes: 2742, count: 78498, Valid: true
29 - Passes: 2742, count: 78498, Valid: true
30 - Passes: 2735, count: 78498, Valid: true
31 - Passes: 2745, count: 78498, Valid: true
32 - Passes: 2740, count: 78498, Valid: true
33 - Passes: 2737, count: 78498, Valid: true
34 - Passes: 2733, count: 78498, Valid: true
35 - Passes: 2748, count: 78498, Valid: true
36 - Passes: 2742, count: 78498, Valid: true
37 - Passes: 2745, count: 78498, Valid: true
38 - Passes: 2742, count: 78498, Valid: true
39 - Passes: 2749, count: 78498, Valid: true
40 - Passes: 2743, count: 78498, Valid: true
41 - Passes: 2743, count: 78498, Valid: true
42 - Passes: 2736, count: 78498, Valid: true
43 - Passes: 2744, count: 78498, Valid: true
... (snip, same picture with the remaining 56 runs)
(It could, of course, also be that on JDK7 there never kicks in a performance boost.)
Is it really about the BitSet?
Actually, I really don’t have enough information to know this. In
fact so far I have seen other strange performance numbers using a
regular array of boolean
s for the sieve storage. (Even if
in this particular setup, the boolean-array
performs super stable).
For me the mystery started with that a Clojure version of the sieve performed much better on my M1 Mac than on this Linux machine. (Both running in Docker). I have on my todo list to summarize the clues as I have seen so far. I’ll update the repository with reproductions of these other setups as I can make them minimal enough.
This one might, or might not, be related:
On some machines, this code runs 20% slower than the out-commented version, when repeatedly run for 5 secs with n=1M. n and sqrtN are instance variables, and the instance is freshly created each run. I have a theory for the performance diff. Do you? pic.twitter.com/V1z9wXFl8h
— Peter Strömberg aka PEZ (@pappapez) February 5, 2022
I’ve also speculated around that it might have to do with that the
JIT compiler does something different between the two types of
performance runs. There’s a commented line in run.sh
that
runs the sieve program with -XX:+PrintCompilation
to print
out JIT compiler logs. However, I am not savvy enough to make sense of
the output. Which brings me to …
Experiment variants
I’ll keep this updated with the experiments I (or you?) conduct.
- It’s not
System.currentTimeMillis()
: test-runs/timed-passes-10K - Running for 60s instead of 5 makes no difference: test-runs/w5-r60
- No flicker when switching the
BitSet
for an array ofboolean
s: test-runs/boolean-array - Pinning the JVM and the heap allocation to the same NUMA node: test-runs/numa-pin
Help wanted
This mystery is haunting me. I’d love me some help how to minimize the reproduction further.
Please consider sending pull requests with repros/experiments, etcetera, and also feel encourage to use the Discussions forum on the repository.
Here is a Hacker News thread where I get some help with what I could test and how to try frame the problem: https://news.ycombinator.com/item?id=30226083