A few words from Agical

Help solve this Java BitSet performance-flicker mystery

UPDATE February 6, 2022: The behaviour seems to be triggered by that the JVM process sometimes gets scheduled on a different NUMA node from where the heap is allocated. I tested this by using this command to run the sieve:

numactl --cpunodebind=0 --membind=0 java PrimeSieve

On my first try this gave me the error message setting membind: Operation not permitted. The workaround I used was to run the docker container in privileged mode:

docker run --privileged --rm -it primes-java-bitset

I don’t know if the docker containers in the drag-racing competition are run like this, but I’ll figure that out eventually. If not, I will have to try see if there is a way to pin this that does not require root…

See:

Some thoughts:

Now, why does the boolean-array variant of the sieve not suffer from this? Because the boolean array is not allocated on the heap?

This NUMA scheduling seems pretty important to consider when deploying JVM applications using Docker, right?

See also my follow up article: Be aware of NUMA when your application needs to run very fast

Orinal post below:

I am currently chasing a Java BitSet performance-eating ghost. Can’t say I am particularly close to catching it, but I have just managed to minimize the reproduction substantially. Even if it is not yet particularly minimal, it will have to do for now.

The Problem

When writing implementations of the Eratosthenes sieve in Clojure I started to notice strange performance behaviour:

There are several strangeties, but one is easier to investigate than the others. A ”flicker” in the performance of the sieve when using Java’s BitSet. It manifests itself when it is run in Docker on som Linux Machines. Like when it is run in the Programming Languages drag-racing, as set up by Dave Plummer. Some of the times (more than half of the times) the sieve performs at only half of the performance it ”should” have.

For the purpose of the investigation, I could use my Agical freedoms to procure equipment, and I bought a machine that was reasonably similar to the drag-racing machines. An ASUS Mini-PC with Ryzen7. You can see the detailed specs at the repro repo.

I started to suspect it was not about Clojure, and decided to implement the sieve in Java in hopes of being able to reproduce it there. Which, a bit to my surprise, I actually could. I have created a repository for it.

The strange behaviour is not new, it is there also with Java 8 (actually a bit more pronounced, see below).

Reproduction

The repository has this main content:

This is the code being run over and over again in the benchmark:

public PrimeSieve(int n) {
  this.n = n;
  int half_n = (n + 1) >> 1;
  this.sieve = new BitSet(half_n);
  this.sieve.set(0, half_n);
}

public void run() {
  int q = (int) Math.ceil(Math.sqrt(n));
  for (int p = 3; p < q; p += 2) {
    if (sieve.get(p >> 1)) {
      for (int i = (p * p) >> 1; i < n >> 1; i += p) {
        sieve.clear(i);
      }
    }
  }
}

To run the reproduction:

$ cd experiments/java-bitset-perf-flicker
$ docker build -t primes-java-bitset  . && docker run --rm -it primes-java-bitset

Sample output from my Mac M1:

  0: Passes: 3659, count: 78498, Valid: true
  1: Passes: 3623, count: 78498, Valid: true
  2: Passes: 3678, count: 78498, Valid: true
  3: Passes: 3684, count: 78498, Valid: true
  4: Passes: 3653, count: 78498, Valid: true
  5: Passes: 3663, count: 78498, Valid: true
  6: Passes: 3666, count: 78498, Valid: true
  7: Passes: 3651, count: 78498, Valid: true
  8: Passes: 3672, count: 78498, Valid: true
  9: Passes: 3682, count: 78498, Valid: true

Very stable performance, albeit not stellar.

Sample output from a Linux x64 machine I have:

  0: Passes: 6077, count: 78498, Valid: true
  1: Passes: 6059, count: 78498, Valid: true
  2: Passes: 2758, count: 78498, Valid: true
  3: Passes: 2751, count: 78498, Valid: true
  4: Passes: 2752, count: 78498, Valid: true
  5: Passes: 2768, count: 78498, Valid: true
  6: Passes: 2758, count: 78498, Valid: true
  7: Passes: 2760, count: 78498, Valid: true
  8: Passes: 6070, count: 78498, Valid: true
  9: Passes: 2759, count: 78498, Valid: true

It either performs very well, or it performs at 0.5X of very well. When running this 100 times or more, the variance among the 1X runs is low, as is the variance among the 0.5X runs.

Also on JDK8, just even more crazy

On JDK8 the program has better ”normal” performance, while the runs hit with the drop is at the same level as with newer JDKs.

  0: Passes: 6431, count: 78498, Valid: true
  1: Passes: 6421, count: 78498, Valid: true
  2: Passes: 2749, count: 78498, Valid: true
  3: Passes: 6381, count: 78498, Valid: true
  4: Passes: 2741, count: 78498, Valid: true
  5: Passes: 2749, count: 78498, Valid: true
  6: Passes: 6397, count: 78498, Valid: true
  7: Passes: 6415, count: 78498, Valid: true
  8: Passes: 2742, count: 78498, Valid: true
  9: Passes: 2745, count: 78498, Valid: true

On JDK7 all runs suffer the performance drop

0 - Passes: 2743, count: 78498, Valid: true
1 - Passes: 2743, count: 78498, Valid: true
2 - Passes: 2735, count: 78498, Valid: true
3 - Passes: 2748, count: 78498, Valid: true
4 - Passes: 2747, count: 78498, Valid: true
5 - Passes: 2746, count: 78498, Valid: true
6 - Passes: 2734, count: 78498, Valid: true
7 - Passes: 2745, count: 78498, Valid: true
8 - Passes: 2746, count: 78498, Valid: true
9 - Passes: 2751, count: 78498, Valid: true
10 - Passes: 2746, count: 78498, Valid: true
11 - Passes: 2732, count: 78498, Valid: true
12 - Passes: 2750, count: 78498, Valid: true
13 - Passes: 2753, count: 78498, Valid: true
14 - Passes: 2742, count: 78498, Valid: true
15 - Passes: 2747, count: 78498, Valid: true
16 - Passes: 2746, count: 78498, Valid: true
17 - Passes: 2746, count: 78498, Valid: true
18 - Passes: 2744, count: 78498, Valid: true
19 - Passes: 2750, count: 78498, Valid: true
20 - Passes: 2748, count: 78498, Valid: true
21 - Passes: 2743, count: 78498, Valid: true
22 - Passes: 2742, count: 78498, Valid: true
23 - Passes: 2746, count: 78498, Valid: true
24 - Passes: 2746, count: 78498, Valid: true
25 - Passes: 2747, count: 78498, Valid: true
26 - Passes: 2736, count: 78498, Valid: true
27 - Passes: 2738, count: 78498, Valid: true
28 - Passes: 2742, count: 78498, Valid: true
29 - Passes: 2742, count: 78498, Valid: true
30 - Passes: 2735, count: 78498, Valid: true
31 - Passes: 2745, count: 78498, Valid: true
32 - Passes: 2740, count: 78498, Valid: true
33 - Passes: 2737, count: 78498, Valid: true
34 - Passes: 2733, count: 78498, Valid: true
35 - Passes: 2748, count: 78498, Valid: true
36 - Passes: 2742, count: 78498, Valid: true
37 - Passes: 2745, count: 78498, Valid: true
38 - Passes: 2742, count: 78498, Valid: true
39 - Passes: 2749, count: 78498, Valid: true
40 - Passes: 2743, count: 78498, Valid: true
41 - Passes: 2743, count: 78498, Valid: true
42 - Passes: 2736, count: 78498, Valid: true
43 - Passes: 2744, count: 78498, Valid: true
... (snip, same picture with the remaining 56 runs)

(It could, of course, also be that on JDK7 there never kicks in a performance boost.)

Is it really about the BitSet?

Actually, I really don’t have enough information to know this. In fact so far I have seen other strange performance numbers using a regular array of booleans for the sieve storage. (Even if in this particular setup, the boolean-array performs super stable).

For me the mystery started with that a Clojure version of the sieve performed much better on my M1 Mac than on this Linux machine. (Both running in Docker). I have on my todo list to summarize the clues as I have seen so far. I’ll update the repository with reproductions of these other setups as I can make them minimal enough.

This one might, or might not, be related:

I’ve also speculated around that it might have to do with that the JIT compiler does something different between the two types of performance runs. There’s a commented line in run.sh that runs the sieve program with -XX:+PrintCompilation to print out JIT compiler logs. However, I am not savvy enough to make sense of the output. Which brings me to …

Experiment variants

I’ll keep this updated with the experiments I (or you?) conduct.

Help wanted

This mystery is haunting me. I’d love me some help how to minimize the reproduction further.

Please consider sending pull requests with repros/experiments, etcetera, and also feel encourage to use the Discussions forum on the repository.

Here is a Hacker News thread where I get some help with what I could test and how to try frame the problem: https://news.ycombinator.com/item?id=30226083