Help solve this Java BitSet performance-flicker mystery

Peter Strömberg 2022-02-05

UPDATE February 6, 2022: The behaviour seems to be triggered by that the JVM process sometimes gets scheduled on a different NUMA node from where the heap is allocated. I tested this by using this command to run the sieve:

numactl --cpunodebind=0 --membind=0 java PrimeSieve

On my first try this gave me the error message setting membind: Operation not permitted. The workaround I used was to run the docker container in privileged mode:

docker run --privileged --rm -it primes-java-bitset

I don’t know if the docker containers in the drag-racing competition are run like this, but I’ll figure that out eventually. If not, I will have to try see if there is a way to pin this that does not require root…

See:

CraigJPerry’s HN answer suggesting this, along with the rationale.
My experiment confirming that the theory seems spot on: test-runs/numa-pin

Some thoughts:

Now, why does the boolean-array variant of the sieve not suffer from this? Because the boolean array is not allocated on the heap?

This NUMA scheduling seems pretty important to consider when deploying JVM applications using Docker, right?

See also my follow up article: Be aware of NUMA when your application needs to run very fast

Orinal post below:

I am currently chasing a Java BitSet performance-eating ghost. Can’t say I am particularly close to catching it, but I have just managed to minimize the reproduction substantially. Even if it is not yet particularly minimal, it will have to do for now.

The Problem

When writing implementations of the Eratosthenes sieve in Clojure I started to notice strange performance behaviour:

Today I am investigating why my Clojure solution runs so fast on my Macs (an x64 and an M1), but so slow on the some of the benchmarking machines running Dave's drag-racing competition. It has brought me to @_bsless's Clojure Goes Fast profiler: https://t.co/rQUAMpWHPS
— Peter Strömberg aka PEZ (@pappapez) January 26, 2022

There are several strangeties, but one is easier to investigate than the others. A ”flicker” in the performance of the sieve when using Java’s BitSet. It manifests itself when it is run in Docker on som Linux Machines. Like when it is run in the Programming Languages drag-racing, as set up by Dave Plummer. Some of the times (more than half of the times) the sieve performs at only half of the performance it ”should” have.

For the purpose of the investigation, I could use my Agical freedoms to procure equipment, and I bought a machine that was reasonably similar to the drag-racing machines. An ASUS Mini-PC with Ryzen7. You can see the detailed specs at the repro repo.

I started to suspect it was not about Clojure, and decided to implement the sieve in Java in hopes of being able to reproduce it there. Which, a bit to my surprise, I actually could. I have created a repository for it.

The strange behaviour is not new, it is there also with Java 8 (actually a bit more pronounced, see below).

Reproduction

The repository has this main content:

PrimeSieve.java, a program that sieves limit numbers of numbers for primes, defaulting to 1 million
- The program starts with sieving the numbers as many times as it can for 5 seconds as a warmup.
- Then sieves the numbers for 5 seconds counting the number of passes.
- Then prints the number of passes, along with a sanity check that the sieve is correctly updated.
- The BitSet is recreated each pass.
run.sh, a script that will compile PrimeSieve.java and run it X times. X defaults to 10.
Dockerfile, a docker image definition based on openjdk:17 that when run will run run.sh then exit.
machine-pez-x64-ubuntu.md, the specs for a machine where I can repro the behaviour
test-runs, a directory with modified copies of the main program files + results from the runs (very WIP)

This is the code being run over and over again in the benchmark:

public PrimeSieve(int n) {
  this.n = n;
  int half_n = (n + 1) >> 1;
  this.sieve = new BitSet(half_n);
  this.sieve.set(0, half_n);
}

public void run() {
  int q = (int) Math.ceil(Math.sqrt(n));
  for (int p = 3; p < q; p += 2) {
    if (sieve.get(p >> 1)) {
      for (int i = (p * p) >> 1; i < n >> 1; i += p) {
        sieve.clear(i);
      }
    }
  }
}

To run the reproduction:

$ cd experiments/java-bitset-perf-flicker
$ docker build -t primes-java-bitset  . && docker run --rm -it primes-java-bitset

Sample output from my Mac M1:

  0: Passes: 3659, count: 78498, Valid: true
  1: Passes: 3623, count: 78498, Valid: true
  2: Passes: 3678, count: 78498, Valid: true
  3: Passes: 3684, count: 78498, Valid: true
  4: Passes: 3653, count: 78498, Valid: true
  5: Passes: 3663, count: 78498, Valid: true
  6: Passes: 3666, count: 78498, Valid: true
  7: Passes: 3651, count: 78498, Valid: true
  8: Passes: 3672, count: 78498, Valid: true
  9: Passes: 3682, count: 78498, Valid: true

Very stable performance, albeit not stellar.

Sample output from a Linux x64 machine I have:

  0: Passes: 6077, count: 78498, Valid: true
  1: Passes: 6059, count: 78498, Valid: true
  2: Passes: 2758, count: 78498, Valid: true
  3: Passes: 2751, count: 78498, Valid: true
  4: Passes: 2752, count: 78498, Valid: true
  5: Passes: 2768, count: 78498, Valid: true
  6: Passes: 2758, count: 78498, Valid: true
  7: Passes: 2760, count: 78498, Valid: true
  8: Passes: 6070, count: 78498, Valid: true
  9: Passes: 2759, count: 78498, Valid: true

It either performs very well, or it performs at 0.5X of very well. When running this 100 times or more, the variance among the 1X runs is low, as is the variance among the 0.5X runs.

Also on JDK8, just even more crazy

On JDK8 the program has better ”normal” performance, while the runs hit with the drop is at the same level as with newer JDKs.

  0: Passes: 6431, count: 78498, Valid: true
  1: Passes: 6421, count: 78498, Valid: true
  2: Passes: 2749, count: 78498, Valid: true
  3: Passes: 6381, count: 78498, Valid: true
  4: Passes: 2741, count: 78498, Valid: true
  5: Passes: 2749, count: 78498, Valid: true
  6: Passes: 6397, count: 78498, Valid: true
  7: Passes: 6415, count: 78498, Valid: true
  8: Passes: 2742, count: 78498, Valid: true
  9: Passes: 2745, count: 78498, Valid: true

On JDK7 all runs suffer the performance drop

0 - Passes: 2743, count: 78498, Valid: true
1 - Passes: 2743, count: 78498, Valid: true
2 - Passes: 2735, count: 78498, Valid: true
3 - Passes: 2748, count: 78498, Valid: true
4 - Passes: 2747, count: 78498, Valid: true
5 - Passes: 2746, count: 78498, Valid: true
6 - Passes: 2734, count: 78498, Valid: true
7 - Passes: 2745, count: 78498, Valid: true
8 - Passes: 2746, count: 78498, Valid: true
9 - Passes: 2751, count: 78498, Valid: true
10 - Passes: 2746, count: 78498, Valid: true
11 - Passes: 2732, count: 78498, Valid: true
12 - Passes: 2750, count: 78498, Valid: true
13 - Passes: 2753, count: 78498, Valid: true
14 - Passes: 2742, count: 78498, Valid: true
15 - Passes: 2747, count: 78498, Valid: true
16 - Passes: 2746, count: 78498, Valid: true
17 - Passes: 2746, count: 78498, Valid: true
18 - Passes: 2744, count: 78498, Valid: true
19 - Passes: 2750, count: 78498, Valid: true
20 - Passes: 2748, count: 78498, Valid: true
21 - Passes: 2743, count: 78498, Valid: true
22 - Passes: 2742, count: 78498, Valid: true
23 - Passes: 2746, count: 78498, Valid: true
24 - Passes: 2746, count: 78498, Valid: true
25 - Passes: 2747, count: 78498, Valid: true
26 - Passes: 2736, count: 78498, Valid: true
27 - Passes: 2738, count: 78498, Valid: true
28 - Passes: 2742, count: 78498, Valid: true
29 - Passes: 2742, count: 78498, Valid: true
30 - Passes: 2735, count: 78498, Valid: true
31 - Passes: 2745, count: 78498, Valid: true
32 - Passes: 2740, count: 78498, Valid: true
33 - Passes: 2737, count: 78498, Valid: true
34 - Passes: 2733, count: 78498, Valid: true
35 - Passes: 2748, count: 78498, Valid: true
36 - Passes: 2742, count: 78498, Valid: true
37 - Passes: 2745, count: 78498, Valid: true
38 - Passes: 2742, count: 78498, Valid: true
39 - Passes: 2749, count: 78498, Valid: true
40 - Passes: 2743, count: 78498, Valid: true
41 - Passes: 2743, count: 78498, Valid: true
42 - Passes: 2736, count: 78498, Valid: true
43 - Passes: 2744, count: 78498, Valid: true
... (snip, same picture with the remaining 56 runs)

(It could, of course, also be that on JDK7 there never kicks in a performance boost.)

Is it really about the BitSet?

Actually, I really don’t have enough information to know this. In fact so far I have seen other strange performance numbers using a regular array of booleans for the sieve storage. (Even if in this particular setup, the boolean-array performs super stable).

For me the mystery started with that a Clojure version of the sieve performed much better on my M1 Mac than on this Linux machine. (Both running in Docker). I have on my todo list to summarize the clues as I have seen so far. I’ll update the repository with reproductions of these other setups as I can make them minimal enough.

This one might, or might not, be related:

On some machines, this code runs 20% slower than the out-commented version, when repeatedly run for 5 secs with n=1M. n and sqrtN are instance variables, and the instance is freshly created each run. I have a theory for the performance diff. Do you? pic.twitter.com/V1z9wXFl8h
— Peter Strömberg aka PEZ (@pappapez) February 5, 2022

I’ve also speculated around that it might have to do with that the JIT compiler does something different between the two types of performance runs. There’s a commented line in run.sh that runs the sieve program with -XX:+PrintCompilation to print out JIT compiler logs. However, I am not savvy enough to make sense of the output. Which brings me to …

Experiment variants

I’ll keep this updated with the experiments I (or you?) conduct.

It’s not System.currentTimeMillis(): test-runs/timed-passes-10K
Running for 60s instead of 5 makes no difference: test-runs/w5-r60
No flicker when switching the BitSet for an array of booleans: test-runs/boolean-array
Pinning the JVM and the heap allocation to the same NUMA node: test-runs/numa-pin

Help wanted

This mystery is haunting me. I’d love me some help how to minimize the reproduction further.

Please consider sending pull requests with repros/experiments, etcetera, and also feel encourage to use the Discussions forum on the repository.

Here is a Hacker News thread where I get some help with what I could test and how to try frame the problem: https://news.ycombinator.com/item?id=30226083

Peter Strömberg

Peter Strömberg is an autodidact programming nerd living in Stockholm, Sweden. He has been all over the (coder) place for a long time. After some 20 years in the profession he made an attempt to leave it for another passion of his: Product. That attempt ended when Peter was introduced to Clojure and he felt how it gave him back all the joy of programming, and then some.

You can contract Peter for remote Clojure/ClojureScript gigs up to 80% of an FTE. The other 20% Agical pays him to spend doing open source work. Most often this is about maintaining Calva, a popular Visual Studio Code Clojure IDE extension which Peter created 2018.