Tech Blog - Rohlik Group

Part 1 fixed the collector and sized the heap. This part covers what happens next — the OOMs you get anyway.

Where Part 1 left us

In Part 1 we pinned -XX:+UseG1GC, set MaxRAMPercentage to something sane (70% for small containers), and flipped the pod to requests == limits for Guaranteed QoS. The payoff was exactly what we hoped for:

Heap usage became a boring flat line. jvm.memory.used sat at ~40–50% of -Xmx, stable across days.
GC logs went quiet — no full GCs, no long pauses, no allocation spikes.
p99 latency stopped spiking at random.

And then, a week later, we started getting OOMKilled anyway.

No heap pressure. No java.lang.OutOfMemoryError in the app log. GC fine. But the container's memory.working_set_bytes crept upward until it hit the limit, then the pod got a silent SIGKILL from the kernel. Restart, buy a few hours, repeat.

This is the class of bug Part 1 can't fix — because the memory leaking isn't heap.

Heap is not RSS

The kernel enforces the container memory limit against RSS (resident set size), not against -Xmx. RSS is everything the process has resident in physical memory: heap and metaspace, code cache, thread stacks, direct buffers, GC internal structures, JIT-compiled code, JNI allocations, and everything malloc() hands out under the JVM's feet. The canonical list is in the HotSpot GC Tuning Guide and the native-memory areas are tracked in detail by Native Memory Tracking.

When RSS drifts above heap for no obvious reason, the cause is almost always in that last bucket: native allocations owned by the C library, not the JVM. This is the exact failure mode documented across a decade of industry write-ups — DZone — JDK 17 Memory Bloat in Containers: A Post-Mortem, DZone — Troubleshooting Problems With Native (Off-Heap) Memory in Java Applications, Red Hat — Application uses a lot of memory under RHEL 6, and Brice Dutheil — Handling native memory fragmentation of glibc all describe the same pattern.

How to actually see it

The moment heap doesn't match RSS, stop looking at Java dashboards and start looking at the process from the outside.

1. Compare RSS to heap, continuously. Plot process.rss next to jvm.memory.committed. If the gap grows without bound, the leak is native. This is the single most useful chart for this class of bug.

2. Turn on Native Memory Tracking. Start the JVM with -XX:NativeMemoryTracking=summary and diff over time (Oracle — Native Memory Tracking):

jcmd <pid> VM.native_memory baseline
# ... wait ...
jcmd <pid> VM.native_memory summary.diff

NMT accounts for every byte the JVM knows it allocated — heap, metaspace, threads, code, GC, internal. If NMT's "Total committed" is far smaller than RSS, the leak is happening outside what the JVM tracks. That's your signal to look at the allocator.

3. Read /proc/<pid>/smaps. Look for a long tail of ~64 MB anonymous mappings. That's the fingerprint of glibc arenas, as the glibc manual and the Linux man page for mallopt(3) both describe.

4. Use pmap -x and malloc_stats(). pmap shows the mapping layout. glibc exposes malloc_info(3) and malloc_stats(3) (man page) that print arena counts and per-arena usage.

In our case, the gap was ~2 GB on a 4 GB container. NMT accounted for ~1.5 GB of native memory. The missing chunk was glibc's.

What we found: the glibc arena story

glibc's malloc is optimised for multithreaded programs by splitting its heap into multiple arenas. Threads hash to an arena, each arena has its own lock, and contention is low. The authoritative description is in the glibc source and the glibc MallocInternals wiki page; the tunable is documented in mallopt(3) as M_ARENA_MAX and its environment-variable equivalent MALLOC_ARENA_MAX.

The default cap is:

MALLOC_ARENA_MAX = 8 × number_of_CPUs        (on 64-bit systems)

Each arena reserves address space in 64 MB chunks, and — critically — glibc is very reluctant to return arena memory to the OS once it has been touched, even if the arena is mostly empty. Fragmentation across many arenas compounds this: a handful of live allocations can pin an entire 64 MB region. This behaviour, and the 64 MB per-arena sizing, is described in the Heroku — Tuning glibc Memory Behavior write-up and in the foundational Facebook Engineering — Scalable memory allocation using jemalloc post, which exists precisely because Facebook hit this on their own server fleet.

And here's the container-specific twist. glibc computes number_of_CPUs from sysconf(_SC_NPROCESSORS_ONLN) — i.e., the host's online CPU count, not the container's cgroup CPU quota. On a node with 64 logical CPUs, every JVM container, no matter how small, is sized for up to 8 × 64 = 512 arenas. At 64 MB per arena that's a theoretical ceiling of ~32 GB of native address space per process for fragmentation overhead alone.

You don't need to hit the ceiling for this to hurt — you just need enough threads bouncing across enough arenas for fragmentation to balloon past your container limit. This mismatch between glibc's CPU detection and the container's actual CPU budget is the root cause called out in every one of these write-ups:

Cloud Foundry Java buildpack issue #320 — Tuning Glibc Environment Variables (MALLOC_ARENA_MAX) — and the follow-up PR #160 that pinned MALLOC_ARENA_MAX=2 as the buildpack default.
Presto issue #8993 — Consider lowering MALLOC_ARENA_MAX to prevent native memory OOM — Facebook / Presto team hitting the same thing.
Kubernetes issue #28290 — Google Container Engine and Java memory consumption — the GKE-specific manifestation.
Broadcom / VMware KB — Java Application gets Out of Memory exit code 137 due to MALLOC_ARENA_MAX.
Malt Engineering — Java in K8s: how we've reduced memory usage without changing any code.
Arcesium — From Malloc to Jemalloc: Slashing Java Container Memory Usage by 10%.

Our service was the textbook case: many Netty/Lettuce event loops (each with its own ClientResources instead of sharing one), the OpenTelemetry auto-instrumentation agent adding more instrumented allocation paths, and a 40-core host underneath a 2-core container. The arenas did exactly what they were designed to do. The container just couldn't afford it.

How to fix it

Three levers, in order of effort.

1. Cap `MALLOC_ARENA_MAX`

Set it as an env var on the container:

env:
  - name: MALLOC_ARENA_MAX
    value: "2"

The Cloud Foundry Java buildpack pins this to 2 by default for exactly this reason — their production incidents with container OOMs trace back to the same root cause. Heroku's guidance and Red Hat's KB article both recommend 2 as the starting point for containerised workloads.

Values of 2, 4, or 8 are all reasonable — benchmark under realistic concurrency. We landed on 4 after measuring p99 latency against 2, 4, 8, and unset on a Redis-heavy path:

`MALLOC_ARENA_MAX`	Steady-state RSS	p99 latency
unset (8 × 40 CPUs on host)	OOMKilled within hours	—
8	~3.2 GB	baseline
4	~2.4 GB	+2% vs baseline
2	~2.0 GB	+11% vs baseline

For most Spring Boot services, the difference between 2 and 4 is noise and the step from unset to capped is where the real win lives. The DZone post-mortem JDK 17 Memory Bloat in Containers reports the same shape — "reduced native arena overhead from approximately 1.5GB to below 200MB" after capping arenas — and the Cloud Foundry buildpack issue thread has similar numbers from production apps.

2. Allocate less in the first place

A cap bounds the damage, but the real fix is reducing the number of native-allocation hotspots.

Share Lettuce ClientResources. The Lettuce reference guide is explicit: ClientResources is expensive to create and is intended to be shared. Each instance spins up its own EventLoopGroup, EventExecutorGroup, and DNS resolver. We had several ClientResources instances created implicitly by auto-configured beans; consolidating to one shared instance removed a pile of Netty thread pools and their arena affinity.

Audit the OpenTelemetry agent. The OpenTelemetry Java agent is convenient but allocates aggressively and introduces long-lived native buffers for exporters. Disable the instrumentations you don't use (-Dotel.instrumentation.<name>.enabled=false), prefer OTLP over legacy exporters, and consider the manual SDK if you only need a subset of signals. The agent's resource footprint is acknowledged directly in the project's performance notes.

Direct buffers and Netty pools. Cap -XX:MaxDirectMemorySize, and if you use Netty directly, set -Dio.netty.allocator.numDirectArenas / numHeapArenas explicitly rather than letting Netty pick based on host CPUs — Netty's PooledByteBufAllocator uses Runtime.availableProcessors() * 2 as its default, which has the same cgroup-blindness problem as glibc on some JVM/OS combinations.

3. Replace the allocator

If glibc still isn't behaving, swap it. jemalloc and tcmalloc are both drop-in replacements via LD_PRELOAD:

RUN apt-get install -y libjemalloc2
ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

Both are fundamentally better than glibc for long-running, many-threaded server processes: they fragment less, return memory to the OS more aggressively, and expose far better profiling tools.

jemalloc is designed to give an upper-bound fragmentation rate (jemalloc background docs). Set MALLOC_CONF=prof:true,prof_leak:true and you get heap profiles that point at the offending call sites — invaluable when you do have a true native leak rather than just fragmentation.
TCMalloc (gperftools) is Google's allocator, also drop-in via LD_PRELOAD, with similar fragmentation characteristics and its own heap-profiling tooling.

This is the path Facebook took with jemalloc at scale (Scalable memory allocation using jemalloc) and the path the Presto maintainers ultimately recommend (prestodb/presto#8993). LinkedIn's Venice team hit and fixed the same class of issue (Taming memory fragmentation in Venice with Jemalloc).

The JDK 25 angle

JDK 25 doesn't fix glibc — glibc is not in the JDK. But it does give you better tools for finding this class of bug:

NMT overhead is lower in JDK 25, making it cheaper to leave summary mode on in production.
jcmd System.map and System.dump_map, introduced in earlier versions and still evolving, give you a structured view of the process's virtual-memory layout without dropping to pmap (JDK jcmd reference).
The Consolidated JDK 25 Release Notes document further improvements to G1's own native-memory footprint (shared G1CardSet across co-evacuated regions), reducing the JVM's contribution to the non-heap bucket and making the glibc contribution easier to isolate.

None of that makes the arenas smaller. It just makes the bug findable.

Takeaway

Part 1 made the heap boring. Part 2 is about what remains visible in RSS once the heap is boring: the native-memory bucket the JVM doesn't own. If your container is OOMKilled while the heap is flat, the bug is almost never in your code or your -Xmx. It's in the gap between what the JVM tracks and what the kernel counts — and glibc, doing exactly what it was designed to do on a 64-core bare-metal box, is very often sitting in that gap with a pile of 64 MB arenas it has no intention of giving back.

Cap the arenas, share your client resources, and if you're still fighting it, switch the allocator. Combined with the Part 1 setup — G1, sane MaxRAMPercentage, Guaranteed QoS — you end up with a JVM that actually fits inside the box you rented for it.

References

JDK documentation and tooling

glibc and allocator documentation

Canonical industry write-ups on this bug

Library references for the "allocate less" section