Taming JVM Memory on JDK 25 — Part 2: OOMs While the Heap Is Flat
Part 1 fixed the collector and sized the heap. This part covers what happens next — native memory leaks from glibc arenas, RSS vs heap diagnosis, and the fixes that actually work.
Part 1 fixed the collector and sized the heap. This part covers what happens next — the OOMs you get anyway.
Where Part 1 left us
In Part 1 we pinned -XX:+UseG1GC, set MaxRAMPercentage to something sane (70% for small containers), and flipped the pod to requests == limits for Guaranteed QoS. The payoff was exactly what we hoped for:
- Heap usage became a boring flat line.
jvm.memory.usedsat at ~40–50% of-Xmx, stable across days. - GC logs went quiet — no full GCs, no long pauses, no allocation spikes.
- p99 latency stopped spiking at random.
And then, a week later, we started getting OOMKilled anyway.
No heap pressure. No java.lang.OutOfMemoryError in the app log. GC fine. But the container's memory.working_set_bytes crept upward until it hit the limit, then the pod got a silent SIGKILL from the kernel. Restart, buy a few hours, repeat.
This is the class of bug Part 1 can't fix — because the memory leaking isn't heap.
Heap is not RSS
The kernel enforces the container memory limit against RSS (resident set size), not against -Xmx. RSS is everything the process has resident in physical memory: heap and metaspace, code cache, thread stacks, direct buffers, GC internal structures, JIT-compiled code, JNI allocations, and everything malloc() hands out under the JVM's feet. The canonical list is in the HotSpot GC Tuning Guide and the native-memory areas are tracked in detail by Native Memory Tracking.
When RSS drifts above heap for no obvious reason, the cause is almost always in that last bucket: native allocations owned by the C library, not the JVM. This is the exact failure mode documented across a decade of industry write-ups — DZone — JDK 17 Memory Bloat in Containers: A Post-Mortem, DZone — Troubleshooting Problems With Native (Off-Heap) Memory in Java Applications, Red Hat — Application uses a lot of memory under RHEL 6, and Brice Dutheil — Handling native memory fragmentation of glibc all describe the same pattern.
How to actually see it
The moment heap doesn't match RSS, stop looking at Java dashboards and start looking at the process from the outside.
1. Compare RSS to heap, continuously. Plot process.rss next to jvm.memory.committed. If the gap grows without bound, the leak is native. This is the single most useful chart for this class of bug.
2. Turn on Native Memory Tracking. Start the JVM with -XX:NativeMemoryTracking=summary and diff over time (Oracle — Native Memory Tracking):
jcmd <pid> VM.native_memory baseline
# ... wait ...
jcmd <pid> VM.native_memory summary.diff
NMT accounts for every byte the JVM knows it allocated — heap, metaspace, threads, code, GC, internal. If NMT's "Total committed" is far smaller than RSS, the leak is happening outside what the JVM tracks. That's your signal to look at the allocator.
3. Read /proc/<pid>/smaps. Look for a long tail of ~64 MB anonymous mappings. That's the fingerprint of glibc arenas, as the glibc manual and the Linux man page for mallopt(3) both describe.
4. Use pmap -x and malloc_stats(). pmap shows the mapping layout. glibc exposes malloc_info(3) and malloc_stats(3) (man page) that print arena counts and per-arena usage.
In our case, the gap was ~2 GB on a 4 GB container. NMT accounted for ~1.5 GB of native memory. The missing chunk was glibc's.
What we found: the glibc arena story
glibc's malloc is optimised for multithreaded programs by splitting its heap into multiple arenas. Threads hash to an arena, each arena has its own lock, and contention is low. The authoritative description is in the glibc source and the glibc MallocInternals wiki page; the tunable is documented in mallopt(3) as M_ARENA_MAX and its environment-variable equivalent MALLOC_ARENA_MAX.
The default cap is:
MALLOC_ARENA_MAX = 8 × number_of_CPUs (on 64-bit systems)
Each arena reserves address space in 64 MB chunks, and — critically — glibc is very reluctant to return arena memory to the OS once it has been touched, even if the arena is mostly empty. Fragmentation across many arenas compounds this: a handful of live allocations can pin an entire 64 MB region. This behaviour, and the 64 MB per-arena sizing, is described in the Heroku — Tuning glibc Memory Behavior write-up and in the foundational Facebook Engineering — Scalable memory allocation using jemalloc post, which exists precisely because Facebook hit this on their own server fleet.
And here's the container-specific twist. glibc computes number_of_CPUs from sysconf(_SC_NPROCESSORS_ONLN) — i.e., the host's online CPU count, not the container's cgroup CPU quota. On a node with 64 logical CPUs, every JVM container, no matter how small, is sized for up to 8 × 64 = 512 arenas. At 64 MB per arena that's a theoretical ceiling of ~32 GB of native address space per process for fragmentation overhead alone.
You don't need to hit the ceiling for this to hurt — you just need enough threads bouncing across enough arenas for fragmentation to balloon past your container limit. This mismatch between glibc's CPU detection and the container's actual CPU budget is the root cause called out in every one of these write-ups:
- Cloud Foundry Java buildpack issue #320 — Tuning Glibc Environment Variables (
MALLOC_ARENA_MAX) — and the follow-up PR #160 that pinnedMALLOC_ARENA_MAX=2as the buildpack default. - Presto issue #8993 — Consider lowering MALLOC_ARENA_MAX to prevent native memory OOM — Facebook / Presto team hitting the same thing.
- Kubernetes issue #28290 — Google Container Engine and Java memory consumption — the GKE-specific manifestation.
- Broadcom / VMware KB — Java Application gets Out of Memory exit code 137 due to MALLOC_ARENA_MAX.
- Malt Engineering — Java in K8s: how we've reduced memory usage without changing any code.
- Arcesium — From Malloc to Jemalloc: Slashing Java Container Memory Usage by 10%.
Our service was the textbook case: many Netty/Lettuce event loops (each with its own ClientResources instead of sharing one), the OpenTelemetry auto-instrumentation agent adding more instrumented allocation paths, and a 40-core host underneath a 2-core container. The arenas did exactly what they were designed to do. The container just couldn't afford it.
How to fix it
Three levers, in order of effort.
1. Cap MALLOC_ARENA_MAX
Set it as an env var on the container:
env:
- name: MALLOC_ARENA_MAX
value: "2"
The Cloud Foundry Java buildpack pins this to 2 by default for exactly this reason — their production incidents with container OOMs trace back to the same root cause. Heroku's guidance and Red Hat's KB article both recommend 2 as the starting point for containerised workloads.
Values of 2, 4, or 8 are all reasonable — benchmark under realistic concurrency. We landed on 4 after measuring p99 latency against 2, 4, 8, and unset on a Redis-heavy path:
MALLOC_ARENA_MAX | Steady-state RSS | p99 latency |
|---|---|---|
| unset (8 × 40 CPUs on host) | OOMKilled within hours | — |
| 8 | ~3.2 GB | baseline |
| 4 | ~2.4 GB | +2% vs baseline |
| 2 | ~2.0 GB | +11% vs baseline |
For most Spring Boot services, the difference between 2 and 4 is noise and the step from unset to capped is where the real win lives. The DZone post-mortem JDK 17 Memory Bloat in Containers reports the same shape — "reduced native arena overhead from approximately 1.5GB to below 200MB" after capping arenas — and the Cloud Foundry buildpack issue thread has similar numbers from production apps.
2. Allocate less in the first place
A cap bounds the damage, but the real fix is reducing the number of native-allocation hotspots.
Share Lettuce ClientResources. The Lettuce reference guide is explicit: ClientResources is expensive to create and is intended to be shared. Each instance spins up its own EventLoopGroup, EventExecutorGroup, and DNS resolver. We had several ClientResources instances created implicitly by auto-configured beans; consolidating to one shared instance removed a pile of Netty thread pools and their arena affinity.
Audit the OpenTelemetry agent. The OpenTelemetry Java agent is convenient but allocates aggressively and introduces long-lived native buffers for exporters. Disable the instrumentations you don't use (-Dotel.instrumentation.<name>.enabled=false), prefer OTLP over legacy exporters, and consider the manual SDK if you only need a subset of signals. The agent's resource footprint is acknowledged directly in the project's performance notes.
Direct buffers and Netty pools. Cap -XX:MaxDirectMemorySize, and if you use Netty directly, set -Dio.netty.allocator.numDirectArenas / numHeapArenas explicitly rather than letting Netty pick based on host CPUs — Netty's PooledByteBufAllocator uses Runtime.availableProcessors() * 2 as its default, which has the same cgroup-blindness problem as glibc on some JVM/OS combinations.
3. Replace the allocator
If glibc still isn't behaving, swap it. jemalloc and tcmalloc are both drop-in replacements via LD_PRELOAD:
RUN apt-get install -y libjemalloc2
ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
Both are fundamentally better than glibc for long-running, many-threaded server processes: they fragment less, return memory to the OS more aggressively, and expose far better profiling tools.
- jemalloc is designed to give an upper-bound fragmentation rate (jemalloc background docs). Set
MALLOC_CONF=prof:true,prof_leak:trueand you get heap profiles that point at the offending call sites — invaluable when you do have a true native leak rather than just fragmentation. - TCMalloc (gperftools) is Google's allocator, also drop-in via
LD_PRELOAD, with similar fragmentation characteristics and its own heap-profiling tooling.
This is the path Facebook took with jemalloc at scale (Scalable memory allocation using jemalloc) and the path the Presto maintainers ultimately recommend (prestodb/presto#8993). LinkedIn's Venice team hit and fixed the same class of issue (Taming memory fragmentation in Venice with Jemalloc).
The JDK 25 angle
JDK 25 doesn't fix glibc — glibc is not in the JDK. But it does give you better tools for finding this class of bug:
- NMT overhead is lower in JDK 25, making it cheaper to leave
summarymode on in production. jcmd System.mapandSystem.dump_map, introduced in earlier versions and still evolving, give you a structured view of the process's virtual-memory layout without dropping topmap(JDKjcmdreference).- The Consolidated JDK 25 Release Notes document further improvements to G1's own native-memory footprint (shared
G1CardSetacross co-evacuated regions), reducing the JVM's contribution to the non-heap bucket and making the glibc contribution easier to isolate.
None of that makes the arenas smaller. It just makes the bug findable.
Takeaway
Part 1 made the heap boring. Part 2 is about what remains visible in RSS once the heap is boring: the native-memory bucket the JVM doesn't own. If your container is OOMKilled while the heap is flat, the bug is almost never in your code or your -Xmx. It's in the gap between what the JVM tracks and what the kernel counts — and glibc, doing exactly what it was designed to do on a 64-core bare-metal box, is very often sitting in that gap with a pile of 64 MB arenas it has no intention of giving back.
Cap the arenas, share your client resources, and if you're still fighting it, switch the allocator. Combined with the Part 1 setup — G1, sane MaxRAMPercentage, Guaranteed QoS — you end up with a JVM that actually fits inside the box you rented for it.
References
JDK documentation and tooling
- HotSpot Virtual Machine Garbage Collection Tuning Guide
- Native Memory Tracking
jcmdreference- Consolidated JDK 25 Release Notes
glibc and allocator documentation
- glibc MallocInternals wiki
mallopt(3)—M_ARENA_MAX/MALLOC_ARENA_MAXmalloc_stats(3)/malloc_info(3)sysconf(3)—_SC_NPROCESSORS_ONLN(host-level CPU count)- jemalloc background and jemalloc.net
- TCMalloc (gperftools)
Canonical industry write-ups on this bug
- Cloud Foundry java-buildpack #320 — Tuning Glibc Environment Variables
- Cloud Foundry java-buildpack PR #160 — pin
MALLOC_ARENA_MAX=2 - DZone — JDK 17 Memory Bloat in Containers: A Post-Mortem
- DZone — Troubleshooting Problems With Native (Off-Heap) Memory in Java Applications
- Red Hat — Application uses a lot of memory under RHEL 6
- Broadcom / VMware KB — Java Application gets Out of Memory exit code 137 due to MALLOC_ARENA_MAX
- Heroku — Tuning glibc Memory Behavior
- Presto #8993 — Consider lowering MALLOC_ARENA_MAX
- Facebook Engineering — Scalable memory allocation using jemalloc
- Kubernetes #28290 — Google Container Engine and Java memory consumption
- Brice Dutheil — Handling native memory fragmentation of glibc
- Malt Engineering — Java in K8s: reducing memory usage without changing any code
- Arcesium — From Malloc to Jemalloc: Slashing Java Container Memory Usage by 10%
- LinkedIn Engineering — Taming memory fragmentation in Venice with Jemalloc
Library references for the "allocate less" section