Confusion over Kubernetes out of memory with Java OOM

Confusion over Kubernetes out of memory with Java OOM

Recently I have faced a situation where one of the services I work within a Kubernetes cluster was getting restarted once an hour on average. Checking the logs obviously didn’t lead to anything useful. Except seeing that the JVM process suddenly got killed with no further error. To make the matter worse I observed that the issue was happening on random occasion. For instance, one time the container has failed after five minutes, another time it ran for a couple of hours.

By checking the Kubernetes pod status, I have found out that at some point the pod status has changed from Runningto OOM Killed. Obviously, the first step I took to investigate the issue was to get a heap dump of the container. In Spring Boot it is possible to get heap dump using /heapdumpin the actuator if it is not disabled.

Analyzing the heap dump didn’t shed any lights on the issue. I repeated analyzing various heap dumps multiple times. Each time I got more confused and couldn’t reach to a sensible conclusion.

As the last resort, before giving up I took a quick look at the pod configuration and saw this:

jvm:
  minHeapSize: 256m
  maxHeapSize: 256m

resources:
  cpu: 300m
  memory: 512Mi

So basically, pod runs on 512 MB of RAM which half of is for JVM. That rang a bell, Linux might have killed the process. To confirm my hypothesis I had a look at the system log, dmesgand found out this:

[4368955.387661] Memory cgroup out of memory: Kill process 4883 (java) score 11 or sacrifice child
[4368955.393077] Killed process 4883 (java) total-vm:1925104kB, anon-rss:516924kB, file-rss:10852kB, shmem-rss:0kB
[4368955.966522] oom_reaper: reaped process 4883 (java), now anon-rss:0kB, file-rss:4kB, shmem-rss:0kB

So what’s the problem?

Well, Linux kernel has Out Of Memory Management mechanism AKA OOM Killer which activates when the kernel faces severe lower memory condition. The kernel then literally kills a process which is usually the one that uses the most of the memory.

Solution

The simplest fix is to increase the pod memory size and enable swap. Other than, if the application does not have the memory leak, there is not much of thing to do except fine-tuning the Linux kernel. And that’s out of the scope of this article.

References

Inline/featured images credits