The story starts with a message on our workplace chat:
“Our k8s master servers are rebooting randomly every day!”
This announcement had a good and a bad side. On the good side, our clusters handled this unfortunate event without problems, even when two out of five servers were rebooting at the same time. On the bad side, we didn’t receive an alert. We had absolutely zero information about the cause. We were flying blind.
We had zero information about what could possible cause this. So, we thought to start the investigation by looking at system statistics. The interval of pulling those statistics was too high and as a result the statistics hid valuable information.
We didn’t know the cause of the restart but we found out the following two kernel settings.
kernel.hung_task_panic = 1
kernel.softlockup_panic = 1
Those settings instruct the kernel to panic when a task stops or when a softlockup occurs. Furthermore, we found out that kernel automatically reboots when a panic happens because we have kernel.panic = 70 in our sysctl settings. The combination of triggering a panic and auto reboot prevented us to capture a kernel crash dump.
The above two observations contributed to the fact that we were flying blind. Thus, we switched to per second pulling for the sar tool in order to get more system statistics. Furthermore, we also disabled auto reboot to actually capture the kernel messages.
The first crash came within a few days. We saw an incomplete crash dump in the console and in other occurrences nothing at all. A coworker informed us that this behavior is the result of missing nomodeset from kernel boot parameters. We fixed that and rebooted all our servers with this setting in place.
In order to be sure that we didn’t miss again any valuable information, we decided to configure “netconsole”, which is a kernel module that logs kernel messages over UDP to a remote syslog server. This configuration ensures that we get a kernel crash dump on a remote point.
In order to further ensure that we capture kernel crash dumps, we opened a few remote console sessions from a TMUX session that runs on a server. The buffer size on TMUX panes was high enough so that we could have those console session running for a few days without losing anything.
Another crash came in and this time we had a full call trace, for example:
In the above trace, we can see that an user space process “kube-apiserver”, a Kubernetes daemon that runs on master server, was involved and the kernel issued a panic because a softlockup occured.
Another server crashed with a different call trace .The trace indicated that a network kernel driver timed out for 22 seconds, here is the beginning of that trace:
bnx2x: [bnx2x_clean_tx_queue:1205(eth0)]timeout waiting for queue: txdata->tx_pkt_prod(19537) != txdata->tx_pkt_cons(19447)bnx2x: [bnx2x_state_wait:310(eth0)]timeout waiting for state 9watchdog: BUG: soft lockup — CPU#7 stuck for 22s! [kube-apiserver:63429]
Why 22 seconds? In our kernel settings, we have kernel.watchdog_thresh = 10. So, the kernel waits twice that threshold before it reports a softlockup.
The fact we had two different call traces complicated the situation but we didn’t give up just yet!
The system statistics per second revealed something very interesting. We looked at all system metrics with the sar tool and we found that the CPU utilization at system level increased to 100% just a few seconds before the crash and only for the CPU mentioned in the crash:
sar -f sa14 -s 08:12:00 -P 41 <← this the CPU mentioned in the crash Linux 4.14.32–1.el7.x86_64 04/14/2018 _x86_64_ (56 CPU) 08:12:00 AM CPU %user %nice %system %iowait %steal %idle 08:12:01 AM 41 0.00 0.00 0.00 0.00 0.00 100.00 08:12:02 AM 41 0.99 0.00 0.99 0.00 0.00 98.02 08:12:03 AM 41 1.00 0.00 0.00 0.00 0.00 99.00 08:12:04 AM 41 0.00 0.00 0.00 0.00 0.00 100.00 08:12:05 AM 41 0.00 0.00 0.00 0.00 0.00 100.00 08:12:06 AM 41 0.00 0.00 0.00 0.00 0.00 100.00 08:12:07 AM 41 0.99 0.00 0.99 0.00 0.00 98.02 08:12:08 AM 41 0.00 0.00 0.00 0.00 0.00 100.00 08:12:09 AM 41 1.00 0.00 0.00 0.00 0.00 99.00 08:12:10 AM 41 0.00 0.00 0.00 0.00 0.00 100.00 08:12:11 AM 41 0.00 0.00 0.00 0.00 0.00 100.00 08:12:12 AM 41 1.00 0.00 0.00 0.00 0.00 99.00 08:12:13 AM 41 0.00 0.00 0.00 0.00 0.00 100.00 08:12:14 AM 41 0.00 0.00 0.00 0.00 0.00 100.00 08:12:15 AM 41 0.00 0.00 0.99 0.00 0.00 99.01 08:12:16 AM 41 1.00 0.00 0.00 0.00 0.00 99.00 08:12:17 AM 41 1.00 0.00 0.00 0.00 0.00 99.00 08:12:18 AM 41 0.00 0.00 0.00 0.00 0.00 100.00 08:12:19 AM 41 0.00 0.00 1.00 0.00 0.00 99.00 <<--- 08:12:20 AM 41 0.99 0.00 54.46 0.00 0.00 44.55 <<--- CPU Spike 08:12:21 AM 41 0.00 0.00 100.00 0.00 0.00 0.00 <<--- 08:12:22 AM 41 0.00 0.00 100.00 0.00 0.00 0.00 08:12:23 AM 41 0.00 0.00 100.00 0.00 0.00 0.00 08:12:24 AM 41 0.00 0.00 100.00 0.00 0.00 0.00 08:12:25 AM 41 0.00 0.00 100.00 0.00 0.00 0.00 08:12:26 AM 41 0.00 0.00 100.00 0.00 0.00 0.00 08:12:27 AM 41 0.00 0.00 100.00 0.00 0.00 0.00 08:12:28 AM 41 0.00 0.00 100.00 0.00 0.00 0.00 08:12:29 AM 41 0.00 0.00 100.00 0.00 0.00 0.00 08:12:30 AM 41 0.00 0.00 100.00 0.00 0.00 0.00 08:12:31 AM 41 0.00 0.00 100.00 0.00 0.00 0.00 08:12:32 AM 41 0.00 0.00 100.00 0.00 0.00 0.00
That was a useful piece of information as it indicated that the system entered in some kind of loop where it exhausted CPU resources and thus any tasks running on that CPU hang. We examined all other system resources (memory, IO and etc) prior to the crash and we only found the spike in CPU utilization and an increase in the running queue, which is the symptom of CPU resources exhaustion.
We then looked at all the system and application logs but we didn’t find anything that could explain this behavior.
We have a diverse hardware profile for our Kubernetes master servers, where we use Dell and HP servers with different network cards. So we excluded a hardware problem or firmware issue. Needless to say, we went through the manual process of confirming that we use the latest firmware for cards, BIOS and other devices. Moreover, we have servers in the Distributed Load Balancing service that run the same kernel but we hadn’t observed any crashes there. So we thought the issue could be related with the type of the workload that we run on master servers for Kubernetes clusters.
We know from experience that the RAID controller or disk could have caused the crash when the trace has calls that are related to the filesystem or disks. But in our case, all the call traces had calls about memory and scheduler ruling out the storage layer.
In our traces, the kernel reported that user space processes were hanging but we also saw a few traces where items in network queues were stuck. So we concluded that the kernel and user space processes were suffering from the same issue. Something was causing them to hang. What was it?
We tried (sometimes unsuccessfully) to resolve the addresses in the call traces, and we saw recent changes (kernel version 4.12) in that code area, leading us to believe it might be a kernel bug. However, the percentage of trust on this statement is very low as we don’t understand the internals of the kernel.
So, we reached a point where we said it must be a kernel issue, even without having hard evidence to prove that, besides kernel crashes, but still we can not be 100% sure that the cause of the crash is the kernel itself. We were running kernel version 4.14 and decided to rolled back to kernel version 4.9 as the first step of mitigating the issue. In parallel to that, we informed kernel developers about the two different kernel crashes we have seen. See relevant threads: Kernel 4.14.x crash around fsnotify_mark_connector and Kernel panics with 4.14.X versions. For the latter thread you can follow the conversation here.
Several people reported similar crashes and a kernel bug was submitted. In the meantime, kernel version 4.9 was running fine on our Kuberneters master servers, so the mitigation plan worked.
We were surprised to receive responses to our e-mail threads within a day. Kernel developers suggested to us to enable more debugging and also test kernel version 4.16. We followed their suggestions and started waiting for the first crash to occur. After two days we received a patch from a Google engineer. Unfortunately, the patch came just a few hours before I left for vacation. So I couldn’t test it but the patch was included in the next release of kernel version 4.14. So luckily I was able to test the patch after my vacation and I needed to wait only two to three days without a crash.
Important thing here to remember, we received a patch within two days and within ten days the patch was included in the Long Term Maintenance kernel version 4.14! Amazing!
Hunting kernel bugs does NOT require deep understanding of the kernel internals. It only requires from you to follow standard troubleshooting techniques and communicate the issue in a clear way to the right people.
During this troubleshooting, we saw kernel errors during the boot process. At Booking.com, we have a zero errors policy for our systems. Thus, we decided to address those errors. We contacted the maintainers of the kernel subsystem which reported the error, see 4.14.x yields errors for EDAC thread. A developer replied to us within a day, and after several e-mails exchanges, we found the issue.