Curious Case of Increased System Calls

Sudheer Bhat
Sudheer Bhat
5 min read
Posted on July 05, 2022
Curious Case of Increased System Calls

InMobi Exchange runs some really high requests-per-second (RPS) systems with extreme low latency requirements. We use async-profiler to periodically profile these high RPS systems to ensure that we haven’t regressed or have a sub-optimal code in hot path.

During one such exercise in mid May 2022, we observed that the system calls started contributing to around 7% of the total CPU samples! This was a huge increase from the samples that were collected just a month earlier, during late March and early April.

Figure 1: System Calls in Profiling Report for April 2022

Figure 2: System Calls in Profile Report for May 2022

The fallback to system calls as part of System.currentTimeMillis and System.nanoTime has suddenly started contributing to around ~ 6 to 7 percent of total CPU used by our Java process.

Analysis

On Linux, a mechanism called vDSO is responsible for the gettimeofday and clock_gettime functions to be executed in the user space. It should be transparent to the Java code, which simply means this being a result of a code change was highly unlikely. Since it was most likely an issue in the underlying kernel, we looked around our clusters and observed this behavior on the 5.4.0-1078-azure kernel version. We checked the behavior on 5.4.0-1073-azure kernel version as well and it's all good there!

The following is a sample program used for testing the gettimeofday function.

#include
#include

void main(void)
{
    struct timeval tv;
    for(int i = 0; i < 1000; i++) {
        /* glibc wrapped, shouldn't actually syscall */
        gettimeofday(&tv, NULL);
    }
}

When this runs on the 5.4.0-1078-azure kernel module with the vDSO not working.

sudo gcc vdsotest.c -o vdso && strace ./vdso
execve("./a.out", ["./a.out"], 0x7ffe0bd23dc0 /* 14 vars */) = 0
brk(NULL)                               = 0x5586dfff6000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=36045, ...}) = 0
mmap(NULL, 36045, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ffbfd7d7000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\35\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2030928, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffbfd7d5000
mmap(NULL, 4131552, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ffbfd1c6000
mprotect(0x7ffbfd3ad000, 2097152, PROT_NONE) = 0
mmap(0x7ffbfd5ad000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7ffbfd5ad000
mmap(0x7ffbfd5b3000, 15072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ffbfd5b3000
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7ffbfd7d64c0) = 0
mprotect(0x7ffbfd5ad000, 16384, PROT_READ) = 0
mprotect(0x5586df99e000, 4096, PROT_READ) = 0
mprotect(0x7ffbfd7e0000, 4096, PROT_READ) = 0
munmap(0x7ffbfd7d7000, 36045)           = 0
gettimeofday({tv_sec=1653284709, tv_usec=747997}, NULL) = 0
gettimeofday({tv_sec=1653284709, tv_usec=748048}, NULL) = 0
gettimeofday({tv_sec=1653284709, tv_usec=748086}, NULL) = 0

When this runs on 5.4.0-1073-azure kernel module with the vDSO working.

sudo gcc vdsotest.c -o vdsotest && sudo strace ./vdsotest
execve("./vdsotest", ["./vdsotest"], 0x7ffe7f889110 /* 16 vars */) = 0
brk(NULL)                               = 0x55da84545000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=31602, ...}) = 0
mmap(NULL, 31602, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe64421d000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\35\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2030928, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe64421b000
mmap(NULL, 4131552, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fe643c0b000
mprotect(0x7fe643df2000, 2097152, PROT_NONE) = 0
mmap(0x7fe643ff2000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7fe643ff2000
mmap(0x7fe643ff8000, 15072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fe643ff8000
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7fe64421c4c0) = 0
mprotect(0x7fe643ff2000, 16384, PROT_READ) = 0
mprotect(0x55da83b2b000, 4096, PROT_READ) = 0
mprotect(0x7fe644225000, 4096, PROT_READ) = 0
munmap(0x7fe64421d000, 31602)           = 0
exit_group(0)                           = ?
+++ exited with 0 +++

We looked at the Linux kernel code trying to reason out the behavior and below is what we figured out.

Since Azure uses Hyper-V as the hypervisor, so the corresponding file responsible for setting the timer related functionality is linux/hyperv_timer.c at v5.4 · torvalds/linux.

When the gettimeofday function is called, its definition is in linux/gettimeofday.c at v5.4 · torvalds/linux. It will fallback to the system call if the line in question returns True.

linux/gettimeofday.h at v5.4 · torvalds/linux should be able to determine the clocksource (that decides if the vDSO can be used or not). If the clocksource is not properly set, then it will return U64_max which will result in fallback.

Since we were convinced that its an issue with the underlying kernel module, we raised a support request with our cloud provider, Microsoft.

Latency Numbers

Now it was time to measure the latency impact. We used JMH for our basic micro benchmark. The measured latency is for System.nanotime and the unit of measure is nanoseconds.

Case 1: On a machine where vDSO works

Benchmark Mode Cnt Score Error Units
Benchmark.getNanoTime avgt 9 28.151 ± 0.219 ns/op

Case 2: On a machine where vDSO doesn't work

Benchmark Mode Cnt Score Error Units
Benchmark.getNanoTime avgt 9 537.079 ± 23.312 ns/op

Case 1 with the vDSO version works 20x faster!

The important question that arises here is do we really need to care about something being ~20x slower when the latency is still in nanoseconds? Well, the answer is Yes!

Lets take a dump of system calls that are made in a faulty production box for a period of 5 seconds and then filter out the gettimeofday and clock_gettime calls.

# Get all the sysevents in last 5 seconds.
sudo perf stat -e 'syscalls:sys_enter_*' -a sleep 5
706702      syscalls:sys_enter_clock_gettime
1718430      syscalls:sys_enter_gettimeofday

# As expected, on the box where we have downgraded the kernel, both of these report 0.
# From above benchmark, each syscall from java context results ina ~ 530 nanoseconds per operation.
jshell> (1718430 + 706702) * 530 / Math.pow(10,9)
$187 ==> 1.28531996   # That is a full 1.28 seconds every ~5 seconds!

# Had it been vDSO calls
jshell> (1718430 + 706702) * 28 / Math.pow(10,9)
$188 ==> 0.067903696 # ~70 millis.

# Net impact
jshell> 1.28531996 - 0.067903696
$189 ==> 1.2174162640000001 # ~1.21 second penalty!! 

Resolution

Partnering with Microsoft, we are currently addressing the issue with the kernel modules and are also looking for a short-term mitigation plan.

Conclusion and Takeaways

Regularly profile your high impact and latency sensitive applications in production. It might spring a surprise where you least expect.

Tools like async-profiler make it really easy with extremely minimal overhead. The whole exercise helped us to understand the vDSO mechanism slightly better and the impact of it being misconfigured in a busy system.