Today we want to introduce quite simple yet very valuable feature - Kernel and User Times graphs in the reports. Potentially it can significantly increase the quality of your expertise of the garbage collector work.
During the garbage collection process the CPU time is separated into two groups: GC Threads (user) and OS system calls (kernel). Both are done using multiple threads (except Serial or non-parallel OldGen collector is in use), which amount is usually equals to the number of available cores (unless
-XX:ParallelGCThreads=<value> flag is set).
Then, the real pause time can be roughly calculated like this:
(user_CPU + kernel_CPU) / ParallelGCThreads.
The problem with analyzing only real value is essential - you don’t know whether this pause was caused by inappropriate GC configuration or some OS delays/problems.
Consider the following example:
Young pause in a red rectangle took ~508 ms, which stands out strongly from the rest. The question is how this happened? Why the garbage collector stuck on this?
It turns out that this pause has nothing to do with the concrete GC algorithm. By plotting the work of Kernel, we simply see that it was the root cause of it:
Which is a good indicator that the issue lays in underlying OS work rather than JVM.
We at GCPlot GC Logs Analyzer platform considered this as a major flaw of today’s GC logs analysis approaches and started to support separate graphs for both Kernel and User CPU usages. This would open up the possibility of much more verbose analysis on a distinct pause level.
Apart from that, our platform can still produce tons of other graphs and statistics, which would tell nearly everything about how your GC flies.