Correlating CPU usage from top output to a particular thread in the JVM

Posted . Visible to the public.

http://www-01.ibm.com/support/docview.wss?uid=swg21158192 Show archive.org snapshot
Technote (troubleshooting)

Problem(Abstract)

This document contains Information on how to correlate a process ID (PID) from top output to a thread in your javacore file.
Cause

This document describes how to find Java™ threads from an IBM JRE running on Linux that are consuming large amounts of CPU cycles. This document does not describe all of the possible reasons and causes for high CPU problems. It's possible that the high CPU usage is a Java heap problem or an operating system or "machine" problem. Before attempting to find the Java threads that might be consuming CPU, be sure to rule out operating system paging and JRE Heap problems.
Environment

Linux
Resolving the problem

There are a couple ways to find the Java threads consuming the CPU. The method to choose depends on the documentation collected at the time of the problem. The following example data was gathered as indicated in the MustGather: Performance, Hang or High CPU Issues on Linux technote when the JVM is consuming high CPU.

The linperf.sh script from this MustGather URL will collect top -H output as well as ps -eLf output. You need the javacore files and either the top -H output or the ps -eLf output.

The linperf.sh script takes one required parameter, the process ID (PID) of the WebSphere Java process causing the high CPU problem. Here's an example:
./linperf.sh 27050

The top.out flie is your standard top output. This file needs to be reviewed to see if the PID passed to the linperf.sh script is truly the process consuming the CPU.

Here's an example of the top output as collected by the linperf.sh script:
Tue May 12 14:21:34 EDT 2009

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27050 root 16 0 1164m 223m 40m S 80.5 11.1 82:40.26 /opt/IBM/WebSphere6.1/AppServer/java/bin/java
26437 root 16 0 398m 156m 36m S 2.0 7.8 92:18.92 /opt/IBM/WebSphere6.1/AppServer/java/bin/java
17125 root 15 0 2960 924 704 R 2.0 0.0 0:00.01 top -bc -d 60 -n 5
1 root 16 0 2704 544 464 S 0.0 0.0 0:00.97 init [5]
2 root RT 0 0 0 0 S 0.0 0.0 0:00.09 [migration/0]

From this output, we see that in fact our suspected high CPU process, 27050, is consuming 80.5% of the CPU. The next step now would be to review the verbose garbage collection output in the native_stderr.log file to ensure that the garbage collection code for the Java heap is not causing this problem. That is not covered this here. Assume that this example is not a Java garbage collection problem.

Now that you know that the suspected process is consuming CPU, and this is not an operating system paging problem or Java GC problem, move on to reviewing more of the MustGather documentation that will list out the threads of our suspect process.

For newer versions of the Linux kernel on Red Hat Linux, the top -H is the best output to look at. The top -H output will be in the topdashH.out file. If the top -H fails on the Linux machine, it is because of a downlevel Procps package (the /proc file system utilities).

The topdashH.out file consists of series of top commands run over around four minutes of time. The big difference between the regular top output and the top -H output, is while the regular top output lists all of the processes running on the machine, the top -H output lists all of the threads from one particular process. (Normally this would be one of the WebSphere processes)

As with default top output, the threads from the process are sorted according to CPU usage with the highest consumers at the top. The topdashH.out file needs to be reviewed from top to bottom to see what thread(s) is(are) consuming the CPU over the four minutes this command ran.

Here's a top -H example:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27052 root 18 0 1164m 223m 40m R 63.1 11.1 0:00.85 java
19511 root 20 0 1164m 223m 40m S 15.8 11.1 0:02.74 java
27050 root 16 0 1164m 223m 40m S 0.0 11.1 0:35.05 java
27053 root 17 0 1164m 223m 40m S 0.0 11.1 0:14.83 java
27054 root 15 0 1164m 223m 40m S 0.0 11.1 2:33.91 java
27056 root 16 0 1164m 223m 40m S 0.0 11.1 0:00.00 java

Except for the first process listed, 27052, this is a list of the threads from the WebSphere process, and not a list of processes. Thread 27052 is the actual process ID of the WebSphere Java process. The rest of the list are the threads. From this output, you see that thread 19511 is consuming a reported 15.8% of the CPU.

Convert 19511 to hex (4C37), and search the javacore files for this native thread id.

In this example, the javacore list this thread:
"WebContainer : 1" (TID:0x0933CB00, sys_thread_t:0x09EC4774, state:CW, native ID:0x00004C37) prio=5
at java/text/FieldPosition$Delegate.formatted(FieldPosition.java:291(Compiled Code))
at java/text/SimpleDateFormat.subFormat(SimpleDateFormat.java:1037(Compiled Code))
at java/text/SimpleDateFormat.format(SimpleDateFormat.java:836(Compiled Code))
at java/text/SimpleDateFormat.format(SimpleDateFormat.java:808(Compiled Code))
at com/ibm/ejs/ras/StreamEvent6.formatTimeAndThread(StreamEvent6.java:400(Compiled Code))
at com/ibm/ejs/ras/StreamEvent6.formatHeaderBasic(StreamEvent6.java:284(Compiled Code))
at com/ibm/ejs/ras/StreamEvent6.writeSelfToStream(StreamEvent6.java:132(Compiled Code))
at com/ibm/ejs/ras/SystemStream.doPrintLine(SystemStream.java:786(Compiled Code))
at com/ibm/ejs/ras/SystemStream.println(SystemStream.java:691(Compiled Code))
at TE01Servlet.p(TE01Servlet.java:106(Compiled Code))
at TE01Servlet.doLoop(TE01Servlet.java:94(Compiled Code))
at TE01Servlet.doPost(TE01Servlet.java:45)
at TE01Servlet.doGet(TE01Servlet.java:18)
at javax/servlet/http/HttpServlet.service(HttpServlet.java:743)
at javax/servlet/http/HttpServlet.service(HttpServlet.java:856)
at com/ibm/ws/webcontainer/servlet/ServletWrapper.service(ServletWrapper.java:966)
at com/ibm/ws/webcontainer/servlet/ServletWrapper.handleRequest(ServletWrapper.java:478)
at com/ibm/ws/wswebcontainer/servlet/ServletWrapper.handleRequest(ServletWrapper.java:463)
at com/ibm/ws/webcontainer/servlet/CacheServletWrapper.handleRequest(CacheServletWrapper.java:92)
at com/ibm/ws/webcontainer/WebContainer.handleRequest(WebContainer.java:744)
at com/ibm/ws/wswebcontainer/WebContainer.handleRequest(WebContainer.java:1425)
at com/ibm/ws/webcontainer/channel/WCChannelLink.ready(WCChannelLink.java:92)
at com/ibm/ws/http/channel/inbound/impl/HttpInboundLink.handleDiscrimination(HttpInboundLink.java:465)
at com/ibm/ws/http/channel/inbound/impl/HttpInboundLink.handleNewInformation(HttpInboundLink.java:394)
at com/ibm/ws/http/channel/inbound/impl/HttpInboundLink.ready(HttpInboundLink.java:274)
at com/ibm/ws/tcp/channel/impl/NewConnectionInitialReadCallback.sendToDiscriminators(NewConnectionInitialReadCallback.java:214)
at com/ibm/ws/tcp/channel/impl/NewConnectionInitialReadCallback.complete(NewConnectionInitialReadCallback.java:113)
at com/ibm/ws/tcp/channel/impl/AioReadCompletionListener.futureCompleted(AioReadCompletionListener.java:152(Compiled Code))
at com/ibm/io/async/AbstractAsyncFuture.invokeCallback(AbstractAsyncFuture.java:213(Compiled Code))
at com/ibm/io/async/AsyncChannelFuture$1.run(AsyncChannelFuture.java:163)
at com/ibm/ws/util/ThreadPool$Worker.run(ThreadPool.java:1498(Compiled Code))

For this example, we see that the javacore files show this thread looping somewhere in the java/text/SimpleDateFormat.format code. Review the thread(s) you found in all of the javacore files provided to see how and if the code stack moves through time.

Now to review the ps -eLf output. If the Linux Procps package is downlevel and we didn't get the top -H output, then you'll need to use the ps -eLf output. This output is in the file: ps.out.

This file is harder to use for a couple of reasons. First off, this file lists every thread of every process running on the Linux machine. So there can be a large amount of output to review. The second problem with this output is that it doesn't list the %CPU output. Rather the ps -eLf output gives us the CPU_TIME (listed as TIME). This is the amount of time this thread has used since the WebSphere process was started. Threads that are quickly consuming CPU time will have a high TIME value, but so will long running threads. So it helps to know what threads in the javacore files could be a problem. So then you can take these threads, and work backwards from a suspected thread in a javacore file to the ps output and thus verify if your suspects are truly consuming CPU.

For older 1.4.2 Java SDK's see this section:

Use the top output and the javacore files to find the threads consuming the CPU.

The following sample top output is for a high CPU condition:
PID USER PRI NI SIZE RSS SHARhey PatE STAT %CPU %MEM TIME COMMAND
19099 root 25 0 433M 433M 48484 R 99.9 12.3 20:07 /opt/Web...
19440 root 15 0 2160 2156 1500 R 1.9 0.0 0:00 top
1 root 15 0 512 512 444 S 0.0 0.0 0:04 init

From the sample top output, you can see that PID "19099" is utilizing 99.9% of the CPU. Locate the javacore file that was generated at the same time asyour top output. To find the thread, perform a search within the javacore file for the PID showing high utilization of the CPU. In this example it would be PID "19099".

Following is an extraction example from the javacore file that was collected at the same time as the top output:

"Thread-1161" (TID:0x16510C18, sys_thread_t:0x51E78568, state:R, native ID:0x51B437) prio=5
at java.io.RandomAccessFile.read
at java.io.RandomAccessFile.readInt
at jet.connect.DbStringColumnBuffer.readValueFromFile
at jet.connect.DbColumnBuffer.readFromFile
at jet.connect.DbPage.read
at jet.connect.DbPage.readFromFile
at jet.connect.DbBuffer.$GQ
at jet.connect.DbBuffer.getPageOfTheRow
at jet.connect.DbBuffer.refreshRecord
at jet.connect.DbRecord.refresh
at jet.connect.jdbms.JDbResultSet.nextRecord
at jet.server.schedule.CompletedTable.
getCompletedTaskRecordsByTaskID
at jet.server.jrserver.JRCompletedTaskTable.
getCompletedTaskRecordsByTaskID
at cats.rpt.JREntThread.updateTaskStatus
at cats.rpt.JREntThread.run

Native Stack of ""Thread-1161"" PID 19099
-------------------------
 pthread_sigmask at 4002A136 in libpthread.so.0
 sysRead at 4038F041 in libhpi.so
 JVM_Read at 4024EBBE in libjvm.so
 readSingle at 403C614D in libjava.so
 Java_java_io_RandomAccessFile_read at 403C5678 in
 libjava.so 457214B5

From this extraction, it is evident that "Thread-1161" is the cause for high CPU. You can now review the stack for the previous thread to find the cause of the problem.

Fabio Silva
Posted by Fabio Silva to Fabio Silva's deck (2018-02-20 12:36)