[solved] xwiki 10.2 memory leak (was Xwiki 10.2 busy cpu)

pdwalker · March 29, 2018, 7:21am

Ok, I’ve seen this problem a couple of times so far - xwiki 10.2 becomes unresponsive. Checking the server, I can see the cpu is running at 100% on an idle wiki.

the problem goes away on a restart.
I don’t know how to trigger the problem - it just happens
I didn’t notice this problem in 10.1

The next way it happens, what’s the best way for me to capture what’s going on so I can debug where the issue is?

vmassol · March 29, 2018, 7:28am

Yes we’ve had several such reports and we need to figure out what’s going on.

You’ll just need to take some java thread dumps when the wiki is busy and send that to us for analysis so that we can check what’s happening inside XWiki at that time.

See for ex How do I generate a Java thread dump on Linux/Unix? - Red Hat Customer Portal

Thanks, that would help a lot!

tmortagne · March 29, 2018, 7:43am

You might also want to follow XWiki and Tomcat crashes - #5 by gdelhumeau which probably have a related issue.

pdwalker · March 29, 2018, 7:45am

Thanks for the information. I’m watching the other issue and I’ll take the thread dump when my problem pops up next.

pdwalker · April 4, 2018, 5:48am

…and mission accomplished. I’ve managed to spin out the cpu again in 10.2 and I’ve gotten a thread dump: dump.txt (200.1 KB)

This may or may not be relevant to the problem; I just just run a query on the database to correct the mime types of the DRF files I work with:
update xwikiattachment set xwa_mimetype = 'text/plain' where xwa_filename like '%.DRF' and (xwa_mimetype is NULL or xwa_mimetype = "image/x-raw-kodak");

and then reloaded a page with those attachments.

I’ll test it again and see if I can reproduce the problem.

pdwalker · April 4, 2018, 6:07am

did the same thing again, busy cpu not triggered. something else then.

tmortagne · April 4, 2018, 7:28am

The thread dumps (I found 6 of them) you posted pretty much only contains threads passively or actively waiting. Are you sure you took them during the 100% CPU ?

Using 100% of the proc is OK if it does not stay like this for a long time and if it does not block the other stuff (some short lived actions are just taking all the CPU they can get but it does not mean they have the priority).

pdwalker · April 4, 2018, 7:39am

Yes, 6 thread dumps with some time spent between them.

the system dual cpu’s were pegged at 100% running Java. XWiki itself became very slow for viewing pages, and unresponsive while trying to edit pages.

I didn’t wait this time to see how long it would take, but the last time this happened, it ran for several hours before I restarted tomcat8.

I’ll keep an eye for when it happens again and give you a process dump so you can see how the cpu is behaving.

pdwalker · April 9, 2018, 6:45am

Another busy event. Server was busy when I got into the office.

Output from ‘top’:

top - 14:24:37 up 11 days, 22:46,  2 users,  load average: 1.06, 1.04, 1.00
Tasks: 183 total,   1 running, 182 sleeping,   0 stopped,   0 zombie
%Cpu0  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  1.0 us,  0.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  8175184 total,  3287252 free,  3044324 used,  1843608 buff/cache
KiB Swap:  8385532 total,  8385532 free,        0 used.  4787624 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
27812 tomcat8   20   0 5098280 2.794g  53568 S 100.0 35.8   2334:09 java
    1 root      20   0   37900   5980   4040 S   0.0  0.1   0:29.19 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.07 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0   0:14.27 ksoftirqd/0
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
    7 root      20   0       0      0      0 S   0.0  0.0   3:18.08 rcu_sched
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
    9 root      rt   0       0      0      0 S   0.0  0.0   0:00.35 migration/0
   10 root      rt   0       0      0      0 S   0.0  0.0   0:05.44 watchdog/0
   11 root      rt   0       0      0      0 S   0.0  0.0   0:05.42 watchdog/1
   12 root      rt   0       0      0      0 S   0.0  0.0   0:00.32 migration/1
   13 root      20   0       0      0      0 S   0.0  0.0   0:21.94 ksoftirqd/1
   15 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:0H
   16 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kdevtmpfs
   17 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 netns

output from ‘vmstat 3’:

root@wiki:~# vmstat 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 3287908 245548 1598060    0    0     0    10    0   20  7  0 92  0  0
 1  0      0 3287420 245548 1598608    0    0     0    13  395  265 50  1 50  0  0
 1  0      0 3287388 245548 1598704    0    0     0     0  382  266 50  0 50  0  0
 1  0      0 3287388 245548 1598704    0    0     0     5  379  243 50  0 50  0  0
 1  0      0 3287324 245548 1598704    0    0     0     0  369  239 50  0 50  0  0
 1  0      0 3287420 245548 1598704    0    0     0     9  402  266 50  0 50  0  0
 1  0      0 3287420 245548 1598704    0    0     0     0  402  288 50  0 50  0  0
 1  0      0 3287356 245548 1598704    0    0     0     0  398  287 50  0 50  0  0
 1  0      0 3287356 245548 1598704    0    0     0     0  381  243 50  0 50  0  0
 1  0      0 3287484 245548 1598704    0    0     0     0  386  254 50  1 50  0  0
 1  0      0 3287420 245552 1598704    0    0     0   201  394  266 50  0 50  0  0
 1  0      0 3287388 245552 1598704    0    0     0    15  364  223 50  0 50  0  0
 1  0      0 3287356 245552 1598704    0    0     0     0  408  285 50  0 50  0  0
 1  0      0 3287420 245552 1598704    0    0     0     4  389  269 50  0 50  0  0
 1  0      0 3287420 245552 1598704    0    0     0     0  387  270 50  0 50  0  0
 1  0      0 3287452 245552 1598704    0    0     0     0  385  242 50  0 50  0  0
 1  0      0 3287452 245552 1598704    0    0     0     8  391  247 50  0 50  0  0
 1  0      0 3287484 245552 1598704    0    0     0     0  368  222 50  0 50  0  0
 1  0      0 3287420 245552 1598704    0    0     0     0  395  273 50  0 49  0  0

Attached is the contents of the catalina.out with 6 consecutive thread dumps: dump.txt (2.2 MB)

The first thread dump starts at line 11510 and starts with “Full thread dump Java”.

Also, this time I notice I am getting out of memory errors which I didn’t have or notice before. Unfortunately, the vm was unresponsive so I couldn’t debug to see where the memory was going.

JAVA_OPTS=“-server -Djava.awt.headless=true -Xmx2048m -XX:+UseConcMarkSweepGC -Djava.net.preferIPv4Stack=true”

I’ve increased the Xmx to 3072m and we’ll see if it happens again.

tmortagne · April 9, 2018, 7:34am

You might want to enable automatic memory dump in case of OOM. See Memory Analysis: How to Obtain Java Heap Dump - DZone for example.

pdwalker · April 9, 2018, 10:36am

Enabled. Let’s see what develops. Thanks.

pdwalker · April 11, 2018, 4:04am

Another CPU busy event. Most of the memory (3GB) is in use. I had to restart the jvm again after just over 24 hours.

Here is the full java-melody output (5.4MB png) for the jvm at the moment I was having the problem. And here is the memory histogram (139KB). Oh, and don’t forget about the full heap dump (.hprof.bz2 compressed, 712MB).

Can anyone suggest a real time debugger that I could connect to the jvm so I can monitor what is happening with the jvm at the moment I’m having a problem?

At the moment, I don’t know if I’m dealing with one problem, or many problems disguised as one.

tmortagne · April 11, 2018, 8:37am

The problem is that live debugging a JVM which is in an unusable state is not easy especially if you got a OOM since most Java debuggers need the VM to send them information. In any case the memory dump should give valuable information at least on the memory side.

For the proc side one thing I did not tried yet is things like setup glowroot in a collector mode (instead of accessing it directly in live like javamelody it send everything to a configured collector intance which is not affected by whatever put the XWiki VM down so you don’t get live information but at least as recent as you can get). Now if you got a OOM you can generally forget about anything else which is often just consequences.

tmortagne · April 11, 2018, 8:58am

I quickly looked at your memory dump and most of the RAM is used by the document cache which contains 22721 documents. You might want to reduce quite a lot what you indicated in xwiki.cfg (property xwiki.store.cache.capacity)

pdwalker · April 11, 2018, 9:32am

[edit: I was looking at the wrong instance]
My settings are:

#-# Maximum number of documents to keep in the cache.
#-# The default is 500.
xwiki.store.cache.capacity=1000

Also, I don’t believe the wiki has 22K documents to cache.

tmortagne · April 11, 2018, 10:01am

Actually after looking a bit more the details the root document cache level contains a lot less documents than the 22k instance of XWikiDocument found in the cache in general.

It looks like a lot of attachment instances retain a reference to a lot of different documents. It’s the fist time I see this so it maybe it comes from some fishy manipulation of attachments (possibly triggered by an XWikiAttachment instance moved from one XWikiDocument instance to another several times) but I will see if there is some way to bulletproof this a bit more to not end up in this situation.

pdwalker · April 11, 2018, 10:12am

Is there any way I can debug this further for you?

My jvm runs three xwiki instances - only 1 of which has more than a few documents stored in it. The other has a few, and the last is just a default install.

[edit] I’ve enabled jmx monitoring so I can use visualvm to connect to the jvm to get a better idea of what it is doing.

tmortagne · April 11, 2018, 10:19am

I’m on something, will make a snapshot jar available soon.

https://jira.xwiki.org/browse/XWIKI-15181

pdwalker · April 11, 2018, 11:21am

Datapoint: Editing pages seems to cause the problem fairly quickly. In the last hour, I’ve edited about 20 “app in minutes” pages multiple times and I’m running out of memory again.

The cpu business is the jvm GC going nuts trying to get some memory back. Screenshot from VisualVM monitoring.

tmortagne · April 11, 2018, 12:59pm

Here is a patched version of xwiki-platform-legacy-oldcore jar: http://maven.xwiki.org/snapshots/org/xwiki/platform/xwiki-platform-legacy-oldcore/10.2.1-SNAPSHOT/xwiki-platform-legacy-oldcore-10.2.1-20180411.125748-6.jar

It should reduce a lot the possibility to have this issue.