File handle exhaustion: Xwiki issue with LDAP or LDAP issue possibly related to Xwiki

I upgraded from Xwiki 10.11.2 to 10.11.3 (LTS) two days ago. The server is running Debian 9 and I’m using the Xwiki Debian packages.

After one day (yesterday), OpenLDAP running on this server (and also used by Xwiki) failed due to file handle exhaustion. As I needed the system to work again, I restarted OpenLDAP which solved the immediate problem then.

Today we ran into the same issue and I took some information from the system before restarting slapd again. This showed thousands of established TCP connections between the processes java and slapd in “lsof”.

The only Java processes running on that machine belong to Tomcat, which only hosts Xwiki.

I also installed Debian security updates two days ago, but there basically was nothing else related to either OpenLDAP or Java:

Start-Date: 2019-03-13  17:05:19
Commandline: apt-get upgrade
Upgrade: libopenjp2-7:amd64 (2.1.2-1.1+deb9u2, 2.1.2-1.1+deb9u3), php7.0-bz2:amd64
(7.0.33-0+deb9u2, 7.0.33-0+deb9u3), php7.0-cli:amd64 (7.0.33-0+deb9u2, 7.0.33-0+deb9u3),
php7.0-gd:amd64 (7.0.33-0+deb9u2, 7.0.33-0+deb9u3), php7.0-opcache:amd64
(7.0.33-0+deb9u2, 7.0.33-0+deb9u3), php7.0:amd64 (7.0.33-0+deb9u2, 7.0.33-0+deb9u3),
php7.0-recode:amd64 (7.0.33-0+deb9u2, 7.0.33-0+deb9u3), php7.0-common:amd64
(7.0.33-0+deb9u2, 7.0.33-0+deb9u3), openssh-sftp-server:amd64 (1:7.4p1-10+deb9u5,
1:7.4p1-10+deb9u6), php7.0-json:amd64 (7.0.33-0+deb9u2, 7.0.33-0+deb9u3),
php7.0-mbstring:amd64 (7.0.33-0+deb9u2, 7.0.33-0+deb9u3), php7.0-readline:amd64
(7.0.33-0+deb9u2, 7.0.33-0+deb9u3), php7.0-xml:amd64 (7.0.33-0+deb9u2,
7.0.33-0+deb9u3), php7.0-curl:amd64 (7.0.33-0+deb9u2, 7.0.33-0+deb9u3), php7.0-zip:amd64
(7.0.33-0+deb9u2, 7.0.33-0+deb9u3), php7.0-ldap:amd64 (7.0.33-0+deb9u2, 7.0.33-0+deb9u3), 
php7.0-mcrypt:amd64 (7.0.33-0+deb9u2, 7.0.33-0+deb9u3), php7.0-imap:amd64
(7.0.33-0+deb9u2, 7.0.33-0+deb9u3), openssh-server:amd64 (1:7.4p1-10+deb9u5,
1:7.4p1-10+deb9u6), php7.0-intl:amd64 (7.0.33-0+deb9u2, 7.0.33-0+deb9u3),
openssh-client:amd64 (1:7.4p1-10+deb9u5, 1:7.4p1-10+deb9u6), libapache2-mod-php7.0:amd64
(7.0.33-0+deb9u2, 7.0.33-0+deb9u3), php7.0-mysql:amd64 (7.0.33-0+deb9u2,
7.0.33-0+deb9u3)
End-Date: 2019-03-13  17:07:02

Start-Date: 2019-03-13  17:32:17
Commandline: apt-get upgrade
Upgrade: xwiki-mysql-common:amd64 (10.11.2, 10.11.3), xwiki-common:amd64 (10.11.2, 10.11.3), 
xwiki-tomcat8-mysql:amd64 (10.11.2, 10.11.3), xwiki-tomcat8-common:amd64 (10.11.2, 10.11.3)
End-Date: 2019-03-13  17:33:05

So I presumed that the issue was related to the Xwiki upgrade and just performed a downgrade to 10.11.2.

I’m not sure if this changed anything, though: Tomcat is only running for about an hour now but already has about 205 established connections to slapd - and it was only me performing some tests from the browser against the server and maybe two other users, that’s it.

Both processes belonging to the connections (slapd and java) are waiting in a “futex()” syscall if I strace them, but that’s probably not saying much…

I don’t really have a clue how to debug this behaviour - are that many established connections to be expected? I’d rather expect the ldap connections to be pretty short-lived, created and closed as needed, with maybe only a small pool of connections being keep open for caching purposes…

1 Like

I really doubt it since only the LDAP authenticator extension do LDAP connections.

They are supposed to be yes since each authenticator use a new connection.

That being said I looked at the extension code and it seems the connection is released only when the Java Garbage Collector remove the connection object which is not very reliable to say the least but it’s like this since forever. I just released a new version (9.3.5) in which the close is forced as soon as it’s not used anymore.

But the authenticator is working like this since forever so it does not explain why you have this issue only now. Maybe something cause a lot more authentications than you used to have and Java does not have time to release them.

Mh, no idea… I now re-upgraded to 10.11.3 and the fixed LDAP plugin, and the connection mess is completely gone. Thanks!

No idea why I hadn’t had any problems with this previously… I can’t think of any change to the server which could have had influence to this, besides the Xwiki upgrade… Maybe Xwiki 10.11.3 somehow performs much more authentications than 10.11.2? But that’s just guessing, I’ve no idea, really… :frowning: