If it happens again, please also check memory usage. This type of error is generally one of the two.
If it happens again, please also check memory usage. This type of error is generally one of the two.
If you aren't leaking file handles, then setting ulimit to twice your max ccu should work just fine.
edit: my notes in an unfinished article on load testing for the manual say:
On Linux, use ulimit -n (twice the number of ES5 clients allowed). For example, if your ES5 is configured for 100K ccu, use ulimit -n 200000.
edit 2: You may be able to check memory usage by checking the Console.log file - there's a low memory warning when it gets pretty low. Also try the ES Admin's Server Monitoring, Reporting; there are three graphs for memory.
Last edited by tcarr; 05-25-2012 at 12:34 AM.
Teresa Carrigan
Senior Engineer
Electrotank, Inc.
Hmm the problem persists. Memory does not seem to be an issue. The server process has 14 GB allocated to it, all of which ES claims for itself when it starts up. The reporting actually shows ES still had 12 GB free when connections started dropping.
The next suspect then is file handles. Our ES5 is configured for 100k ccu, with ulimit at 250k. I've also set a high upper limit in several other places, such as /proc/sys/fs/file-max and /etc/sysctl.conf (fs.file-max). I even upped this to 500k in all places (which is unrealistically high, since our actual usage at time of failure is about 700 ccu with ~4000 total open files as reported by "sudo /usr/sbin/lsof -n | wc -l")
A few things that may help figure out the issue:
1. The problem actually has some effects across reboots. After a reboot, the server will run fine for 30 - 60 minutes before refusing connections. For the best result (so far), we have to reboot, kill the automatically launched ES process (which we placed in /etc/rc.local), then manually run the startup script again. Even after a manual launch, last time it only lasted ~30 hours before starting to refuse connections again.
2. We reverted to server version 5.3.1 and saw similar behavior, although it logged a different exception (below). After this exception, the server refused new connections just like with the Netty exception under 5.3.2. However, lsof only showed a few thousand open files, nothing near the upper limit of 500k.
3. Prior to when this problem started, the server ran fine for 18 days without a reboot. During that time it handled approximately the same daily load of users as what we see currently. Our only interaction with the server was to read log files via FTP and monitor it via the admin console. Then, suddenly, it started refusing connections and has been unstable since.Code:2012-May-27 14:12:20:078 [accept BinaryTCP-0] ERROR com.electrotank.io.Acceptor - Exception accepting new connection java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(Unknown Source) at com.electrotank.io.Acceptor.run(Acceptor.java:69) at java.lang.Thread.run(Unknown Source)
What can we investigate next? All the data we have suggests that the system has plenty of resources available.
The only thing I can think of is that something is leaking file handles, because nobody else is seeing this problem with ElectroServer and only 700 ccu.
It's a holiday weekend here, but I'll ask Jason if he has any ideas when he gets back to his desk on Tuesday. In the meantime it might be useful to take a thread dump before rebooting, to see if that gives us a clue. You might also setup YourKit profiling; I can give you instructions on how to do that if you need them. When the bug is a bug with ElectroServer, Jason will ask for access to the YourKit profiler so that he can see what is happening. I don't think it's likely this is a bug with ES5 however.
Teresa Carrigan
Senior Engineer
Electrotank, Inc.
Can you please log in to the server under the user that ElectroServer runs as and send me the results of the command:
ulimit -a
Thanks,
Jason
tcarr (05-28-2012)
This thread on StackOverflow seems relevant, and might be useful to you in tracking down a filehandle leak.
edit: what garbage collection scheme are you using? Because if Java doesn't reclaim the closed filehandles until GC is run, and you aren't running GC very often, that might be the problem.
Teresa Carrigan
Senior Engineer
Electrotank, Inc.
Thanks Teresa, that stackoverflow article makes a really good point!
We are using the UseConcMarkSweepGC/CMSIncrementalMode options for the runtime. We also have a large amount of memory that doesn't get used up very quickly, and which is allocated to the JVM entirely at launch.
Perhaps then the non-GC'ed sockets just happened to build up too high at some point.
We've lowered the heap size and will be monitoring, in case there is still a leak of handles somewhere. We'll look at putting in a periodic manual GC in the code as well.
Jason, the result is:
Code:ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 139264 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 500000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 139264 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
I would think that UseConcMarkSweepGC/CMSIncrementalMode would garbage collect file handles quickly enough, but yes allocating a lower amount of heap might help. If the problem is leakage of file handles (opening files and forgetting to close them) you will want to check your server code in each place that opens a file.
Teresa Carrigan
Senior Engineer
Electrotank, Inc.
RJ has some suggestions on this issue:
On some platforms like Ubuntu, it can be surprisingly tricky to get the ulimit to stick for the right user (root - which is the user running ElectroServer). First question I would have was which user they ran the command below with, and are they sure it's the same user running ElectroServer.
One way to verify they have ulimit set correctly for ES is to have ulimit -n echoed or redirected to a text file on ElectroServer startup:
ulimit -n > /tmp/ulimit_on_startup.txt
java ... ElectroServer
If it's not set correctly for the ES user (which is likely Root), this article discusses how to set it right with Ubuntu:
http://posidev.com/blog/2009/06/04/s...ers-on-ubuntu/
That said, I've still had cases (which were admittedly, quite possibly my error) where I couldn't get the above to take when using something like Daemontools/Svc - in those cases I had to set the ulimit directly in the supervise run script for ElectroServer.
Not sure this will help, but it's the next thing I would check.
Teresa Carrigan
Senior Engineer
Electrotank, Inc.
We found that forgoing the startup script and instead launching the server manually did improve the situation. It might be that the high upper bounds were not sticking (and yes, it's running as root).
So that's half the issue, and then there is still a slow connection leak. The number of established connections increases by about 5,500 per day. We have a daily load of ~220,000 new connections, so 2.5% of our connections never close.
Our extensions don't open files or connections. We pair clients, and kick them off the server under some conditions.
How about TCP keepAlive? Since our users are on mobile devices, it's quite likely there will always be a small percentage who lose connection and don't disconnect cleanly. Other projects encountered these exceptions due to not setting keepAlive in Netty:
http://apache-avro.679487.n3.nabble....td3788498.html
http://mail-archives.apache.org/mod_...A@yahoo.com%3E