thinault: is the nginx server standalone or on the same server as the application?
is the memcache and and sql standalone too or on the same machine?
also, you might want to check out the tcp stack settings, if you haven't done so already..
and how about nginx buffer and handling?
do you have a spare server for the backend that could loadbalance the application connections?
is the nginx request timeout on high load too low?
are the network settings ok? like, no misconfiguration in vlans, duplex or speed?
firewall rules that might be dropping packets or doing a timeout if a connection is persistant after a while?
- Forum
- Lichess Feedback
lichess outages
also, are you using iptables?
is the "-A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT" line there?
- all services are running on the same box, except for the AI but it is not relevant
- I don't use memcache nor sql, but mongodb (same box)
- Here is my sysctl.conf that I have tweaked:
net.ipv4.ip_forward=0
kernel.sysrq = 0
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_max_syn_backlog = 6553
net.core.somaxconn = 20000
net.core.netdev_max_backlog = 3000
- I've not tried anything on nginx side, can you point me to a resource to get me started configuring the "buffer and handling"?
- I don't know the network settings you talk about. FYI I run an archlinux with kernel 2.6.38.2
- iptables:
cat /etc/iptables/simple_firewall.rules
*filter
:INPUT DROP [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -p icmp -j ACCEPT
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -j REJECT --reject-with tcp-reset
-A INPUT -p udp -j REJECT --reject-with icmp-port-unreachable
-A INPUT -j REJECT --reject-with icmp-proto-unreachable
COMMIT
At the moment I'm trying to tweak the netty server options, like this:
val receiveSize = 1024 * 1024
val backlogSize = 1024 * 64
bootstrap.setOption("tcpNoDelay", true);
bootstrap.setOption("receiveBufferSize", receiveSize)
bootstrap.setOption("child.receiveBufferSize", receiveSize)
bootstrap.setOption("backlog", backlogSize);
What's annoying is that the request dropping only happens on production, after a few hours. Which considerably slows down experimentations.
Thanks for your help!
looks ok to me.
you could read here about nginx, and also the nginx forums and faq.
http://wiki.nginx.org/HttpProxyModule
https://calomel.org/nginx.html
does this resemble anything you are dealing with?
http://serverfault.com/questions/339412/nginx-timeout-after-200-concurrent-connections
could you paste in your nginx config or send it to me via email?
/etc/sysctl.conf:
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
also, try shutting iptables or any other firewall off for a while to see if it helps, and if it does, try tuning the filters and set it on again.
I don't know your email but I don't mind publicly sharing nginx config:
- main config file https://gist.github.com/2567675
- lichess config file https://gist.github.com/2567679
I'm looking at the serverfault page, seems interesting.
There is no firewall on my box (that I know of). I'll try more tweaks if the server goes down again today, as I've made some changes this morning already and I want to see how they do.
Stupid little green men.
YOU MUST DIE!!!
heh, so as to make an effort not to sound completely useless though, "buffer and handling" sounds more like, handling buffer overflows. buffer/array/memory/string all basically mean the same shit, if
there's too much data it overflows. So there are different types of buffers that nginx might be simultaneously managing, that just one has a boundary with insufficient maintenance.
I've made a bit of progress today. I can now predict when the site is gonna stop responding.
When I run the command:
netstat -pan | grep "java" | wc -l
It tells me the number of sockets (generally tcp) currently used by processes which name contains "java". Lichess runs on scala which is based on the JVM, so the lichess process contains "java".
When the number given by this command reaches 900 (more or less 10) then the site stops responding.
So I restart the lichess process, the number falls to zero and then grows quickly again.
Dependending on the site frequentation, it takes between 20 minutes and 10 hours to reach this mysterious limit of 900.
Here we are. I don't know what's next to do, but I keep searching (and learning lot of interresting stuff doing it)
There's a bug, which _may_ be related to your 900 lost(?) sockets:
When I lose the connection to lichess because my local router/modem (which connects me "directly" to the internet) crashes, I can not play any further: After the router crashes, if have to reboot it,
then I reload the page of my game. Now the Reconnecting message disappears. But when i try to make a move, i can make the move "localy" (the piece goes graphically to it's destination square) but the
Reconnecting message appears immediately again. After reloading again, the Reconnection message disappears but the moved piece is back to its old square.