INFO: What happens to ssh every 2:11:15?

I was getting a weird artifact in my logs.  A daemon process that was in charge of keeping an ssh connection open to a remote host was restarting ssh every two hours eleven minutes:

myuser 15208 0.0 0.0 0 0 Z 02:01 0:00 [ssh <defunct>]
myuser 15511 0.0 0.0 0 0 Z 04:12 0:00 [ssh <defunct>]
myuser 15548 0.0 0.0 0 0 Z 06:24 0:00 [ssh <defunct>]
myuser 15584 0.0 0.0 0 0 Z 08:35 0:00 [ssh <defunct>]
myuser 15619 0.0 0.3 3408 1704 S 10:46 0:00 ssh -T myhost ...

What the heck is going on? I was running this from behind a DSL modem, and I had experienced some intermittent problems with it before. Was it the modem? Googling on the model # indicated nothing similar reported by others. Was it my ISP or Telco? Phone calls to them indicated that 2 hours was the median time between dropped connections for some old modems, but not mine and not my circuit type. Hmm. Many people pointed to the TCP KeepAlive default of 7200 seconds — two hours — but my problem had a period of over two hours. Almost exactly, consistently, two hours eleven minutes.

As it turns out, the TCP KeepAlive time of 7200 seconds plus the default KeepAlive probe interval (75) times the default probe count (9) add up to 2:11:15.

If you want to change this for one reason or another, try:

echo "30" > /proc/sys/net/ipv4/tcp_keepalive_time

… or likewise (remember that you'll still have 11:15 worth of probe * count; lower those too if you need to know sooner). Better yet, read http://av.stanford.edu/books/tcpip/tcp_keep.htm for some actual theory on the subject.

One good use for this information is if you want to keep a persistent connection open between two machines using, e.g., Net::SSH::sshopen2 for a bidirectional remote connection to a process executed on a remote machine, but you're on a kind of flaky connection that can cause the connection to get dropped often but briefly, and the nature of the stuff you're doing is such that you want it to re-connect and try again rather than obliviously sit through the blip.

(The reason I ramble so lengthily on what particularly one might use this for is because you do NOT want to follow these directions if you're having a more common “momentarily flaky” connection sequela, such as you have terminal sessions that you wish to keep open despite a moment of flakiness — in that case, you do NOT want to enable short TCP keepalives, since they are really “detect deads,” and they will increase the likelihood that your blip in the connection will kill your terminal session.  In that case, you pretty much want to do the OPPPOSITE of this, excepting that 1. if you are behind a NAT router and your connection isn't actually flaky, you might really be seeing a timeout of the NAT table, not connection flakeage, and so you DO want to put a keepalive in shorter than the NAT table timeout [it's all a bit much, isn't it?] 2. you are probably best off just using “screen” and doing a screen -r to reconnect to an old screen when you get reconnected [screen is awesome for all sorts of reasons, and wth screen, if you can divorce yourself from the graphical burden, you've essentially got a total multitasking virtual desktop with persistent state as long as you've got a vt100 terminal].)

The way I would recommend would be the following:

1. Set up your local ssh_config to make sure you're using KeepAlive yes.

2. Set up your local tcp settings to have a short keepalive time and probe interval/count.  (Some kernels apparently don't behave with less than 90 seconds as the keepalive time but I have had success with much lower numbers.)

3. Set up your remote sshd_config to use the ClientAliveInterval and ClientAliveCountMax with reasonable values.  What this does is sort of a reverse and in-band version of what the TCP keepalive is doing on the local machine; the ssh daemon will send an encrypted signal across every ClientAliveInterval seconds and will hang up the connection if it misses CountMax of them in a row; this makes sure that the process you run on the remote machine gets hung up OK.

4. Make sure that your sshopen2 call and the sending and receiving of things along it recognizes when the SSH connection gets closed out and deals with it, such as by an eval loop and a reconnection in the event of $@ .

 

Leave a Reply