Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#155 PuTTY failing after a while with "Server key not valid"

v0.8.x (devel)
closed-fixed
Henry N.
5
2009-04-15
2009-04-09
Anonymous
No

Hi,
a few days ago I upgraded my coLinux install to version 0.8.0 snapshot 20090329.
While using the ndis-bridge feature I noticed my SSH connections crashing after a while (like 20 minutes, sometimes less, sometimes after hours).

PuTTY and WinSCP report the error as "Server's host key did not match the signature supplied".
WinSCP is build on the SSH code of PuTTY so it seems to be a low-level error.

I tested a few setups to isolate the problem:
- coLinux 0.8.0 (20090329) with ndis-bridge: fails after a while.
- coLinux 0.8.0 (20090329) with pcap-bridge: fails after a while.
- coLinux 0.7.4 rc1 (20090329) with ndis-bridge: fails after a while.
- coLinux 0.7.4 rc1 (20090329) with pcap-bridge: fails after a while.

- coLinux 0.7.3 (20080608) with pcap-bridge: still going strong after 2 hours and alot of traffic.

The fact that I get the same error with the RC1 and the 0.8.0 development version is not supprising as it should be based on the same code.

From searching the net I found these possible causes for the problem:
- The cached server key signature is not valid anymore (would prevent me from logging in)
- The network traffic gets mauled between server and client
- The client or server software can be at fault.

My setup specs:
- AMD AthlonXP 3800+, 2gb ram
- Windows XP Pro SP3 + all updates to date
- Network card: NVIDIA nForce 10/100/1000 Mbps Ethernet
- Guest OS is ArchLinux (ver 2009.02) (using only prebuild packages from the ArchLinux repositories)
- Server software: OpenSSL (0.9.8j pacman package #1) / OpenSSH (5.1p1 pacman package #2)
- Client software: PuTTY (0.60) / WinSCP (4.1.7 build 413)

I tested the virtual machine on the windows host system. There cannot be any signal decay on the wire cause the bridged network shouldn't use it. This, and the fact I cannot reproduce the problem with version 0.7.3, leads me to believe the newer colinux versions are corrupting the data somewhere in between.

I would like to know what you make of this problem. Solutions/workarounds and reproducability confirmations are also welcome.

Thanks, Keith

Discussion

  • Henry N.
    Henry N.
    2009-04-09

    Hello Keith,

    do you use PuTTY on same desktop where coLinux is running?
    Or goes it over the wire?

    I can not assume, that this is a problem on the network bridge between coLinux and Host. I use PuTTY every day for many hours and never have corrupted data, or such errors. I'm auto logged in with public + private key pairs. PuTTY and coLinux runs on the same desktop. For me, I use PuTTY as connection between Host (Windows) and Guest (coLInux). I also have no errors from network mounts and getting downloads from internet via this bridging interface.

    coLinux ndis-bridge have heavy tested with "netio" and there are no errors.

    Please test your network connection without ssh, for example with netio ( http://www.nwlab.net/art/netio/netio.html ). Use the -t option to test only TCP (ssh uses TCP only).
    Test first from your location where you have detected the errors. Than test the connection between your Host (Windows) and Guest (coLinux).

    Changes on pcap-bridge betwen 0.7.3 and 0.7.4 are very rarely. Here is the list of changes:
    http://colinux.svn.sourceforge.net/viewvc/colinux/branches/devel/src/colinux/os/winnt/user/conet-bridged-daemon/?view=log

    Revision 1057 was the release build for 0.7.3. So, there exist only 4 changes up to version 0.7.4.
    The biggest change was revision 1222, this is first included in build 20090227.

    To find the regression, you can use snapshots from http://www.henrynestler.com/colinux/testing/devel-0.8.0/

    Henry

     
  • Thanks for the quick response.

    Above I wrote:
    I tested the virtual machine on the windows host system.

    This means I used the same desktop/computer system to run the virtual machine and to run the client software (PuTTY, WinSCP). Therefore I concluded the traffic does not go over the wire.

    As for the hours between my bug report and this comment the same virtual machine running under coLinux 0.7.3 is still doing great. I did not experience any corrupted downloads inside the vm, but I will test this later on.

    Also I noticed my sleep command aborting sometimes while running this bash script (to generate traffic)
    while true; do echo -n "$RANDOM"; sleep 0.5; done
    It blurts somekind of assertion not being valid or something in xnanosleep.c on some line, will report the actual error message later on. Maybe this is related, cause I don't get such aborts with the 0.7.3 version.

    I will investigate the problem further tomorrow or the day after that.

    Keith

     
  • After further investigating the problem I found this thread on the net:
    http://fixunix.com/openssl/518688-re-uml-devel-dev-random-problems-fp-registers-corruption.html

    It mentions something strange happening in the User Mode Linux kernel which causes random variables in random processes to get corrupted when doing something with OpenSSL.

    My little bash script became very unreliable when starting an openssl key generation process in another putty. Normally I get a neverending string of random decimal numbers, but now the sleep program starts aborting with message:

    sleep: xnanosleep.c:67: xnanosleep: Assertion `0 <= seconds' failed.
    Aborted

    Can someone please try to reproduce the error?
    Start this bash script in one putty session:
    while true; do echo -n "$RANDOM"; sleep 0.1; done
    Start the following command in another one:
    openssl genrsa -out /dev/null 4096

    It should be obvious the sleep command is failing, 'cause there are alot of error messages.

    I used colinux 0.8.0 (20090329) this time with the pcap bridge for networking.

    Also, henryn, I monitored my network traffic and the network traffic does reach the wire. My network is very reliable, but of course we can't rule out the packet corruption is comming from interferance on the wire. But this does not explain sleep aborting when doing something with OpenSSL.

    This seems like a whole other bug, but I think the networking thing and the corruption of memory are related.

     
  • Henry N.
    Henry N.
    2009-04-10

    Hello Keith,

    please open a separate Tracker for the sleep bug. It has nothing to do with putty.
    I can confirm it under Debian 4.0 with coLinux 0.7.4-rc1 on FLTK console, the message is

    sleep: xnanosleep.c:58: xnanosleep: Assertion `0 <= seconds' failed.

    For the network, it is normal, that all packets from Windows Host to coLinux Guest goes also over the wire out. Windows does not known, that we are in capture mode on this network interface. My question was more, goes the ssh connection over the wire. You sad no. Then please lets test some netio from Windows to coLinux.

    Henry

     
  • Bug research update.

    I've done the netio performance benchmark twice from windows to colinux and twice from colinux to windows (four runs total). Here are the results:

    From windows to colinux #1:
    TCP connection established.
    Packet size 1k bytes: 11246 KByte/s Tx, 17599 KByte/s Rx.
    Packet size 2k bytes: 11296 KByte/s Tx, 18233 KByte/s Rx.
    Packet size 4k bytes: 11561 KByte/s Tx, 18841 KByte/s Rx.
    Packet size 8k bytes: 11566 KByte/s Tx, 19752 KByte/s Rx.
    Packet size 16k bytes: 11564 KByte/s Tx, 20031 KByte/s Rx.
    Packet size 32k bytes: 11586 KByte/s Tx, 18700 KByte/s Rx.
    Done.

    From windows to colinux #2:
    TCP connection established.
    Packet size 1k bytes: 11331 KByte/s Tx, 17431 KByte/s Rx.
    Packet size 2k bytes: 11228 KByte/s Tx, 17502 KByte/s Rx.
    Packet size 4k bytes: 11564 KByte/s Tx, 17893 KByte/s Rx.
    Packet size 8k bytes: 11542 KByte/s Tx, 18682 KByte/s Rx.
    Packet size 16k bytes: 11500 KByte/s Tx, 16087 KByte/s Rx.
    Packet size 32k bytes: 10794 KByte/s Tx, 20235 KByte/s Rx.
    Done.

    From colinux to windows #1:
    TCP connection established.
    Packet size 1k bytes: 17529 KByte/s Tx, 11034 KByte/s Rx.
    Packet size 2k bytes: 17824 KByte/s Tx, 11249 KByte/s Rx.
    Packet size 4k bytes: 17897 KByte/s Tx, 10706 KByte/s Rx.
    Packet size 8k bytes: 19426 KByte/s Tx, 10716 KByte/s Rx.
    Packet size 16k bytes: 19306 KByte/s Tx, 11464 KByte/s Rx.
    Packet size 32k bytes: 19284 KByte/s Tx, 11487 KByte/s Rx.
    Done.

    From colinux to windows #2:
    TCP connection established.
    Packet size 1k bytes: 17279 KByte/s Tx, 11073 KByte/s Rx.
    Packet size 2k bytes: 17536 KByte/s Tx, 11239 KByte/s Rx.
    Packet size 4k bytes: 17704 KByte/s Tx, 11451 KByte/s Rx.
    Packet size 8k bytes: 17088 KByte/s Tx, 11338 KByte/s Rx.
    Packet size 16k bytes: 19456 KByte/s Tx, 11458 KByte/s Rx.
    Packet size 32k bytes: 19034 KByte/s Tx, 11500 KByte/s Rx.
    Done.

    Nothing interesting here. Also I looked at the netio description and source, to me it doesn't seem to do integrity checks on the data being received. The way I understand the error is that the key or packets get corrupted along the way. So it would be nice to know if what should have been sent out also gets received unharmed at the other end.

    After looking over my last log I found that the PuTTY sessions which crashed were exactly x hours old. Where x is mostly 1, 2 or 3 hours. See below:
    xxx pts/0 x.x.x.x Fri Apr 10 16:56 - 17:56 (01:00)
    xxx pts/0 x.x.x.x Fri Apr 10 20:13 - 21:13 (01:00)
    xxx pts/0 x.x.x.x Fri Apr 10 21:14 - 04:14 (07:00)
    xxx pts/2 x.x.x.x Fri Apr 10 20:14 - 03:14 (07:00)

    (These results are older and I'm not 100% sure these sessions have crashed but their entries are suspicious.)
    xxx pts/1 x.x.x.x Tue Apr 7 13:38 - 18:38 (05:00)
    xxx pts/0 x.x.x.x Tue Apr 7 19:57 - 20:57 (01:00)
    xxx pts/1 x.x.x.x Tue Apr 7 22:59 - 01:59 (03:00)
    xxx pts/0 x.x.x.x Tue Apr 7 22:59 - 23:59 (01:00)
    xxx pts/1 x.x.x.x Thu Apr 9 11:35 - 12:35 (01:00)
    xxx pts/0 x.x.x.x Thu Apr 9 11:34 - 15:34 (04:00)

    Then I remembered there is a 3600 value configuration option in sshd_config so I changed that to a more frequent value (10 seconds). The configuration option is the SSH-1 key regeneration interval (KeyRegenerationInterval). But I use SSH protocol v2 so this configuration parameter did not affect my putty connection crashes.

    Putty also has such an option, located in the session editor screen Connection -> SSH -> Kex option "Max minutes before rekey (0 for no limit)".
    When I changed this to 1 minute the session crashes became much more frequent. See below:
    xxx pts/1 x.x.x.x Sat Apr 11 19:42 - 19:44 (00:02)
    xxx pts/0 x.x.x.x Sat Apr 11 19:27 - 23:58 (04:30)
    xxx pts/1 x.x.x.x Sat Apr 11 16:25 - 16:27 (00:02)
    xxx pts/0 x.x.x.x Sat Apr 11 16:25 - 16:31 (00:06)
    xxx pts/0 x.x.x.x Sat Apr 11 13:46 - 15:46 (02:00)
    xxx pts/1 x.x.x.x Sat Apr 11 13:45 - 13:47 (00:02)
    xxx pts/1 x.x.x.x Sat Apr 11 13:43 - 13:44 (00:01)
    xxx pts/1 x.x.x.x Sat Apr 11 13:13 - 13:14 (00:01)
    xxx pts/0 x.x.x.x Sat Apr 11 13:12 - 13:45 (00:33)

    Now there are some sessions which lasted alot longer than the others. This is due to me having two different sessions open most of the time, one for running the 'echo $RANDOM' script and one for calling 'last | head' when the other session had crashed. The 'last | head' session does not generate much output and this seems to affect the probability of the connection crashing. Not sure as of yet how exactly the output and key exchange are related.

    I am currently trying to reproduce the connection crashing with the increased number of key exchanges with version 0.7.3, but I don't think it will crash.

    I would like to ask anyone (Henry ;) ) to try and reproduce the crashing by setting your rekey PuTTY option to 1 minute. Thanks.

    Keith

     
  • Henry N.
    Henry N.
    2009-04-12

    • status: open --> open-accepted
     
  • Henry N.
    Henry N.
    2009-04-12

    Hello Keith,
    > I would like to ask anyone (Henry ;) ) to try and reproduce the crashing
    > by setting your rekey PuTTY option to 1 minute.

    Yes. It's crashing with this setting after ~15 Minutes.
    I was running "watch cat /proc/colinux/stats" inside PuTTY.

    I have restarted PuTTY, and this is running without this problem longer than 60 minutes now. That to rarely to resolve the problem.

    PuTTY 0.60
    Debian 4.0, openssl 0.9.8c-4etch5, openssh-server 4.3p2-9etch3
    coLinux 0.7.4-rc1 (20090329)
    pcap-bridge on Realtek RTL8102E Family PCI-E Fast Ethernet NIC
    Hardware Checksum disabled. Only with this option I can connect to ssh from host, see Bug #2688891.

    The same coLinux connected from native Linux with "ssh -o RekeyLimit=1K user@192.168.2.104" the "watch cat /proc/colinux/stats" runs without problems for more as 60 minutes now. I can see the key-re-generation by tcpdump from fltk console for every ~20 seconds. The different here are the ssh vs. PuTTY, and this here goes over the wire.

    Can you check this with Cygwin's ssh on the host?

    As workaround you can use tuntap for your PuTTY login.

    Henry

     
  • I cannot get version 0.7.3 connections to crash by increasing the rekey limit.
    I also tried cygwin's ssh with the RekeyLimit=1k option and it does not crash. PuTTY is still crashing though. And the sleep command is still dying occasionally, but this bug is being tracked here http://sourceforge.net/tracker/?func=detail&aid=2756909&group_id=98788&atid=622063

    Keith

     
  • Henry N.
    Henry N.
    2009-04-13

    First scene:
    I was idle in prompt with PuTTY on Host (no wire) and at same time was logged in from other machine with "ssh -o RekeyLimit=1K hn@192.168.2.104" and doing some compiling stuff under coLinux. Both ssh-sessions killed at same time. PuTTY with "Server's host key did not match the signature supplied", and the ssh with:
    """
    RSA_public_decrypt failed: error:0407006A:rsa routines:RSA_padding_check_PKCS1_type_1:block type is not 01
    key_verify failed for server_host_key
    """

    Second scene:
    I was logged in with PuTTY (option key regen 1 minute) via *tuntap*, and on first nt-console was running the dblchange.c (see bug #2756909), on second nt-console (ALT-F2) was running "openssl genrsa -out /dev/null 4096". At same time the dblchange.c detected the error, my PuTTY was terminated with the error message "Server's host key did not match the signature supplied". The difference is, that I not was using pcap-bridge or ndis-bridge.

    Keith, you are right, that both bugs have some the same source. But, please lets follow the bug #2756909. That is better, as waiting the termination of PuTTY.

     
  • Henry N.
    Henry N.
    2009-04-15

    This bug depends on a wrongly handled FPU save/restore for operating system switch. It was better to see with the test programs in Bug #2756909, and is fixed now by reverting the changes from SVN r1237 (Floating point optimizations for operating switch).

    It's committed as SVN revision r1243 (devel) and r1245 (stable).
    New snapshots are available on http://www.colinux.org/snapshots/

    Keith, many thanks for reporting and helpfully test environments.

     
  • Henry N.
    Henry N.
    2009-04-15

    • labels: 1144488 --> Linux Kernel
    • assigned_to: nobody --> henryn
    • status: open-accepted --> closed-fixed