Menu

Lost Connection

sl919
2019-08-23
2019-09-03
  • sl919

    sl919 - 2019-08-23

    Hi Dennis,

    I'm trying to figure out why we're getting a Lost Connection status on the server when trying to transfer at speeds greater than 2.5Gbps. At speeds below 2.5Gbps, the transfer complets without any issues but once I set the speed higher than 2.5Gbps, we're getting a Lost Connection status on the server and it states that there's no response from the client. Our current setup is two PCs running Win10 Pro on i9 processors. And the uftp commands are:

    Server: uftp -B 104857600 -b 8800 -R 10000000 -t 100 filename
    Client: uftpd -B 104857600 -D dataDirectory -c 20971520

    Thanks!

     
  • sl919

    sl919 - 2019-08-26

    Hi Dennis,

    Here's some additional information. I captured WireShark data on the Server and Client side at 1Gbps and 10Gbps transfer rates. In looking at the WireShark data, it looks like the only difference between the two transfer rates is at the end of the transfer. With the 1Gbps transfer rate, the client sends the COMPLETE message but with the 10Gbps transfer rate, the client sends an ABORT type message with the message of Trnasfer timed out. However, I'm not sure why the client would send this. Can you provide any information on what to look at in the WireShark data to determine why the client would send an ABORT instead of a COMPLETE?

    Thanks!

     
  • Dennis Bush

    Dennis Bush - 2019-08-27

    A client will send either a STATUS message or a COMPLETE message in response to a DONE message depending on whether it has received all packets.

    While receiving a file, the client will send an ABORT if it hasn't received any messages from the server in ROBUST * GRTT seconds or 1 second, whichever is larger, where ROBUST is the robustness factor specified by the server via the -s option (default is 20) and GRTT is the current group round trip time. So look at the timestamps of the received packets on the client side to see if there is a gap like this.

     
  • sl919

    sl919 - 2019-08-27

    I've ran some more tests with more logging enabled. In looking at the client log file, it appears that the problem is the number of NAKs. When running at 10Gbps, there's upwards of 20k NAKs.

    I'm trying to correlate the WireShark data with the log files and I'm coming across some odd findings. In looking at the client uftpd log, it states that it received packet 527, but it's out of cache range since the last packet was 518. This repeats for many packets. I then look at the WireShark data on the client side and I see the FILESEG packets and there I see missing blocks. For example, I see Section=0 Block=21, then the next one is Section=0 Block=37. So it looks like blocks 22 to 36 are missing. Then I look at the WireShark data on the server side and I see blocks 22 to 27, but then it skips to block 44.

    From this, it looks like WireShark may not have captured all the data if the uftpd log doesn't report an issue until packet 527, I'm guessing that the packet and block are the same.

    All this was captured with log level 4. Odd thing is that if I try log level 5, the transfer completes, but the speed is around 1.8Gbps instead of the 10Gbps connection. I'm using -1 for the -R option with these transfer.

    So I then just created a shared folder on the server and from the client, I just copied a file from the shared folder to a local folder on the client and the transfer was at 1.08GBs (~8Gbps). From this, I would think that with the uftp (with one client and one server), I should in theory acheive the same speed of ~8Gbps.

    Is there anything else that I can collect or settings that I can modify to figure out what's going on here?

    Thanks.

     
  • Dennis Bush

    Dennis Bush - 2019-08-28

    The "out of cache range" messages make sense because of the gaps in received packets. The purpose of the cache is to be able to write multiple contiguous blocks to disk at once, so if there's a gap you don't want to write blocks you haven't received.

    You're currently using the hardcoded max of 100MB for the UDP buffer size however at 10 Gbps that's only enough for about 0.1 seconds of data. It might make sense to raise that limit. You can try this out on your end with a minor code change.

    Line 308 of client_config.c looks like this:

    if ((rcvbuf < 65536) || (rcvbuf > 104857600)) {
    

    And line 370 of server_config.c looks the same:

    if ((rcvbuf < 65536) || (rcvbuf > 104857600)) {
    

    Change both to:

    if ((rcvbuf < 65536) || (rcvbuf > 1024 * 1024 * 1024)) {
    

    This will raise the limit for the UDP buffer size to 1 GB. Make these changes, recompile, and give that a try.

    You might also want to try increasing the priority of the client using the -N option. A value of -1 sets the priority to ABOVE_NORMAL and a value of -2 sets the priority to HIGH.

     
  • sl919

    sl919 - 2019-08-29

    I tried making the change and changing the client priority but still no luck. When the transfer completes, the throughput is only about 1.8Gbps. And other times, it fail for the Lost Connection status. I'm going to try and capture some data off the switch to see if for some reason the switch is dropping packets. Do you know if there's anything else I can try? Or is it just not possible to get those type of transfer rates with Win10? Do you know if anyone has been able to?

    Thanks.

     
  • Dennis Bush

    Dennis Bush - 2019-08-29

    I don't have a 10 Gbps network to test with, and so far I haven't heard from anyone that has been able to reach those speeds. I would say at this point to set up packet captures whereever you can to see where drops are happening and to feel free to tweak multiple settings such as cache size, buffer size, robustness factor, GRTT limits, etc. and see what effect it has.

     
  • sl919

    sl919 - 2019-09-03

    I sort of see what my issue is, and it appears to be related to the cache size. From analyzing the logs, it appears that the client is missing packets when it needs to write the cache to disk. At first I thought perhaps if I set the lowest cache so this way it doesn't have to write a large amount of data, then it can handle the incoming packets because with a large cache, it's spending too much time writing the data and not processing the incoming data but that didn't work.

    So is there an option in the server to sort of pause when it's sent enough packets for the cache size before continuing to send data? I know this will slow things down a bit but it might make it more robust, of course for a multiple client setup, all clients should have the same cache size.

    Thanks.

     
  • Dennis Bush

    Dennis Bush - 2019-09-03

    No, there's nothing like that other than setting the rate lower. Basically you need to see at what transmission rate you start to see NAKs and how many, then you tweak things like UDP buffer, cache, GRTT limits and see how they affect the NAKs.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.