I am facing a unreliable transfer.
Setup:
I have a performant Laptop (1Gbps ethernet interface) with manjaro Linux running.
My clients are Raspberry 3B+ (no quite sure whats possible, but I read about 100Mbps up to 340 Mbps) I target less than 100Mbps, less is Okay, but I need it stable.
I am Running Tests, sening a 1GB file from one server to two clients.
Created Custom logs, just because I am too lazy to go through more than 300 Logs and python did sum-up the work.
One bug to mention: Total NAKs found 7284.0 from my custom log measn that there are 7284 lines which unlcude the string NAKs.
The challenge:
I did about 100 runs per setup to see if something changes. I also did way more than mentioned here, but these results are representive of my overall observings.
On client site, I just use uftpd -d -D ~/Downloads
On server site uftp ~/projects/dummies/oneG.dummie -x 3 -I enp5s0f3u1u4u3 -r 0.5:0.10:15.0 -R 78000 -B 2097152 -L logs/uftp.log
The original logs showing that I am facing time outs so I took a closer look at the grtt, but I would expect a more stable result. I "own" the network, beside of some ARPS is there nothing going on. Just a laptop, one switch and two raspberrys.
The grtt with a minimum of 0.09 is already pretty high(?), normally the logs showing about 0.01xx rtt
I also tried a higher min value, but beside that the transfer is getting slower and slower, the timeouts are still there. Sometimes more, sometimes less.
Configure the grtt -> works, first I was facing way more timed out clients. Since I tweaked it a bit, it is better, but not solved.
Configure the cache_size (on client site) -> more cache, more drops, less throughput... but do not really know what numbers are making sense here.
does that influence the linux settings as well?
Configure the buf_size (on client site) -> more buffer, way more NAKs. Same here I do not know what does make sense here
I also reconfigured the UDP buffer size on system-level for the clients, but still a lot of NAKs and timeouts.... :(
How do I determine the best vlaue arguments for my setup?
How is it possible to change the setup and find the best way to transfer data without testing several days?
does anyone has any suggestions?
Thank you
Last edit: tobias frahm 2020-09-22
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You're getting a very high level of NAKs. Based on the total number of blocks and sections, each section has either 10324 or 10325 blocks, and there are several sections which show NAKs for all blocks in several consecutive sections. One thing in particular that jumped out from the log was this:
2020/09/22 10:27:23.914010: [15990A7E/00:0001]: Sent 7519 NAKs for section 14
2020/09/22 10:27:44.748964: [15990A7E/00:0001]: Sent 1279 NAKs for section 16
2020/09/22 10:27:44.751497: [15990A7E/00:0001]: Sent 10325 NAKs for section 17
2020/09/22 10:27:44.753793: [15990A7E/00:0001]: Sent 10325 NAKs for section 18
2020/09/22 10:27:44.755957: [15990A7E/00:0001]: Sent 10325 NAKs for section 19
2020/09/22 10:27:44.758151: [15990A7E/00:0001]: Sent 10325 NAKs for section 20
2020/09/22 10:27:44.760516: [15990A7E/00:0001]: Sent 10325 NAKs for section 21
2020/09/22 10:27:44.762788: [15990A7E/00:0001]: Sent 10325 NAKs for section 22
2020/09/22 10:27:44.940347: [15990A7E/00:0001]: Sent 9345 NAKs for section 23
This tells me that there was a gap on the client side where nothing was received from section 15 to section 23, and that there was a 21 second delay in between. This suggests that the client machine was very busy doing something. Did you notice a high CPU load on the client? Was there some other task running?
You could try dropping the "nice" value to increase the client's priority via the -N option. Also, given that the client device is a Raspberry you could also try disabling encryption via "-Y none" on the server to see if that makes a difference.
Regards,
Dennis
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello Dennis,
thank you for your reply.
So the -Y none was a good shot. Since I already figured out that encryption is indeed a bottleneck on the raspberry. But if it changed something, its not really recognizeable.
I logged the CPU load during the transfer on raspberry site.
-Y none, the maximum CPU load was about 10% in this case
Client:
My guess is the read/write speed of the Raspberry Storage, since is is a simple SD-Card but since it is a C10 card. I would expect something around 10MB/s (80.000kpbs)
My guess is that the Raspberry is quite busy with writing the contetn from Buffer to the SD-Card. So I tried a run with a faster USB flashdrive connected
On the Client site the logs are looking better (I did not count the naks, but the amount seems to be way less) So far so good. Is there any option for something like a flow-control? I did not found any but the -C cc_type,but if I use this, it is very very very slow. I do not think that the SD card is that slow to not even handle a transfer-speed about 5MB/s (40.000kbps)
The -C option is used to enable congestion control. The algorithm it uses (TFMCC, RFC4654) does tend to be quick to scale back the speed in the face of a slowdown.
You could try changing the blocksize (-b) on the server to 1024 to better fit the disk's block size. Also, try varying the write cache size on the client (-c) to see what works. Getting the right value for -c will probably do a lot for your throughput.
Regards,
Dennis
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello Dennis,
the high amount of NAKs was due to the client beeing busy. Since I could not observe this, I thought about the Storage Speed. I already knew this was a bottleneck but I expected this to work. Since the SD-Card has a write speed of about 15MB/s (120000Kbps) but I plugged in a USB flashdrive anyways and it is way more stable with the flashdrive instead of the sd-cad.
Zero out of 625 runs in total failed. This is a good result.
Also the NAKs are way less than before.
Maybe, I could even get better results, but so far so good.
Hello,
I am facing a unreliable transfer.
Setup:
I have a performant Laptop (1Gbps ethernet interface) with manjaro Linux running.
My clients are Raspberry 3B+ (no quite sure whats possible, but I read about 100Mbps up to 340 Mbps) I target less than 100Mbps, less is Okay, but I need it stable.
I am Running Tests, sening a 1GB file from one server to two clients.
Created Custom logs, just because I am too lazy to go through more than 300 Logs and python did sum-up the work.
One bug to mention: Total NAKs found 7284.0 from my custom log measn that there are 7284 lines which unlcude the string NAKs.
The challenge:
I did about 100 runs per setup to see if something changes. I also did way more than mentioned here, but these results are representive of my overall observings.
On client site, I just use
uftpd -d -D ~/Downloads
On server site
uftp ~/projects/dummies/oneG.dummie -x 3 -I enp5s0f3u1u4u3 -r 0.5:0.10:15.0 -R 78000 -B 2097152 -L logs/uftp.log
Outcome:
So far so good, in 8% of the cases a timeout occured.
Next Try
Outcome:
How is that possible? It goes 1k slower, timeout jumps up to 19%
Next, same settings as the first run.
Outcome:
24% time out. with the exact same settings as the first run +-20% ?
Okay lets go slower...
Outcome:
mhm seems to work? Let`s try again just to be sure..
Did not had that much time, only 20 runs.
But the Outcome is way worse.
The original logs showing that I am facing time outs so I took a closer look at the grtt, but I would expect a more stable result. I "own" the network, beside of some ARPS is there nothing going on. Just a laptop, one switch and two raspberrys.
The grtt with a minimum of 0.09 is already pretty high(?), normally the logs showing about 0.01xx rtt
I also tried a higher min value, but beside that the transfer is getting slower and slower, the timeouts are still there. Sometimes more, sometimes less.
This is what a average client log looks like:
This was sended with:
What I tried so far:
Configure the grtt -> works, first I was facing way more timed out clients. Since I tweaked it a bit, it is better, but not solved.
Configure the cache_size (on client site) -> more cache, more drops, less throughput... but do not really know what numbers are making sense here.
does that influence the linux settings as well?
Configure the buf_size (on client site) -> more buffer, way more NAKs. Same here I do not know what does make sense here
I also reconfigured the UDP buffer size on system-level for the clients, but still a lot of NAKs and timeouts.... :(
How do I determine the best vlaue arguments for my setup?
How is it possible to change the setup and find the best way to transfer data without testing several days?
does anyone has any suggestions?
Thank you
Last edit: tobias frahm 2020-09-22
Tobias,
You're getting a very high level of NAKs. Based on the total number of blocks and sections, each section has either 10324 or 10325 blocks, and there are several sections which show NAKs for all blocks in several consecutive sections. One thing in particular that jumped out from the log was this:
2020/09/22 10:27:23.914010: [15990A7E/00:0001]: Sent 7519 NAKs for section 14
2020/09/22 10:27:44.748964: [15990A7E/00:0001]: Sent 1279 NAKs for section 16
2020/09/22 10:27:44.751497: [15990A7E/00:0001]: Sent 10325 NAKs for section 17
2020/09/22 10:27:44.753793: [15990A7E/00:0001]: Sent 10325 NAKs for section 18
2020/09/22 10:27:44.755957: [15990A7E/00:0001]: Sent 10325 NAKs for section 19
2020/09/22 10:27:44.758151: [15990A7E/00:0001]: Sent 10325 NAKs for section 20
2020/09/22 10:27:44.760516: [15990A7E/00:0001]: Sent 10325 NAKs for section 21
2020/09/22 10:27:44.762788: [15990A7E/00:0001]: Sent 10325 NAKs for section 22
2020/09/22 10:27:44.940347: [15990A7E/00:0001]: Sent 9345 NAKs for section 23
This tells me that there was a gap on the client side where nothing was received from section 15 to section 23, and that there was a 21 second delay in between. This suggests that the client machine was very busy doing something. Did you notice a high CPU load on the client? Was there some other task running?
You could try dropping the "nice" value to increase the client's priority via the -N option. Also, given that the client device is a Raspberry you could also try disabling encryption via "-Y none" on the server to see if that makes a difference.
Regards,
Dennis
Hello Dennis,
thank you for your reply.
So the -Y none was a good shot. Since I already figured out that encryption is indeed a bottleneck on the raspberry. But if it changed something, its not really recognizeable.
I logged the CPU load during the transfer on raspberry site.
-Y none, the maximum CPU load was about 10% in this case
Client:
Server:
-Y default, the maximum CPU load was up to 40%
Client
Server:
My guess is the read/write speed of the Raspberry Storage, since is is a simple SD-Card but since it is a C10 card. I would expect something around 10MB/s (80.000kpbs)
SD-Card Wiki
My guess is that the Raspberry is quite busy with writing the contetn from Buffer to the SD-Card. So I tried a run with a faster USB flashdrive connected
On the Client site the logs are looking better (I did not count the naks, but the amount seems to be way less) So far so good. Is there any option for something like a flow-control? I did not found any but the
-C cc_type
,but if I use this, it is very very very slow. I do not think that the SD card is that slow to not even handle a transfer-speed about 5MB/s (40.000kbps)log with flashdrive:
Server:
Client:
Regards,
Tobi
Tobi,
The -C option is used to enable congestion control. The algorithm it uses (TFMCC, RFC4654) does tend to be quick to scale back the speed in the face of a slowdown.
You could try changing the blocksize (-b) on the server to 1024 to better fit the disk's block size. Also, try varying the write cache size on the client (-c) to see what works. Getting the right value for -c will probably do a lot for your throughput.
Regards,
Dennis
Hello Dennis,
the high amount of NAKs was due to the client beeing busy. Since I could not observe this, I thought about the Storage Speed. I already knew this was a bottleneck but I expected this to work. Since the SD-Card has a write speed of about 15MB/s (120000Kbps) but I plugged in a USB flashdrive anyways and it is way more stable with the flashdrive instead of the sd-cad.
Zero out of 625 runs in total failed. This is a good result.
Also the NAKs are way less than before.
Maybe, I could even get better results, but so far so good.
Thank you for your support so far.
I will look into the block/chache size as well.
Best regards,
Tobi