It's syncing a folder of photos and videos to 7 windows pcs. Usually it's just a single new folder that gets added inside of that root folder. The client pcs occasionally are offline and need the full folder of files that may have been created while it was offline so they get re-synced when they come back online. Works great on the first 6, but fails 99% of the time on the 7th, though it does occasionally work correctly randomly and I have no idea why. As far as I can tell the PC is the same as the other 6 hardware and software wise. Same gigabit network switch.
Immediately upon launching the client it has read time out in the log. Log level 5 gives me this on the client.
"read timeout" is a normal message at log level 5 that states how long the client will wait for a packet before checking if any message related timers have tripped in order to send retransmissions or to check if the user pressed CTRL-C. It's not indicative of a problem.
The incorrect file ID messages are interesting. Unless there is another session going on that shouldn't happen.
The NAKs in the log seem to indicate that the client is loosing packets in bursts, possibly because some other process is either taking up CPU or network resources. The long string of hex characters after the logged NAKs prints at log level 5 and represent the bitmap of NAKs that will be sent for a particular section.
Since you're using a block size of 8192, first verify that you have jumbo packets enabled on all machines. If not, you'll end up with IP fragmentation which will greatly increase the loss rate.
Next , try testing with different speeds to see where the breaking point is. Start at 100Mbps and slowly increase until it starts failing. Also do the same with the client side disk cache and maybe the UDP buffer size. Be sure to avoid changing more than one setting at a time so there's no confusion as to which one caused a change. Also, changing the client's priority with the -N option might help it to perform more favorably.
I would also suggest running Wireshark on the problem client to get some idea of what's coming in and what gets lost, whether at the host or process level. A packet capture on the client along with the logs on both sides for one particular session would give us the most insight, and I'd be most interested in seeing a trace of when an "incorrect file_id" message is printed.
Regards,
Dennis
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the info. I wondered how much of what I was seeing was just due to the increased logging level and how much was actual problem.
I've verified that jumbo packets are enabled. The windows PCs are all identical (at least same model anyway. Minor internal differences maybe like drive brand).
Watching the transfer there isn't an extreme jump in CPU usage, but there is a spike in disk traffic where it seems to be checking the files on the drive for a match. Maybe by the time it's finished it's flunked out and the sender has moved on.
Here is a simplified log of a time section of transfer from the point of view of the sender, and two clients. One client that succeeds and the problem client that almost always fails.
The sender. 0xD74CB09B is the problem client. Couldn't get file info and then was never able to communicate with 0xD74CB09B ever again.
Here it is from the point of view of the failing client. It checks the first file, and then there is a 4 second gap where nothing seems to happen, and the world moves on. All the other clients have already checked and received all the files by then.
I haven't fired up wireshark yet. That will be next. I'm starting to suspect a disk latency issue. The PCs use EMMC drives. Most are SanDisk brand. The problem PC is a Generic. 3 of the PCs actually use an external USB drive as the target drive, including the problem PC.
Performance seems ok though in normal usage. Performance may be a bit sporadic though. Does this seem like a drive performance issue perhaps?
I'll see what I can do with wireshark tomorrow.
Thanks for your help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This definitely looks like disk latency. During that 4 second gap in the client log, the only thing the client is doing is checking the validity of the destination filename, creating directories in the target file's path if required, and reading the info for the existing file. Nothing resource intensive.
See if there's anything else that might be using that disk. Antivirus could be a possibility. Try temporarily disabling it add see what happens. Also, increasing the client's priority might give it preferential access to the disk, so try passing in -N -1 or -N -2.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
what exactly is this message telling me? What file is file_id 0169? It's usually 0169 in the logs, but it's sometimes others as well like 0139 or 014D or 014E or 014F or 0150. Sometimes it happens once, sometimes it happens on hundreds of lines repeating.
I tried -N -1 with no real change. Tried different transfer rates, buffer sizes, cache size, congestion control, etc.
I thought I had found a drive latency problem with the external hard drive taking a while to seek with the average response time in task manager showing nearly 4500 milliseconds spike, but the same behaviour happens when using the internal eMMC drive that doesn't show the average response time spike at all and it's still eventually losing connection even after successfully receiving files.
Regarding the "incorrect file_id" message, each file in a session is assigned a sequential number starting with 1. This message is printed when a message comes in related to one file which is different from the active file. It's not uncommon to see this message when the invalid file_id is for the prior file, but the fact that the file_id you're seeing is far off from the current file and in some cases larger is concerning.
I'm almost positive that the disk spikes are coming from some unrelated process hogging the disk. Increasing the client's priority with -N -1 or -N -2 could help with that. Temporarily disabling antivirus could also tell you if that's the culprit, and if so you might try whitelisting the uftpd.exe process.
If you can't figure out what's hogging the disk, the other option is to increase the minimum round trip time on the server to account for delays on the client. The -g option on the server specifies the initial, minimum and maximum RTT. Start with -g 0.5:0.5:15 and also change -s to 15 or 20 to keep the waiting time down.
Last edit: Dennis Bush 2021-08-23
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The only AV is the windows malware protection. I've setup an exclusion for the uftpd.exe process and the entire external drive that it writes to. Even when I temporarily turn it off it didn't make a difference. Plus it always turns itself back on anyway. I also added dropbox.exe to the exclusion list since it's the only other process that is a resource hog.
I've increased the priority to -N -2.
Are you sure you mean -g for RTT? The manual that I've been looking at use that for max log size. -g max_log_size http://uftp-multicast.sourceforge.net/server_usage.txt
it shows -r for RTT. -r init_grtt[:min_grtt:max_grtt]
Have I been using the wrong manual?
I tried with -r 0.5:0.5:15 and -s 20
and... so far so good. The first sync with these changes actually managed to transfer several files that had consistently gotten missed. A sync with no changes correctly skipped all other files just like the other clients do. And a new sync with just a new batch of files to all clients worked correctly as well.
The overall transfer time is a bit longer I think based on the transfer time logs I have it write. Before was ~20 seconds. Now it's ~40 seconds. If that's what it takes for all clients to sync correctly I'll take it. Though faster is always better in this case. So I might try to back off a few changes and see what breaks it again. Will report back.
Thanks so much for the help and support.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
WellI can confirm now that the transfers are successfully completing. However the average time has gone from 20 seconds to 60 seconds. If I understand what the RTT change has done, is that the minimum time per file is now 0.5 seconds round trip instead of 0.01 seconds? And so tuning that minimum time to find the point where transfers start to fail again might allow for a reduction in transfer times again?
I think getting rid of that problem client entirely will be the long term solution. Or at least excluding it from the uftp transfer and maybe letting a slower separate process handle the copy for it.
Thank you Dennis. I have many more questions to try and get the most out of UFTP I will ask another day. I appreciate your time helping me.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, the change to the -r option sets the minimum round trip to 0.5 seconds, and adjusting that value (specifically the second one) will help reduce the overall time.
🎉
1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have one client that refuses to get any files and I can't figure out why. Any help or insight would be greatly appreciated.
It's syncing a folder of photos and videos to 7 windows pcs. Usually it's just a single new folder that gets added inside of that root folder. The client pcs occasionally are offline and need the full folder of files that may have been created while it was offline so they get re-synced when they come back online. Works great on the first 6, but fails 99% of the time on the 7th, though it does occasionally work correctly randomly and I have no idea why. As far as I can tell the PC is the same as the other 6 hardware and software wise. Same gigabit network switch.
Immediately upon launching the client it has read time out in the log. Log level 5 gives me this on the client.
What does read timeout mean? What is timing out? Why?
I also get some of these:
And then sometimes a lot of NAKS and this:
The sender log looks like this at default log level.
And then sometimes
Last edit: jason znack 2021-08-20
Jason,
"read timeout" is a normal message at log level 5 that states how long the client will wait for a packet before checking if any message related timers have tripped in order to send retransmissions or to check if the user pressed CTRL-C. It's not indicative of a problem.
The incorrect file ID messages are interesting. Unless there is another session going on that shouldn't happen.
The NAKs in the log seem to indicate that the client is loosing packets in bursts, possibly because some other process is either taking up CPU or network resources. The long string of hex characters after the logged NAKs prints at log level 5 and represent the bitmap of NAKs that will be sent for a particular section.
Since you're using a block size of 8192, first verify that you have jumbo packets enabled on all machines. If not, you'll end up with IP fragmentation which will greatly increase the loss rate.
Next , try testing with different speeds to see where the breaking point is. Start at 100Mbps and slowly increase until it starts failing. Also do the same with the client side disk cache and maybe the UDP buffer size. Be sure to avoid changing more than one setting at a time so there's no confusion as to which one caused a change. Also, changing the client's priority with the -N option might help it to perform more favorably.
I would also suggest running Wireshark on the problem client to get some idea of what's coming in and what gets lost, whether at the host or process level. A packet capture on the client along with the logs on both sides for one particular session would give us the most insight, and I'd be most interested in seeing a trace of when an "incorrect file_id" message is printed.
Regards,
Dennis
I'll see if I can grab more incorrect file ID messages. I have seen a lot of those.
And those go on and on and on until it times out a split second later.
Thanks for the info. I wondered how much of what I was seeing was just due to the increased logging level and how much was actual problem.
I've verified that jumbo packets are enabled. The windows PCs are all identical (at least same model anyway. Minor internal differences maybe like drive brand).
Watching the transfer there isn't an extreme jump in CPU usage, but there is a spike in disk traffic where it seems to be checking the files on the drive for a match. Maybe by the time it's finished it's flunked out and the sender has moved on.
Here is a simplified log of a time section of transfer from the point of view of the sender, and two clients. One client that succeeds and the problem client that almost always fails.
The sender. 0xD74CB09B is the problem client. Couldn't get file info and then was never able to communicate with 0xD74CB09B ever again.
Here it is from the point of view of the failing client. It checks the first file, and then there is a 4 second gap where nothing seems to happen, and the world moves on. All the other clients have already checked and received all the files by then.
And here is the same time snip from a successful client.
I haven't fired up wireshark yet. That will be next. I'm starting to suspect a disk latency issue. The PCs use EMMC drives. Most are SanDisk brand. The problem PC is a Generic. 3 of the PCs actually use an external USB drive as the target drive, including the problem PC.
Performance seems ok though in normal usage. Performance may be a bit sporadic though. Does this seem like a drive performance issue perhaps?
I'll see what I can do with wireshark tomorrow.
Thanks for your help.
This definitely looks like disk latency. During that 4 second gap in the client log, the only thing the client is doing is checking the validity of the destination filename, creating directories in the target file's path if required, and reading the info for the existing file. Nothing resource intensive.
See if there's anything else that might be using that disk. Antivirus could be a possibility. Try temporarily disabling it add see what happens. Also, increasing the client's priority might give it preferential access to the disk, so try passing in
-N -1
or-N -2
.Rejecting FILESEG: got incorrect file_id 0169
what exactly is this message telling me? What file is file_id 0169? It's usually 0169 in the logs, but it's sometimes others as well like 0139 or 014D or 014E or 014F or 0150. Sometimes it happens once, sometimes it happens on hundreds of lines repeating.
I tried -N -1 with no real change. Tried different transfer rates, buffer sizes, cache size, congestion control, etc.
I thought I had found a drive latency problem with the external hard drive taking a while to seek with the average response time in task manager showing nearly 4500 milliseconds spike, but the same behaviour happens when using the internal eMMC drive that doesn't show the average response time spike at all and it's still eventually losing connection even after successfully receiving files.
From the sender:
From the client:
Next step wireshark capture I guess? It seems to get way more NAKs than the other clients.
Regarding the "incorrect file_id" message, each file in a session is assigned a sequential number starting with 1. This message is printed when a message comes in related to one file which is different from the active file. It's not uncommon to see this message when the invalid file_id is for the prior file, but the fact that the file_id you're seeing is far off from the current file and in some cases larger is concerning.
I'm almost positive that the disk spikes are coming from some unrelated process hogging the disk. Increasing the client's priority with
-N -1
or-N -2
could help with that. Temporarily disabling antivirus could also tell you if that's the culprit, and if so you might try whitelisting the uftpd.exe process.If you can't figure out what's hogging the disk, the other option is to increase the minimum round trip time on the server to account for delays on the client. The
-g
option on the server specifies the initial, minimum and maximum RTT. Start with-g 0.5:0.5:15
and also change-s
to 15 or 20 to keep the waiting time down.Last edit: Dennis Bush 2021-08-23
The only AV is the windows malware protection. I've setup an exclusion for the uftpd.exe process and the entire external drive that it writes to. Even when I temporarily turn it off it didn't make a difference. Plus it always turns itself back on anyway. I also added dropbox.exe to the exclusion list since it's the only other process that is a resource hog.
I've increased the priority to -N -2.
Are you sure you mean
-g
for RTT? The manual that I've been looking at use that for max log size.-g max_log_size
http://uftp-multicast.sourceforge.net/server_usage.txt
it shows -r for RTT.
-r init_grtt[:min_grtt:max_grtt]
Have I been using the wrong manual?
I tried with -r 0.5:0.5:15 and -s 20
and... so far so good. The first sync with these changes actually managed to transfer several files that had consistently gotten missed. A sync with no changes correctly skipped all other files just like the other clients do. And a new sync with just a new batch of files to all clients worked correctly as well.
The overall transfer time is a bit longer I think based on the transfer time logs I have it write. Before was ~20 seconds. Now it's ~40 seconds. If that's what it takes for all clients to sync correctly I'll take it. Though faster is always better in this case. So I might try to back off a few changes and see what breaks it again. Will report back.
Thanks so much for the help and support.
It seems a bit better with the AV and priority changes alone wihtout the RTT changes, but it still drops the problem client eventually.
Adding back the
-r 0.5:0.5:15
seems to let it keep up and hang in there. Will see how it looks after a day of use tomorrow.Thanks again.
WellI can confirm now that the transfers are successfully completing. However the average time has gone from 20 seconds to 60 seconds. If I understand what the RTT change has done, is that the minimum time per file is now 0.5 seconds round trip instead of 0.01 seconds? And so tuning that minimum time to find the point where transfers start to fail again might allow for a reduction in transfer times again?
I think getting rid of that problem client entirely will be the long term solution. Or at least excluding it from the uftp transfer and maybe letting a slower separate process handle the copy for it.
Thank you Dennis. I have many more questions to try and get the most out of UFTP I will ask another day. I appreciate your time helping me.
Yes, the change to the -r option sets the minimum round trip to 0.5 seconds, and adjusting that value (specifically the second one) will help reduce the overall time.