We are running some tests of large file distribution among 70~100 cluster nodes using aria2 with HTTP and torrent support.
We have noticed that when there are more then 56~60 aria2c clients downloading the same file from HTTP and sharing the downloaded chunks with torrent, some clients fail and only 56~60 of them complete successfully.
Unfortunately I can't acquire any significant error log or error messages; the only I could retrieve is:
- in syslog:
Aug 27 17:20:30 localhost aria2c[9554]: segfault at 00000001000001f7 rip 00000000004e40ac rsp 00007fff6a5c5010 error 4
while the complete log of one of the failed client follows in attachment.
It seems that for some mysterious reason clients (or maybe the tracker) can't handle more than 50~60 aria2c clients at the same time getting the same file.
Thanks,
Massimo.
Logged In: YES
user_id=122061
Originator: YES
Sorry, I've missed something that may interest:
the successful clients reports, at the end of the transfer, the following error:
node35: [#3 SIZE:4,603.0MiB/4,608.0MiB(99%) CN:41 SPD:17909.20KiB/s UP:15494.17KiB/s(3,569.9MiB)]
node35: [#3 SEEDING(ratio:0.7) CN:41 UP:15439.37KiB/s(3,592.2MiB)]
node35:
node35: 2008-08-27 17:23:57 NOTICE - Seeding is over.
node35:
node35: 2008-08-27 17:23:57 ERROR - CUID#198 - Download aborted. URI=http://myserver.example.com:8082/announce?info_hash=%95%8f%3b%e3%1bD%8e%0f%b0%f0%12%60%ec%d9q%9b%fc%5dU%88&peer_id=%2daria2%2dlrcZRySpMEnES&uploaded=3766716416&downloaded=44090
65472&left=0&compact=1&key=TpOksBtp&numwant=0&no_peer_id=1&port=6881&event=completed&supportcrypto=1
node35: Exception: libz::inflate() failed. cause:incorrect header check
node35: [#3 SEEDING(ratio:0.7) CN:9 UP:0.00KiB/s(3,592.3MiB)]
node35: [#3 SEEDING(ratio:0.7) CN:1 UP:0.00KiB/s(3,592.3MiB)]
node35:
node35: 2008-08-27 17:23:59 ERROR - CUID#201 - Download aborted. URI=http://myserver.example.com:8082/announce?info_hash=%95%8f%3b%e3%1bD%8e%0f%b0%f0%12%60%ec%d9q%9b%fc%5dU%88&peer_id=%2daria2%2dlrcZRySpMEnES&uploaded=3766831104&downloaded=44092
45696&left=0&compact=1&key=TpOksBtp&numwant=0&no_peer_id=1&port=6881&event=stopped&supportcrypto=1
node35: Exception: libz::inflate() failed. cause:incorrect header check
node35: [#3 SEEDING(ratio:0.7) CN:0 UP:0.00KiB/s(3,592.3MiB)]
node35: [#3 SEEDING(ratio:0.7) CN:0 UP:0.00KiB/s(3,592.3MiB)]
node35:
node35: 2008-08-27 17:24:01 NOTICE - Download complete: /data//huge_file.img
node35:
node35: 2008-08-27 17:24:01 NOTICE - Your share ratio was 0.7, uploaded/downloaded=3,592.3MiB/4,608.0MiB
node35:
node35: Download Results:
node35: gid|stat|path/URI
node35: ===+====+======================================================================
node35: 1| OK|[MEMORY]/huge_file.img.metalink
node35: 2| OK|[MEMORY]/huge_file.img.torrent
node35: 3| OK|/data//huge_file.img
node35:
node35: Status Legend:
node35: (OK):download completed.(ERR):error occurred.(INPR):download in-progress.
but the download was successful and the file is not damaged nor corrupted.
Thanks,
Massimo.
Logged In: YES
user_id=1450148
Originator: NO
It seems failed client caused segmentation fault. Actually I also experienced segmentation fault using aria2 downloading a torrent a few weeks ago.
Unformtunately, I didn't log something or run it on gdb and since then no segmentation fault.
If you occasionally experience segmentation fault, could you please send me log(-l option) or stack trace of gdb?
It is very useful to me.
I think tracker is fine, since it is simpler than client and normally it can handle hundreds of peers.
The error on successful client is zlib error. 1.5.0 has the bug in zlib inflate. If you use 1.5.0, please use 1.5.2.
After some months I could follow with my tests and now I can give you some details about this issue.
I've run several trials and the problem arises everytime.
I run 20 aria2c clients for machine on two linux servers downloading the same 4 GB file from HTTP and sharing the chunks as
torrent. After few minutes some clients start to die saying something like "aria2c[8801]: segfault at 00000001000001f7 rip 00000000004ea83c rsp 00007fff99bb62f0 error" on syslog.
Only a dozen clients for machine ends successfully.
In attach you can find both the log caught with '-l' option and the stdout/err of one client, the others are quite the same. The last logline is often "INFO - Leecher state, 2 choke round started". I tried to move the two hosts network configuration from dhcp to fixed ip address, because it seemed that segfaults occurred few seconds after IP address renew, but clients continued to die. Now I'm using aria2c version 1.1.2 and bittorrent tracker 5.2.0. Hosts are Dual Core AMD Opteron with Gentoo Linux OS, Kernel 2.6.20.
How to reproduce:
launch at least a dozen of aria2c clients downloading a file from HTTP and sharing it as torrent.
Massimo.
File Added: thread_01.log.gz
Log file from '-l' option
File Added: thread_01_stdout.log
Stdout and stderr of died client.
Other details about this issue: it seems affecting only the torrent feature, since all HTTP downloads finish successfully.
Gdb output related to aria2 segfault.
Other details for this issue. I successed to catch the segfault in GDB and I've put the output in the text file attached. The version used is aria2 version 1.2.0b+20090208.
Context and trials are the same of the previous.
Thanks,
Massimo.
Sorry for late response.
And thanks for log and especially gdb back trace.
I found a bug in choking algorithm that causes problem observed in gdb back trace. I created a patch for fix this issue. The patch is applicable to latest svn trunk or latest beta.
The patch is btleecher.patch
The patch was included in 1.2.0 release.
This Tracker item was closed automatically by the system. It was
previously set to a Pending status, and the original submitter
did not respond within 14 days (the time period specified by
the administrator of this Tracker).