|
From: Markus K. <koe...@gm...> - 2006-05-02 15:02:28
|
Hi,
I guess some people might be offended by this email, but I don't intend
to start a flamewar, and as the list has about 4 users, there is only a
slight chance somebody will take it that way.
Try to take it as constructive critics, I spend two days on evaluating
the code, more than half of this was burned on the last released
version, which is far beyond my vocabular.
I expected the codebase from last years endofsoc to work, maybe with
little problems, but the rare opposite is the fact, even after a whole
year, the current cvs code still suffers from its very early design
decisions.
>> I already profiled the code, and checked where to find the deadlocks,
>> (not how to defeat them), so if these problems are not known, drop me a
>> line and I'll followup with a more complete bugreport.
>
>
> I'm working on getting rid of these before the next release which will
> come out this week. Its a known issue but it has less to do with
> deadlocks and more to do with the logic of the protocol. I've observed
> deadlock scenarios in which RTT estimation fails to approximate a
> reasonable number for example and the two endpoints fall out of sync
> and take an effectively infinite time to complete. Packets are
> streaming (if you use the -v option) but data packets aren't being
> sent. If you turn off the congestion control, transfers complete but
> with intermediate congestion collapse.
It is a threading deadlock, maybe the protocol bugs too, but what I saw
was a definitly a threading deadlock.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7249 me 16 0 56716 22m 692 S 90.1 2.2 3:16.98 vfer
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
futex(0x8062988, FUTEX_WAIT, 2, NULL
and then the cpu peaks at 99% for the process.
==8071== at 0x1B949199: __lll_mutex_lock_wait (in
/lib/tls/libpthread-2.3.6.so)
==8071== by 0x804F105: Control_Recv (control.c:600)
==8071== by 0x804B1ED: vfer_recv (api.c:930)
==8071== by 0x804AF44: vfer_recvfile (api.c:872)
==8071== by 0x805D3C8: main (vfer_rcp.c:407)
>> From my point of view the whole threading used for the socketio is far
>> to complex for the task itself, incomplete and therefore pretty error
>> prone, for example
....
>
>
> Yes, this is definitely a handicap of the implementation, but it also
> has some benefits which I won't go into. We're hoping to find a
> student for Google's summer of code 2006 to experiment with non
> threaded alternatives and perform performance measurements.
I'd be pleased to hear about these benefits.
I like sockets, and was about to apply for this project, but ... 3 month
is far too less to fix a 12 month history of visions, even with code reuse.
I already solved the first tasks, measure performance and check if its
threading problems.
Compiling with -pg and running gprof on the code, valgrind, gdb, and I
still can't understand how you could start implementing the protocol
without a reliable working socket base.
With threads this will *never* run stable and offer the protocols
possible performance, therefore protocol measurements and improvements
are wasted time, to go without threads, somebody has to rewrite it.
Murphys law says "Interchangeable parts-won't", and the socket io used
in vfer was not even planned as interchangeable, replacing it will be
the hell. One has to get a _real_ close view on all parts of the code,
draft a new layout, and restructure the existing code, while rewriting
the socket io.
> The impl only uses blocking mutex acquisitions. The short story is
> that there are two sets of mutexes, those that control bins that
> receive packets and those that control access to the sockets array.
> Receiving packets into and reading from the same bin makes more sense
> with a blocking mutex, and socket array access likewise doesn't
> benefit from a non blocking trylock call. The sockets array mutexes
> haven't been tested in full since one socket suffices to find the
> important bugs right now. The bin mutexes have been thoroughly tested.
>
...
>
> None of these are actually being used. These are there for the
> vfer_select() call which is not fully implemented right now. Condition
> vars are not being used anywhere in the active code.
Thats the problem, not the solution.
Don't take it bad, but the used threading lacks any design idea or
knowledge about thread synchronisation, so how can you claim benefits
using threads here?
There has to be no no blocking code, everything that has to be done can
be done nonblocking, sendmsg, recvmsg, writev, readv,
Just pure memory operations, and some computations, nothing that could
b/lock the process over the time to make threads a usefull feature.
As said before, I was about to apply, but after getting a deeper view, I
won't, this won't last 3 month, it will take at least (!) one year.
And the application simply lies, it talks about improvements, it should
state 'fix somebodys visions' and get the skill ranking 'hardco(r|d)e'.
My respect for the choosen one who applies to this task and solves it,
even if he makes me look like an arrogant liar, I'll owe him a beer.
For all those reading the ml as they want to assign for the project, if
you want your summer of code to be fun, don't apply here,
this is a summer of pain project.
MfG
Markus Koetter
|