[hpoj-devel] scan lockups with PSC500

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

dgun@really.hates.spam said:
> Well, I haven't have time to look at the code, but I did clean up the
> debug output and put a few comments in it.  I've attached it to this
> message.  It doesn't matter what debug level I put, it still hangs.
> The less the debug output, the longer it takes to hang.  So I just
> used the max debug level for the attached output.

> The syslog debug output was taken using hpoj-0.6 and kernel 2.2.17.
> The commands used to collect the data are in the file.  The program
> hangs during the <ESC>*f0S scan command with a timeout error (-110).
> After the timeout error, I entered the final <ESC>E command and
> received the second timeout error.  In every single lockup, the same
> messages get repeated hundreds of times just before the timeout error,
> so even though the errors occur at random times, the symptom is always
> the same.  Hope you can make something of the data and get the driver
> fixed.  Otherwise, I'll be looking at it myself when I get the time. 

Hi, Daniel.  Thanks for getting me the debug output.  Unfortunately there
was a gap right at the place where it became interesting (where it changes
from event9/get_nibble/event11/event6 to mlcpp_intr).  If you happened to
save other debug logs (or if you try this again), could you check to see if
any of the other logs show this transition better?  If you see a line such
as "get_nibble(c60b_intr c60b1160", that is a sign of a gap.

It occurred to me that after we set event 7 or 12 (nAutoFd=0) in get_nibble,
perhaps the peripheral doesn't respond with event 9 (nAck=0) until after we
drop out of the poll loop contained within the PAR_WAIT_SET_CLEAR macro.
I don't know how long it is until mlcpp_intr gets called again to retry the
poll loop, but perhaps the peripheral times out in the meantime and backs
out of event 9, such that we time out for event 9.  To test this theory,
please try increasing the value of SHORT_WAIT, which is currently 500
iterations of the loop in PAR_WAIT_SET_CLEAR.  I'm not sure what would be a
better value, but you could experiment with several, such as 5000 or 10000.
If it's too large you may start to notice that your machine gets sluggish
due to spending a lot of time in kernel mode.

Another interesting test would be to log the return code of parStatusRead()
right before setting event 7/12, and again after "SET_STATE(l,9,BUSY_TIME);"
but before "PAR_WAIT_SET_CLEAR(l,0,PAR_STATUS_NACK);".  Any status line
changes after event 7/12 might further suggest a peripheral-side timeout.

David