From: Gregg L. <gr...@li...> - 2009-04-21 10:58:40
|
w20...@wo... wrote: > > Gregg Liming wrote: >> Marc MERLIN wrote: >> >>> On Sun, Apr 19, 2009 at 04:22:18PM -0500, w20...@wo... wrote: >>> >>> >>>> And why, after that last error message does ALL scanning stop? >>>> >>> Not that I know the code, but what I think it does is talk to the PLM to >>> know what its configs are before then talking to all other devices and scan >>> their tables too. >>> If the plm scan fails, there is no reason to go further. >>> >> >> Not exactly. Every bit of the insteon code is event-centric and, for >> the most part, asynchronous. This means that the code isn't sitting >> there--blocking everything else--awaiting the response back from the >> PLM. That would obviously be a bad strategy. So, instead, it uses the >> equivalent of "callbacks" to know what to do next. I've attempted to >> catch exceptions/problems and handle them. However, that's not >> happening when a total failure occurs (which is what is considered when >> the PLM stops responding). IMO, it should and I'll try to remember to >> take a look at this as part of the new refactor. On the other hand, a >> PLM "fail event" (which is what I consider an absence of any response) >> is pretty serious. I definitely would not trust any memory oriented >> operations (especially writes) if this is a recurring situation for you. >> >> As an aside, I wonder if you are encountering pauses? You might take a >> look at the pause log to see if anything is occurring at the point that >> you get these messages. >> >> >> > > I found no evidence of pauses in the logs for the last week or two. > > > Additional information: when the problem occurs, it always has the same > form: > > 04/21/09 03:04:19 AM Running: Scan all link tables > 04/21/09 03:04:19 AM [Scan all link tables] Now scanning: $plm (1 of 25) > 04/21/09 03:04:19 AM [Insteon_PLM] Parsing serial data: > 0269060257a20106704e000000 > 04/21/09 03:04:19 AM [Insteon_PLM] ALL-Link Record > Response:0257a20106704e000000 > 04/21/09 03:04:19 AM [Insteon_PLM] Parsing serial data: > 026a060257e288061e91010005 > 04/21/09 03:04:19 AM [Insteon_PLM] ALL-Link Record > Response:0257e288061e91010005 > ... > 04/21/09 03:04:30 AM [Insteon_PLM] Parsing serial data: > 026a060257e29106704e010000 > 04/21/09 03:04:30 AM [Insteon_PLM] ALL-Link Record > Response:0257e29106704e010000 > 04/21/09 03:04:30 AM [Insteon_PLM] Parsing serial data: > 026a060257a201033b78000000 > 04/21/09 03:04:30 AM [Insteon_PLM] ALL-Link Record > Response:0257a201033b78000000 > 04/21/09 03:04:30 AM [Insteon_PLM] num commands in 1 second exceeded > threshold. Now delaying additional transmission for 1 second > 04/21/09 03:04:30 AM [Insteon_PLM] num commands in 1 second exceeded > threshold. Now delaying additional transmission for 1 second > 04/21/09 03:04:31 AM [Insteon_PLM] Parsing serial data: > 816a060257a2010764bb000000 > 04/21/09 03:04:31 AM [Insteon_PLM] ALL-Link Record > Response:0257a2010764bb000000 > 04/21/09 03:04:31 AM [Insteon_PLM] Attempt to queue command ('026A') > already in queue at position 1; skipping ... > 04/21/09 03:04:34 AM [Insteon_PLM] No acknowledgement from PLM to last > command requires forced abort of current command (026A}). This may > reflect a problem with your environment. > > > First the "already in queue message", then the "no acknowledgment" > message a few seconds later. So, first, why would MH's scan be queueing > a message already in the queue? The garbled line appears to contain two messages--one not parsable (816a06--that probably should have been 026a06) and one that is the parsable. The second one is the all-link record response (as noted in the log). When an all-link record response is received, it then triggers a request for the next all-link record which is "026a". A better approach would be to treat the reception of an all-link record response as a means for forcing the awaiting queued command acknowledgement (which wasn't received as it was garbled) to be removed from the queue. > It happens at different point in the > scan each time I try it. Debugging shows that the message received is > garbled, beginning with 0x81 instead of 0x02, so the first large foreach > block in _parse_data() throws the command back on the command_stack2 and > then continues on to a foreach(split()) loop which matches the message > data as an "ALL-Link Record Response" and attempts to put another 026A > command on the stack even though the prior one is still there. > > Not familiar with the code enough yet to know if that's a problem or > not...but probably a degenerate case due to issues with my PLM > communication. That entire area will likely the redesigned (for other reasons). And, I will probably incorporate the approach outlined above so as to cause the scanning to continue. However, as noted, the PLM appears to be producing faulty data. So, I would be more concerned about that condition. Gregg |