Re: [mh] no acknowledgement from "PLM"?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

w20...@wo... wrote:
> 
> Gregg Liming wrote:
>> Marc MERLIN wrote:
>>   
>>> On Sun, Apr 19, 2009 at 04:22:18PM -0500, w20...@wo... wrote:
>>>     
>>>
>>>> And why, after that last error message does ALL scanning stop?
>>>>       
>>> Not that I know the code, but what I think it does is talk to the PLM to
>>> know what its configs are before then talking to all other devices and scan
>>> their tables too.
>>> If the plm scan fails, there is no reason to go further.
>>>     
>>
>> Not exactly.  Every bit of the insteon code is event-centric and, for 
>> the most part, asynchronous.  This means that the code isn't sitting 
>> there--blocking everything else--awaiting the response back from the 
>> PLM.  That would obviously be a bad strategy.  So, instead, it uses the 
>> equivalent of "callbacks" to know what to do next.  I've attempted to 
>> catch exceptions/problems and handle them.  However, that's not 
>> happening when a total failure occurs (which is what is considered when 
>> the PLM stops responding).  IMO, it should and I'll try to remember to 
>> take a look at this as part of the new refactor.  On the other hand, a 
>> PLM "fail event" (which is what I consider an absence of any response) 
>> is pretty serious.  I definitely would not trust any memory oriented 
>> operations (especially writes) if this is a recurring situation for you.
>>
>> As an aside, I wonder if you are encountering pauses?  You might take a 
>> look at the pause log to see if anything is occurring at the point that 
>> you get these messages.
>>
>>
>>   
> 
> I found no evidence of pauses in the logs for the last week or two.
> 
> 
> Additional information: when the problem occurs, it always has the same 
> form:
> 
> 04/21/09 03:04:19 AM Running: Scan all link tables
> 04/21/09 03:04:19 AM [Scan all link tables] Now scanning: $plm (1 of 25)
> 04/21/09 03:04:19 AM [Insteon_PLM] Parsing serial data: 
> 0269060257a20106704e000000
> 04/21/09 03:04:19 AM [Insteon_PLM] ALL-Link Record 
> Response:0257a20106704e000000
> 04/21/09 03:04:19 AM [Insteon_PLM] Parsing serial data: 
> 026a060257e288061e91010005
> 04/21/09 03:04:19 AM [Insteon_PLM] ALL-Link Record 
> Response:0257e288061e91010005
> ...
> 04/21/09 03:04:30 AM [Insteon_PLM] Parsing serial data: 
> 026a060257e29106704e010000
> 04/21/09 03:04:30 AM [Insteon_PLM] ALL-Link Record 
> Response:0257e29106704e010000
> 04/21/09 03:04:30 AM [Insteon_PLM] Parsing serial data: 
> 026a060257a201033b78000000
> 04/21/09 03:04:30 AM [Insteon_PLM] ALL-Link Record 
> Response:0257a201033b78000000
> 04/21/09 03:04:30 AM [Insteon_PLM] num commands in 1 second exceeded 
> threshold. Now delaying additional transmission for 1 second
> 04/21/09 03:04:30 AM [Insteon_PLM] num commands in 1 second exceeded 
> threshold. Now delaying additional transmission for 1 second
> 04/21/09 03:04:31 AM [Insteon_PLM] Parsing serial data: 
> 816a060257a2010764bb000000
> 04/21/09 03:04:31 AM [Insteon_PLM] ALL-Link Record 
> Response:0257a2010764bb000000
> 04/21/09 03:04:31 AM [Insteon_PLM] Attempt to queue command ('026A') 
> already in queue at position 1; skipping ...
> 04/21/09 03:04:34 AM [Insteon_PLM] No acknowledgement from PLM to last 
> command requires forced abort of current command (026A}). This may 
> reflect a problem with your environment.
> 
> 
> First the "already in queue message", then the "no acknowledgment" 
> message a few seconds later.  So, first, why would MH's scan be queueing 
> a message already in the queue?

The garbled line appears to contain two messages--one not parsable 
(816a06--that probably should have been 026a06) and one that is the 
parsable.  The second one is the all-link record response (as noted in 
the log).  When an all-link record response is received, it then 
triggers a request for the next all-link record which is "026a".  A 
better approach would be to treat the reception of an all-link record 
response as a means for forcing the awaiting queued command 
acknowledgement (which wasn't received as it was garbled) to be removed 
from the queue.

>  It happens at different point in the 
> scan each time I try it.  Debugging shows that the message received is 
> garbled, beginning with 0x81 instead of 0x02, so the first large foreach 
> block in _parse_data() throws the command back on the command_stack2 and 
> then continues on to a foreach(split()) loop which matches the message 
> data as an "ALL-Link Record Response" and attempts to put another 026A 
> command on the stack even though the prior one is still there.
> 
> Not familiar with the code enough yet to know if that's a problem or 
> not...but probably a degenerate case due to issues with my PLM 
> communication.

That entire area will likely the redesigned (for other reasons).  And, I 
will probably incorporate the approach outlined above so as to cause the 
  scanning to continue.  However, as noted, the PLM appears to be 
producing faulty data.  So, I would be more concerned about that condition.

Gregg