Re: [mh] Trying scan all links/sync all links/delete orphans

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Tue, Mar 26, 2013 at 06:10:55PM -0700, Kevin Robert Keegan wrote:
> Marc,
> 
> Let me try and address them in order.

Man, that was thorough, thanks for the detailled review.

> The link table for gar_outlights_kpl seems to go well at the start of your
> sync_links. it increments up from 16 to 27 between 22:14:44 and 22:15:21.
>  Around, 22:15:27 MH adds another link to the device (this should cause the
> link table version to bump to 28).  MH requests the version number and
> receives back CC.  This message is combined with the PLM ACK in this line:
> 
> 25/03/2013 22:15:27  [Insteon_PLM] DEBUG3: Received PLM raw data:
> 026220fb3005190006025020fb3018d4ce22cc40
> 
> Breaking these two messages apart, the second message is:
> 
> 25020fb3018d4ce22cc40
> 
> Version number CC is way off, that would mean that there was nearly 200
> writes to the devices link table when MH only expected 1.  Seeing nothing
> wrong with this message MH saves CC as the current version number.
> 
> MH later receives what appears to be the correct link table version packet.
>  It is this line:
> 
> 25/03/2013 22:15:27  [Insteon_PLM] DEBUG3: Received PLM raw data:
> 025020fb3018d4ce212800
> 
> Unfortunately, MH believes that this is merely an a duplicate message or a
> corrupt message and ignores it.

Oh man, that's "interesting"

> A few lines down when MH tries to write another link to the device around
> 22:15:30, the device table is reported as version 28 which doesn't match
> CC.  Misterhouse thinking that something funny has occurred wont allow the
> device to be written to until it is rescanned.  Sync Links is then
> prevented from adding a number of links to this device.

It sounds like I just got unlucky then, and that mh did reasonable enough
things, considering.
I'm a bit dismayed that mh got corruption from that device because it's dual
band, and my PLM should have been able to receive a purely radio copy of the
message (on the power line, it happens to be the one device the farthest
from my PLM).

> The possible solutions are:
> 1. Catch that version CC is wrong and re-request the link table version.
>  -- This would help in this case, but we could still receive corruption
> when the link table variation is much closer.  The question would then
> become, where did the corruption occur?  In the reporting of the link table
> version, or in writing the address to the device?  Do we ever try and
> re-write the link table data?
> 2. Don't ask for the link table version after writing an address.  -- In
> essence, always assume that what was supposed to be written was done
> successfully.  This would cut down on the messages sent and would
> proportionally decrease the number of potentially corrupt messages.

During a device sync, would it make sense to just blindly write everything
that should be, and then re-read the whole device's link table, see if they
match what mh expected. If not, fix what needs to be fixed?
I realize that might not plug well into the current read/write logic though.

> 3. Do nothing.  In a sense, the system worked the way it should here.
>  Something went bad and so it stops before it messes anything else up.  The
> feature that is lacking is proper user notification.  The error is buried
> in a long list of sync_link logs.  Issue #73 already raises this point.  If
> MH had warned Marc that some links did not sync correctly to this device,
> then he could have fixed it rather than think that something in
> delete_orphans caused the problem.

That's a simpler and more reasonable IMO. Fail early and fail often has
advantages for sure. If there is a clear message as my last log line before
things stop, it's much easier to act on it.

> N.B. Marc, to save time, you don't need to conduct a complete rescan of all
> devices.  Selecting "Scan Changed Device Links" from the PLM or selecting
> "Scan Device Links" from the specific out-of-sync device would have brought
> the link table version back in line and allowed sync-links to complete
> properly.

Yes, I'm aware of the 2 now. Because things went wrong, I didn't know who
was wrong and who was corrupted, so to be sure, I did a complete scan just
in case.

> The idea with the link table version, is that you should never need to use
> "Scan All Device Links" from the PLM again.  You can always select "Scan
> Changed Device Links" and MH will interrogate each device and only scan
> those that need to be scanned.  The "Scan All Device Links" is left in
> there as a bit of a security blanket for the time being, but can likely be
> eventually dropped.

It sounds like the aldb version numbers on the devices are supposed to be
reliable enough then, I'll do that next time.

> So, yes, the second run of sync-links was necessary because of the issue
> with gar_outlights_kpl discussed above, I see that it completes without any
> further errors.  (I do note, that gar_outside_kpl has a lot of links setup
> in it.  Given the scale of the link database on this device, I am not ready
> to call a single error in writing all of these links a failure.  We may
> have to live with one failure for a job this size.  We just have to make
> that failure more clear and recoverable.)

100% agreed. This kind of network, apparently even the radio side, is not
reliable. Errors definitely have to be expected and reported in an easy to
notice way.

> The third run should not have resulted in anything happening, but I see it
> is "updating" a number of links added to gar_outlights_kpl.  An "update"
> occurs when the on-level or ramp rate changes.  I have noticed this bug
> before, but have not found the source, it has something to do with a
> mismatch between what we place in the MH link table hash and then what we
> expect to be there when we read it back later.  This is a fixable bug.

I think I saw an issue on that already, so no need to open a new one, right? 

> > 3) I did a scan all links overnight, and another sync all links, and
> > 3a) it wanted to sync yet more links, even though it was done last night
> > http://marc.merlins.org/tmp/print.log-goodsync-nextday
> 
> Umm, I don't know what happened and when, but it looks like MH's record of
> your PLM link database was totally lost.  You will note all it is trying to
> add are items to the PLM.  Not sure what happened in the intervening 10

Yep, I noticed that.

> hours between the end of the last log and this one.  It was likely a
> restart of some sort, but I don't know why only the PLM data was lost.  A
> "scan link table" of the PLM only would have fixed this.

In hindsight, yes. 
Say, if mh lost the state of the PLM, shouldn't it have a different value
for the version number on the insteon device's aldb?
Would it make sense for sync links to first get the version of the aldb on
the device, and make sure insteon has the same version number (and if not,
fail?)
Or I think you said above this is already going on, but somehow didn't
happen here?

> >  Also, I used my now very long filter of normal messages to find
> > interesting ones in case they help.
> 
>  I don't see anything in your filtered logs that was not already addressed
> or is that out of the ordinary.

Great. Those messages seemed ok considering, I just took them out of the
long logs to show corner case handling code at work, and thankfully doing
the right things now.

Thanks for your long review of this.
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/