Re: [mh] I2CS Code Testing (sync links working)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Marc,
I'll break this thread up into two different topics.  Here I'll discuss the errors on you
Insteon installation and the network errors.  In a separate email I'll discuss the logs
you sent me.  See my comments below:

On Jan 18, 2013 12:24 am Marc wrote:
> On Thu, Jan 17, 2013 at 02:33:12PM -0600, Michael Stovenour wrote:
>
> > Nice..  a real world network compete with dropped packets, packet corruption, and
> > duplicate packets all in the same aldb read sequence  ;)
> 
> Oh yeah, I have all of those galore here. Gregg used to be "exited" with the
> logs I used to send him :)
> 
> > Could you do me a favor for future log captures and set your Insteon debug to
> insteon:4?
> 
> Done. Oh my, that's a lot of logging :)
> 
> By the way, s/Depricated/Deprecated/ (whoever wrote this).

Thanks I'll check in a fix.  That was me.  My eight grade honors English teacher would
proclaim a loud "see, I told you so!" right now.

> 
> > That will decode the messages for me in your debug output.  Also, my output seems to
> > have timestamps in it, but I'm not sure why.  Both my print.log file and the stdout
seem to
> 
> I truncated the timestamps so that the lines weren't as wide and wrapped by
> some mail systems.
> I'll leave them next time.

Ok, maybe you could attach the logs vs. imbed them if they are long.  They are much easier
for me to analyze in a text editor anyway.

> > Did you use another software package to configure the I2CS device $fmr_slav or did you
> > create all of the links with manual linking?  I've not see the device id ff.ff.ff
> > entries
> 
> I used manual linking after failing to have the current code trying to link
> to it.
> 
> > in my testing so I'm curious how those entries were created.  They are not a problem,
> > as far as I know, but I'm curious about them.
> 
> I think the old code created those.

OK.  "if" I get all this code working you might reset to defaults and re-sync on that one.

> > Alas could you explain how your PLM is cabled to your MH computer and the PLM model
> 
> [Insteon_PLM] PLM id: 18d4ce firmware: 92
> It's a USB dual band PLM

How long is your cable?  I've seen issues with the FTDI chips driving anything over 6ft.

> > number?  I'm trying to figure how you are getting corrupted packets.  I see two bad
> 
> I have KPL buttons that switch on their own from time to time, some that
> aren't even links to anything.
> I just have random noise on my powerline, some that I haven't been able to
> filter out, and as you know some insteon traffic does not use checksums and
> therefore allows noise or corrupted data to be written.

There are checksums on all data transmitted on the power line or over RF.  See the
attached pages from the INSTEON developers guide.  Look at the last heading on the last
page.  I also know that later specification documents show the same "on the wire" message
integrity byte.  The recent i2CS devices insert an "additional" CRC in the D14 address of
the user data.

There are many reasons the CRC might not work as intended.  I think a noisy electrical
system could cause the microcontroller to misbehave.  Also, there could be message race
conditions in the microcontroller code that doesn't handle the case where one transmission
steps on another.  If the PLM has such a bug then it could explain why the PLM is sending
partial and corrupted messages to MH.  But...  It should not.  It should drop all such
messages.

I'm not sure why the packet checksum isn't enough protection, but the D14 checksum
protects the data "end-2-end" where the packet checksum protects hop-by-hop.  The gaping
hole in the original packet checksum is the link between the PC and the PLM.  The new D14
checksum protects configuration writes on that link as well.  

> When I worked with Gregg on this, I found links in some of my PLMs that had
> device numbers that were off by a bit or two.

Yah, that could be a real issue.  The retransmission logic will cause lots of extra
messages on the line if the device is sending to a bad Insteon address.  I suspect this is
one of the reasons for the additional checksum in the i2CS devices.  That checksum is only
mandatory for commands that modify the device configuration (e.g. aldb write).  It would
help to reduce the chances that a device is programmed with a corrupted data.  I also
think the ALDB_i1 state machine could be at fault here.  It does not handle a NACK from
the device as far as I can tell.  If a packet is corrupted on the wire, the device will
detect it and send a NACK.  That command would be ignored by the device.  If this occurred
while trying to write a sequence of bytes to the device aldb then one of the bytes could
be skipped.  I can see why the peek/poke method was obsoleted by Insteon; the code to
support it is quite fragile.

> > packets in the data below and have not seen a single one in my house.  If the packets
> > are corrupted on the power line or RF link, then I think they should be dropped by the
PLM
> > and never sent to MH.  There is an overall checksum on the powerline / RF packets that
> > should prevent corruption there from being sent to MH.  I suspect the corruption is
between
> > the PLM and the MH PC in your case.
> 
> I'm not certain that's the case. I'm pretty sure that some of my corruption
> is on the powerline itself. Are the I1 peek/poke packets really all
> protected by CRCs?

Yes, see the attached document.

> Can you explain how I have I1 KPL secondary buttons that occasionally will
> switch on when I have absolutely nothing linked to them?

No. But are you "sure" nothing in your house links to them?  You would need to scan the
aldb of every device in the house to be 100% sure.

> If you're very sure, we could look into the PLM/USB connection, but somehow
> I'm thinking USB data doesn't just get corrupted without checksums either,
> or does it?

Good question.  I don't know the USB protocol well enough to know if there are protections
against corruption.  Probably....

> 
> > I have more comments in line below:
> (...)
> > I just checked in a fix for that issue, but I can't test it because it requires a
> > specific message sequence I can't emulate.  There are other issues I've found with
sync_links()
> > that I'm working on now.
> 
> Let me get your new code and re-run my test.
> 
> > > Am I correct that your code knows about I2 devices but still speaks I1 to
> > > them when talking to ALDB? (because this is what happened to me).
> >
> > I only changed the peek/poke procedure to 0x2F ALDB read/write for I2CS devices.  I2
> > devices support peek/poke and I left the code unmodified for those.  My reasoning is
> > that I believe most people have I2 devices and those are working under the I1 aldb
process.
> > I only want to expose people to my new, unstable code if they have an i2CS device
(which
> > otherwise does not work at all with the old code).  If we get the new I2 aldb process
> > working reliably, it will be easy to enable it for I2 devices as well but I want to
> > wait. You should select I2CS devices when testing the new code.
> 
> Understood, along with the rationale. I'll be happy to switch to your new
> code for I2 too though, because syncing all links is _very_ slow, and so
> slow on a remotelinc that the pairing listen mode on the remote times out
> before all the links can be sent with the I1 protocol.
> 
> > Did you scan links for $fmr_lamp before trying to sync links?  From what I can tell
> 
> I'm pretty sure I had, actually sync all links even. I'll do it again with
> your new code though.

Sync all links doesn't do a scan.  It simply calls sync links for each device.  You need
to have completed a successful scan of the device "and" the PLM before trying to sync
links on a device.  I think we might need some defensive code here to protect the user
from a mistake that isn't at all obvious.  I'll see if there are any protections already
in the code but if there is protection logic it isn't working.

> > If you suspect the mh aldb is not right then:
> > Select "scan link table" on the PLM
> > Select "scan link table" on the device
> > You shouldn't need to do this often.  MH should save / restore this data when stopping
> /
> > restarting / reloading MH
> 
> Yes it does. I just remember though that Greg's code that does a full
> nightly scan is disabled for me, partially because it was sometimes hanging
> the bus and causing everything to go pear shaped after that.

You should probably keep that disabled until we get all of this working.  For now just do
a manual scan before trying to test anything that modifies the aldb.

> > When testing sync links:
> > Select "log links" on the PLM
> > Select "log links" on the test device
> > Select "sync links" on the test device
> > Select "log links" on the PLM
> > Select "log links" on the device
> >
> > That will give us the before and after snapshot of the aldb when trying to
> troubleshoot
> > the issues.
> 
> Yep, I've done that on occasion, I just didn't know how many hundreds of
> lines of logs you wanted :)

In my day job we get customer field and lab issues with logs that are 10 of thousands of
lines long.  It turns out to take a "lot" of work to setup a phone call over IP.  Luckily
that's not my primary job but I still get pulled into plenty of troubleshooting.

> On Thu, Jan 17, 2013 at 01:28:38PM -0800, Kevin Robert Keegan wrote:
> > It looks like the individual who coded these two subroutines expected
> > get_first_empty_address to return 0 if no address could be found.  An
> > If statement in add_link is setup to return an error warning the user
> > to scan the link table if a 0 is returned.  I just submitted a pull
> > request for a simple solution that corrects this.
> 
> Thanks for that.
> 
> On Thu, Jan 17, 2013 at 05:53:54PM -0600, Michael Stovenour wrote:
> > Nice solution.  I'll test it on i2CS devices and maybe Marc can test it on i1/i2.
> >
> > Marc,
> > I've checked in Kevin's change to my repository.
> 
> Ok, let me get your new tree and try again.
> 
> First thing is scan link table doesn't work on my I2CS device anymore:

Ok, I'll take this part to another mail thread.