Let me try and address them in order.

1) I was told one device wasn't healthy anymore even though I had a full
scan all links that had completed:
25/03/2013 22:15:49  [Insteon::AllLinkDatabase] Delete orphan links: skipping check for reciprocal links from $gar_outlights_kpl because health: out-of-sync. Please rescan this device!!

The link table for gar_outlights_kpl seems to go well at the start of your sync_links. it increments up from 16 to 27 between 22:14:44 and 22:15:21.  Around, 22:15:27 MH adds another link to the device (this should cause the link table version to bump to 28).  MH requests the version number and receives back CC.  This message is combined with the PLM ACK in this line:

25/03/2013 22:15:27  [Insteon_PLM] DEBUG3: Received PLM raw data: 026220fb3005190006025020fb3018d4ce22cc40

Breaking these two messages apart, the second message is:


Version number CC is way off, that would mean that there was nearly 200 writes to the devices link table when MH only expected 1.  Seeing nothing wrong with this message MH saves CC as the current version number.  

MH later receives what appears to be the correct link table version packet.  It is this line:

25/03/2013 22:15:27  [Insteon_PLM] DEBUG3: Received PLM raw data: 025020fb3018d4ce212800

Unfortunately, MH believes that this is merely an a duplicate message or a corrupt message and ignores it.

A few lines down when MH tries to write another link to the device around 22:15:30, the device table is reported as version 28 which doesn't match CC.  Misterhouse thinking that something funny has occurred wont allow the device to be written to until it is rescanned.  Sync Links is then prevented from adding a number of links to this device.

I can't explain why we receive a link table version of CC.  It is otherwise a valid packet, although cmd2 is 40, which would suggest the light is dimmed and not off.  This looks like a classic corrupt message, but unfortunately due to the nature of the Standard Message we are unable to detect the corruption.

The possible solutions are:
1. Catch that version CC is wrong and re-request the link table version.  -- This would help in this case, but we could still receive corruption when the link table variation is much closer.  The question would then become, where did the corruption occur?  In the reporting of the link table version, or in writing the address to the device?  Do we ever try and re-write the link table data?
2. Don't ask for the link table version after writing an address.  -- In essence, always assume that what was supposed to be written was done successfully.  This would cut down on the messages sent and would proportionally decrease the number of potentially corrupt messages.
3. Do nothing.  In a sense, the system worked the way it should here.  Something went bad and so it stops before it messes anything else up.  The feature that is lacking is proper user notification.  The error is buried in a long list of sync_link logs.  Issue #73 already raises this point.  If MH had warned Marc that some links did not sync correctly to this device, then he could have fixed it rather than think that something in delete_orphans caused the problem.

N.B. Marc, to save time, you don't need to conduct a complete rescan of all devices.  Selecting "Scan Changed Device Links" from the PLM or selecting "Scan Device Links" from the specific out-of-sync device would have brought the link table version back in line and allowed sync-links to complete properly.  

The idea with the link table version, is that you should never need to use "Scan All Device Links" from the PLM again.  You can always select "Scan Changed Device Links" and MH will interrogate each device and only scan those that need to be scanned.  The "Scan All Device Links" is left in there as a bit of a security blanket for the time being, but can likely be eventually dropped.
2) I had to do sync all links twice after that before it converged
See the full log:
22:14:16  [Sync all links] Starting now!

So, yes, the second run of sync-links was necessary because of the issue with gar_outlights_kpl discussed above, I see that it completes without any further errors.  (I do note, that gar_outside_kpl has a lot of links setup in it.  Given the scale of the link database on this device, I am not ready to call a single error in writing all of these links a failure.  We may have to live with one failure for a job this size.  We just have to make that failure more clear and recoverable.)

The third run should not have resulted in anything happening, but I see it is "updating" a number of links added to gar_outlights_kpl.  An "update" occurs when the on-level or ramp rate changes.  I have noticed this bug before, but have not found the source, it has something to do with a mismatch between what we place in the MH link table hash and then what we expect to be there when we read it back later.  This is a fixable bug.

3) I did a scan all links overnight, and another sync all links, and
3a) it wanted to sync yet more links, even though it was done last night

Umm, I don't know what happened and when, but it looks like MH's record of your PLM link database was totally lost.  You will note all it is trying to add are items to the PLM.  Not sure what happened in the intervening 10 hours between the end of the last log and this one.  It was likely a restart of some sort, but I don't know why only the PLM data was lost.  A "scan link table" of the PLM only would have fixed this. 
3b) it failed with
26/03/2013 09:16:02  [Insteon_PLM] WARN: PLM unable to complete requested PLM link table update (update/add responder record) for group: 01 and deviceid: 0fb705

This happened because the PLM already has the link, but MH doesn't know it.  (See problem above).  We should probably catch this error though rather than just let it die like this.  This is another fixable bug. 

Also, I used my now very long filter of normal messages to find interesting ones
in case they help. 

 I don't see anything in your filtered logs that was not already addressed or is that out of the ordinary.