Thread: [Gpsbabel-code] Garmin USB "Bad payload size" error
Brought to you by:
robertl
From: Martin B. <mb-...@gr...> - 2010-06-13 21:49:10
|
Hi, some weeks ago, I mentioned that I'm occasionally getting a "Bad payload size" error while downloading tracks from a Garmin Forerunner 305 via USB on Linux using the latest gpsbabel from CVS. Gerhard Olsson noticed similar problems on Windows. When the problem occurs, the first byte of a data packet on USB gets lost and the rest of the packet is shifted by one byte (in my tests, the following packets were OK again). In the meantime, I switched to a new PC and on that, I'm getting the error quite regularly (in about 50% of all track download attempts). While this is kind of annoying, it gave me the possibility to analyze the problem better. I noticed that the problem is timing dependent: Running gpsbabel with -D10 and thus slowing it down made the problem occur less frequently, running it under strace made it disappear completely. So I tried adding a small delay at the beginning of gusb_cmd_get() in jeeps/gpsusbcommon.c which is called for each trackpoint during download. As soon as I made this delay larger than ~500us, the problem disappeared even without -D10 or strace! So this raises a few questions: * Does anybody think that this is a suitable fix (it slows down download of ~12000 trackpoints by ~30% or 7s, but IMO it's better to have a slower but 100% reliable download)? * If so, should it be done for all Garmin USB devices or only on those known to show the problem? * Does anybody know a way to do such short delays on all supported operating systems? usleep()/nanosleep() on Linux works well with recent kernels, but older kernels rounded up the delay to 10ms which would make download way to slow. MacOS probably is similar to old Linux kernels and Sleep() on Windows doesn't even support sub-ms arguments. So we'll probably have to use busy waiting with gettimeofday() calls in the wait loop. Is it safe to assume that gettimeofday() has at least 100us resolution on all platforms? What should be used on Windows? Thanks, Martin |
From: Robert L. <rob...@gp...> - 2010-06-13 22:09:37
|
On Sun, Jun 13, 2010 at 4:33 PM, Martin Buck < mb-...@gr...> wrote: > Hi, > > some weeks ago, I mentioned that I'm occasionally getting a "Bad payload > size" error while downloading tracks from a Garmin Forerunner 305 via USB > on > Linux using the latest gpsbabel from CVS. Gerhard Olsson noticed similar > problems on Windows. When the problem occurs, the first byte of a data > packet on USB gets lost and the rest of the packet is shifted by one byte > (in my tests, the following packets were OK again). > > In the meantime, I switched to a new PC and on that, I'm getting the error > quite regularly (in about 50% of all track download attempts). While this > is > kind of annoying, it gave me the possibility to analyze the problem better. > Wow. When this came up twice in rapid succession a few weeks ago, I ran my 305 overnight in a loop reading points and never got it. So I tried adding a small delay at the beginning of gusb_cmd_get() in > jeeps/gpsusbcommon.c which is called for each trackpoint during download. > As > soon as I made this delay larger than ~500us, the problem disappeared even > Yikes. I'd sure like to see on a protocol analyzer what's going on. Can you see any kind of pattern to the failures? Is it like it's always the first packet after we send one or always during a certain command/ack sequence or something? If we could safely avoid the delay inside the read of individual trackpoints, or just pad after certain writes or something, that would help a LOT. (It could also make a a fragile problem even more fragile if it's really not a 100% fix.) * Does anybody think that this is a suitable fix (it slows down download of > ~12000 trackpoints by ~30% or 7s, but IMO it's better to have a slower but > 100% reliable download)? > * If so, should it be done for all Garmin USB devices or only on those > known > to show the problem? > I don't have any reports of this happening on anything *but* the FR305 and I'd sure hate to penalize everyone else. > * Does anybody know a way to do such short delays on all supported > operating > systems? usleep()/nanosleep() on Linux works well with recent kernels, but > older kernels rounded up the delay to 10ms which would make download way > to > FWIW, gb_sleep wraps this stuff up. The fallthrough "HAVE_SLEEP" case is used approximately never. For some classes of products, fretting over ten year old Linux kernels makes sense but for us, it really doesn't. People connecting USB devices tend to be "desktoppy" and not totally unaware that the computer industry changes. This particular kernel/library feature has been around so long that any Linux distro even vaguely supported will have it, so I wouldn't worry about it. > slow. MacOS probably is similar to old Linux kernels and Sleep() on > MacOS has had it for a long time. > Windows doesn't even support sub-ms arguments. So we'll probably have to > use busy waiting with gettimeofday() calls in the wait loop. Is it safe to > assume that gettimeofday() has at least 100us resolution on all platforms? > What should be used on Windows? gb_sleep uses Sleep. RJL |
From: Martin B. <mb-...@gr...> - 2010-06-16 09:36:15
|
On Sun, Jun 13, 2010 at 05:09:30PM -0500, Robert Lipe wrote: > Wow. When this came up twice in rapid succession a few weeks ago, I ran my > 305 overnight in a loop reading points and never got it. I'm also surprised that I'm not seeing more reports of this. I tried it on 4 PCs and I'm seeing it on 2 of them (only rarely on one, and so frequently that gpsbabel is nearly unusable for download on the other one which unfortunately is my new main PC). > Yikes. I'd sure like to see on a protocol analyzer what's going on. Me too. Unfortunately, I don't have one and my logic analyzer is too slow for USB 2.0. > Can you see any kind of pattern to the failures? Is it like it's always the > first packet after we send one or always during a certain command/ack > sequence or something? Absolutely no obvious pattern. It always happens somewhere in the middle of a long track log download. The exact location changes every time. I also added timestamps to gpsbabel's track download to see whether there are any timing changes when the problem occurs, but there's nothing obvious. The weird thing is: Since this happens in the middle of a track download, we're not sending any commands to the device at that time. We're just reading whatever the device sends us and sometimes it simply sends bogus data. Here's a small sample from a download with gettimeofday() calls before/after the gusb_llops->llop_get_intr() call in gusb_cmd_get() (1st timestamp is before, 2nd is after the get_intr corresponding to the following RX (intr) line): RX timestamps: 1275596996.971612/1275596996.972550 RX (intr) [36]:14 00 00 00 22 00 00 00 18 00 00 00 a6 0d aa 20 21 c4 56 06 1b a0 52 26 24 94 23 44 5f 51 da 45 00 ff 00 00 ..................V...R....D.Q.E....(TRKDAT ) RX timestamps: 1275596996.972587/1275596996.974575 RX (intr) [36]:14 00 00 00 22 00 00 00 18 00 00 00 f1 08 aa 20 aa c4 56 06 24 a0 52 26 44 7f 22 44 97 ab da 45 00 ff 00 00 ..................V...R.D..D...E....(TRKDAT ) RX timestamps: 1275596996.974612/1275596996.975518 RX (intr) [36]:14 00 00 00 22 00 00 00 18 00 00 00 b4 05 aa 20 d4 c5 56 06 2b a0 52 26 9c 56 23 44 5a eb da 45 00 ff 00 00 ..................V...R..V.DZ..E....(TRKDAT ) RX timestamps: 1275596996.975556/1275596996.977552 RX (intr) [36]:14 00 00 00 22 00 00 00 18 00 00 00 28 02 aa 20 ca c6 56 06 32 a0 52 26 2c 0f 24 44 cb 30 db 45 00 ff 00 00 ..................V.2.R....D.0.E....(TRKDAT ) RX timestamps: 1275596996.977590/1275596996.978572 RX (intr) [36]:14 00 00 00 22 00 00 00 18 00 00 00 8c fd a9 20 a8 c7 56 06 3a a0 52 26 d8 37 23 44 0f 8a db 45 00 ff 00 00 ..................V...R..7.D...E....(TRKDAT ) RX timestamps: 1275596996.978609/1275596996.980560 RX (intr) [36]:14 00 00 00 22 00 00 00 18 00 00 00 92 f8 a9 20 c7 c5 56 06 43 a0 52 26 a8 d1 23 44 69 ec db 45 00 ff 00 00 ..................V.C.R....Di..E....(TRKDAT ) RX timestamps: 1275596996.980597/1275596996.981579 RX (intr) [36]:14 00 00 00 22 00 00 00 18 00 00 00 ad f3 a9 20 84 c3 56 06 4c a0 52 26 3c 8a 24 44 cb 4e dc 45 00 ff 00 00 ..................V.L.R....D.N.E....(TRKDAT ) RX timestamps: 1275596996.981616/1275596996.983577 RX (intr) [35]:00 00 00 22 00 00 00 18 00 00 00 8f ef a9 20 42 c1 56 06 53 a0 52 26 44 05 25 44 6e a2 dc 45 00 ff 00 00 ...............B.V.S.R.D..Dn..E....(UNKNOWN ) [ERROR] GPS_Packet_Read_usb: Bad payload size -1895825408 As you can see, the time spent in get_intr waiting for data from the device is always slightly less than either 1ms or 2ms (I guess that granularity is coming from the USB frame time of 1ms). When the error occured, it took 2ms, but other packets before also took 2ms and had correct data. And gpsbabel always seems to take slightly less than 40us to process a packet (i.e. from the end of one get_intr request to the start of the next one). > If we could safely avoid the delay inside the read of individual > trackpoints, or just pad after certain writes or something, that would help > a LOT. > > (It could also make a a fragile problem even more fragile if it's really not > a 100% fix.) 100% agreed. I'd *love* to fix this properly, but I've run out of ideas at the moment. I also had a look at the source code from http://motionbased.sourceforge.net/ and didn't see any workarounds/fixes for this problem. They seem to talk to the USB devices in a similar way as gpsbabel. So the more I think about it, the less I understand why only Gerhard and I see this problem. > FWIW, gb_sleep wraps this stuff up. The fallthrough "HAVE_SLEEP" case is > used approximately never. For some classes of products, fretting over ten > year old Linux kernels makes sense but for us, it really doesn't. People > connecting USB devices tend to be "desktoppy" and not totally unaware that > the computer industry changes. This particular kernel/library feature has > been around so long that any Linux distro even vaguely supported will have > it, so I wouldn't worry about it. I don't know details about the high resolution timers in Linux and when they became enabled by default, but I'd guess it's closer to 2 years ago than 10 years. So I'd be a bit worried about this, ... > > Windows doesn't even support sub-ms arguments. So we'll probably have to > > use busy waiting with gettimeofday() calls in the wait loop. Is it safe to > > assume that gettimeofday() has at least 100us resolution on all platforms? > > What should be used on Windows? > > > gb_sleep uses Sleep. ...especially if it will hit *all* Windows users (their downloads will get slower by a factor 10, since 10ms seems to be the smallest granularity you'll get from Sleep()). Another possible fix might be to continue to read as fast as possible and try to fix the data afterwards by shifting the packet contents back by one byte if they look corrupted (and we know exactly what kind of corruption to expect, so this would be relatively easy). But it's difficult for me to decice which of the two possible fixes I consider less disgusting. ;-) Martin |
From: Robert L. <rob...@gp...> - 2010-06-16 20:01:59
|
> > Yikes. I'd sure like to see on a protocol analyzer what's going on. > > Me too. Unfortunately, I don't have one and my logic analyzer is too slow > for USB 2.0. It's a full-speed, not a high-speed interface. > > Can you see any kind of pattern to the failures? Is it like it's always > the > > Absolutely no obvious pattern. It always happens somewhere in the middle > of a long track log download. The exact location changes every time. I also > Gross. weird thing is: Since this happens in the middle of a track download, we're > not sending any commands to the device at that time. We're just reading > whatever the device sends us and sometimes it simply sends bogus data. > There's not even an ack/nack at that point so we can signal a retry. As you can see, the time spent in get_intr waiting for data from the device > is always slightly less than either 1ms or 2ms (I guess that granularity > is coming from the USB frame time of 1ms). When the error occured, it took > Yes, this protocol does a terrible job of efficiently saturating the bus. 2ms, but other packets before also took 2ms and had correct data. And > gpsbabel always seems to take slightly less than 40us to process a packet > There are some things we could do differently to speed up our end, but it's not at all clear to me that it's a raw performance issue. "Obvious" things include preallocation of the trackpoints and moving to a dedicated reader/writer thread talking to the GPS while the main thread is slurping out of a queue. The second is rather unpleasant from a cross-platform view with our existing infrastructure. > > If we could safely avoid the delay inside the read of individual > > trackpoints, or just pad after certain writes or something, that would > help > > a LOT. > > > > (It could also make a a fragile problem even more fragile if it's really > not > > a 100% fix.) > > 100% agreed. I'd *love* to fix this properly, but I've run out of ideas at > the moment. I also had a look at the source code from > http://motionbased.sourceforge.net/ and didn't see any workarounds/fixes > for this problem. They seem to talk to the USB devices in a similar way as > gpsbabel. So the more I think about it, the less I understand why only > I didn't know about this project. Windows has an advantage over our libusb stuff in that there's a driver running that can always catch packets. On our Mac and Linux libusb stuff, we have to make very expensive system call mode transitions to get packets on and off the bus. USB devices never send unsolicited packets; they only repond to poll packets from the host. Maybe there's something in libusb mishandling things. The common thread is "not Windows", right? > > year old Linux kernels makes sense but for us, it really doesn't. > People > > connecting USB devices tend to be "desktoppy" and not totally unaware > that > > the computer industry changes. This particular kernel/library feature > has > > I don't know details about the high resolution timers in Linux and when > they became enabled by default, but I'd guess it's closer to 2 years ago > than 10 years. So I'd be a bit worried about this, ... I wouldn't. Those versions roll so quickly and the users aggressively upgrade. It's not like an even a two year old Linux installation is getting security updates. > > > Windows doesn't even support sub-ms arguments. So we'll probably have > to > > gb_sleep uses Sleep. > > ...especially if it will hit *all* Windows users (their downloads will get > slower by a factor 10, since 10ms seems to be the smallest granularity > you'll get from Sleep()). > That's probably a deal-breaker. Another possible fix might be to continue to read as fast as possible and > try to fix the data afterwards by shifting the packet contents back by one > byte if they look corrupted (and we know exactly what kind of corruption to > expect, so this would be relatively easy). > > But it's difficult for me to decice which of the two possible fixes I > consider less disgusting. ;-) I'm actually liking that last idea. It seems the corruption is always the same; we always lose exactly the first byte, right? It's not like we need that byte and it's not like we have to differentiate various forms of corruption. I'd be much more reluctant to implement packet guessing in the serial side, but on the USB side, maybe it's not that terrible to just put that leading byte back in and let the upper layers deal with the error case if we "fixed" a packet and it still wasn't right. RJL |
From: Martin B. <mb-...@gr...> - 2010-06-19 20:31:05
Attachments:
gpsbabel-garmin-fix-corrupt-packets.diff.gz
|
On Wed, Jun 16, 2010 at 03:01:52PM -0500, Robert Lipe wrote: > It's a full-speed, not a high-speed interface. Good to know. Then I might actually give my logic analyzer a try. > There are some things we could do differently to speed up our end, but it's > not at all clear to me that it's a raw performance issue. I don't think speeding up gpsbabel would help. Gpsbabel only needs 40us for processing between trackpoints and then it has to wait nearly 1ms or 2ms for the next track point. And I don't know whose fault it is that sometimes a USB frame is wasted, but I'd assume that it's the USB stack, because gpsbabel's processing time is pretty constant. > Windows has an advantage over our libusb stuff in that there's a driver > running that can always catch packets. On our Mac and Linux libusb stuff, > we have to make very expensive system call mode transitions to get packets > on and off the bus. USB devices never send unsolicited packets; they only > repond to poll packets from the host. Maybe there's something in libusb > mishandling things. The common thread is "not Windows", right? No, Gerhard saw this on Windows, so I guess the only common thing is "Forerunner 305". And even if it were a libusb-only problem, I can't imagine what libusb could do so wrong to cause single bytes from a packet to be lost. And even the syscall-overhead shouldn't be a problem here. If I understand correctly, a device can only do one interrupt transfer per USB frame, so no matter how (in)efficient the host implementation is, as long as the device doesn't switch to bulk transfers, you wouldn't see a big difference. > I'm actually liking that last idea. It seems the corruption is always the > same; we always lose exactly the first byte, right? It's not like we need > that byte and it's not like we have to differentiate various forms of > corruption. I'd be much more reluctant to implement packet guessing in the > serial side, but on the USB side, maybe it's not that terrible to just put > that leading byte back in and let the upper layers deal with the error case > if we "fixed" a packet and it still wasn't right. OK, I gave this a try and it seems to work pretty well. To test it, I've done now 100 downloads with 12000 track points each. I got 61 corrupted track points and they all were fixed properly. To detect corruption, I'm relying on the fact that the reserved bytes in the packet header are initialized to 0 (which is guaranteed according to the Garmin spec). If the corruption occurs, some of them become nonzero, because parts of the packet ID or the packet length end up there. Without making any assumptions about which packet exactly gets corrupted (so far it always hit a track point, but that's probably just because they occur most often), that's the best I can do, but it seems to work pretty well. All this is only enabled for Forerunner 305 at the moment. What do you think about this patch? Martin Index: jeeps/gpsusbread.c =================================================================== RCS file: /cvsroot/gpsbabel/gpsbabel/jeeps/gpsusbread.c,v retrieving revision 1.8 diff -u -r1.8 gpsusbread.c --- jeeps/gpsusbread.c 26 Mar 2010 03:09:20 -0000 1.8 +++ jeeps/gpsusbread.c 19 Jun 2010 18:27:29 -0000 @@ -62,6 +62,28 @@ goto do_over; } + /* We sometimes get corrupted packets during a track log transfer + * where the first byte in a packet is lost, causing all remaining + * bytes in this packet to be shifted. So far, this has only been + * observed on a Forerunner 305 (both on Linux and Windows). The + * cause is unknown, but it seems to be timing dependent. + * We try to detect the corruption mainly by checking reserved bytes + * 3 and 7 which normally should be 0, the remaining comparisons are + * only sanity checks and they alone could also trigger in case of + * valid packets. Note: We can't detect corrupted packets with an ID + * or length that's a multiple of 256, but such corrupted packets + * haven't been observed so far. + */ + if (gps_save_id == 484 + && pkt.gusb_pkt.type == 0 && pkt.gusb_pkt.reserved1 == 0 + && pkt.gusb_pkt.reserved2 == 0 && pkt.gusb_pkt.reserved3 != 0 + && pkt.gusb_pkt.pkt_id[0] == 0 && pkt.gusb_pkt.pkt_id[1] == 0 + && pkt.gusb_pkt.reserved6 == 0 && pkt.gusb_pkt.reserved7 != 0) { + GPS_Warning("Received USB packet looks corrupted, trying to fix"); + memmove(&pkt.dbuf[1], &pkt.dbuf[0], sizeof(pkt) - 1); + pkt.gusb_pkt.type = 20; + } + /* * Populate members of serial packet from USB packet. The * copy here seems wasteful, but teaching all the callers about @@ -73,6 +95,10 @@ payload_size = le_read32(&pkt.gusb_pkt.datasz); if (payload_size<0 || payload_size>MAX_GPS_PACKET_SIZE) { + /* If you get this, the packet might have been corrupted + * by the unit. Have a look at the corruption detection + * code above. + */ GPS_Error("GPS_Packet_Read_usb: Bad payload size %d", payload_size); gps_errno = FRAMING_ERROR; return 0; |
From: Robert L. <rob...@gp...> - 2010-06-19 22:54:20
|
On Sat, Jun 19, 2010 at 3:30 PM, Martin Buck < mb-...@gr...> wrote: > On Wed, Jun 16, 2010 at 03:01:52PM -0500, Robert Lipe wrote: > > I'm actually liking that last idea. It seems the corruption is always > the > > same; we always lose exactly the first byte, right? It's not like we need > > that byte and it's not like we have to differentiate various forms of > OK, I gave this a try and it seems to work pretty well. To test it, I've > done now 100 downloads with 12000 track points each. I got 61 corrupted > track points and they all were fixed properly. > Sometimes we have to do icky things. This seems like a sound workaround given the hand we're dealt on this one. Thanx for sticking with it. I commtited it. I did drop the warning as it's not like there's anything a user can do about it. |
From: Gerhard O. <ger...@gm...> - 2010-06-20 12:37:16
|
What Packet IDs can be expected? The Garmin iop_spec says 2,5,6, but I seem to get other values. The testcase where I mostly get this problem is loading courses (The courses are read first I believe). In that case _D29 gives packet ID 0x2904, why the check for "pkt.gusb_pkt.pkt_id[0] == 0" will not trigger the fix, so it is not helping. reserved3 and pkt_id[0] should not be tested at all or at least pkt_id[0] should be tested != 0? I believe I get the error more frequently when the Forerunner 305 is charging, the unit is connected via a hub. /Gerhard RX (intr) [36]:14 00 00 00 29 04 00 00 18 00 00 00 c3 64 f9 28 dc cb 90 08 03 0d 55 24 80 10 97 41 bf 2b 42 46 00 ff 00 00 .............d........U....A..BF....(CRSTDA ) RX (intr) [35]:00 00 00 29 04 00 00 18 00 00 00 70 2b f9 28 15 eb 90 08 14 0d 55 24 00 af 87 41 18 66 44 46 00 ff 00 00 ...........p.........U....A.fDF....(UNKNOWN ) On Sat, Jun 19, 2010 at 22:30, Martin Buck <mb-...@gr...> wrote: > + && pkt.gusb_pkt.pkt_id[0] == 0 && pkt.gusb_pkt.pkt_id[1] == 0 |
From: Gerhard O. <ger...@gm...> - 2010-06-20 23:30:52
|
Answering my own question: This is protocol L001 in the iop_spec. The packet ID is below 256 for all but fitness protocols, where there are some in the range 990-1066. So MSB in packet id can be 0, 3, 4. On Sun, Jun 20, 2010 at 14:37, Gerhard Olsson <ger...@gm...> wrote: > What Packet IDs can be expected? |
From: Martin B. <mb-...@gr...> - 2010-06-21 22:49:09
|
On Sun, Jun 20, 2010 at 02:37:10PM +0200, Gerhard Olsson wrote: > What Packet IDs can be expected? The Garmin iop_spec says 2,5,6, but I > seem to get other values. No, that's only for the USB protocol layer. For link protocol 1 (which is what's used for the actual data transfer), there are many more. I promise I read the spec but got the check wrong nevertheless. And of course, I only tested with track downloads which use packet IDs in the range supported by my original patch... Robert, could you please change the "pkt_id[0] == 0" check to "pkt_id[0] <= 4"? This should support all packet IDs in the range 0...1279 which covers all the documented ones plus a lot more. Thanks, Martin |
From: Robert L. <rob...@gp...> - 2010-06-22 03:09:42
|
On Mon, Jun 21, 2010 at 5:33 PM, Martin Buck < mb-...@gr...> wrote: > On Sun, Jun 20, 2010 at 02:37:10PM +0200, Gerhard Olsson wrote: > > What Packet IDs can be expected? The Garmin iop_spec says 2,5,6, but I > > seem to get other values. > > No, that's only for the USB protocol layer. For link protocol 1 (which is > what's used for the actual data transfer), there are many more. I promise I > read the spec but got the check wrong nevertheless. And of course, I only > tested with track downloads which use packet IDs in the range supported by > my original patch... > > Robert, could you please change the "pkt_id[0] == 0" check to > "pkt_id[0] <= 4"? This should support all packet IDs in the range 0...1279 > which covers all the documented ones plus a lot more. > > Thanks, > Martin > |
From: Robert L. <rob...@gp...> - 2010-06-22 03:10:33
|
On Mon, Jun 21, 2010 at 5:33 PM, Martin Buck < mb-...@gr...> wrote: Robert, could you please change the "pkt_id[0] == 0" check to > "pkt_id[0] <= 4"? This should support all packet IDs in the range 0...1279 > Committed. (Comcast only now decided to bless me with a smattering of internets for the first time today...) |