From: Jason C. <jas...@gm...> - 2013-10-30 04:41:34
|
I have 17 Gumstix in front of me. I'm having a serious memory corruption issue with two of them. First I will describe the issue, then what I know about the hardware and software on each. I will refer to the working group of Gumstix as OK and the non-working group as BAD. The OK ones are Overo WaterStorms; the BAD ones I *think* are WaterStorms (see below). My primary goal is to figure out what is happening and why; actually fixing the BAD ones is a lower priority. *Issue: * I am observing a consistent RAM corruption issue while running software on the BAD Gumstix. I first noticed it when I cloned an SD card full of working software, ran it on one of the BAD ones, and after about 10 minutes the X framebuffer became corrupt (slowly displaying garbage starting at the top, as if something else was writing trash to it). I wrote some test software that simply malloc'd memory in 4k chunks and initialized it to a pattern of alternating 00 and FF bytes, running until all free memory was used. When I did this on the two BAD devices, all signs pointed to it blowing away existing memory (the 32bpp frame buffer turned green with a bit of other data every 1k pixels, presumably heap block headers, and every running process either crashed or seg faulted, and the kernel started complaining about 0xFF00FF00 address accesses and dumped blocks of data containing the pattern, and the device quickly froze). The issue becomes visible once about 128MB of RAM is used (swap is disabled). It is behaving as if physical memory addresses are inappropriately wrapping around (possibly at 128MB but not sure), and while the kernel thinks its addressing higher areas of memory, its actually blowing away lower areas. The OK devices were well-behaved and had no issues. *Hardware: * The OK devices were purchased within the last two months. Their packaging is labeled "Overo(R) WaterSTORM CC GUM3703W". The product number on the SD slot sticker is "GS3703W-R3949". The board has "PF3503-R3949" printed next to the pads where the wifi goes. The devices do not have wifi. These devices are fresh out of the box. The BAD devices were purchased about 8 months ago. They *should* be WaterStorms. Their packaging is lost (and I was not responsible for purchasing them but I am currently trying to obtain the original invoice / packing list). The product number on the SD slot sticker is worn off, on the best one all I can read is "GS3???W-R?358" (above it it says "W/O 28498", not sure if thats a serial number of something that is useful for identification). The board also has "PF3503-R3949" printed next to the wifi pads. The devices do not have wifi. These devices were previously used in another project. As an aside, in the project that used the BAD boards, we are currently attempting to assess a failure of nearly 40 devices within a 2 week period after about 4 months of running. Our initial guess was SD write failures, and on our off-site prototypes (where my two BAD Gumstix that I'm using now came from), we observed significant but still unexplained filesystem corruption after running for 3 weeks with no power cycles. After being unable to observe any actual acute SD failures, I am now wondering if this memory corruption issue is the cause of the problems in that project. (I discovered the memory corruption issue when recycling an unused prototype from this previous project for use in a new project). Anyways... *Software*: Currently, all devices are being booted from an SD card containing a Linaro image created by following the instructions at http://www.b1gtuna.com/2012/08/installing-linaro-image-on-gumstix-overo/. There is our own custom software on there which is essentially just a fullscreen video player that starts with the window manager. Everything else is fairly stock. I do not know the history of the BAD devices, but I do know that they used to be running from SD cards that had an extremely custom Angstrom image on them, built from scratch by a coworker (I am working to find out more information). There are some differences when booting all of these devices from nand (with no SD card in them). I'm leaving lots of hopefully unimportant info out to keep from cluttering up this email with output. Things are slightly more complicated here: All of the boards say these things on boot with no SD card inserted (sorry this may be TMI since I usually use the SD card, and u-boot is loaded from there, but I'm trying to give some background info): Texas Instruments X-Loader 1.5.0 (Sep 16 2011 - 12:10:02) ... DRAM: 512 MiB NAND: 512 MiB However, *most* of the new OK group says: OMAP36XX/37XX-GP ES2.1, CPU-OPP2, L3-165MHz, Max CPU Clock 1 Ghz U-Boot 2010.12-00023-g4eb0f5e (Sep 16 2011 - 13:51:55) NAND read: device 0 offset 0x280000, size 0x400000 Image Name: Angstrom/2.6.39/overo Data Size: 2914708 Bytes = 2.8 MiB *There is one exception!* I have a board in the OK group that I'm using for development that I put one of the old custom Angstrom SD cards in once. It says the same thing that the BAD groups say (below) except for the processor type: OMAP36XX/37XX-GP ES2.1, CPU-OPP2, L3-165MHz, Max CPU Clock 1 Ghz U-Boot 2010.09 (Oct 20 2010 - 10:11:49) NAND read: device 0 offset 0x280000, size 0x400000 Image Name: Angstrom/2.6.34/overo Data Size: 3160068 Bytes = 3 MiB And the BAD group says: OMAP3530-GP ES3.1, CPU-OPP2, L3-165MHz, Max CPU Clock 720 mHz U-Boot 2010.09 (Oct 20 2010 - 10:11:49) NAND read: device 0 offset 0x280000, size 0x400000 Image Name: Angstrom/2.6.34/overo Data Size: 3160068 Bytes = 3 MiB So basically, everything that was touched by that old Angstrom image has an older version of uboot and a kernel in nand (although note that the OK device with the old u-boot still does not have any memory issues). Additionally, in /proc/cpuinfo, the OK group says: Processor : ARMv7 Processor rev 2 (v7l) The BAD group says (yes it says rev 3 even though its older than the OK group and with a slower clock): Processor : ARMv7 Processor rev 3 (v7l) There is one more oddity: All of the devices with the older u-boot (i.e. all BAD and that one OK) hang indefinitely at "booting kernel", while the rest of the OK devices boot just fine from nand (into a 2.6.39 kernel). I'm a little confused about the u-boot stuff; it does appear that the custom Angstrom SD image is perhaps writing an old u-boot to nand (if that's even possible) although those Angstrom images where 3.something kernels, I'm not sure why the image written to nand ended up being a 2.6.34 one (but again, a coworker set up those SD images and I have no idea what he did to create them). Most of the info above I've given just as background info, because in any case when booting with the Linaro SDs in, all the OK and BAD devices display: U-Boot 2012.04.01 (Jul 19 2012 - 17:31:34) ... ## Booting kernel from Legacy Image at 80000000 ... Image Name: Ubuntu Kernel ... Data Size: 4356000 Bytes = 4.2 MiB ## Loading init Ramdisk from Legacy Image at 81600000 ... Image Name: Ubuntu Initrd ... Data Size: 1997195 Bytes = 1.9 MiB Which makes me even more confused, because while the hardware (and nand images) differ between OK and BAD, the kernel, u-boot, and software on the SD cards is identical. The memory issue seems like it has something to do with these older Gumstix, perhaps some incompatibility between the slightly older devices and the Linaro images? I don't know. Sorry, I know this has been a really long email, but I'm a bit fried, and I have no good theories to go on and I'm not sure what is important or what isn't. Does anybody have any idea what could be going on here? Please let me know what info is important, or if there are any tests I should run, or whatever. Like I said, my primary goal is figuring out *why* this is happening. For this newer project that I recycled one of the old devices for, it's not a major issue as I can just order a new device, but if this has something to do with the massive failures on the older project I mentioned, it's extremely important that I figure out what the cause is (the devices are in a remote, difficult to access location, and we can't spend a lot of time on site troubleshooting). Thanks, Jason |
From: Jeff D. <je...@gr...> - 2013-10-30 05:38:20
|
On 10/30/2013 12:40 AM, Jason Cipriani wrote: > *Software*: > > Currently, all devices are being booted from an SD card containing a Linaro > image created by following the instructions at > http://www.b1gtuna.com/2012/08/installing-linaro-image-on-gumstix-overo/. > There is our own custom software on there which is essentially just a > fullscreen video player that starts with the window manager. Everything else > is fairly stock. I see the year 2012 in several places. How old are your Linaro kernels? It was discovered late last year that the stock Linaro Overo kernel was missing CONFIG_ARM_ERRATA_430973=y, which would lead to crashes. The OMAP35xx processor in the Water definitely needs that enabled. > However, /most/ of the new OK group says: > > OMAP36XX/37XX-GP ES2.1, CPU-OPP2, L3-165MHz, Max CPU Clock 1 Ghz WaterSTORM processor. > And the BAD group says: > > OMAP3530-GP ES3.1, CPU-OPP2, L3-165MHz, Max CPU Clock 720 mHz Water processor. -- Jeff DeFouw <je...@gr...> Programmer Grand Rapids Technologies |
From: Ash C. <as...@gu...> - 2013-10-30 07:04:13
|
Hi Jason, Some thoughts inline. On Tue, Oct 29, 2013 at 9:40 PM, Jason Cipriani <jas...@gm...> wrote: > The BAD devices were purchased about 8 months ago. They *should* be > WaterStorms. Their packaging is lost (and I was not responsible for > purchasing them but I am currently trying to obtain the original invoice / > packing list). The product number on the SD slot sticker is worn off, on the > best one all I can read is "GS3???W-R?358" (above it it says "W/O 28498", > not sure if thats a serial number of something that is useful for > identification). The board also has "PF3503-R3949" printed next to the wifi > pads. The devices do not have wifi. These devices were previously used in > another project. W/O 28498 corresponds to the work order under which these were built (details http://pubs.gumstix.com/boards/AA_README.txt). These are GS3503-R3358---Overo Water COMs. > > As an aside, in the project that used the BAD boards, we are currently > attempting to assess a failure of nearly 40 devices within a 2 week period > after about 4 months of running. Our initial guess was SD write failures, > and on our off-site prototypes (where my two BAD Gumstix that I'm using now > came from), we observed significant but still unexplained filesystem > corruption after running for 3 weeks with no power cycles. After being > unable to observe any actual acute SD failures, I am now wondering if this > memory corruption issue is the cause of the problems in that project. (I > discovered the memory corruption issue when recycling an unused prototype > from this previous project for use in a new project). Anyways... Can you confirm the part code on the POP memory chip (i.e. the line above the FBGA code shown here: http://www.micron.com/products/support/fbga)? > > Software: > > Currently, all devices are being booted from an SD card containing a Linaro > image created by following the instructions at > http://www.b1gtuna.com/2012/08/installing-linaro-image-on-gumstix-overo/. > There is our own custom software on there which is essentially just a > fullscreen video player that starts with the window manager. Everything else > is fairly stock. > > I do not know the history of the BAD devices, but I do know that they used > to be running from SD cards that had an extremely custom Angstrom image on > them, built from scratch by a coworker (I am working to find out more > information). > > There are some differences when booting all of these devices from nand (with > no SD card in them). I'm leaving lots of hopefully unimportant info out to > keep from cluttering up this email with output. Things are slightly more > complicated here: > > All of the boards say these things on boot with no SD card inserted (sorry > this may be TMI since I usually use the SD card, and u-boot is loaded from > there, but I'm trying to give some background info): > > Texas Instruments X-Loader 1.5.0 (Sep 16 2011 - 12:10:02) > ... > DRAM: 512 MiB > NAND: 512 MiB > > However, most of the new OK group says: > > OMAP36XX/37XX-GP ES2.1, CPU-OPP2, L3-165MHz, Max CPU Clock 1 Ghz > U-Boot 2010.12-00023-g4eb0f5e (Sep 16 2011 - 13:51:55) > NAND read: device 0 offset 0x280000, size 0x400000 > Image Name: Angstrom/2.6.39/overo > Data Size: 2914708 Bytes = 2.8 MiB Some standard software has been flashed. The U-boot/MLO is a little older now but should work fine. > > There is one exception! I have a board in the OK group that I'm using for > development that I put one of the old custom Angstrom SD cards in once. It > says the same thing that the BAD groups say (below) except for the processor > type: > > OMAP36XX/37XX-GP ES2.1, CPU-OPP2, L3-165MHz, Max CPU Clock 1 Ghz > U-Boot 2010.09 (Oct 20 2010 - 10:11:49) > NAND read: device 0 offset 0x280000, size 0x400000 > Image Name: Angstrom/2.6.34/overo > Data Size: 3160068 Bytes = 3 MiB Looks like an older version of software was flashed. I'd update if you are using this or just run off the microSD. > > And the BAD group says: > > OMAP3530-GP ES3.1, CPU-OPP2, L3-165MHz, Max CPU Clock 720 mHz > U-Boot 2010.09 (Oct 20 2010 - 10:11:49) > NAND read: device 0 offset 0x280000, size 0x400000 > Image Name: Angstrom/2.6.34/overo > Data Size: 3160068 Bytes = 3 MiB > > So basically, everything that was touched by that old Angstrom image has an > older version of uboot and a kernel in nand (although note that the OK > device with the old u-boot still does not have any memory issues). I suspect this is not strictly a u-boot/MLO version issue (particularly if you still see issues booting with a microSD card with recent MLO/u-boot). A kernel problem or hardware issue seem more likely. > > Additionally, in /proc/cpuinfo, the OK group says: > > Processor : ARMv7 Processor rev 2 (v7l) > > The BAD group says (yes it says rev 3 even though its older than the OK > group and with a slower clock): > > Processor : ARMv7 Processor rev 3 (v7l) Rev 3 but of an older processor (OMAP35xx rather than DM37xx) > > There is one more oddity: > > All of the devices with the older u-boot (i.e. all BAD and that one OK) hang > indefinitely at "booting kernel", while the rest of the OK devices boot just > fine from nand (into a 2.6.39 kernel). I'm a little confused about the > u-boot stuff; it does appear that the custom Angstrom SD image is perhaps > writing an old u-boot to nand (if that's even possible) although those > Angstrom images where 3.something kernels, I'm not sure why the image > written to nand ended up being a 2.6.34 one (but again, a coworker set up > those SD images and I have no idea what he did to create them). For kernels before 2.6.36, the default console is named /dev/ttyS2 as opposed to /dev/ttyO2 on more recent consoles. Be sure the correct console is set in u-boot so you don't miss any boot messages: # setenv console /dev/ttyS2,115200n8 # boot > > Most of the info above I've given just as background info, because in any > case when booting with the Linaro SDs in, all the OK and BAD devices > display: > > U-Boot 2012.04.01 (Jul 19 2012 - 17:31:34) > ... > ## Booting kernel from Legacy Image at 80000000 ... > Image Name: Ubuntu Kernel > ... > Data Size: 4356000 Bytes = 4.2 MiB > ## Loading init Ramdisk from Legacy Image at 81600000 ... > Image Name: Ubuntu Initrd > ... > Data Size: 1997195 Bytes = 1.9 MiB > > Which makes me even more confused, because while the hardware (and nand > images) differ between OK and BAD, the kernel, u-boot, and software on the > SD cards is identical. The memory issue seems like it has something to do > with these older Gumstix, perhaps some incompatibility between the slightly > older devices and the Linaro images? I don't know. --Ash |
From: Jason C. <jas...@gm...> - 2013-10-30 17:45:18
|
On Wed, Oct 30, 2013 at 3:03 AM, Ash Charles <as...@gu...> wrote: > On Tue, Oct 29, 2013 at 9:40 PM, Jason Cipriani > <jas...@gm...> wrote: > > The BAD devices were purchased about 8 months ago. They *should* be > > WaterStorms. Their packaging is lost (and I was not responsible for > > purchasing them but I am currently trying to obtain the original invoice > / > > packing list). The product number on the SD slot sticker is worn off, on > the > > best one all I can read is "GS3???W-R?358" (above it it says "W/O 28498", > > not sure if thats a serial number of something that is useful for > > identification). The board also has "PF3503-R3949" printed next to the > wifi > > pads. The devices do not have wifi. These devices were previously used in > > another project. > W/O 28498 corresponds to the work order under which these were built > (details http://pubs.gumstix.com/boards/AA_README.txt). These are > GS3503-R3358---Overo Water COMs. > > > > As an aside, in the project that used the BAD boards, we are currently > > attempting to assess a failure of nearly 40 devices within a 2 week > period > > after about 4 months of running. Our initial guess was SD write failures, > > and on our off-site prototypes (where my two BAD Gumstix that I'm using > now > > came from), we observed significant but still unexplained filesystem > > corruption after running for 3 weeks with no power cycles. After being > > unable to observe any actual acute SD failures, I am now wondering if > this > > memory corruption issue is the cause of the problems in that project. (I > > discovered the memory corruption issue when recycling an unused prototype > > from this previous project for use in a new project). Anyways... > Can you confirm the part code on the POP memory chip (i.e. the line > above the FBGA code shown here: > http://www.micron.com/products/support/fbga)? > In the OK group, the WaterSTORMs: FBGA code: JW734 Part #: MT29C4G96MAZBACJG-5 WT In the BAD group, the Waters (thanks for identifying): FBGA code: JW513 Part #: MT29C4G96MAZAPCJG-5 IT > > > > Software: > > > > Currently, all devices are being booted from an SD card containing a > Linaro > > image created by following the instructions at > > http://www.b1gtuna.com/2012/08/installing-linaro-image-on-gumstix-overo/ > . > > There is our own custom software on there which is essentially just a > > fullscreen video player that starts with the window manager. Everything > else > > is fairly stock. > > > > I do not know the history of the BAD devices, but I do know that they > used > > to be running from SD cards that had an extremely custom Angstrom image > on > > them, built from scratch by a coworker (I am working to find out more > > information). > > > > There are some differences when booting all of these devices from nand > (with > > no SD card in them). I'm leaving lots of hopefully unimportant info out > to > > keep from cluttering up this email with output. Things are slightly more > > complicated here: > > > > All of the boards say these things on boot with no SD card inserted > (sorry > > this may be TMI since I usually use the SD card, and u-boot is loaded > from > > there, but I'm trying to give some background info): > > > > Texas Instruments X-Loader 1.5.0 (Sep 16 2011 - 12:10:02) > > ... > > DRAM: 512 MiB > > NAND: 512 MiB > > > > However, most of the new OK group says: > > > > OMAP36XX/37XX-GP ES2.1, CPU-OPP2, L3-165MHz, Max CPU Clock 1 Ghz > > U-Boot 2010.12-00023-g4eb0f5e (Sep 16 2011 - 13:51:55) > > NAND read: device 0 offset 0x280000, size 0x400000 > > Image Name: Angstrom/2.6.39/overo > > Data Size: 2914708 Bytes = 2.8 MiB > Some standard software has been flashed. The U-boot/MLO is a little > older now but should work fine. > > > > There is one exception! I have a board in the OK group that I'm using for > > development that I put one of the old custom Angstrom SD cards in once. > It > > says the same thing that the BAD groups say (below) except for the > processor > > type: > > > > OMAP36XX/37XX-GP ES2.1, CPU-OPP2, L3-165MHz, Max CPU Clock 1 Ghz > > U-Boot 2010.09 (Oct 20 2010 - 10:11:49) > > NAND read: device 0 offset 0x280000, size 0x400000 > > Image Name: Angstrom/2.6.34/overo > > Data Size: 3160068 Bytes = 3 MiB > Looks like an older version of software was flashed. I'd update if > you are using this or just run off the microSD. > I'm just running off the SD so it should be fine, although I have a question about this: Do some SD-based boot images flash a new u-boot onto the nand while others don't? It appears when I boot any Gumstix from these old Angstrom SD images, it downgrades the nand u-boot to 2010.09, but when I boot from the new Linaro SD images, it doesn't touch the nand u-boot. > > > > And the BAD group says: > > > > OMAP3530-GP ES3.1, CPU-OPP2, L3-165MHz, Max CPU Clock 720 mHz > > U-Boot 2010.09 (Oct 20 2010 - 10:11:49) > > NAND read: device 0 offset 0x280000, size 0x400000 > > Image Name: Angstrom/2.6.34/overo > > Data Size: 3160068 Bytes = 3 MiB > > > > So basically, everything that was touched by that old Angstrom image has > an > > older version of uboot and a kernel in nand (although note that the OK > > device with the old u-boot still does not have any memory issues). > I suspect this is not strictly a u-boot/MLO version issue > (particularly if you still see issues booting with a microSD card with > recent MLO/u-boot). A kernel problem or hardware issue seem more > likely. > Agreed. I have confirmed that bootloader versions and the previous use of the Angstrom images seem to be unrelated to the memory corruption issue. The consistent pattern I've observed so far is that when using the Linaro 3.2.1 image, WaterSTORMs behave properly and Waters show the memory issue. I will see if I can do a more controlled test (I will also check the stock Linaro image available from the Gumstix site, which I have been avoiding for reasons mentioned in other emails). > > > > Additionally, in /proc/cpuinfo, the OK group says: > > > > Processor : ARMv7 Processor rev 2 (v7l) > > > > The BAD group says (yes it says rev 3 even though its older than the OK > > group and with a slower clock): > > > > Processor : ARMv7 Processor rev 3 (v7l) > Rev 3 but of an older processor (OMAP35xx rather than DM37xx) > > > > There is one more oddity: > > > > All of the devices with the older u-boot (i.e. all BAD and that one OK) > hang > > indefinitely at "booting kernel", while the rest of the OK devices boot > just > > fine from nand (into a 2.6.39 kernel). I'm a little confused about the > > u-boot stuff; it does appear that the custom Angstrom SD image is perhaps > > writing an old u-boot to nand (if that's even possible) although those > > Angstrom images where 3.something kernels, I'm not sure why the image > > written to nand ended up being a 2.6.34 one (but again, a coworker set up > > those SD images and I have no idea what he did to create them). > For kernels before 2.6.36, the default console is named /dev/ttyS2 as > opposed to /dev/ttyO2 on more recent consoles. Be sure the correct > console is set in u-boot so you don't miss any boot messages: > # setenv console /dev/ttyS2,115200n8 > # boot > Thanks; I did this on the ones that hang (the one OK board and all the BAD boards - i.e. every board that was touched by that Angstrom image), and it yielded: Creating 5 MTD partitions on "omap2-nand.0": 0x000000000000-0x000000080000 : "xloader" 0x000000080000-0x000000240000 : "uboot" 0x000000240000-0x000000280000 : "uboot environment" 0x000000280000-0x000000680000 : "linux" 0x000000680000-0x000020000000 : "rootfs" UBI: attaching mtd4 to ubi0 UBI: physical eraseblock size: 131072 bytes (128 KiB) UBI: logical eraseblock size: 129024 bytes UBI: smallest flash I/O unit: 2048 UBI: sub-page size: 512 UBI: VID header offset: 512 (aligned 512) UBI: data offset: 2048 UBI warning: ubi_scan: 1239 PEBs are corrupted corrupted PEBs are: ... [Jason removed list] ... UBIFS error (pid 1): ubifs_get_sb: cannot open "ubi0:rootfs", error -19 VFS: Cannot open root device "ubi0:rootfs" or unknown-block(0,0) Please append a correct "root=" boot option; here are the available partitions: 1f00 512 mtdblock0 (driver?) 1f01 1792 mtdblock1 (driver?) 1f02 256 mtdblock2 (driver?) 1f03 4096 mtdblock3 (driver?) 1f04 517632 mtdblock4 (driver?) Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0) This is unrelated to the memory corruption issue I am seeing and appears to be some mismatched configuration somewhere (based on the OK device with this issue that doesn't have the memory issue). I don't know how or why this happened and I am certainly interested in helping to get to the bottom of it, although I don't want to get too distracted. :) They all boot fine off the SD cards. > > > > Most of the info above I've given just as background info, because in any > > case when booting with the Linaro SDs in, all the OK and BAD devices > > display: > > > > U-Boot 2012.04.01 (Jul 19 2012 - 17:31:34) > > ... > > ## Booting kernel from Legacy Image at 80000000 ... > > Image Name: Ubuntu Kernel > > ... > > Data Size: 4356000 Bytes = 4.2 MiB > > ## Loading init Ramdisk from Legacy Image at 81600000 ... > > Image Name: Ubuntu Initrd > > ... > > Data Size: 1997195 Bytes = 1.9 MiB > > > > Which makes me even more confused, because while the hardware (and nand > > images) differ between OK and BAD, the kernel, u-boot, and software on > the > > SD cards is identical. The memory issue seems like it has something to do > > with these older Gumstix, perhaps some incompatibility between the > slightly > > older devices and the Linaro images? I don't know. > > --Ash > At the moment the only thing I have to go on is possible effects of the unpatched erratum that Jeff DeFouw pointed out. I will investigate that further unless a better theory comes up. It does seem like it's on the hardware level or some other low level somewhere. The kernel appears completely unaware that it is accessing the same physical memory through multiple addresses. I will also try and see if I can discover anything more specific about what is going on there. Thanks, Jason |
From: Jason C. <jas...@gm...> - 2013-10-30 17:26:24
|
On Wed, Oct 30, 2013 at 1:22 AM, Jeff DeFouw <je...@gr...> wrote: > On 10/30/2013 12:40 AM, Jason Cipriani wrote: > > *Software*: > > > > Currently, all devices are being booted from an SD card containing a > Linaro > > image created by following the instructions at > > http://www.b1gtuna.com/2012/08/installing-linaro-image-on-gumstix-overo/ > . > > There is our own custom software on there which is essentially just a > > fullscreen video player that starts with the window manager. Everything > else > > is fairly stock. > > I see the year 2012 in several places. How old are your Linaro kernels? > It > was discovered late last year that the stock Linaro Overo kernel was > missing > CONFIG_ARM_ERRATA_430973=y, which would lead to crashes. The OMAP35xx > processor in the Water definitely needs that enabled. > The kernel is 3.2.1, dated July 2012: 3.2.1-linaro-omap #3 PREEMPT Thu Jul 26 17:05:26 PDT 2012 I can confirm that that configuration option is not enabled: root:~# cat /boot/config-3.2.1-linaro-omap | grep ARM_ERRATA # CONFIG_ARM_ERRATA_430973 is not set # CONFIG_ARM_ERRATA_458693 is not set # CONFIG_ARM_ERRATA_460075 is not set CONFIG_ARM_ERRATA_720789=y # CONFIG_ARM_ERRATA_743622 is not set # CONFIG_ARM_ERRATA_751472 is not set # CONFIG_ARM_ERRATA_754322 is not set Although in reading the description of that erratum ( http://cateee.net/lkddb/web-lkddb/ARM_ERRATA_430973.html), I am wondering if that can lead to the effect I am seeing. Of course it could lead to any number of strange things happening, but I'm not convinced that this isn't a red herring. The severity, predictability, and consistency of the issue I'm seeing *seems* mismatched with the effects of that erratum. Definitely no evidence *against* it, and in any case that does mean that the kernel image I'm using isn't appropriate for the OLD boards, but I still want to be sure of the cause of the memory issue. If I can't come up with any other good theories I will see if I can reproduce it with a controlled kernel build with and without that workaround enabled (for now I am going to avoid dedicating time to that). > > > However, /most/ of the new OK group says: > > > > OMAP36XX/37XX-GP ES2.1, CPU-OPP2, L3-165MHz, Max CPU Clock 1 Ghz > > WaterSTORM processor. > > > And the BAD group says: > > > > OMAP3530-GP ES3.1, CPU-OPP2, L3-165MHz, Max CPU Clock 720 mHz > > Water processor. > Thank you for clearing that up. I also found the invoice for the older bunch and confirmed that they are indeed Waters. Jason |
From: Jason C. <jas...@gm...> - 2013-10-30 18:39:12
|
On Wed, Oct 30, 2013 at 1:25 PM, Jason Cipriani <jas...@gm...>wrote: > > On Wed, Oct 30, 2013 at 1:22 AM, Jeff DeFouw <je...@gr...>wrote: > >> On 10/30/2013 12:40 AM, Jason Cipriani wrote: >> > *Software*: >> > >> > Currently, all devices are being booted from an SD card containing a >> Linaro >> > image created by following the instructions at >> > >> http://www.b1gtuna.com/2012/08/installing-linaro-image-on-gumstix-overo/. >> > There is our own custom software on there which is essentially just a >> > fullscreen video player that starts with the window manager. Everything >> else >> > is fairly stock. >> >> I see the year 2012 in several places. How old are your Linaro kernels? >> It >> was discovered late last year that the stock Linaro Overo kernel was >> missing >> CONFIG_ARM_ERRATA_430973=y, which would lead to crashes. The OMAP35xx >> processor in the Water definitely needs that enabled. >> > > The kernel is 3.2.1, dated July 2012: > > 3.2.1-linaro-omap #3 PREEMPT Thu Jul 26 17:05:26 PDT 2012 > > I can confirm that that configuration option is not enabled: > > root:~# cat /boot/config-3.2.1-linaro-omap | grep ARM_ERRATA > # CONFIG_ARM_ERRATA_430973 is not set > # CONFIG_ARM_ERRATA_458693 is not set > # CONFIG_ARM_ERRATA_460075 is not set > CONFIG_ARM_ERRATA_720789=y > # CONFIG_ARM_ERRATA_743622 is not set > # CONFIG_ARM_ERRATA_751472 is not set > # CONFIG_ARM_ERRATA_754322 is not set > > Although in reading the description of that erratum ( > http://cateee.net/lkddb/web-lkddb/ARM_ERRATA_430973.html), I am wondering > if that can lead to the effect I am seeing. Of course it could lead to any > number of strange things happening, but I'm not convinced that this isn't a > red herring. The severity, predictability, and consistency of the issue I'm > seeing *seems* mismatched with the effects of that erratum. Definitely no > evidence *against* it, and in any case that does mean that the kernel image > I'm using isn't appropriate for the OLD boards, but I still want to be sure > of the cause of the memory issue. > > If I can't come up with any other good theories I will see if I can > reproduce it with a controlled kernel build with and without that > workaround enabled (for now I am going to avoid dedicating time to that). > > The BAD Gumstix, when running an Angstrom kernel: 3.2.28-rt42+ #6 PREEMPT RT Fri Sep 21 12:23:39 EDT 2012 Do not exhibit the memory corruption problem. These kernels do have the erratum workaround enabled: root:~# gunzip -c /proc/config.gz | grep ARM_ERRATA CONFIG_ARM_ERRATA_430973=y # CONFIG_ARM_ERRATA_458693 is not set # CONFIG_ARM_ERRATA_460075 is not set # CONFIG_ARM_ERRATA_720789 is not set # CONFIG_ARM_ERRATA_743622 is not set # CONFIG_ARM_ERRATA_751472 is not set # CONFIG_ARM_ERRATA_754322 is not set That does lend support to that being the cause... Jason |