On Wed, Oct 30, 2013 at 3:03 AM, Ash Charles <ash@gumstix.com> wrote:
On Tue, Oct 29, 2013 at 9:40 PM, Jason Cipriani
<jason.cipriani@gmail.com> wrote:
> The BAD devices were purchased about 8 months ago. They *should* be
> WaterStorms. Their packaging is lost (and I was not responsible for
> purchasing them but I am currently trying to obtain the original invoice /
> packing list). The product number on the SD slot sticker is worn off, on the
> best one all I can read is "GS3???W-R?358" (above it it says "W/O 28498",
> not sure if thats a serial number of something that is useful for
> identification). The board also has "PF3503-R3949" printed next to the wifi
> pads. The devices do not have wifi. These devices were previously used in
> another project.
W/O 28498 corresponds to the work order under which these were built
(details http://pubs.gumstix.com/boards/AA_README.txt).  These are
GS3503-R3358---Overo Water COMs.
>
> As an aside, in the project that used the BAD boards, we are currently
> attempting to assess a failure of nearly 40 devices within a 2 week period
> after about 4 months of running. Our initial guess was SD write failures,
> and on our off-site prototypes (where my two BAD Gumstix that I'm using now
> came from), we observed significant but still unexplained filesystem
> corruption after running for 3 weeks with no power cycles. After being
> unable to observe any actual acute SD failures, I am now wondering if this
> memory corruption issue is the cause of the problems in that project. (I
> discovered the memory corruption issue when recycling an unused prototype
> from this previous project for use in a new project). Anyways...
Can you confirm the part code on the POP memory chip (i.e. the line
above the FBGA code shown here:
http://www.micron.com/products/support/fbga)?


In the OK group, the WaterSTORMs:
  FBGA code: JW734
  Part #: MT29C4G96MAZBACJG-5 WT

In the BAD group, the Waters (thanks for identifying):
  FBGA code: JW513
  Part #: MT29C4G96MAZAPCJG-5 IT

 
>
> Software:
>
> Currently, all devices are being booted from an SD card containing a Linaro
> image created by following the instructions at
> http://www.b1gtuna.com/2012/08/installing-linaro-image-on-gumstix-overo/.
> There is our own custom software on there which is essentially just a
> fullscreen video player that starts with the window manager. Everything else
> is fairly stock.
>
> I do not know the history of the BAD devices, but I do know that they used
> to be running from SD cards that had an extremely custom Angstrom image on
> them, built from scratch by a coworker (I am working to find out more
> information).
>
> There are some differences when booting all of these devices from nand (with
> no SD card in them). I'm leaving lots of hopefully unimportant info out to
> keep from cluttering up this email with output. Things are slightly more
> complicated here:
>
> All of the boards say these things on boot with no SD card inserted (sorry
> this may be TMI since I usually use the SD card, and u-boot is loaded from
> there, but I'm trying to give some background info):
>
> Texas Instruments X-Loader 1.5.0 (Sep 16 2011 - 12:10:02)
> ...
> DRAM:  512 MiB
> NAND:  512 MiB
>
> However, most of the new OK group says:
>
> OMAP36XX/37XX-GP ES2.1, CPU-OPP2, L3-165MHz, Max CPU Clock 1 Ghz
> U-Boot 2010.12-00023-g4eb0f5e (Sep 16 2011 - 13:51:55)
> NAND read: device 0 offset 0x280000, size 0x400000
>    Image Name:   Angstrom/2.6.39/overo
>    Data Size:    2914708 Bytes = 2.8 MiB
Some standard software has been flashed.  The U-boot/MLO is a little
older now but should work fine.
>
> There is one exception! I have a board in the OK group that I'm using for
> development that I put one of the old custom Angstrom SD cards in once. It
> says the same thing that the BAD groups say (below) except for the processor
> type:
>
> OMAP36XX/37XX-GP ES2.1, CPU-OPP2, L3-165MHz, Max CPU Clock 1 Ghz
> U-Boot 2010.09 (Oct 20 2010 - 10:11:49)
> NAND read: device 0 offset 0x280000, size 0x400000
>    Image Name:   Angstrom/2.6.34/overo
>    Data Size:    3160068 Bytes = 3 MiB
Looks like an older version of software was flashed.  I'd update if
you are using this or just run off the microSD.

I'm just running off the SD so it should be fine, although I have a question about this: Do some SD-based boot images flash a new u-boot onto the nand while others don't? It appears when I boot any Gumstix from these old Angstrom SD images, it downgrades the nand u-boot to 2010.09, but when I boot from the new Linaro SD images, it doesn't touch the nand u-boot.


 
>
> And the BAD group says:
>
> OMAP3530-GP ES3.1, CPU-OPP2, L3-165MHz, Max CPU Clock 720 mHz
> U-Boot 2010.09 (Oct 20 2010 - 10:11:49)
> NAND read: device 0 offset 0x280000, size 0x400000
>    Image Name:   Angstrom/2.6.34/overo
>    Data Size:    3160068 Bytes = 3 MiB
>
> So basically, everything that was touched by that old Angstrom image has an
> older version of uboot and a kernel in nand (although note that the OK
> device with the old u-boot still does not have any memory issues).
I suspect this is not strictly a u-boot/MLO version issue
(particularly if you still see issues booting with a microSD card with
recent MLO/u-boot).  A kernel problem or hardware issue seem more
likely.

Agreed. I have confirmed that bootloader versions and the previous use of the Angstrom images seem to be unrelated to the memory corruption issue. The consistent pattern I've observed so far is that when using the Linaro 3.2.1 image, WaterSTORMs behave properly and Waters show the memory issue. I will see if I can do a more controlled test (I will also check the stock Linaro image available from the Gumstix site, which I have been avoiding for reasons mentioned in other emails).


 
>
> Additionally, in /proc/cpuinfo, the OK group says:
>
> Processor    : ARMv7 Processor rev 2 (v7l)
>
> The BAD group says (yes it says rev 3 even though its older than the OK
> group and with a slower clock):
>
> Processor    : ARMv7 Processor rev 3 (v7l)
Rev 3 but of an older processor (OMAP35xx rather than DM37xx)
>
> There is one more oddity:
>
> All of the devices with the older u-boot (i.e. all BAD and that one OK) hang
> indefinitely at "booting kernel", while the rest of the OK devices boot just
> fine from nand (into a 2.6.39 kernel). I'm a little confused about the
> u-boot stuff; it does appear that the custom Angstrom SD image is perhaps
> writing an old u-boot to nand (if that's even possible) although those
> Angstrom images where 3.something kernels, I'm not sure why the image
> written to nand ended up being a 2.6.34 one (but again, a coworker set up
> those SD images and I have no idea what he did to create them).
For kernels before 2.6.36, the default console is named /dev/ttyS2 as
opposed to /dev/ttyO2 on more recent consoles.  Be sure the correct
console is set in u-boot so you don't miss any boot messages:
# setenv console /dev/ttyS2,115200n8
# boot

Thanks; I did this on the ones that hang (the one OK board and all the BAD boards - i.e. every board that was touched by that Angstrom image), and it yielded:

Creating 5 MTD partitions on "omap2-nand.0":
0x000000000000-0x000000080000 : "xloader"
0x000000080000-0x000000240000 : "uboot"
0x000000240000-0x000000280000 : "uboot environment"
0x000000280000-0x000000680000 : "linux"
0x000000680000-0x000020000000 : "rootfs"
UBI: attaching mtd4 to ubi0
UBI: physical eraseblock size:   131072 bytes (128 KiB)
UBI: logical eraseblock size:    129024 bytes
UBI: smallest flash I/O unit:    2048
UBI: sub-page size:              512
UBI: VID header offset:          512 (aligned 512)
UBI: data offset:                2048
UBI warning: ubi_scan: 1239 PEBs are corrupted
corrupted PEBs are: ... [Jason removed list]
...
UBIFS error (pid 1): ubifs_get_sb: cannot open "ubi0:rootfs", error -19
VFS: Cannot open root device "ubi0:rootfs" or unknown-block(0,0)
Please append a correct "root=" boot option; here are the available partitions:
1f00             512 mtdblock0 (driver?)
1f01            1792 mtdblock1 (driver?)
1f02             256 mtdblock2 (driver?)
1f03            4096 mtdblock3 (driver?)
1f04          517632 mtdblock4 (driver?)
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

This is unrelated to the memory corruption issue I am seeing and appears to be some mismatched configuration somewhere (based on the OK device with this issue that doesn't have the memory issue). I don't know how or why this happened and I am certainly interested in helping to get to the bottom of it, although I don't want to get too distracted. :) They all boot fine off the SD cards.

 
>
> Most of the info above I've given just as background info, because in any
> case when booting with the Linaro SDs in, all the OK and BAD devices
> display:
>
> U-Boot 2012.04.01 (Jul 19 2012 - 17:31:34)
> ...
> ## Booting kernel from Legacy Image at 80000000 ...
>    Image Name:   Ubuntu Kernel
>    ...
>    Data Size:    4356000 Bytes = 4.2 MiB
> ## Loading init Ramdisk from Legacy Image at 81600000 ...
>    Image Name:   Ubuntu Initrd
>    ...
>    Data Size:    1997195 Bytes = 1.9 MiB
>
> Which makes me even more confused, because while the hardware (and nand
> images) differ between OK and BAD, the kernel, u-boot, and software on the
> SD cards is identical. The memory issue seems like it has something to do
> with these older Gumstix, perhaps some incompatibility between the slightly
> older devices and the Linaro images? I don't know.

--Ash

At the moment the only thing I have to go on is possible effects of the unpatched erratum that Jeff DeFouw pointed out. I will investigate that further unless a better theory comes up.

It does seem like it's on the hardware level or some other low level somewhere. The kernel appears completely unaware that it is accessing the same physical memory through multiple addresses. I will also try and see if I can discover anything more specific about what is going on there.

Thanks,
Jason