Menu

#203 programming st_nucleo_f7 (stm32f767) bank 2 consistently fails

0.9.0
new
nobody
None
2018-08-30
2018-08-20
No

In the stm32f767zi (on the nucleo-f767zi board), there is 2MiB of flash. When it is configured into dual bank mode (stm32f2x options_write 0 0xDFC 0x0080 0x0040, presuming all other options are left at their defaults), using the program command to program the second bank (bank1_start=0x0810_0000, bank2_start=0x0800_0000) with the command flash write_image fw.elf erase 0x100000, the execution consistently fails with the following output:

openocd -f board/st_nucleo_f7.cfg

> flash write_image fw.elf 0x100000       
Flash write discontinued at 0x081020c4, next section at 0x08120000
timed out while waiting for target halted
target halted due to debug-request, current mode: Handler HardFault
xPSR: 0x00000003 pc: 00000000 msp: 0xffffffe0
error waiting for target flash write algorithm
error writing to flash at address 0x08000000 at offset 0x00100000

This is using the embedded ST-LINK included on the nucleo. The st-link firmware version is V2J31M21.

fw.elf has sections starting in bank1 of flash, which is why the offset is only the difference between bank1 and bank2.

The banks refered to here are banks in the stm32f7x sense, and are not openocd flash banks.

Discussion

  • Cody Schafer

    Cody Schafer - 2018-08-20

    Sorry about reference to program in the ticket, I simplified the reproduction to write_image.

    Also, heres some log following immediately after the above output (had to reset before I tried again, with the same result):

    > flash write_image fw.elf 0x100000                          
    Flash write discontinued at 0x081020c4, next section at 0x08120000
    Target is already running an algorithm
    error starting target flash write algorithm
    error writing to flash at address 0x08000000 at offset 0x00100000
    
    > reset init
    Unable to match requested speed 2000 kHz, using 1800 kHz
    Unable to match requested speed 2000 kHz, using 1800 kHz
    adapter speed: 1800 kHz
    target halted due to debug-request, current mode: Thread 
    xPSR: 00000000 pc: 00000000 msp: 00000000
    Unable to match requested speed 8000 kHz, using 4000 kHz
    Unable to match requested speed 8000 kHz, using 4000 kHz
    adapter speed: 4000 kHz
    > flash write_image fw.elf 0x100000
    Flash write discontinued at 0x081020c4, next section at 0x08120000
    timed out while waiting for target halted
    target halted due to debug-request, current mode: Handler HardFault
    xPSR: 0x00000003 pc: 00000000 msp: 0xffffffe0
    error waiting for target flash write algorithm
    error writing to flash at address 0x08000000 at offset 0x00100000
    
    > 
    
     
  • Cody Schafer

    Cody Schafer - 2018-08-20

    Attached is a log of the command output:

    openocd -f board/st_nucleo_f7.cfg -c 'init' -c 'reset init' -c 'flash write_image fw.elf 0x100000' -d

     
  • Cody Schafer

    Cody Schafer - 2018-08-21

    As a side note: it isn't just the algorithm that's failing: the fallback/normal write mechanism fails too.
    Here's some log output from running with set WORKAREASIZE 0 to force non-algorithm flash writing, which also fails.

    This is kind enough to fail more quickly than the algorithm variant, which waits for a timeout (probably should try to catch hardfaults when executing an algorithm).

     
  • Andreas Bolsch

    Andreas Bolsch - 2018-08-22

    Hm, just checked with current head on Nucleo-F767ZI via integrated ST-Link:
    stm32f2x user_options 0xDFC, boot_add0 0x0080, boot_add1 0x0040,
    so in dual-bank mode, after mass erase.

    Programming the whole flash (2MBytes) with random data (flash write_bank 0 random.bin) and verify after read back (flash read_bank 0 verify.bin) works flawlessly for me.

    And same to second bank only works for me, too.

    I'd suggest you try again without the 'erase' (do a mass erase instead and an erase check before), and then use flash write_bank with a binary (or ihex, srec) file.

    Maybe your elf file has some 'unusual' properties.

    BTW: Any sector protection set?

     
    • Cody Schafer

      Cody Schafer - 2018-08-22

      I just tried reproducing and got the same failure when using telnet to command openocd. I've attached a telnet session (openocd -f board/st_nucleo_f7.cfg -c 'init' -c 'reset init') .

      I then tried a fully automated variant immediately afterward with the same board and was not able to reproduce (flash occured succesfully): openocd -f board/st_nucleo_f7.cfg -c 'init; reset init; stm32f2x mass_erase 0; flash write_bank 0 random_1MB.bin

      (Attached file is the failure via telnet, the content is the telnet session)

      No sector protection set (I printed it in the attached log)

       
  • Cody Schafer

    Cody Schafer - 2018-08-22

    Info on my elf file:

    program headers readelf -l fw.elf:

    Elf file type is EXEC (Executable file)
    Entry point 0x8020239
    There are 6 program headers, starting at offset 52
    
    Program Headers:
      Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
      LOAD           0x010000 0x08000000 0x08000000 0x020c4 0x020c4 R   0x10000
      LOAD           0x020000 0x08020000 0x08020000 0x28ad0 0x28ad0 RWE 0x10000
      LOAD           0x055538 0x20025538 0x08048ad0 0x00e80 0x00e80 RW  0x10000
      LOAD           0x059950 0x08049950 0x08049950 0x00068 0x00068 R   0x10000
      LOAD           0x060000 0x20020000 0x20020000 0x00000 0x05538 RW  0x10000
      NOTE           0x059990 0x08049990 0x08049990 0x00024 0x00024 R   0x4
    
     Section to Segment mapping:
      Segment Sections...
       00     .bootldr 
       01     .vector_table .text .ARM .init_array .fini_array .rodata 
       02     .data 
       03     .build_info .note.gnu.build-id .build_info_suffix 
       04     .bss 
       05     .note.gnu.build-id 
    

    openocd -f board/st_nucleo_f7.cfg -c 'test_image fw.elf 0 elf':

    address 0x08000000 length 0x000020c4
    address 0x08020000 length 0x00028ad0
    address 0x08048ad0 length 0x00000e80
    address 0x08049950 length 0x00000068
    verified 178812 bytes in 0.000513s (340392.000 KiB/s)
    

    I've done some further testing with the following 2 commands:

    A. a.cfg: write random 1MB to bank 1, random 1MB to bank2 (openocd -f board/st_nucleo_f7.cfg -f a.cfg -c exit)
    B. b.cfg: write fw.elf to bank 1, random 1MB to bank 2 (openocd -f board/st_nucleo_f7.cfg -f b.cfg -c exit)

    Here's a sequence of executions with OK, ERROR 1, and ERROR 2 indicating the operation which failed (flashing bank 1 or 2)

    # plug in nucleo-f767zi's stlink to computer
    A OK
    A OK
    B ERROR 2
    B ERROR 1
    B ERROR 2
    B ERROR 1
    B ERROR 2
    B ERROR 1
    A OK
    B ERROR 2
    A ERROR 1
    A OK
    B ERROR 2
    A ERROR 1
    B ERROR 2
    A ERROR 1
    A OK
    A OK
    

    So:

    • The programming the elf file triggers this issue
    • Programming that elf file appears to break the following 2 programming attempts (at least in the bank1, bank2, bank1, bank2, ... programming sequence I tested here.
     

    Last edit: Cody Schafer 2018-08-22
  • Cody Schafer

    Cody Schafer - 2018-08-22

    fw.elf test_image with gap(s) annotated:

    address 0x08000000 length 0x000020c4
    #gap    0x080020c4 length 0x0001df3c
    address 0x08020000 length 0x00028ad0
    address 0x08048ad0 length 0x00000e80
    address 0x08049950 length 0x00000068
    
     
  • Andreas Bolsch

    Andreas Bolsch - 2018-08-22

    I've created an elf file with the same section addresses/sizes, filled with garbage, test_image reports the same figures as for yor file. No problem whatsoever, programming and verification works ok for me. Checked with ST-Link V2J30M19 and V2J31M21.
    Other than the chip rev. (yours is Z, mine is A), I don't see any difference.

    So either it's defective hardware, or ... your firmware does weird things like fiddling with watchdogs, clocks, interrupts, sleep mode ...
    This might explain your observations above.

    Maybe add

    reset_config srst_only srst_nogate connect_assert_srst

    to your cfg or place an infinite loop at the very beginning of your startup code (but take care not to change the length of the startup code, so that all sections remain at precisely the same offsets).

     
    • Cody Schafer

      Cody Schafer - 2018-08-23

      Thank you for trying to reproduce.

      It seems very curious that the actions of my firmware (which doesn't write to flash, etc) would affect the ability of openocd to program the chips flash, especially given that reset init is being used here to reset & halt the target.

      I'll try out adding a loop in startup code so we can see if somehow openocd isn't managing to reset/halt the processor properly.

      My firmware does use clocks (it increases clock speed to 216MHz), interrupts (enables a bunch of them, including a few timers), and enables the watchdog (specifically, IWDG).

       
    • Cody Schafer

      Cody Schafer - 2018-08-27

      I tried the reset_config srst_only srst_nogate connect_assert_srst with fw.elf. No change in behavior (still fails every other time) was noted (test script attached).

      I also modified fw.elf to start with an infinite loop. No change in behavior was noted (still fails to flash via algorithm every-other time)

       

      Last edit: Cody Schafer 2018-08-27
  • Andreas Fritiofson

    On Wed, Aug 22, 2018 at 8:33 PM Ismail Kose ihkose@gmail.com wrote:

    I built openocd from 6060545458f6863710d576fc4bd2512d34f88f89 commit-id,
    but cant make SWD working. I get invalid command name "swd" error
    message when I run "sudo openocd -f max3263x_hdk.cfg" command on my Ubuntu
    16.04.

    You'll need to "transport select swd" after selecting the interface to get
    access to the swd commands.

    Also NEVER start openocd as root!

    /Andreas

     
  • Cody Schafer

    Cody Schafer - 2018-08-23

    I've done some more testing wrt this issue and noticed that the failure is reproduced without even using the second bank. Instead, just programming my fw.elf multiple times causes exactly every-other attempt to program fw.elf to fail. (running openocd -f board/st_nucleo_f7.cfg -f d.cfg -c exit each time)

    I've added some debug output to various code around stm32x_block_write to try and figure out what in particular is failing (see attached output & patches).

    I've added various mov r5, #0xbb (etc) instructions to try to track the progression of the algorithm, initially setting r5 to 0xaa. In the failure case, I've never seen anything by 0xaa.

    I've observed that the r0 value returned by the algorithm (which should be the flash status register) does not appear to actually be the flash status register. It has bits set that are marked as "reserved" in the stm32f7/6xxx manual and it's value appears to change as I change/add to the algorithm asm. The value looks very much like a pointer to ram, possibly to the end of the code composing the algorithm.

    Edit:
    Further examination indicated r0 was exactly source->address: a pointer to the working area where the circular buffer would have been stored, which is preloaded into r0 prior to algorithm execution. This seems to indicate that the issue is that the algorithm is never getting started at all in the failure case, and the preloading of r0 with source->address was hiding this (probably should use an additional register for return and preload with a sentinal to detect the "didn't execute" case)

     

    Last edit: Cody Schafer 2018-08-23
  • Cody Schafer

    Cody Schafer - 2018-08-24

    After some more digging I've seen the following:

    • reading DHCSR before & after resuming the processor indicates that when the failure occurs the processor is in lockup (S_LOCKUP is set)
    • further examination of CFSR indicates that this is an imprecise usage fault
    • the T bit in xPSR was 0, which would cause a usage fault (can't disable thumb mode on armv7m)
    • tweaking run_algorithm (in armv7m.c) to set xPSR so the T bit is set causes flashing multiple times to be reliable (have not yet tested writing to offset parts of the flash).

    Not yet clear to me why T is getting cleared in the first place, as resolving that would be ideal.

     
  • Tommy Murphy

    Tommy Murphy - 2018-08-24

    Any chance the CPU is executing from zeroized or garbage memory on power on reset thus causing the T bit to be cleared at some stage, and then a double fault and lockup occurring?
    However even if this was happening I would expect the debug connection and reset init to get it back into a known good state....

     
  • Cody Schafer

    Cody Schafer - 2018-08-27

    I've used the attached (4-line) patch on openocd master to workaround this issue (by setting xPSR.T).

     
    • Cody Schafer

      Cody Schafer - 2018-08-28

      I've submitted a variation of this for inclusion. http://openocd.zylin.com/#/c/4658/

       
  • Cody Schafer

    Cody Schafer - 2018-08-27

    A few more details (using stlink_usb_v2_read_debug() to get values)

    • In success case, while running the flash algorithm: stlink_usb_run(), xPSR.T==1 prior to clearing C_HALT and xPSR.T==0 after clearing C_HALT.
    • In the failing case, xPSR.T==0 also prior to clearing C_HALT (theory: it immediately faults in this case).
    • Multiple algorithm executions in a single flash write_image erase fw.elf work. Even though xPSR.T==0 on read back in stlink_usb_run(), on second algorithm execution xPSR.T==1 is seen prior to clearing C_HALT.
    • Multiple algorithm executions across multiple flash write_image erase fw.elf work (no reset between). (see pf.cfg attached)
    • Multiple algorithm executions with resets between them fail (see pfr.cfg, attached).

    Theory: the reset is relevent because of the caching of register values by openocd. It's plausible that the st-link is failing to return the full xPSR in some cases, causing openocd to write-back different values into xPSR, clearing the xPSR.T bit.

     

    Last edit: Cody Schafer 2018-08-27
  • Cody Schafer

    Cody Schafer - 2018-08-29

    Ah, found out why xPSR.T was 0:

    My fw.elf file's first section (address 0x08000000 length 0x000020c4) is a bootloader, which is generated/included via (approximately) the following steps:

    $CC -o boot.elf $BOOT_OBJ
    $OBJCOPY -O binary boot.elf boot.bin
    $LD -r -b binary boot.bin -o boot.o
    $OBJCOPY --rename-section .data=.bootldr,alloc,load,readonly,rom,data boot.o boot-ldr.o
    

    boot-ldr.o is then linked into the image with the following linker script snippet:

    SECTIONS
    {
      .bootldr ORIGIN(FLASH) : {
         KEEP(*(.bootldr))
      }
    }
    

    The key part is (for some reason) the section flags set when objcopying: alloc,load,readonly,rom,data. These were added to the firmware somewhat recently.

    When loading the fw.elf binary composed with the bootloader image (generated as described above), the bootloader section in fw.elf (.bootldr) appears to be filled with zeros rather than actual data. As a result, when the processor resets, it reads the second element of the interrupt vector (0, in this case), and sets pc=0, xPSR=0. This is why after a reset I would observe failures would begin to happen (as a reset would trigger loading 0 into xPSR).

    Failures only occured every-other time because erasing the flash (which happened without running a target algorithm on the device) causes the interrupt vector to contain (instead) 0xffffffff, resulting in pc=0xfffffffe, xPSR.T=1 on the next reset. This likely means that a reset between erase and writing would have also worked around the issue.

    Removing the section flag specification from objcopy ($OBJCOPY --rename-section .data=.bootldr boot.o boot-ldr.o) results in the values in the .bootldr section being loaded as expected (rather than being set to zero). It's not yet clear to me why the section flags are having this effect. The diff from arm-none-eabi-objdump -h fw.elf is below, and only shows that I've removed the READONLY flag by not passing my explicit flags.

    --- without-set-flags   2018-08-29 14:29:13.971991842 -0400
    +++ with-set-flags      2018-08-29 14:28:53.065266231 -0400
    @@ -4,7 +4,7 @@
     Sections:
     Idx Name          Size      VMA       LMA       File off  Algn
       0 .bootldr      000021b4  08000000  08000000  00010000  2**0
    -                  CONTENTS, ALLOC, LOAD, DATA
    +                  CONTENTS, ALLOC, LOAD, READONLY, DATA
       1 .vector_table 000001f8  08020000  08020000  00020000  2**2
                       CONTENTS, ALLOC, LOAD, READONLY, DATA
       2 .text         0001d410  080201f8  080201f8  000201f8  2**6
    

    On a related note: I discovered this while loading with gdb's load rather than using openocd's program or flash write_image.

    In any case: while it's true that the xPSR.T being set to 0 is something not entirely related to target algorithms, it's also the case that given that reset halt can cause xPSR.T to be set to 0, we should explictily set it when trying to run algorithms.

     
    • Antonio Borneo

      Antonio Borneo - 2018-08-30

      When loading the fw.elf binary composed with the bootloader image (generated as described above), the bootloader section in fw.elf (.bootldr) appears to be filled with zeros rather than actual data. As a result, when the processor resets, it reads the second element of the interrupt vector (0, in this case), and sets pc=0, xPSR=0. This is why after a reset I would observe failures would begin to happen (as a reset would trigger loading 0 into xPSR).

      Failures only occured every-other time because erasing the flash (which happened without running a target algorithm on the device) causes the interrupt vector to contain (instead) 0xffffffff, resulting in pc=0xfffffffe, xPSR.T=1 on the next reset. This likely means that a reset between erase and writing would have also worked around the issue.

      This makes sense!
      After a "reset halt" the PC is loaded from the reser vector and the thumb mode is set from the LSB of the reset vector.
      In OpenOCD there is nothing that forces thumb mode before executing an algorithm (and every angorithm for ARM in contrib/loaders/ is written in thumb).
      I have not tested your patch, but the functionality seams correct.
      But now that you have clear the root cause, I suggest you to update both commit message and comment in your patch.

       
    • Paul Fertser

      Paul Fertser - 2018-08-30

      On Wed, Aug 29, 2018 at 06:49:55PM -0000, Cody Schafer wrote:

      Ah, found out why xPSR.T was 0:
      ...

      So obvious in the hindsight but boy what a rough trip you had finding
      it! Thank you so much for your persistence and sharing the
      result.

      --
      Be free, use free (http://www.gnu.org/philosophy/free-sw.html) software!
      mailto:fercerpav@gmail.com

       
  • Cody Schafer

    Cody Schafer - 2018-08-29

    For others running into this: turns out the magical flag for objcopy is contents, without which it zeros the section's content.

     

Log in to post a comment.