Menu

Q_NEW() fails and controller goes to hardfault

Manu K
2017-05-30
2017-06-09
  • Manu K

    Manu K - 2017-05-30

    Development environment:
    Controller: STM32F205
    IDE: Keil
    QM framework : 3.3.0
    QP Version: 5.6.1.

    Issue:
    Q_NEW() is invoked from interrupt callback of UART which is posted to UART active object. After repeated posts from the callback, system goes to hard fault handler.
    Function execution track back as obtained from Keil IDE:
    HardFault_Handler <- QMPool_get() <-QF_New() <- UART_CallBack.

     
  • Panopticon

    Panopticon - 2017-05-30

    You can obtain the Program Counter (PC) of where the hard fault happened - what line of QMPool_get() is faulting? It sounds like you have memory corruption (stack or memory pools or something).

     
  • Quantum Leaps

    Quantum Leaps - 2017-05-30

    The first thing to check when you are getting Hard Faults is stack overflow. Do you have adequate stack?

    You don't care to mention which real-time kernel you are using, but note that the preemptive QK kernel requires more stack than the cooperative QV kernel, for example.

    Also, since you apparently are using STM32Cube, please note that their interrupt handling is quite inefficient (ISRs make callbacks, etc.). If you are interrupting for every byte received and perhaps every byte transmitted, you could be doing a lot of interrupting...

    Also, posting a QP event for every byte is quite inefficient. It is better to collect several bytes (e.g., 16) and post event only when the buffer fills out or a timeout occurs. (Of course, the most efficient technique is to use DMA to fill an event for you and get an interrutp only when the DMA is done)

    --MMS

     
  • Gawie de Vos

    Gawie de Vos - 2017-05-31

    Hi Manu

    You are definately experiencing some memory corrupting. Stack overflow is the first thing to check, but if that is not the culprit, I can share some previous experiences that might help.

    I have encounter similar hardfaults with Q_NEW()->QMPool_get() in the past. Note that the first 4 bytes of an unused item in a mempool is a pointer to a free item in that mempool (linked list). Almost all of my issues were due to writing beyond the edge of a previous allocated mempool item and the corrupting the next pointer. See link for some background:
    https://sourceforge.net/p/qpc/discussion/668726/thread/fe3def78/?limit=25#cd59

    To add to what Miro said about collecting a number of bytes before posting, see the link below for some discussion about the same topic:
    https://sourceforge.net/p/qpc/discussion/668726/thread/50f10adb/#8cec

    Cheers,
    Gawie

     
  • Manu K

    Manu K - 2017-06-08

    Hi All,
    Thanks for the reply.

    I've checked with the possibilities commented above but have not helped yet.
    Verified the stack and heap overflow by filling known data and running the system. This came back negative for any possible overflows as sufficient data remained unwritten. Have verified any possible writing beyond the edge of previously allocated memory, which also did not help.
    What I could observe is the Q_NEW() calls from callbacks are causing issue(attached screenshot HardFault_FunctionCallBackTrace.png). I've seen QMPool_get() with free->head pointing to invalid memory once checking call-stack after hitting hardfault(attached screenshot PoolGetFreeHeadInvalidMemory.png). I had not taken care to inform QP on isr entries for all the used peripherals using QK_ISR_ENTRY/EXIT. By doing this, I'm getting class getting stuck on pre-emption waits in qk_port.s file(attached screenshot PreemptionbyPendSV.png). I had seen in QP ref manual pointing to do explicit garbage collection if using "raw" events like QEQueue.I'm using this in different class but not sure if that's causing issue as I'm not seeing any heap overflow. Confirmation could not still be made if this is a memory corruption issue or a pre-emption issue. Any help would be greatly appreciated.

    Thanks.

     
  • Gawie de Vos

    Gawie de Vos - 2017-06-08

    Hi Manu

    From your screenshot it is evident that your mempool is corrupted:

    Mempool is defined by the its start and end pointers in memory. In you case:
    start: 0x20000A5C
    end: 0x20000CB0

    And your next free block pointer (free_head) must point to somewhere in the mempool, which it isn’t.
    free_head: 0x010100B3

    Either something is writing past the end of an allocated mempool item or something else is corrupting your mempool object itself.

    Do you see any assertions firing? This should catch free_head being corrupted when writing past the end of a mempool item:

            /* pool is not empty, so the next free block must be in range
            *
            * NOTE: the next free block pointer can fall out of range
            * when the client code writes past the memory block, thus
            * corrupting the next block.
            */
            Q_ASSERT_ID(330, QF_PTR_RANGE_(fb_next, me->start, me->end));
    

    What I had to do in the past is to copy the complete range of mempool data from a debug Memory view and analyse it by hand in a text editor. Have a look at Miro’s book ("7.9 Native QF Memory Pool") for a better understanding of mempool's inner workings.

    Good luck.
    Gawie

     
  • Manu K

    Manu K - 2017-06-09

    Hi Gawie,

    Thanks for the input. I've tried using the assert to catch the one before making free_head to point to invalid memory(attached screenshot) and could find it's coming from SPI call back. But there's only a signal being posted from callback and nothing I could suspect that can corrupt the memory(attached screenshot). I've gone through each event postings as well to make sure if any border crossing issues are not happening.

    Thanks,
    Manu

     
    • Gawie de Vos

      Gawie de Vos - 2017-06-12

      Hi Manu

      Great, you ASSERT is firing. BTW, you should never develop (or even release) with ASSERTs disabled!

      Your mempool is already corrupted when your SPI callback is calling Q_NEW(). The ASSERT is just telling you that fbnext was just assigned a value that is not pointing to memory in the allocated space for your mempool.
      The mempool wasn’t corrupted by anything related when the ASSERT fired, it was corrupted by some previous operation - can be totally unrelated, depends of what other events were created from this mempool.
      fb (me->freehead) is now pointing the corrupted memory. The data in the mempool item, just before the item that fb is pointing to, is your only clue to find this bug. I suspect that something has written passed the boundary of that mempool item. You will need to analyse this mempool data to find its event id and related data and then track it back to where that event was created and populated.

      I suggest that you copy the complete range of mempool data (start to end) from a debug Memory view and analyse it by hand in a text editor. Refer to Miro’s book ("7.9 Native QF Memory Pool”) to understand what you are looking at.

      Good luck.

      Cheers,
      Gawie

       

      Last edit: Gawie de Vos 2017-06-12
  • Quantum Leaps

    Quantum Leaps - 2017-06-09

    I'm really confused as to what you are doing. The SPIM_signalObjectEvent() code is not ISR, obviously, because it takes parameters. Yet, you have some commented out QK_ISR_ENTRY() code with a comment from QK-nano (?). Perhaps the SPIM_signalObjectEvent() function is called from an ISR (?) Yet, you have some HAL_delay(1) code commented out, which would preclude ISR context.

    The event allocation pEvt = Q_NEW(QEvt, ...) is a prime example of situation where you could use a "static event" which does not come from a pool. Your event has no event-parameters (QEvt type), so it doesn't change. You could simply pre-allocate a static and const event and post it:

    static QEvt const spiTransferComplete = { SPI_EVENT_TRANSFER_COMPLETE_SIG, 0, 0};
    QACTIVE_POST(&l_ButtonBar, &spiTransferComplete, (void *)0);
    

    This is faster, more efficient, and does not require any space in an event pool. In fact the "static event" is allocated in ROM, so even this one event does not cost RAM.

    --MMS

     
  • Manu K

    Manu K - 2017-06-14

    Hi All,
    Thanks a lot for the reply.

    I've tried to go through events which can possibly cause the memory pool corruption.
    In one of the events I could find a uint16_t variable updated with uint32_t.

    Snippet:

    typedef struct {
    QEvt super;
    QMActive * p_senderAO;
    uint8_t * pData;
    uint16_t dataLen;
    } UARTDriverEvt;

    UARTDriverEvt * pRxEvt;

    pRxEvt->dataLen = me->handle_uart->GetRxCount();

    GetRxCount(); returns a uint32_t value.

    Would it be possible for the above to create the issue? Why I'm doubting is how the assignment of 16bit from a 32bit cause the issue as it only assigns the max of 16bit variable can hold. I'm not finding any asserts or Hardfaults after the above change. But how can I be sure the above fixes the issue?

    I've tried the static events to reduce overhead as well but still the hardfault was coming.
    @QL, kindly ignore the comments. It's only a development code and comment from QK-nano is wrongly put.

    Thanks,
    Manu

     
    • Gawie de Vos

      Gawie de Vos - 2017-06-14

      Hi Manu

      Regarding your question about assigning a 32bit value to a 16bit variable - the value will be truncated to 16bits. Your compiler should warn you about it.

      The event pool that you previously referred to in your screenshots, had a blocksize of 4 (l_smlPoolSto) - which can only be used for events with no data part. Your UARTDriverEvt has a size of 14, so either you are allocating events wrongly or this event is allocated from a different pool. If it is the latter, then you are barking at the wrong tree.

      You did not include the code where you allocate an event for pRxEvt. Is it something like this?

      UARTDriverEvt * pRxEvt;
      pRxEvt = Q_NEW(UARTDriverEvt, YOUR_SIG);
      pRxEvt->dataLen = me->handle_uart->GetRxCount();
      

      Something else - your assert handler is checking for a specific file and line number before executing __breakpoint(0). It is not clear from the screenshot what is happening to other asserts?

      Cheers,
      Gawie

       
  • Manu K

    Manu K - 2017-06-14

    Hi Gawie,

    Thanks for the reply.
    The event allocation for pRxEvt is the same as what you've written(attaching screenshot). I didn't get what you actually mean by "you are allocating events wrongly or this event is allocated from a different pool". Do we need to explicitly mention which pool the event has to go?
    Have put the assert handler specifically because, after that particular assert(qf_mem.c, 330) failure, the pointer was getting corrupted. And also, one more assert failure is qf_actq.c assert no 110. No other asserts.

    Thanks,
    Manu

     
    • Quantum Leaps

      Quantum Leaps - 2017-06-14

      Manu: this is probably urelated to your corrupted event pool assertions. But in your screen shot you apparently send a pointer to a data buffer me->SestinatinData.pData inside an event. In most cases this is a bad idea, because the external data buffer is a shared resource. So, by putting a pointer to this shared buffer you bypass all safety mechanisms in the framework to deliver event data in a thread-safe manner from the producer to the consumer. Sure, the framework delivers the pointer correctly, but the actual data is not part of the event, so the framework is helpless here. I hope you see that you potentially have a race codition around the shared data buffer. By "race condition" I mean a situation where the data in the buffer will be changed (e.g., by an interrupt), while the processing of this data is not finished yet.

      --MMS

       

      Last edit: Quantum Leaps 2017-06-14
  • Gawie de Vos

    Gawie de Vos - 2017-06-14

    Hi Manu

    What I am trying to say is that your mempool corruption is happening in the pool with block size = 4. CLIUARTDriverEvt requires at least 14 bytes and events of this type will be allocated from another pool. You don't have to specify the event pool when allocating an event, QPC will automatically select the correct pool by using the size of the event.

    May be I am not understanding you correctly, but it seems like you are ignoring some asserts - you should break for all assertions while debugging. Please correct me if I am wrong.

    Cheers,
    Gawie

     

Log in to post a comment.