Issue:
Q_NEW() is invoked from interrupt callback of UART which is posted to UART active object. After repeated posts from the callback, system goes to hard fault handler.
Function execution track back as obtained from Keil IDE:
HardFault_Handler <- QMPool_get() <-QF_New() <- UART_CallBack.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You can obtain the Program Counter (PC) of where the hard fault happened - what line of QMPool_get() is faulting? It sounds like you have memory corruption (stack or memory pools or something).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The first thing to check when you are getting Hard Faults is stack overflow. Do you have adequate stack?
You don't care to mention which real-time kernel you are using, but note that the preemptive QK kernel requires more stack than the cooperative QV kernel, for example.
Also, since you apparently are using STM32Cube, please note that their interrupt handling is quite inefficient (ISRs make callbacks, etc.). If you are interrupting for every byte received and perhaps every byte transmitted, you could be doing a lot of interrupting...
Also, posting a QP event for every byte is quite inefficient. It is better to collect several bytes (e.g., 16) and post event only when the buffer fills out or a timeout occurs. (Of course, the most efficient technique is to use DMA to fill an event for you and get an interrutp only when the DMA is done)
--MMS
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You are definately experiencing some memory corrupting. Stack overflow is the first thing to check, but if that is not the culprit, I can share some previous experiences that might help.
I have encounter similar hardfaults with Q_NEW()->QMPool_get() in the past. Note that the first 4 bytes of an unused item in a mempool is a pointer to a free item in that mempool (linked list). Almost all of my issues were due to writing beyond the edge of a previous allocated mempool item and the corrupting the next pointer. See link for some background: https://sourceforge.net/p/qpc/discussion/668726/thread/fe3def78/?limit=25#cd59
I've checked with the possibilities commented above but have not helped yet.
Verified the stack and heap overflow by filling known data and running the system. This came back negative for any possible overflows as sufficient data remained unwritten. Have verified any possible writing beyond the edge of previously allocated memory, which also did not help.
What I could observe is the Q_NEW() calls from callbacks are causing issue(attached screenshot HardFault_FunctionCallBackTrace.png). I've seen QMPool_get() with free->head pointing to invalid memory once checking call-stack after hitting hardfault(attached screenshot PoolGetFreeHeadInvalidMemory.png). I had not taken care to inform QP on isr entries for all the used peripherals using QK_ISR_ENTRY/EXIT. By doing this, I'm getting class getting stuck on pre-emption waits in qk_port.s file(attached screenshot PreemptionbyPendSV.png). I had seen in QP ref manual pointing to do explicit garbage collection if using "raw" events like QEQueue.I'm using this in different class but not sure if that's causing issue as I'm not seeing any heap overflow. Confirmation could not still be made if this is a memory corruption issue or a pre-emption issue. Any help would be greatly appreciated.
From your screenshot it is evident that your mempool is corrupted:
Mempool is defined by the its start and end pointers in memory. In you case:
start: 0x20000A5C
end: 0x20000CB0
And your next free block pointer (free_head) must point to somewhere in the mempool, which it isn’t.
free_head: 0x010100B3
Either something is writing past the end of an allocated mempool item or something else is corrupting your mempool object itself.
Do you see any assertions firing? This should catch free_head being corrupted when writing past the end of a mempool item:
/* pool is not empty, so the next free block must be in range
*
* NOTE: the next free block pointer can fall out of range
* when the client code writes past the memory block, thus
* corrupting the next block.
*/
Q_ASSERT_ID(330, QF_PTR_RANGE_(fb_next, me->start, me->end));
What I had to do in the past is to copy the complete range of mempool data from a debug Memory view and analyse it by hand in a text editor. Have a look at Miro’s book ("7.9 Native QF Memory Pool") for a better understanding of mempool's inner workings.
Good luck.
Gawie
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the input. I've tried using the assert to catch the one before making free_head to point to invalid memory(attached screenshot) and could find it's coming from SPI call back. But there's only a signal being posted from callback and nothing I could suspect that can corrupt the memory(attached screenshot). I've gone through each event postings as well to make sure if any border crossing issues are not happening.
Great, you ASSERT is firing. BTW, you should never develop (or even release) with ASSERTs disabled!
Your mempool is already corrupted when your SPI callback is calling Q_NEW(). The ASSERT is just telling you that fbnext was just assigned a value that is not pointing to memory in the allocated space for your mempool.
The mempool wasn’t corrupted by anything related when the ASSERT fired, it was corrupted by some previous operation - can be totally unrelated, depends of what other events were created from this mempool. fb (me->freehead) is now pointing the corrupted memory. The data in the mempool item, just before the item that fb is pointing to, is your only clue to find this bug. I suspect that something has written passed the boundary of that mempool item. You will need to analyse this mempool data to find its event id and related data and then track it back to where that event was created and populated.
I suggest that you copy the complete range of mempool data (start to end) from a debug Memory view and analyse it by hand in a text editor. Refer to Miro’s book ("7.9 Native QF Memory Pool”) to understand what you are looking at.
Good luck.
Cheers,
Gawie
Last edit: Gawie de Vos 2017-06-12
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm really confused as to what you are doing. The SPIM_signalObjectEvent() code is not ISR, obviously, because it takes parameters. Yet, you have some commented out QK_ISR_ENTRY() code with a comment from QK-nano (?). Perhaps the SPIM_signalObjectEvent() function is called from an ISR (?) Yet, you have some HAL_delay(1) code commented out, which would preclude ISR context.
The event allocation pEvt = Q_NEW(QEvt, ...) is a prime example of situation where you could use a "static event" which does not come from a pool. Your event has no event-parameters (QEvt type), so it doesn't change. You could simply pre-allocate a static and const event and post it:
This is faster, more efficient, and does not require any space in an event pool. In fact the "static event" is allocated in ROM, so even this one event does not cost RAM.
--MMS
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've tried to go through events which can possibly cause the memory pool corruption.
In one of the events I could find a uint16_t variable updated with uint32_t.
Would it be possible for the above to create the issue? Why I'm doubting is how the assignment of 16bit from a 32bit cause the issue as it only assigns the max of 16bit variable can hold. I'm not finding any asserts or Hardfaults after the above change. But how can I be sure the above fixes the issue?
I've tried the static events to reduce overhead as well but still the hardfault was coming.
@QL, kindly ignore the comments. It's only a development code and comment from QK-nano is wrongly put.
Thanks,
Manu
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Regarding your question about assigning a 32bit value to a 16bit variable - the value will be truncated to 16bits. Your compiler should warn you about it.
The event pool that you previously referred to in your screenshots, had a blocksize of 4 (l_smlPoolSto) - which can only be used for events with no data part. Your UARTDriverEvt has a size of 14, so either you are allocating events wrongly or this event is allocated from a different pool. If it is the latter, then you are barking at the wrong tree.
You did not include the code where you allocate an event for pRxEvt. Is it something like this?
Something else - your assert handler is checking for a specific file and line number before executing __breakpoint(0). It is not clear from the screenshot what is happening to other asserts?
Cheers,
Gawie
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the reply.
The event allocation for pRxEvt is the same as what you've written(attaching screenshot). I didn't get what you actually mean by "you are allocating events wrongly or this event is allocated from a different pool". Do we need to explicitly mention which pool the event has to go?
Have put the assert handler specifically because, after that particular assert(qf_mem.c, 330) failure, the pointer was getting corrupted. And also, one more assert failure is qf_actq.c assert no 110. No other asserts.
Manu: this is probably urelated to your corrupted event pool assertions. But in your screen shot you apparently send a pointer to a data buffer me->SestinatinData.pData inside an event. In most cases this is a bad idea, because the external data buffer is a shared resource. So, by putting a pointer to this shared buffer you bypass all safety mechanisms in the framework to deliver event data in a thread-safe manner from the producer to the consumer. Sure, the framework delivers the pointer correctly, but the actual data is not part of the event, so the framework is helpless here. I hope you see that you potentially have a race codition around the shared data buffer. By "race condition" I mean a situation where the data in the buffer will be changed (e.g., by an interrupt), while the processing of this data is not finished yet.
--MMS
Last edit: Quantum Leaps 2017-06-14
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What I am trying to say is that your mempool corruption is happening in the pool with block size = 4. CLIUARTDriverEvt requires at least 14 bytes and events of this type will be allocated from another pool. You don't have to specify the event pool when allocating an event, QPC will automatically select the correct pool by using the size of the event.
May be I am not understanding you correctly, but it seems like you are ignoring some asserts - you should break for all assertions while debugging. Please correct me if I am wrong.
Cheers,
Gawie
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Development environment:
Controller: STM32F205
IDE: Keil
QM framework : 3.3.0
QP Version: 5.6.1.
Issue:
Q_NEW() is invoked from interrupt callback of UART which is posted to UART active object. After repeated posts from the callback, system goes to hard fault handler.
Function execution track back as obtained from Keil IDE:
HardFault_Handler <- QMPool_get() <-QF_New() <- UART_CallBack.
You can obtain the Program Counter (PC) of where the hard fault happened - what line of QMPool_get() is faulting? It sounds like you have memory corruption (stack or memory pools or something).
The first thing to check when you are getting Hard Faults is stack overflow. Do you have adequate stack?
You don't care to mention which real-time kernel you are using, but note that the preemptive QK kernel requires more stack than the cooperative QV kernel, for example.
Also, since you apparently are using STM32Cube, please note that their interrupt handling is quite inefficient (ISRs make callbacks, etc.). If you are interrupting for every byte received and perhaps every byte transmitted, you could be doing a lot of interrupting...
Also, posting a QP event for every byte is quite inefficient. It is better to collect several bytes (e.g., 16) and post event only when the buffer fills out or a timeout occurs. (Of course, the most efficient technique is to use DMA to fill an event for you and get an interrutp only when the DMA is done)
--MMS
Hi Manu
You are definately experiencing some memory corrupting. Stack overflow is the first thing to check, but if that is not the culprit, I can share some previous experiences that might help.
I have encounter similar hardfaults with Q_NEW()->QMPool_get() in the past. Note that the first 4 bytes of an unused item in a mempool is a pointer to a free item in that mempool (linked list). Almost all of my issues were due to writing beyond the edge of a previous allocated mempool item and the corrupting the next pointer. See link for some background:
https://sourceforge.net/p/qpc/discussion/668726/thread/fe3def78/?limit=25#cd59
To add to what Miro said about collecting a number of bytes before posting, see the link below for some discussion about the same topic:
https://sourceforge.net/p/qpc/discussion/668726/thread/50f10adb/#8cec
Cheers,
Gawie
Hi All,
Thanks for the reply.
I've checked with the possibilities commented above but have not helped yet.
Verified the stack and heap overflow by filling known data and running the system. This came back negative for any possible overflows as sufficient data remained unwritten. Have verified any possible writing beyond the edge of previously allocated memory, which also did not help.
What I could observe is the Q_NEW() calls from callbacks are causing issue(attached screenshot HardFault_FunctionCallBackTrace.png). I've seen QMPool_get() with free->head pointing to invalid memory once checking call-stack after hitting hardfault(attached screenshot PoolGetFreeHeadInvalidMemory.png). I had not taken care to inform QP on isr entries for all the used peripherals using QK_ISR_ENTRY/EXIT. By doing this, I'm getting class getting stuck on pre-emption waits in qk_port.s file(attached screenshot PreemptionbyPendSV.png). I had seen in QP ref manual pointing to do explicit garbage collection if using "raw" events like QEQueue.I'm using this in different class but not sure if that's causing issue as I'm not seeing any heap overflow. Confirmation could not still be made if this is a memory corruption issue or a pre-emption issue. Any help would be greatly appreciated.
Thanks.
Hi Manu
From your screenshot it is evident that your mempool is corrupted:
Mempool is defined by the its start and end pointers in memory. In you case:
start: 0x20000A5C
end: 0x20000CB0
And your next free block pointer (free_head) must point to somewhere in the mempool, which it isn’t.
free_head: 0x010100B3
Either something is writing past the end of an allocated mempool item or something else is corrupting your mempool object itself.
Do you see any assertions firing? This should catch free_head being corrupted when writing past the end of a mempool item:
What I had to do in the past is to copy the complete range of mempool data from a debug Memory view and analyse it by hand in a text editor. Have a look at Miro’s book ("7.9 Native QF Memory Pool") for a better understanding of mempool's inner workings.
Good luck.
Gawie
Hi Gawie,
Thanks for the input. I've tried using the assert to catch the one before making free_head to point to invalid memory(attached screenshot) and could find it's coming from SPI call back. But there's only a signal being posted from callback and nothing I could suspect that can corrupt the memory(attached screenshot). I've gone through each event postings as well to make sure if any border crossing issues are not happening.
Thanks,
Manu
Hi Manu
Great, you ASSERT is firing. BTW, you should never develop (or even release) with ASSERTs disabled!
Your mempool is already corrupted when your SPI callback is calling Q_NEW(). The ASSERT is just telling you that fbnext was just assigned a value that is not pointing to memory in the allocated space for your mempool.
The mempool wasn’t corrupted by anything related when the ASSERT fired, it was corrupted by some previous operation - can be totally unrelated, depends of what other events were created from this mempool.
fb (me->freehead) is now pointing the corrupted memory. The data in the mempool item, just before the item that fb is pointing to, is your only clue to find this bug. I suspect that something has written passed the boundary of that mempool item. You will need to analyse this mempool data to find its event id and related data and then track it back to where that event was created and populated.
I suggest that you copy the complete range of mempool data (start to end) from a debug Memory view and analyse it by hand in a text editor. Refer to Miro’s book ("7.9 Native QF Memory Pool”) to understand what you are looking at.
Good luck.
Cheers,
Gawie
Last edit: Gawie de Vos 2017-06-12
I'm really confused as to what you are doing. The
SPIM_signalObjectEvent()
code is not ISR, obviously, because it takes parameters. Yet, you have some commented outQK_ISR_ENTRY()
code with a comment from QK-nano (?). Perhaps theSPIM_signalObjectEvent()
function is called from an ISR (?) Yet, you have someHAL_delay(1)
code commented out, which would preclude ISR context.The event allocation
pEvt = Q_NEW(QEvt, ...)
is a prime example of situation where you could use a "static event" which does not come from a pool. Your event has no event-parameters (QEvt
type), so it doesn't change. You could simply pre-allocate a static and const event and post it:This is faster, more efficient, and does not require any space in an event pool. In fact the "static event" is allocated in ROM, so even this one event does not cost RAM.
--MMS
Hi All,
Thanks a lot for the reply.
I've tried to go through events which can possibly cause the memory pool corruption.
In one of the events I could find a uint16_t variable updated with uint32_t.
Snippet:
typedef struct {
QEvt super;
QMActive * p_senderAO;
uint8_t * pData;
uint16_t dataLen;
} UARTDriverEvt;
UARTDriverEvt * pRxEvt;
pRxEvt->dataLen = me->handle_uart->GetRxCount();
GetRxCount(); returns a uint32_t value.
Would it be possible for the above to create the issue? Why I'm doubting is how the assignment of 16bit from a 32bit cause the issue as it only assigns the max of 16bit variable can hold. I'm not finding any asserts or Hardfaults after the above change. But how can I be sure the above fixes the issue?
I've tried the static events to reduce overhead as well but still the hardfault was coming.
@QL, kindly ignore the comments. It's only a development code and comment from QK-nano is wrongly put.
Thanks,
Manu
Hi Manu
Regarding your question about assigning a 32bit value to a 16bit variable - the value will be truncated to 16bits. Your compiler should warn you about it.
The event pool that you previously referred to in your screenshots, had a blocksize of 4 (l_smlPoolSto) - which can only be used for events with no data part. Your UARTDriverEvt has a size of 14, so either you are allocating events wrongly or this event is allocated from a different pool. If it is the latter, then you are barking at the wrong tree.
You did not include the code where you allocate an event for pRxEvt. Is it something like this?
Something else - your assert handler is checking for a specific file and line number before executing
__breakpoint(0)
. It is not clear from the screenshot what is happening to other asserts?Cheers,
Gawie
Hi Gawie,
Thanks for the reply.
The event allocation for pRxEvt is the same as what you've written(attaching screenshot). I didn't get what you actually mean by "you are allocating events wrongly or this event is allocated from a different pool". Do we need to explicitly mention which pool the event has to go?
Have put the assert handler specifically because, after that particular assert(qf_mem.c, 330) failure, the pointer was getting corrupted. And also, one more assert failure is qf_actq.c assert no 110. No other asserts.
Thanks,
Manu
Manu: this is probably urelated to your corrupted event pool assertions. But in your screen shot you apparently send a pointer to a data buffer
me->SestinatinData.pData
inside an event. In most cases this is a bad idea, because the external data buffer is a shared resource. So, by putting a pointer to this shared buffer you bypass all safety mechanisms in the framework to deliver event data in a thread-safe manner from the producer to the consumer. Sure, the framework delivers the pointer correctly, but the actual data is not part of the event, so the framework is helpless here. I hope you see that you potentially have a race codition around the shared data buffer. By "race condition" I mean a situation where the data in the buffer will be changed (e.g., by an interrupt), while the processing of this data is not finished yet.--MMS
Last edit: Quantum Leaps 2017-06-14
Hi Manu
What I am trying to say is that your mempool corruption is happening in the pool with block size = 4. CLIUARTDriverEvt requires at least 14 bytes and events of this type will be allocated from another pool. You don't have to specify the event pool when allocating an event, QPC will automatically select the correct pool by using the size of the event.
May be I am not understanding you correctly, but it seems like you are ignoring some asserts - you should break for all assertions while debugging. Please correct me if I am wrong.
Cheers,
Gawie