We are reproducibly getting an assertion error from QK when there is much interrupt activity. The error appears to be related to the PendSV interrupt since this seems to be the only place the functions are being called by that could generate this assert error. We have confirmed that the only interrupt at the lowest priority level is the PendSV, all other interrupts (including systick) are at higher priority.
We have two STM32 processors, one an M0, and the other an M3. We are running QPC on both processors using ports/arm-cm/qk/arm port. We are experiencing this problem in both systems on 5.6.3, and had experienced this same problem with 5.6.2 and 5.4.1.
Is there anything we should be looking at that could be causing this problem?
Anonymous
Thank you for taking the time to report an issue. This is always highly appreciated.
The first thing I would highly recommend is to look at your interrupt priority settings. The more recent QP ports to Cortex-M3/M4 use the "selective interrupt disabling" policy, which leaves highest-priority interrupts not disabled at all. Such interrupts are called "kernel unaware" and should never call QP services, as this would lead to corruption of internal data. (Your assertion is indicative of exactly such a situation).
There is a special App Note "Setting ARM Cortex-M Priorities for QP 5.1 and Higher", which I highly recommend for you to read.
Please make a post to this bug report if setting the interrupt priorities helps you.
--MMS
Thank you for your pointer. Associated changes seemed to have improved the issue on the M3 processor, however the problem seems to be worse on the M0. I just corrected a bug where the PendSV and SYSTICK interrupts were both set to 192 (lowest), and now I've changed the SYSTICK to 128, and the problem seems to have been exacerbated. Do you have any suggestion about this same assertion on the M0, given that the M0 doesn't have a BASEPRI register?
Last edit: Kirk Wolff 2016-04-14
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Please do not change the priority of the PendSV exception in the NVIC. This priority is set to 0xFF (the lowest urgency) in QK_init() and must remain the lowest urgency. All this is described in the aforementioned AppNote "Setting ARM Cortex-M Interrupt Priorities...".
To debug the problem on the M0, please make sure that the QF critical section actually works by inspecting the PRIMASK register while the code is inside a critical section. You don't mention which compiler you are using, so I'm not sure how your critical section is implemented.
--MMS
I didn't say I modified the PendSV, I modified the SYSTICK priority so that it was at a higher priority (lower number) than PendSV. I am using the ARM-MDK from Keil.
Before QF_CRIT_ENTRY_, primask is 0
After QF_CRIT_ENTRY_, primask is 1
After QF_CRIT_EXIT_, primask is back to 0
Last edit: Kirk Wolff 2016-04-14
Good. Do you have enough stack? (No stack oveflow?)
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Yes I've been keeping a close eye on it. There is at least 40% margin by the time the assertion happens. Whatever is happening seems to be time sensitive. The system works fine for a short while but eventually asserts. One data point I've seen is from qspy where the scheduler runs twice in a row triggering the error.
Wouild it be possible for you to distill the problem to an example that could be run at Quantum Leaps? Something like that would be very helpful. Being able to reproduce a problem like that is typically more than half of the battle...
If you come up with a sample code, please send it to info at state-machine.com along with any instructions how to reproduce the failure (e.g., by manually triggering interrupts or any other method). Example of such a procedure is provided in the AppNotes about the QP ports to the ARM Cortex-M, whrere in the section about QK you can find the preemption testing procedures.
--MMS
Last edit: Quantum Leaps 2016-04-15
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Hi, I'm a developer that works with Kirk.
It seems that QK_sched_(p) is being called with (p == QK_currPrio_) => true. So an AO is preempting itself. I just wrapped the do{}while() in QK_sched_ with a conditional to make sure it is skipped when (p == pin). The assertion never happens anymore and the application has been running smoothly for around 20 minutes now (before it wouldn't make it past a minute).
I'm guessing there is a guard protecting the scheduler from being called with a priority that should not preempt the current one. Somehow we're getting past that and shouldn't be. I'll keep digging.
-Chris
Thank you Chris. Your findings are very interesting and disturbing at the same time. We will also keep digging...
But this comes at a particularly busy time for us here, so please allow about a week for us to catch up. I sincerely apologize for the delay.
--MMS
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
I think I found it.
In PendSV handler, the PendSV pending bit is cleared by hardware on entry, but could be set again before interrupts are disabled to invoke the scheduler. The result is that PendSV and the scheduler will be called again with the same nextPrio once it enables interrupts.
Wanted to get this to you quick for feedback, so below is my diff. I think this ensures that only one scheduler invocation happens per priority preemption, but please correct me if I'm wrong. We're still wrestling with some other problems here, but I think they are unrelated.
Thanks,
Chris
Yes, I also came to the same conclusion. There is a time window at the beginning of PendSV, before interrupts get disabled. This bug has been introduced when we changed the PendSV implementation to free up the SVCall exception used previously.
Thank you for reporting the bug and for performing this thorough analysis. Apparently, you know Cortex-M very well.
The bug fix is on the way. Stay tuned...
--MMS
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
There may be another bug along the same line. I'm not 100% sure of this one because I haven't tried to step through the scenario completely. It would explain the symptoms of our remaining problem (but I can't rule out that it's not us yet either).
You are using the PendSV when nextPrio==0 as a special case to return to the previously preempted task, whose exception frame sits above the "fake" frame used to "return" to the scheduler, right?
If:
1. PendSV bit is set for this purpose with nextPrio=0 at the end of the sched_ret.
2. The PendSV_Handler is entered
3. An interrupt occurs which updates the nextPrio to something >0 (before interrupts are disabled)
Then I think the request to return to the preempted task is lost and no priorities lower than the current one will ever resume.
We're running to cases now where low priority tasks are getting queue overflow asserts or pool depletion and qspy indicates that some low priority tasks never get resumed, despite nothing else happening on the processor. I think the debugger usually shows the processor stuck in the "branch to self" at the end of the sched_ret with a currPrio >0.
Again, I'm not 100% on this one. I just noticed the symptoms seem to match.
-Chris
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
I think the problem described above is happening. I fixed it by turning the end of QK_sched_ret into a loop. I think this should work because when an attempt to trigger PendSV with nextPrio==0 gets preempted, the preemption will eventually return back to the loop with nextPrio==0 (from the preempting task's QK_sched_ret).
For reference, here's my diff again.
Last edit: Anonymous 2016-04-20
I have faced with the same behaviour in version:
; Last Updated for Version: 5.6.0
; Date of the Last Update: 2015-12-11
It seems i found the explanation in late arriving & tail chaining mechanism of Cortex M.
See Technical referense manual for late arrival & tail chaining.
0) Instruction rises a PendSv interrupt
STR r1,[r0] ; ICSR[28] := 1 (pend PendSV)
1) Assume the some Exception "ISR X" happens before the first instruction of PendSV handler. Late arrival happens and ISR X goes to execution and PendSV stay "pended".
2) Assume ISR X cause to change QK_nextPrio_ from 0 and rises Pend SV which is already in "pending" state.
So, PendSV will be executed only ones and only ones clears the frame from stack. And returns to a loop instead of a preempted thread. Your fix will not help for M4F variant. Because in this case a "R0, LR" pushed to stack at the entry of PendSV. And missed "tail" of PendSv (which properly pop "R0, LR) causes to stack damage.
You report that you are using the version 5.6.0, while the comment to this bug filed on 2016-05-01 states that the problem has been fixed in version 5.6.4. Any chance that you might upgrade to that version, or better yet, to the latest QP/C/C++ version?
--MMS
Last edit: Quantum Leaps 2019-12-21
Hi Kirk and Chris,
I wonder if you would be interested in beta-testing the new QK-port that fixes the bug(s) you've found.
If so, please contact Quantum Leaps directly at: info@state-machine.com
--MMS
This bug has been fixed in QP 5.6.4 (in all QP types: QP/C, QP/C++, and QP-nano).
The new Application Note "QP and ARM Cortex-M" explains the updated QK implementation on ARM Cortex-M. The Application Note also explains the QV and QXK kernels.
--MMS