Menu

#128 QK Q_ASSERT_ID(0,...) from QACTIVE_EQUEUE_WAIT_ in QActive_get_ on CortexM

QP
closed
QK (5) CM (3)
1
2024-08-01
2016-04-14
Kirk Wolff
No

We are reproducibly getting an assertion error from QK when there is much interrupt activity. The error appears to be related to the PendSV interrupt since this seems to be the only place the functions are being called by that could generate this assert error. We have confirmed that the only interrupt at the lowest priority level is the PendSV, all other interrupts (including systick) are at higher priority.

We have two STM32 processors, one an M0, and the other an M3. We are running QPC on both processors using ports/arm-cm/qk/arm port. We are experiencing this problem in both systems on 5.6.3, and had experienced this same problem with 5.6.2 and 5.4.1.

Is there anything we should be looking at that could be causing this problem?

Discussion

  • Quantum Leaps

    Quantum Leaps - 2016-04-14

    Thank you for taking the time to report an issue. This is always highly appreciated.

    The first thing I would highly recommend is to look at your interrupt priority settings. The more recent QP ports to Cortex-M3/M4 use the "selective interrupt disabling" policy, which leaves highest-priority interrupts not disabled at all. Such interrupts are called "kernel unaware" and should never call QP services, as this would lead to corruption of internal data. (Your assertion is indicative of exactly such a situation).

    There is a special App Note "Setting ARM Cortex-M Priorities for QP 5.1 and Higher", which I highly recommend for you to read.

    Please make a post to this bug report if setting the interrupt priorities helps you.

    --MMS

     
  • Kirk Wolff

    Kirk Wolff - 2016-04-14

    Thank you for your pointer. Associated changes seemed to have improved the issue on the M3 processor, however the problem seems to be worse on the M0. I just corrected a bug where the PendSV and SYSTICK interrupts were both set to 192 (lowest), and now I've changed the SYSTICK to 128, and the problem seems to have been exacerbated. Do you have any suggestion about this same assertion on the M0, given that the M0 doesn't have a BASEPRI register?

     

    Last edit: Kirk Wolff 2016-04-14
  • Anonymous

    Anonymous - 2016-04-14

    Please do not change the priority of the PendSV exception in the NVIC. This priority is set to 0xFF (the lowest urgency) in QK_init() and must remain the lowest urgency. All this is described in the aforementioned AppNote "Setting ARM Cortex-M Interrupt Priorities...".

    To debug the problem on the M0, please make sure that the QF critical section actually works by inspecting the PRIMASK register while the code is inside a critical section. You don't mention which compiler you are using, so I'm not sure how your critical section is implemented.

    --MMS

     
  • Kirk Wolff

    Kirk Wolff - 2016-04-14

    I didn't say I modified the PendSV, I modified the SYSTICK priority so that it was at a higher priority (lower number) than PendSV. I am using the ARM-MDK from Keil.

    Before QF_CRIT_ENTRY_, primask is 0
    After QF_CRIT_ENTRY_, primask is 1
    After QF_CRIT_EXIT_, primask is back to 0

     

    Last edit: Kirk Wolff 2016-04-14
  • Quantum Leaps

    Quantum Leaps - 2016-04-14

    Good. Do you have enough stack? (No stack oveflow?)

     
  • Anonymous

    Anonymous - 2016-04-15

    Yes I've been keeping a close eye on it. There is at least 40% margin by the time the assertion happens. Whatever is happening seems to be time sensitive. The system works fine for a short while but eventually asserts. One data point I've seen is from qspy where the scheduler runs twice in a row triggering the error.

     
  • Quantum Leaps

    Quantum Leaps - 2016-04-15

    Wouild it be possible for you to distill the problem to an example that could be run at Quantum Leaps? Something like that would be very helpful. Being able to reproduce a problem like that is typically more than half of the battle...

    If you come up with a sample code, please send it to info at state-machine.com along with any instructions how to reproduce the failure (e.g., by manually triggering interrupts or any other method). Example of such a procedure is provided in the AppNotes about the QP ports to the ARM Cortex-M, whrere in the section about QK you can find the preemption testing procedures.

    --MMS

     

    Last edit: Quantum Leaps 2016-04-15
  • Anonymous

    Anonymous - 2016-04-15

    Hi, I'm a developer that works with Kirk.

    It seems that QK_sched_(p) is being called with (p == QK_currPrio_) => true. So an AO is preempting itself. I just wrapped the do{}while() in QK_sched_ with a conditional to make sure it is skipped when (p == pin). The assertion never happens anymore and the application has been running smoothly for around 20 minutes now (before it wouldn't make it past a minute).

    I'm guessing there is a guard protecting the scheduler from being called with a priority that should not preempt the current one. Somehow we're getting past that and shouldn't be. I'll keep digging.

    -Chris

     
  • Quantum Leaps

    Quantum Leaps - 2016-04-15

    Thank you Chris. Your findings are very interesting and disturbing at the same time. We will also keep digging...

    But this comes at a particularly busy time for us here, so please allow about a week for us to catch up. I sincerely apologize for the delay.

    --MMS

     
  • Anonymous

    Anonymous - 2016-04-19

    I think I found it.

    In PendSV handler, the PendSV pending bit is cleared by hardware on entry, but could be set again before interrupts are disabled to invoke the scheduler. The result is that PendSV and the scheduler will be called again with the same nextPrio once it enables interrupts.

    Wanted to get this to you quick for feedback, so below is my diff. I think this ensures that only one scheduler invocation happens per priority preemption, but please correct me if I'm wrong. We're still wrestling with some other problems here, but I think they are unrelated.

    Thanks,
    Chris

    Index: ports/arm-cm/qk/arm/qk_port.s
    ===================================================================
    --- ports/arm-cm/qk/arm/qk_port.s   (revision 1241)
    +++ ports/arm-cm/qk/arm/qk_port.s   (working copy)
    @@ -98,20 +98,30 @@
     PendSV_Handler
    
       IF {TARGET_ARCH_THUMB} == 3 ; Cortex-M0/M0+/M1 (v6-M, v6S-M)?
         CPSID   i                 ; disable interrupts (set PRIMASK)
       ELSE                        ; M3/M4/M7
         MOVS    r0,#QF_BASEPRI
         MSR     BASEPRI,r0        ; selectively disable interrupts
       ENDIF                       ; M3/M4/M7
         ISB                       ; reset the instruction pipeline
    
    +    ; The PENDSV bit will be cleared by hardware on entry, but could
    +    ; have been set again before disabling interrupts above. Make sure
    +    ; it is clear inside the critical section so the scheduler doesn't
    +    ; get invoked again at the same priority once it enables interrupts
    +    ; to execute the run to completion step.
    +    LDR     r0,=0xE000ED04    ; Interrupt Control and State Register
    +    MOVS    r1,#1
    +    LSLS    r1,r1,#27         ; r0 := (1 << 27) (PENDSVCLR bit)
    +    STR     r1,[r0]           ; ICSR[27] := 1 (clear PendSV)
    +
         LDR     r0,=QK_nextPrio_
         LDR     r0,[r0]
         CMP     r0,#0
         BNE.N   PendSV_sched      ; if QK_nextPrio_ != 0, branch to scheduler
    
     
  • Quantum Leaps

    Quantum Leaps - 2016-04-20

    Yes, I also came to the same conclusion. There is a time window at the beginning of PendSV, before interrupts get disabled. This bug has been introduced when we changed the PendSV implementation to free up the SVCall exception used previously.

    Thank you for reporting the bug and for performing this thorough analysis. Apparently, you know Cortex-M very well.

    The bug fix is on the way. Stay tuned...

    --MMS

     
  • Anonymous

    Anonymous - 2016-04-20

    There may be another bug along the same line. I'm not 100% sure of this one because I haven't tried to step through the scenario completely. It would explain the symptoms of our remaining problem (but I can't rule out that it's not us yet either).

    You are using the PendSV when nextPrio==0 as a special case to return to the previously preempted task, whose exception frame sits above the "fake" frame used to "return" to the scheduler, right?

    If:
    1. PendSV bit is set for this purpose with nextPrio=0 at the end of the sched_ret.
    2. The PendSV_Handler is entered
    3. An interrupt occurs which updates the nextPrio to something >0 (before interrupts are disabled)

    Then I think the request to return to the preempted task is lost and no priorities lower than the current one will ever resume.

    We're running to cases now where low priority tasks are getting queue overflow asserts or pool depletion and qspy indicates that some low priority tasks never get resumed, despite nothing else happening on the processor. I think the debugger usually shows the processor stuck in the "branch to self" at the end of the sched_ret with a currPrio >0.

    Again, I'm not 100% on this one. I just noticed the symptoms seem to match.

    -Chris

     
  • Anonymous

    Anonymous - 2016-04-20

    I think the problem described above is happening. I fixed it by turning the end of QK_sched_ret into a loop. I think this should work because when an attempt to trigger PendSV with nextPrio==0 gets preempted, the preemption will eventually return back to the loop with nextPrio==0 (from the preempting task's QK_sched_ret).

    For reference, here's my diff again.

    Index: ports/arm-cm/qk/arm/qk_port.s
    ===================================================================
    --- ports/arm-cm/qk/arm/qk_port.s   (revision 1241)
    +++ ports/arm-cm/qk/arm/qk_port.s   (working copy)
    @@ -105,6 +105,16 @@
       ENDIF                       ; M3/M4/M7
         ISB                       ; reset the instruction pipeline
    
    +    ; The PENDSV bit will be cleared by hardware on entry, but could
    +    ; have been set again before disabling interrupts above. Make sure
    +    ; it is clear inside the critical section so the scheduler doesn't
    +    ; get invoked again at the same priority once it enables interrupts
    +    ; to execute the run to completion step.
    +    LDR     r0,=0xE000ED04    ; Interrupt Control and State Register
    +    MOVS    r1,#1
    +    LSLS    r1,r1,#27         ; r0 := (1 << 27) (PENDSVCLR bit)
    +    STR     r1,[r0]           ; ICSR[27] := 1 (clear PendSV)
    +
         LDR     r0,=QK_nextPrio_
         LDR     r0,[r0]
         CMP     r0,#0
    @@ -166,12 +176,15 @@
         MSR     BASEPRI,r0        ; enable interrupts (clear BASEPRI)
       ENDIF                       ; M3/M4/M7
    
    -    ; trigger PendSV to return to preempted task...
    +PendSV_sched_ret_pend
    +    ; trigger PendSV to return to preempted task. If this gets lost
    +    ; by a preemption that sets QK_nextPrio >0, then it will return
    +    ; back to here with QK_nextPrio==0 from its PendSV_sched_ret.
         LDR     r0,=0xE000ED04    ; Interrupt Control and State Register
         MOVS    r1,#1
         LSLS    r1,r1,#28         ; r0 := (1 << 28) (PENDSVSET bit)
         STR     r1,[r0]           ; ICSR[28] := 1 (pend PendSV)
    -    B       .                 ; wait for preemption by PendSV
    +    B       PendSV_sched_ret_pend
    
         ALIGN                     ; make sure the END is properly aligned
    
     

    Last edit: Anonymous 2016-04-20
    • Anonymous

      Anonymous - 2019-12-21

      I have faced with the same behaviour in version:
      ; Last Updated for Version: 5.6.0
      ; Date of the Last Update: 2015-12-11

      It seems i found the explanation in late arriving & tail chaining mechanism of Cortex M.

      See Technical referense manual for late arrival & tail chaining.

      0) Instruction rises a PendSv interrupt
      STR r1,[r0] ; ICSR[28] := 1 (pend PendSV)
      1) Assume the some Exception "ISR X" happens before the first instruction of PendSV handler. Late arrival happens and ISR X goes to execution and PendSV stay "pended".
      2) Assume ISR X cause to change QK_nextPrio_ from 0 and rises Pend SV which is already in "pending" state.

      So, PendSV will be executed only ones and only ones clears the frame from stack. And returns to a loop instead of a preempted thread. Your fix will not help for M4F variant. Because in this case a "R0, LR" pushed to stack at the entry of PendSV. And missed "tail" of PendSv (which properly pop "R0, LR) causes to stack damage.

       
      • Quantum Leaps

        Quantum Leaps - 2019-12-21

        You report that you are using the version 5.6.0, while the comment to this bug filed on 2016-05-01 states that the problem has been fixed in version 5.6.4. Any chance that you might upgrade to that version, or better yet, to the latest QP/C/C++ version?

        --MMS

         

        Last edit: Quantum Leaps 2019-12-21
  • Quantum Leaps

    Quantum Leaps - 2016-04-22

    Hi Kirk and Chris,
    I wonder if you would be interested in beta-testing the new QK-port that fixes the bug(s) you've found.

    If so, please contact Quantum Leaps directly at: info@state-machine.com

    --MMS

     
  • Quantum Leaps

    Quantum Leaps - 2016-05-01

    This bug has been fixed in QP 5.6.4 (in all QP types: QP/C, QP/C++, and QP-nano).

    The new Application Note "QP and ARM Cortex-M" explains the updated QK implementation on ARM Cortex-M. The Application Note also explains the QV and QXK kernels.

    --MMS

     

Anonymous
Anonymous

Add attachments
Cancel