I'm using the lwip module from the 2.0.x stable branch on a SAM7X256, and while doing a little stress test, found that there seems to be some kind of stack problem (?) when CH_OPTIMIZE_SPEED is enabled, while compiling at -O2. After working properly for some time, my app ends up stuck in chSchReadyI() - both Thread *cp, tp are 0x0 in the debugger, while in the calling function the pointer to tp is at a valid address . Interestingly, I don't see a fault - the do/while loop in chSchReadyI() just continues looping over what seems like invalid memory. If I disable CH_OPTIMIZE_SPEED or compile with -O0, I have not observed this problem.
I haven't verified which state the processor is in at this point, but I tried doubling the IRQ stack with no success. Any suggestions?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK - another observation: I think this only happens when using the CodeSourcery toolchain. By the way, there are binary builds available for Linux and OS X of the CodeSourcery chain available at http://github.com/jsnyder/arm-eabi-toolchain/downloads, which is what I was using.
I reverted to the plain GCC toolchain and it seems to be running fine.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What you described can only happen if the ready list integrity is compromised, that could happen because a stack overflow corrupting a Thread structure or the ready list header itself.
The only other cause could be a compiler problem, codesourcery is definitely *not* a normal GCC compiler because it has a lot of internal optimisations (try to run a benchmark using CS and normal GCC), so it is plausible it can also have its own problems,
I'll try to run some tests here.
Q: do -Os, -O0 and -O1 work reliably?
Giovanni
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I just came back after letting my app run overnight and saw that it ended up in the same state, even after a build with plain GCC at -O2. I know that the CodeSourcery binaries resulted in the same issue at -Os but I will try -O1 and report back.
Also, when verifying that my threads have not overflowed their stacks, I see several bytes (11?) at the bottom of the thread working area not filled to 0x55. The rest of the working area seems to be consumed from the top down, and I have plenty of room (0x55's) between the consumed stack space and these 11 bytes. Is this normal?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Do you use chThdResume() or chSchReadyI() in your code? it would be possible to break the ready list by readying again a thread that is already ready.
About stacks, it is possible to have gaps filled with 0x55 if the functions do not use all their local variables, large buffers as automatic variables as example. It may have overflowed even if you see fillers. Those non-filler values immediately after the thread structure are suspect.
If you have available RAM I would recommend to allocate more space to those threads and see what happens. Be especially careful with threads invoking code you don't know in details.
Giovanni
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
No, I don't use chThdResume() or chSchReadyI() in my own code. I do notice, however, that this only seems to happen when I'm running the networking system (lwip + mac driver) so I'm wondering if it could be some problem in there.
I've tried increasing stacks previously, but I will double check that and continue to investigate. Thanks for your suggestions.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
All my stacks are about 4 times larger now, and I'm still seeing this problem. I've pasted my .map file at http://pastebin.com/Up7HBBZs
When I see this problem, I see that Thread *otp is located at address 0x209804 in chSchDoRescheduleI() once I've stopped it in the debugger, which is located beyond __heap_base__. Looking at that spot in memory, I see the same 11 bytes starting at 0x209804 followed by some (200 or so) 0x55's and later some other filled in values. Is this indeed a thread structure, and if so, why is it located here?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Also, I see those 11 non-filler bytes at the bottom of each Thread structure immediately after it is created - it does not seem that this space is being overwritten. Are some of the Thread member values stored at the bottom of the working area?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It is the Thread structure itself then, it is located at the bottom of the Thread working area. Note that the thread structure size is not fixed, it can change depending of the settings in chconf.h.
Did you also verify the size of the IRQ and main thread stacks? if you use the FIQ interrupt also make sure to increase the FIQ stack size.
Giovanni
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, that makes sense that the Thread structure is at the bottom of the working area. Do you have any thoughts on why there might be what looks like a Thread + Working Area structure in memory, beyond the __heap_base__ address, as I mentioned previously?
I have increased my IRQ and main thread stacks to 4x what they were (0x400 and 0x800 respectively) and I'm still seeing this issue. I'm not currently using the FIQ interrupt.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The only way to have a thread working area allocated beyond the __heap_base__ address is to use chThdCreateFromHeap() or chThdCreateFromMemoryPool(), do you use dynamic threads in your system? if so it may be an issue of correct threads management (references, memory release and so on, there is a protocol to follow).
If you use lwIP then the internal thread created using the sys_thread_new() wrapper function is also allocated beyond the __heap_base__. Did you increase also that thread stack size? it is done into the lwip configuration file (TCPIP_THREAD_STACKSIZE).
Giovanni
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I only use the chThdCreateStatic() API, but I am using lwIP so that explains the thread beyond the __heap_base__ address. I did confirm that I also gave 4x the default stack for the lwip thread - 2048 instead of 512. Thanks again for your patience.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, I was slightly disappointed to discover that my application ran successfully for the last several hours when built at -O0. I keep thinking that I must have a silly error somewhere, but have not found it yet.
Well, if it does not happen at -O0 then it is bad… it will be hard to find an error that happens so randomly and not that often too.
Using the map file, are you able to reconstruct the ready list double linked list when it happens? it could give an hint of what triggered it. you should also verify if the priority field of the ready list is still set to zero or if it got corrupted.
Giovanni
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, good idea - and I think you're right, lwIP definitely does a lot of (de)allocation of semaphores. I'll try re-implementing sys_arch.c to use the pools allocator and see if that helps.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I haven't yet started my re-implementation yet, but I thought would share a few more details. When my app arrives at chSchReadyI() in the corrupted state, it's looping only between two threads (while I have 4 in my system) - one created on the heap (via sys_thread_new() in the lwip sys_arch.c) at 0x209804 and an invalid pointer, only 40 bytes below it at 0x002097dc. I don't have any other objects at this address in the map file, so it looks like it has clearly been overwritten, although this address is also beyond __heap_base__ which is at 0x20974c. Not sure if this an allocation problem a more likely suspect?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It is still in the heap, you have two kind of objects there, the lwip thread and the lwip semaphores.
Before you go for the re-implementation, you could also try to disable mutexes in chconf.h and see if this makes a difference, disabling mutexes enforces the heap allocator to use semaphores instead mutexes for mutual exclusion.
I would proceed as follow:
1 - Disable mutexes, this allows to exclude a mutual exclusion problem if the problem persists.
2 - Use the pool allocator instead of the heap allocator, this would exclude a problem in the heap allocator if the problem persists.
If none of the above works then we will have to find the problem between: MAC driver, lwIP, lwIP-ChibiOS layer, application.
Giovanni
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Lets try the pools, it is a better solution for the lwIP layer anyway, it uses less memory and is faster than heaps. You can make the pool have a fixed number of elements or make it feed from the core automatically.
Giovanni
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm using the lwip module from the 2.0.x stable branch on a SAM7X256, and while doing a little stress test, found that there seems to be some kind of stack problem (?) when CH_OPTIMIZE_SPEED is enabled, while compiling at -O2. After working properly for some time, my app ends up stuck in chSchReadyI() - both Thread *cp, tp are 0x0 in the debugger, while in the calling function the pointer to tp is at a valid address . Interestingly, I don't see a fault - the do/while loop in chSchReadyI() just continues looping over what seems like invalid memory. If I disable CH_OPTIMIZE_SPEED or compile with -O0, I have not observed this problem.
I haven't verified which state the processor is in at this point, but I tried doubling the IRQ stack with no success. Any suggestions?
Hm - small update, I've seen this problem now with CH_OPTIMIZE_SPEED disabled as well.
OK - another observation: I think this only happens when using the CodeSourcery toolchain. By the way, there are binary builds available for Linux and OS X of the CodeSourcery chain available at http://github.com/jsnyder/arm-eabi-toolchain/downloads, which is what I was using.
I reverted to the plain GCC toolchain and it seems to be running fine.
What you described can only happen if the ready list integrity is compromised, that could happen because a stack overflow corrupting a Thread structure or the ready list header itself.
The only other cause could be a compiler problem, codesourcery is definitely *not* a normal GCC compiler because it has a lot of internal optimisations (try to run a benchmark using CS and normal GCC), so it is plausible it can also have its own problems,
I'll try to run some tests here.
Q: do -Os, -O0 and -O1 work reliably?
Giovanni
I just came back after letting my app run overnight and saw that it ended up in the same state, even after a build with plain GCC at -O2. I know that the CodeSourcery binaries resulted in the same issue at -Os but I will try -O1 and report back.
Also, when verifying that my threads have not overflowed their stacks, I see several bytes (11?) at the bottom of the thread working area not filled to 0x55. The rest of the working area seems to be consumed from the top down, and I have plenty of room (0x55's) between the consumed stack space and these 11 bytes. Is this normal?
Same problem at -O1 CodeSourcery as well.
Do you use chThdResume() or chSchReadyI() in your code? it would be possible to break the ready list by readying again a thread that is already ready.
About stacks, it is possible to have gaps filled with 0x55 if the functions do not use all their local variables, large buffers as automatic variables as example. It may have overflowed even if you see fillers. Those non-filler values immediately after the thread structure are suspect.
If you have available RAM I would recommend to allocate more space to those threads and see what happens. Be especially careful with threads invoking code you don't know in details.
Giovanni
No, I don't use chThdResume() or chSchReadyI() in my own code. I do notice, however, that this only seems to happen when I'm running the networking system (lwip + mac driver) so I'm wondering if it could be some problem in there.
I've tried increasing stacks previously, but I will double check that and continue to investigate. Thanks for your suggestions.
All my stacks are about 4 times larger now, and I'm still seeing this problem. I've pasted my .map file at http://pastebin.com/Up7HBBZs
When I see this problem, I see that Thread *otp is located at address 0x209804 in chSchDoRescheduleI() once I've stopped it in the debugger, which is located beyond __heap_base__. Looking at that spot in memory, I see the same 11 bytes starting at 0x209804 followed by some (200 or so) 0x55's and later some other filled in values. Is this indeed a thread structure, and if so, why is it located here?
Also, I see those 11 non-filler bytes at the bottom of each Thread structure immediately after it is created - it does not seem that this space is being overwritten. Are some of the Thread member values stored at the bottom of the working area?
It is the Thread structure itself then, it is located at the bottom of the Thread working area. Note that the thread structure size is not fixed, it can change depending of the settings in chconf.h.
Did you also verify the size of the IRQ and main thread stacks? if you use the FIQ interrupt also make sure to increase the FIQ stack size.
Giovanni
OK, that makes sense that the Thread structure is at the bottom of the working area. Do you have any thoughts on why there might be what looks like a Thread + Working Area structure in memory, beyond the __heap_base__ address, as I mentioned previously?
I have increased my IRQ and main thread stacks to 4x what they were (0x400 and 0x800 respectively) and I'm still seeing this issue. I'm not currently using the FIQ interrupt.
The only way to have a thread working area allocated beyond the __heap_base__ address is to use chThdCreateFromHeap() or chThdCreateFromMemoryPool(), do you use dynamic threads in your system? if so it may be an issue of correct threads management (references, memory release and so on, there is a protocol to follow).
If you use lwIP then the internal thread created using the sys_thread_new() wrapper function is also allocated beyond the __heap_base__. Did you increase also that thread stack size? it is done into the lwip configuration file (TCPIP_THREAD_STACKSIZE).
Giovanni
I only use the chThdCreateStatic() API, but I am using lwIP so that explains the thread beyond the __heap_base__ address. I did confirm that I also gave 4x the default stack for the lwip thread - 2048 instead of 512. Thanks again for your patience.
Ok, better verify if the problem is still there using -O0, if it is not a stack problem this thing becomes "interesting".
Do you have all the debug options activated?
Giovanni
Well, I was slightly disappointed to discover that my application ran successfully for the last several hours when built at -O0. I keep thinking that I must have a silly error somewhere, but have not found it yet.
I have almost all the debug options activated - my chconf.h looks like http://pastebin.com/evnpSUid
Well, if it does not happen at -O0 then it is bad… it will be hard to find an error that happens so randomly and not that often too.
Using the map file, are you able to reconstruct the ready list double linked list when it happens? it could give an hint of what triggered it. you should also verify if the priority field of the ready list is still set to zero or if it got corrupted.
Giovanni
If I remember well lwIP does allocate and free semaphores very often (too often), this could help us follow two possible ways:
1 - A bug in the heap allocator.
2 - lwIP using a freed semaphore.
Both things could lead to corruption of the OS linked lists.
We could rule out #1 by not using the heap allocator in sys_arch.c (replacing it with its own allocator or using the pools allocator instead).
#2 would be harder to verify.
A bug in the MAC driver should not create the problem you described, it just resets a static semaphore and broadcasts an event.
Giovanni
OK, good idea - and I think you're right, lwIP definitely does a lot of (de)allocation of semaphores. I'll try re-implementing sys_arch.c to use the pools allocator and see if that helps.
I haven't yet started my re-implementation yet, but I thought would share a few more details. When my app arrives at chSchReadyI() in the corrupted state, it's looping only between two threads (while I have 4 in my system) - one created on the heap (via sys_thread_new() in the lwip sys_arch.c) at 0x209804 and an invalid pointer, only 40 bytes below it at 0x002097dc. I don't have any other objects at this address in the map file, so it looks like it has clearly been overwritten, although this address is also beyond __heap_base__ which is at 0x20974c. Not sure if this an allocation problem a more likely suspect?
It is still in the heap, you have two kind of objects there, the lwip thread and the lwip semaphores.
Before you go for the re-implementation, you could also try to disable mutexes in chconf.h and see if this makes a difference, disabling mutexes enforces the heap allocator to use semaphores instead mutexes for mutual exclusion.
I would proceed as follow:
1 - Disable mutexes, this allows to exclude a mutual exclusion problem if the problem persists.
2 - Use the pool allocator instead of the heap allocator, this would exclude a problem in the heap allocator if the problem persists.
If none of the above works then we will have to find the problem between: MAC driver, lwIP, lwIP-ChibiOS layer, application.
Giovanni
BTW, you can understand if it is a heap object by looking at the previous 8 bytes, those are the object header (pointer to heap object and size).
Giovanni
OK, good idea - I'll try disabling mutexes first. And thanks for the tip about heap objects, I didn't know that :)
This failure just occurred again with mutexes disabled.
This somehow is a positive thing :-)
Lets try the pools, it is a better solution for the lwIP layer anyway, it uses less memory and is faster than heaps. You can make the pool have a fixed number of elements or make it feed from the core automatically.
Giovanni