I am developing a kernel driver for ltc244x AD series. It is an spi driver. During developing, I think I’ve found a bug in omap spi driver.
My driver is quite simple. In probe function I start an hrtimer. Every 6ms (more or less) “timer.function” is called (I restart hrtimer using hrtimer_forward_now() in my timer.function). This function sends/receives 4 bytes via spi_async call. My “complete” function reads rx_buffer and elaborates. I have an irq function but as of now I disabled it. I simply want to have a constant time between spi reads (sort of, I know this is not real time etc.).
When I insert module, everything works fine. Communication is going on, with 6ms (varying) between messages. My messages last 51us each one. I have an SPI sniffer connected to my Chestnut 40 pin header and I can see spi messages.
I run some “dd if=/dev/urandom of=/dev/null” via ssh (2 different ssh) to stess-test the hardware during SPI communication. The problem manifests itself after some minutes: SPI starts sending messages without delay. This generates too much traffic on SPI and definitely hangs down my Overo. I double checked my code, I added in my “timer.function” call (which is the one called every 6ms) a simple gpio (147, for the records) switching. Oscilloscope shows 25Hz steady on that pin, so my function is called correctly at that frequency. On spi sniffer messages arrives with no delay. Every 52us a repeated message is sent.
I am pretty sure that they are repeated messages because usually the first byte my driver sends in every message is cycling between 0xA0, 0xA1, 0xA2, 0xA3. When I have the error condition, the same message (for example 0xA0) is repeated many times, in fact it is repeated between timer.function (which is still called at 25Hz). After that, it switches to 0xA1 and goes on repeating itself.
I’m no kernel expert, but seems that the problem is not hrtimers related.
I am running kernel 2.6.33 from Sakoman git, using OE. I tried omap2_mcspi.c from head (2.6.37-rc1?) but the problem is still here. Maybe it is related with
https://github.com/scottellis/overo-adc-mcp3002 cleanup null ref.
Is there some of you who has already seen this problem or knows a solution?
After a minute with spi rushing, this message appears on console:
BUG: soft lockup - CPU#0 stuck for 61s! [omap2_mcspi:11]
Modules linked in: ltc244x
Pid: 11, comm: omap2_mcspi
CPU: 0 Not tainted (2.6.33 #2)
PC is at mcspi_wait_for_reg_bit+0x40/0x58
LR is at msecs_to_jiffies+0x20/0x28
pc : [<c0265d78>] lr : [<c0064368>] psr: 20000013
sp : cf93bf00 ip : 00000000 fp : fa09804c
r10: fa098044 r9 : fa098050 r8 : ced44c43
r7 : fa098044 r6 : 00000001 r5 : c056eb20 r4 : 0000a2c0
r3 : 00000002 r2 : 00000064 r1 : 00000028 r0 : 0000a324
Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel
Control: 10c5387d Table: 8ed38019 DAC: 00000017
Missed ticks 8