Re: [RTnet-developers] "Suspending kernel thread 'rtcfg-rx' after exception" on ns9xxx

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

(take care of CCs)

James Kilts wrote:
>> rtcfg_main_state_off should not be called, even if the frame is totally
>> corrupted (yeah, we are lacking basic range checks here):
>> RTCFG_FRM_STAGE_1_CFG is added to the header ID, so you
>> normally end up in the frame event handlers.
> 
> While it's true that RTCFG_FRM_STAGE_1_CFG is added to the header ID for the
> event_id, this has no effect on device[1].state in rtcfg_do_main_event().
> The value of device[1].state defaults to 0, which means that the function is
> invoked as "(*state[0])(...)" which in the state function array is
> "rtcfg_main_state_off()".  Verified by rtdm_printk(), rtcfg_main_state_off()
> is getting called.  In the end, rtcfg_main_state_off() puts it into the
> correct state (RTCFG_MAIN_CLIENT_0), but not before potentially corrupting
> memory.

Now I got your path.

> 
> Anyway, even when rtcfg_main_state_off() only calls rtcfg_next_main_state()
> and rtdm_mutex_unlock() in the RTCFG_CMD_CLIENT case, this does not seem to
> be the source of my problems.
> 

Right. An unconfigured RTcfg node receiving RTcfg frames will spit out
warnings if debugging is on or will simply ignore those frames. But it
won't deference the event_data under an incorrect type.

> 
>> What is the header ID when things go wrong?
> 
> frm_head->id == 0.  It seems that the exception happens the first time
> calling rtcfg_do_main_event() in rtcfg_rx_task().
> 

Where precisely is important.

> 
>> start with the frame as the stack finds it in memory after
>> the driver handed it over. If that one is correct, we have a bug in the
>> higher layers. But I strongly suspect the frame is corrupted.
> 
> I think the frame is correct... here are some of the values from the rtskb
> in rtcfg_main_state_client_0()  (where event_id == RTCFG_FRM_STAGE_1_CFG):
> 
> rtskb->nh.iph->saddr == 0x44542442
> rtskb->nh.iph->daddr == 0x4643414d
> rtskb->protocol == 8848
> rtskb->pkt_type == 1
> rtskb->priority == 0
> rtskb->len == 83
> rtskb->data[0] == 0x00
> rtskb->data[1] == 0x01
> rtskb->data[2] == 0x0a
> rtskb->data[3] == 0x65
> rtskb->data[4] == 0x01
> rtskb->data[5] == 0x7e
> rtskb->data[6] == 0x0a
> rtskb->data[7] == 0x65
> 
> 
> My debugging environment is limited at the moment.  When the exception
> occurs, the stack is lost due to the exception handler, so the hardware
> debugger is of little or no value.

You know the call path quite well already, and this path is taken the
first time here, so setting proper breakpoints should be no deal - given
you actually have a suited hardware debugger, do you?

>  And when I use too many rtdm_printk()
> calls, the entire system hangs before the exception.  Because of the last
> issue, I can't use the built-in debugging features of RTnet.  Any
> suggestions for debugging this?

Xenomai reports the failing instruction address in the suspension
message. Collect the module address (/proc/modules), compile the module
with debug symbols (=>CONFIG_DEBUG_INFO), and process the binary with
objdump or gdb. That should give you the instruction and its environment
(ie. source code line).

Jan