Thread: [Linux-hls-devel] Processes freezing up
Status: Pre-Alpha
Brought to you by:
lucabe
|
From: Tony L. <tl...@au...> - 2003-09-03 09:31:01
|
Hi Luca/John,
=20
I'm a work colleague of Paul Koufalas and have been looking with him at
running our application with the HLS scheduler.
=20
I've encountered an issue where processes seem to freeze up after
running for any amount of time. Sometimes it's my application and I've
also noticed it happen with top.
=20
I've included the dmesg output below.
=20
HLS MP initializing (HLS_DBG_PRINT_LEVEL =3D 1).
[-1961706252], 833 : sched 'ROOT' registered in slot 0
[-1885129302], 833 : sched 'JOIN' registered in slot 1
[-1808565821], 833 : sched 'TH' registered in slot 2
[-1734729187], 833 : sched 'RR' registered in slot 3
[-1660892643], 833 : sched 'PS' registered in slot 4
[-1587056079], 833 : sched 'RES' registered in slot 5
already PRIVATE_DATA !=3D NULL???
already PRIVATE_DATA !=3D NULL???
hls_ctl: Moving to res1
...and setting the parameters!
bug: kernel timer added twice at e24c1689.
HLSUnblockThreadHook --- WAI =3D 2 [512]
HLSUnblockThreadHook --- WAI =3D 2 [512]
HLS ERROR: Task 687 has rt_priority =3D 100 and state =3D 1
HLSUnblockThreadHook --- WAI =3D 5 [512]
HLSUnblockThreadHook --- WAI =3D 5 [512]
HLS ERROR: Task 687 has rt_priority =3D 100 and state =3D 1
HLS ERROR: Task 687 has rt_priority =3D 100 and state =3D 1
HLSUnblockThreadHook --- WAI =3D 5 [512]
HLSUnblockThreadHook --- WAI =3D 5 [512]
HLS ERROR: Task 687 has rt_priority =3D 100 and state =3D 1
HLS ERROR!!! Task state changed during the hook??? 0 !=3D 1!!!
Correcting...
HLS ERROR: Task 512 has rt_priority =3D 100 and state =3D 1
HLS ERROR: Task 512 has rt_priority =3D 100 and state =3D 0
HLS ERROR: Task 687 has rt_priority =3D 100 and state =3D 1
HLSUnblockThreadHook --- WAI =3D 5 [512]
HLSUnblockThreadHook --- WAI =3D 5 [512]
HLS ERROR: Task 687 has rt_priority =3D 100 and state =3D 1
HLS ERROR: Task 507 has rt_priority =3D 100 and state =3D 1
HLSUnblockThreadHook --- WAI =3D 5 [512]
HLSUnblockThreadHook --- WAI =3D 5 [512]
HLS ERROR: Task 507 has rt_priority =3D 100 and state =3D 1
=20
Here are the processes that are in error:
=20
root 687 685 0 18:37 ? 00:00:00 in.telnetd: dm3
root 507 1 0 18:37 ? 00:00:00 syslogd -m 0
root 512 1 0 18:37 ? 00:00:00 klogd -x
=20
On a previous run, I was seeing error messages with the init process, ie
..
=20
Sep 3 17:58:16 wally kernel: HLS ERROR: Task 1 has rt_priority =3D 100
and state =3D 1
Sep 3 17:58:16 wally kernel: HLS ERROR: Task 1 has rt_priority =3D 100
and state =3D 1
Sep 3 17:58:16 wally kernel: HLSUnblockThreadHook --- WAI =3D 5 =
[511]
Sep 3 17:58:16 wally kernel: HLSUnblockThreadHook --- WAI =3D 5 =
[511]
Sep 3 17:58:16 wally kernel: HLS ERROR: Task 1 has rt_priority =3D 100
and state =3D 1
Sep 3 17:58:16 wally kernel: HLS ERROR!!! Task state changed during the
hook??? 0 !=3D 1!!! Correcting...
=20
And also, when my process freezes, I'm unable to kill it, even with kill
-9 <pid>!
=20
I don't know if it's related, but I've also seen the following messages
in the syslog:
=20
Sep 3 18:50:17 wally kernel: hls_ctl: Moving to res1
Sep 3 18:50:17 wally kernel: ...and setting the parameters!
Sep 3 18:50:22 wally kernel: bug: kernel timer added twice at e24c1689.
=20
Any help would be appreciated, and let me know if I can enable any more
debugging to give more information.
=20
Thanks and regards,
Tony
|
|
From: Luca A. <luc...@em...> - 2003-09-03 10:08:58
|
Hi Tony,
first of all, welcome!!!
> I've included the dmesg output below.
>
> HLS MP initializing (HLS_DBG_PRINT_LEVEL = 1).
> [-1961706252], 833 : sched 'ROOT' registered in slot 0
> [-1885129302], 833 : sched 'JOIN' registered in slot 1
> [-1808565821], 833 : sched 'TH' registered in slot 2
> [-1734729187], 833 : sched 'RR' registered in slot 3
> [-1660892643], 833 : sched 'PS' registered in slot 4
> [-1587056079], 833 : sched 'RES' registered in slot 5
> already PRIVATE_DATA != NULL???
> already PRIVATE_DATA != NULL???
> hls_ctl: Moving to res1
> ...and setting the parameters!
Everything is ok, until here...
> bug: kernel timer added twice at e24c1689.
This is very bad, and could be the root of the problem :(
How much time does it pass between the program start and this message?
Does the application freeze immediately after this message?
> HLSUnblockThreadHook --- WAI = 2 [512]
> HLSUnblockThreadHook --- WAI = 2 [512]
> HLS ERROR: Task 687 has rt_priority = 100 and state = 1
> HLSUnblockThreadHook --- WAI = 5 [512]
> HLSUnblockThreadHook --- WAI = 5 [512]
> HLS ERROR: Task 687 has rt_priority = 100 and state = 1
> HLS ERROR: Task 687 has rt_priority = 100 and state = 1
[...]
Ok, HLS is having serious problems here... :( It seems that a hook is
interrupting another hook, and that some tasks are not currently
descheduled. I must have a better look...
Can you write down a simple example that triggers this bug? That would
help a lot.
> And also, when my process freezes, I'm unable to kill it, even with kill
> -9 <pid>!
Uhmm... This could be due to the fact that HLS sets its priority to a
very low value. Removing the HLS module should permit to kill the task
(if it does not crash the system...).
> Any help would be appreciated, and let me know if I can enable any more
> debugging to give more information.
The only idea that I have right now is that the problem can be related
to a bad interaction with the preemption patch... Are you using it? If
yes, can you try to remove the kernel preemption patch (and eventually
the low-latency patch) and try to reproduce the problem?
Thanks,
Luca
--
_____________________________________________________________________________
Copy this in your signature, if you think it is important:
N O W A R ! ! !
--
Email.it, the professional e-mail, gratis per te: http://www.email.it/f
Sponsor:
Al Garden Center Peraga sta per prendere forma un evento straordinario: il labirinto delle pannocchie. Unoccasione unica per divertirsi in unoasi di pace...
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=1616&d=3-9
|
|
From: Luca A. <luc...@em...> - 2003-09-05 05:38:06
Attachments:
debug.patch
|
Hi guys,
I had a look at the problem, and here is my interpretation:
[...]
> hls_ctl: Moving to res1
> ...and setting the parameters!
> bug: kernel timer added twice at e24c1689.
> HLSUnblockThreadHook --- WAI = 2 [512]
> HLSUnblockThreadHook --- WAI = 2 [512]
Looks like someone is unblocking, and its scheduler posts the scheduler
timer when it is already active... :(
This causes the kernel to scream, performing a printk() that wakes up
the log daemon klogd ---> klogd unblocks during a hook.
From this point, the HLS internal state is screwed, and crazy things
happen. Hence, first of all we need to understand why a scheduler is
posting its timer twice. Unfortunately, I am not able to reproduce the
problem here, hence I'll need your help.
Can you please apply the attached patch and reproduce the problem,
sending me the dmesg output? The patch makes the module should be a
little more verbose, giving information about the scheduler that is
causing the problem.
Also, there are some #ifdef __DO_CLI__ in linux/hls_timers.c that are
currently commented out... Can you try to remove the comments, and see
if this fixes the problem?
Finally, are you usning the preemptive patch? If yes, can you confirm
that the problem still happens without the patch installed? If it still
happens, it likely is a locking problem...
Thanks,
Luca
--
_____________________________________________________________________________
Copy this in your signature, if you think it is important:
N O W A R ! ! !
--
Email.it, the professional e-mail, gratis per te: http://www.email.it/f
Sponsor:
Il nostro catalogo completo a casa tua, gratis! Vieni da Peraga, tanti prodotti introvabili per te.
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=450&d=5-9 |