Thread: [SSI-devel] Regarding DRBD on 1.9
Brought to you by:
brucewalker,
rogertsang
From: Gopalakrishna NM <go...@hp...> - 2005-05-27 10:29:09
|
Hi Roger, DRBD , 1.9 (built from OPENSSI-DEBIAN, May 27 12:58) and the kernel ( built from OPENSSI-DEBIAN, May 27 12:58) resulting in the kernel oops in the function sock_recvmsg. Till yesterday I could reproduce consistently(with the kernel & drbd built on May 25 ), but it suddenly disappeared when I built the new version of command drbdadm and drbdsetup and copied to test system. After few set up reboot, it resulted in oops again!. Today with new kernel and DRBD module, it is resulting in oops consistently. Ofcourse, I built new drbdadm and drbdsetup commands and copied to test system. The console message I am attaching at the end of this mail. . Looking at the DRBD code and generating few debugging message it is clear that following mdev(drbd_dev) fields are some how corrupting/interchanging. mdev->conf.my_addr_len = 1 mdev->conf.other_addr_len = 2 mdev->conf.this_nodenum = 16 mdev->conf.other_nodenum = 16 The function "drbd_wait_for_connect" could not successfully complete "bind" and resulting in error (-22) EINVAL. Because of this connect would fail in the function "drbd_try_connect" (This is expected). The reason for bind failure is some how address is not proper and it is happening because of "my_addr_len & other_addr_len " is having 1 and 2 respectively instead of 16(This is proper length). But node number "this_nodenum & "other_nodenum" is showing 16 instead of 1 and 2 respectively. Since address len field is 1, during bind it might be getting first and second character of the address and it would resulted in bind. Once it start working yesterday (After copying new copy of command drbdadm and drbdsetup), later I copied old command , but it was working!. So I am not even convinced it is command problem. when I built new kernel and DRBD module today , again I am facing same problem consistently with new and old command. Does it any thing to do with race?. Any idea the relationship between drbdsetup and initialization of drbd?. or anything to do with alignment? Any timing issues? My understanding is at this stage of boot up drbdsetup would not have any problem. Your inputs would be very helpful. Thanks and regards, Gopal. ======================CONSOLE messg============= drbd: module cleanup done. modprobe -k drbd minor_count=1 drbd: initialised. Version: 0.7.10 (api:77/proto:74) drbd: SVN Revision: 1743 build by go...@ha..., 2005-05-27 12:58:01 drbd: registered as block device major 147 Starting DRBD resource: drbd0: resync bitmap: bits=1151992 words=36000 drbd0: size = 4499 MB (4607968 KB) drbd0: 4499 MB marked out-of-sync by on disk bit-map. drbd0: Found 6 transactions (230 active extents) in activity log. drbd0: Marked additional 0 KB as out-of-sync based on AL. drbd0: drbdsetup [66084]: cstate Unconfigured --> StandAlone drbd0: drbdsetup [66087]: cstate StandAlone --> Unconnected drbd0: drbd0_receiver [66088]: cstate Unconnected --> WFConnection drbd0: Unabl # WARNING: Do not type 'yes' while waiting for DRBD connection # unless you know what you are doing! You have been warned! # The only exception is when setting up DRBD first time. # e to bind (-22) drbd0: Registering drbd0 with CLMS subsystem dr<b4>d0dr: bdd0r:b d0T_ryriencegi tveo r co[6nn60ec88t ]:ag caisnta.<t1e >WUnFCaoblnene tcot iohan nd--le> kUencrnonelne cNtUeLLd pointer dereference at virtual address 00000008 printing eip: c039ae5c *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: drbd sd_mod sym53c8xx scsi_transport_spi cciss scsi_mod tg3 eepro100 e100 mii CPU: 1 EIP: 0060:[<c039ae5c>] Not tainted VLI EFLAGS: 00010246 (2.6.10-ssi-686-smp) EIP is at sock_recvmsg+0xac/0xf0 eax: f7036f20 ebx: 00000000 ecx: f71157f0 edx: 00000008 esi: 00004100 edi: f7036e74 ebp: f7036f00 esp: f7036e10 ds: 007b es: 007b ss: 0068 Process drbd0_receiver (pid: 66088, threadinfo=f7036000 task=f71157f0) Stack: 000000fc 00000000 f7036e7c ffffffff c0465060 20008790 36303838 00004100 00000008 00000000 00000000 00000000 f7036f20 f712a040 f7036e60 c03db8da f712a040 f712a150 f712a040 f712a040 00000282 c03e108a f712a040 f712a040 Call Trace: [<c010687f>] show_stack+0x7f/0xa0 [<c0106a34>] show_registers+0x164/0x220 [<c0106dc4>] die+0xf4/0x1c0 [<c011f325>] do_page_fault+0x375/0x695 [<c01064d3>] error_code+0x2b/0x30 [<f898e0e9>] drbd_recv+0x89/0x190 [drbd] [<f898e99a>] drbd_recv_header+0x2a/0xf0 [drbd] [<f8991f2c>] drbdd+0x2c/0x160 [drbd] [<f8992b18>] drbdd_init+0x78/0x410 [drbd] [<f8998e3e>] drbd_thread_setup+0x7e/0xf0 [drbd] [<c01022e5>] kernel_thread_helper+0x5/0x10 Code: 85 24 ff ff ff 89 45 e8 31 c0 89 85 3c ff ff ff 8b 45 0c 89 95 30 ff ff ff 89 9d 34 ff ff ff 89 85 40 ff ff ff 89 b5 2c ff ff ff <8b> 43 08 89 74 24 10 89 54 24 0c8b 55 0c 89 5c 24 04 89 3c 24 Entering kdb (current=0xf71157f0, pid 66088) on processor 1 Oops: Oops due to oops @ 0xc039ae5c eax = 0xf7036f20 ebx = 0x00000000 ecx = 0xf71157f0 edx = 0x00000008 esi = 0x00004100 edi = 0xf7036e74 esp = 0xf7036e10 eip = 0xc039ae5c ebp = 0xf7036f00 xss = 0xc0390068 xcs = 0x00000060 eflags = 0x00010246 xds = 0x0000007b xes = 0x0000007b origeax = 0xffffffff ®s = 0xf7036ddc [1]kdb> |
From: Roger T. <rog...@gm...> - 2005-05-27 15:27:49
|
Hi Gopal, Alright I'll take a look at the code. I don't seem to be having this problem on OPENSSI-FC-1-2-STABLE and the code is synced to the trunk which also happens to be the OPENSSI-DEBIAN branch. So we're using the same (drbd) code, just on different kernels. -Roger On 5/27/05, Gopalakrishna NM <go...@hp...> wrote: > Hi Roger, > DRBD , 1.9 (built from OPENSSI-DEBIAN, May 27 12:58) and the kernel ( > built from OPENSSI-DEBIAN, May 27 12:58) resulting in the kernel oops in > the function sock_recvmsg. Till yesterday I could reproduce > consistently(with the kernel & drbd built on May 25 ), but it suddenly > disappeared when I built the new version of command drbdadm and > drbdsetup and copied to test system. After few set up reboot, it > resulted in oops again!. Today with new kernel and DRBD module, it is > resulting in oops consistently. Ofcourse, I built new drbdadm and > drbdsetup commands and copied to test system. >=20 > The console message I am attaching at the end of this mail. . >=20 > Looking at the DRBD code and generating few debugging message it is > clear that following mdev(drbd_dev) fields are some how > corrupting/interchanging. > mdev->conf.my_addr_len =3D 1 > mdev->conf.other_addr_len =3D 2 > mdev->conf.this_nodenum =3D 16 > mdev->conf.other_nodenum =3D 16 >=20 > The function "drbd_wait_for_connect" could not successfully complete > "bind" and resulting in error (-22) EINVAL. Because of this connect > would fail in the function "drbd_try_connect" (This is expected). The > reason for bind failure is some how address is not proper and it is > happening because of "my_addr_len & other_addr_len " is having 1 and 2 > respectively instead of 16(This is proper length). But node number > "this_nodenum & "other_nodenum" is showing 16 instead of 1 and 2 > respectively. Since address len field is 1, during bind it might be > getting first and second character of the address and it would resulted > in bind. >=20 > Once it start working yesterday (After copying new copy of command > drbdadm and drbdsetup), later I copied old command , but it was > working!. So I am not even convinced it is command problem. when I built > new kernel and DRBD module today , again I am facing same problem > consistently with new and old command. >=20 > Does it any thing to do with race?. Any idea the relationship between > drbdsetup and initialization of drbd?. or anything to do with alignment? > Any timing issues? My understanding is at this stage of boot up > drbdsetup would not have any problem. >=20 > Your inputs would be very helpful. >=20 > Thanks and regards, > Gopal. >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DCONSOLE= messg=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > drbd: module cleanup done. > modprobe -k drbd minor_count=3D1 > drbd: initialised. Version: 0.7.10 (api:77/proto:74) > drbd: SVN Revision: 1743 build by go...@ha..., > 2005-05-27 12:58:01 > drbd: registered as block device major 147 > Starting DRBD resource: > drbd0: resync bitmap: bits=3D1151992 words=3D36000 > drbd0: size =3D 4499 MB (4607968 KB) > drbd0: 4499 MB marked out-of-sync by on disk bit-map. > drbd0: Found 6 transactions (230 active extents) in activity log. > drbd0: Marked additional 0 KB as out-of-sync based on AL. > drbd0: drbdsetup [66084]: cstate Unconfigured --> StandAlone > drbd0: drbdsetup [66087]: cstate StandAlone --> Unconnected > drbd0: drbd0_receiver [66088]: cstate Unconnected --> WFConnection > drbd0: Unabl > # WARNING: Do not type 'yes' while waiting for DRBD connection > # unless you know what you are doing! You have been warned! > # The only exception is when setting up DRBD first time. > # > e to bind (-22) > drbd0: Registering drbd0 with CLMS subsystem > dr<b4>d0dr: bdd0r:b d0T_ryriencegi tveo r co[6nn60ec88t ]:ag > caisnta.<t1e >WUnFCaoblnene tcot iohan nd--le> kUencrnonelne cNtUeLLd > pointer dereference at virtual address 00000008 > printing eip: > c039ae5c > *pde =3D 00000000 > Oops: 0000 [#1] > SMP > Modules linked in: drbd sd_mod sym53c8xx scsi_transport_spi cciss > scsi_mod tg3 eepro100 e100 mii > CPU: 1 > EIP: 0060:[<c039ae5c>] Not tainted VLI > EFLAGS: 00010246 (2.6.10-ssi-686-smp) > EIP is at sock_recvmsg+0xac/0xf0 > eax: f7036f20 ebx: 00000000 ecx: f71157f0 edx: 00000008 > esi: 00004100 edi: f7036e74 ebp: f7036f00 esp: f7036e10 > ds: 007b es: 007b ss: 0068 > Process drbd0_receiver (pid: 66088, threadinfo=3Df7036000 task=3Df71157f0= ) > Stack: 000000fc 00000000 f7036e7c ffffffff c0465060 20008790 36303838 > 00004100 > 00000008 00000000 00000000 00000000 f7036f20 f712a040 f7036e60 > c03db8da > f712a040 f712a150 f712a040 f712a040 00000282 c03e108a f712a040 > f712a040 > Call Trace: > [<c010687f>] show_stack+0x7f/0xa0 > [<c0106a34>] show_registers+0x164/0x220 > [<c0106dc4>] die+0xf4/0x1c0 > [<c011f325>] do_page_fault+0x375/0x695 > [<c01064d3>] error_code+0x2b/0x30 > [<f898e0e9>] drbd_recv+0x89/0x190 [drbd] > [<f898e99a>] drbd_recv_header+0x2a/0xf0 [drbd] > [<f8991f2c>] drbdd+0x2c/0x160 [drbd] > [<f8992b18>] drbdd_init+0x78/0x410 [drbd] > [<f8998e3e>] drbd_thread_setup+0x7e/0xf0 [drbd] > [<c01022e5>] kernel_thread_helper+0x5/0x10 > Code: 85 24 ff ff ff 89 45 e8 31 c0 89 85 3c ff ff ff 8b 45 0c 89 95 30 > ff ff ff 89 9d 34 ff ff ff 89 85 40 ff ff ff 89 b5 2c ff ff ff <8b> 43 > 08 89 74 24 10 89 54 24 0c8b 55 0c 89 5c 24 04 89 3c 24 >=20 > Entering kdb (current=3D0xf71157f0, pid 66088) on processor 1 Oops: Oops > due to oops @ 0xc039ae5c > eax =3D 0xf7036f20 ebx =3D 0x00000000 ecx =3D 0xf71157f0 edx =3D 0x000000= 08 > esi =3D 0x00004100 edi =3D 0xf7036e74 esp =3D 0xf7036e10 eip =3D 0xc039ae= 5c > ebp =3D 0xf7036f00 xss =3D 0xc0390068 xcs =3D 0x00000060 eflags =3D 0x000= 10246 > xds =3D 0x0000007b xes =3D 0x0000007b origeax =3D 0xffffffff ®s =3D 0x= f7036ddc > [1]kdb> > |
From: Roger T. <rog...@gm...> - 2005-05-27 16:21:18
|
Okay I think I found a bug in drbdsetup.c <<cmd_net_config>>. It seem the first tuple of the local IP address had been assigned to cn.config.this_nodenum on line 1155. I'll see what to do about this. -Roger On 5/27/05, Roger Tsang <rog...@gm...> wrote: > Hi Gopal, >=20 > Alright I'll take a look at the code. I don't seem to be having this > problem on OPENSSI-FC-1-2-STABLE and the code is synced to the trunk > which also happens to be the OPENSSI-DEBIAN branch. So we're using > the same (drbd) code, just on different kernels. >=20 > -Roger >=20 >=20 > On 5/27/05, Gopalakrishna NM <go...@hp...> wrote: > > Hi Roger, > > DRBD , 1.9 (built from OPENSSI-DEBIAN, May 27 12:58) and the kernel ( > > built from OPENSSI-DEBIAN, May 27 12:58) resulting in the kernel oops i= n > > the function sock_recvmsg. Till yesterday I could reproduce > > consistently(with the kernel & drbd built on May 25 ), but it suddenly > > disappeared when I built the new version of command drbdadm and > > drbdsetup and copied to test system. After few set up reboot, it > > resulted in oops again!. Today with new kernel and DRBD module, it is > > resulting in oops consistently. Ofcourse, I built new drbdadm and > > drbdsetup commands and copied to test system. > > > > The console message I am attaching at the end of this mail. . > > > > Looking at the DRBD code and generating few debugging message it is > > clear that following mdev(drbd_dev) fields are some how > > corrupting/interchanging. > > mdev->conf.my_addr_len =3D 1 > > mdev->conf.other_addr_len =3D 2 > > mdev->conf.this_nodenum =3D 16 > > mdev->conf.other_nodenum =3D 16 > > > > The function "drbd_wait_for_connect" could not successfully complete > > "bind" and resulting in error (-22) EINVAL. Because of this connect > > would fail in the function "drbd_try_connect" (This is expected). The > > reason for bind failure is some how address is not proper and it is > > happening because of "my_addr_len & other_addr_len " is having 1 and 2 > > respectively instead of 16(This is proper length). But node number > > "this_nodenum & "other_nodenum" is showing 16 instead of 1 and 2 > > respectively. Since address len field is 1, during bind it might be > > getting first and second character of the address and it would resulted > > in bind. > > > > Once it start working yesterday (After copying new copy of command > > drbdadm and drbdsetup), later I copied old command , but it was > > working!. So I am not even convinced it is command problem. when I buil= t > > new kernel and DRBD module today , again I am facing same problem > > consistently with new and old command. > > > > Does it any thing to do with race?. Any idea the relationship between > > drbdsetup and initialization of drbd?. or anything to do with alignment= ? > > Any timing issues? My understanding is at this stage of boot up > > drbdsetup would not have any problem. > > > > Your inputs would be very helpful. > > > > Thanks and regards, > > Gopal. > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DCONSO= LE messg=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > drbd: module cleanup done. > > modprobe -k drbd minor_count=3D1 > > drbd: initialised. Version: 0.7.10 (api:77/proto:74) > > drbd: SVN Revision: 1743 build by go...@ha..., > > 2005-05-27 12:58:01 > > drbd: registered as block device major 147 > > Starting DRBD resource: > > drbd0: resync bitmap: bits=3D1151992 words=3D36000 > > drbd0: size =3D 4499 MB (4607968 KB) > > drbd0: 4499 MB marked out-of-sync by on disk bit-map. > > drbd0: Found 6 transactions (230 active extents) in activity log. > > drbd0: Marked additional 0 KB as out-of-sync based on AL. > > drbd0: drbdsetup [66084]: cstate Unconfigured --> StandAlone > > drbd0: drbdsetup [66087]: cstate StandAlone --> Unconnected > > drbd0: drbd0_receiver [66088]: cstate Unconnected --> WFConnection > > drbd0: Unabl > > # WARNING: Do not type 'yes' while waiting for DRBD connection > > # unless you know what you are doing! You have been warned! > > # The only exception is when setting up DRBD first time. > > # > > e to bind (-22) > > drbd0: Registering drbd0 with CLMS subsystem > > dr<b4>d0dr: bdd0r:b d0T_ryriencegi tveo r co[6nn60ec88t ]:ag > > caisnta.<t1e >WUnFCaoblnene tcot iohan nd--le> kUencrnonelne cNtUeLLd > > pointer dereference at virtual address 00000008 > > printing eip: > > c039ae5c > > *pde =3D 00000000 > > Oops: 0000 [#1] > > SMP > > Modules linked in: drbd sd_mod sym53c8xx scsi_transport_spi cciss > > scsi_mod tg3 eepro100 e100 mii > > CPU: 1 > > EIP: 0060:[<c039ae5c>] Not tainted VLI > > EFLAGS: 00010246 (2.6.10-ssi-686-smp) > > EIP is at sock_recvmsg+0xac/0xf0 > > eax: f7036f20 ebx: 00000000 ecx: f71157f0 edx: 00000008 > > esi: 00004100 edi: f7036e74 ebp: f7036f00 esp: f7036e10 > > ds: 007b es: 007b ss: 0068 > > Process drbd0_receiver (pid: 66088, threadinfo=3Df7036000 task=3Df71157= f0) > > Stack: 000000fc 00000000 f7036e7c ffffffff c0465060 20008790 36303838 > > 00004100 > > 00000008 00000000 00000000 00000000 f7036f20 f712a040 f7036e60 > > c03db8da > > f712a040 f712a150 f712a040 f712a040 00000282 c03e108a f712a040 > > f712a040 > > Call Trace: > > [<c010687f>] show_stack+0x7f/0xa0 > > [<c0106a34>] show_registers+0x164/0x220 > > [<c0106dc4>] die+0xf4/0x1c0 > > [<c011f325>] do_page_fault+0x375/0x695 > > [<c01064d3>] error_code+0x2b/0x30 > > [<f898e0e9>] drbd_recv+0x89/0x190 [drbd] > > [<f898e99a>] drbd_recv_header+0x2a/0xf0 [drbd] > > [<f8991f2c>] drbdd+0x2c/0x160 [drbd] > > [<f8992b18>] drbdd_init+0x78/0x410 [drbd] > > [<f8998e3e>] drbd_thread_setup+0x7e/0xf0 [drbd] > > [<c01022e5>] kernel_thread_helper+0x5/0x10 > > Code: 85 24 ff ff ff 89 45 e8 31 c0 89 85 3c ff ff ff 8b 45 0c 89 95 30 > > ff ff ff 89 9d 34 ff ff ff 89 85 40 ff ff ff 89 b5 2c ff ff ff <8b> 43 > > 08 89 74 24 10 89 54 24 0c8b 55 0c 89 5c 24 04 89 3c 24 > > > > Entering kdb (current=3D0xf71157f0, pid 66088) on processor 1 Oops: Oop= s > > due to oops @ 0xc039ae5c > > eax =3D 0xf7036f20 ebx =3D 0x00000000 ecx =3D 0xf71157f0 edx =3D 0x0000= 0008 > > esi =3D 0x00004100 edi =3D 0xf7036e74 esp =3D 0xf7036e10 eip =3D 0xc039= ae5c > > ebp =3D 0xf7036f00 xss =3D 0xc0390068 xcs =3D 0x00000060 eflags =3D 0x0= 0010246 > > xds =3D 0x0000007b xes =3D 0x0000007b origeax =3D 0xffffffff ®s =3D = 0xf7036ddc > > [1]kdb> > > > |
From: Roger T. <rog...@gm...> - 2005-05-27 17:16:59
Attachments:
patch-drbdsetup.c
|
Gopal, Try the attached patch. -Roger On 5/27/05, Roger Tsang <rog...@gm...> wrote: > Okay I think I found a bug in drbdsetup.c <<cmd_net_config>>. It seem > the first tuple of the local IP address had been assigned to > cn.config.this_nodenum on line 1155. I'll see what to do about this. >=20 > -Roger >=20 > On 5/27/05, Roger Tsang <rog...@gm...> wrote: > > Hi Gopal, > > > > Alright I'll take a look at the code. I don't seem to be having this > > problem on OPENSSI-FC-1-2-STABLE and the code is synced to the trunk > > which also happens to be the OPENSSI-DEBIAN branch. So we're using > > the same (drbd) code, just on different kernels. > > > > -Roger > > > > > > On 5/27/05, Gopalakrishna NM <go...@hp...> wrote: > > > Hi Roger, > > > DRBD , 1.9 (built from OPENSSI-DEBIAN, May 27 12:58) and the kernel ( > > > built from OPENSSI-DEBIAN, May 27 12:58) resulting in the kernel oops= in > > > the function sock_recvmsg. Till yesterday I could reproduce > > > consistently(with the kernel & drbd built on May 25 ), but it suddenl= y > > > disappeared when I built the new version of command drbdadm and > > > drbdsetup and copied to test system. After few set up reboot, it > > > resulted in oops again!. Today with new kernel and DRBD module, it is > > > resulting in oops consistently. Ofcourse, I built new drbdadm and > > > drbdsetup commands and copied to test system. > > > > > > The console message I am attaching at the end of this mail. . > > > > > > Looking at the DRBD code and generating few debugging message it is > > > clear that following mdev(drbd_dev) fields are some how > > > corrupting/interchanging. > > > mdev->conf.my_addr_len =3D 1 > > > mdev->conf.other_addr_len =3D 2 > > > mdev->conf.this_nodenum =3D 16 > > > mdev->conf.other_nodenum =3D 16 > > > > > > The function "drbd_wait_for_connect" could not successfully complete > > > "bind" and resulting in error (-22) EINVAL. Because of this connect > > > would fail in the function "drbd_try_connect" (This is expected). The > > > reason for bind failure is some how address is not proper and it is > > > happening because of "my_addr_len & other_addr_len " is having 1 and = 2 > > > respectively instead of 16(This is proper length). But node number > > > "this_nodenum & "other_nodenum" is showing 16 instead of 1 and 2 > > > respectively. Since address len field is 1, during bind it might be > > > getting first and second character of the address and it would result= ed > > > in bind. > > > > > > Once it start working yesterday (After copying new copy of command > > > drbdadm and drbdsetup), later I copied old command , but it was > > > working!. So I am not even convinced it is command problem. when I bu= ilt > > > new kernel and DRBD module today , again I am facing same problem > > > consistently with new and old command. > > > > > > Does it any thing to do with race?. Any idea the relationship between > > > drbdsetup and initialization of drbd?. or anything to do with alignme= nt? > > > Any timing issues? My understanding is at this stage of boot up > > > drbdsetup would not have any problem. > > > > > > Your inputs would be very helpful. > > > > > > Thanks and regards, > > > Gopal. > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DCON= SOLE messg=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > drbd: module cleanup done. > > > modprobe -k drbd minor_count=3D1 > > > drbd: initialised. Version: 0.7.10 (api:77/proto:74) > > > drbd: SVN Revision: 1743 build by go...@ha..., > > > 2005-05-27 12:58:01 > > > drbd: registered as block device major 147 > > > Starting DRBD resource: > > > drbd0: resync bitmap: bits=3D1151992 words=3D36000 > > > drbd0: size =3D 4499 MB (4607968 KB) > > > drbd0: 4499 MB marked out-of-sync by on disk bit-map. > > > drbd0: Found 6 transactions (230 active extents) in activity log. > > > drbd0: Marked additional 0 KB as out-of-sync based on AL. > > > drbd0: drbdsetup [66084]: cstate Unconfigured --> StandAlone > > > drbd0: drbdsetup [66087]: cstate StandAlone --> Unconnected > > > drbd0: drbd0_receiver [66088]: cstate Unconnected --> WFConnection > > > drbd0: Unabl > > > # WARNING: Do not type 'yes' while waiting for DRBD connection > > > # unless you know what you are doing! You have been warned! > > > # The only exception is when setting up DRBD first time. > > > # > > > e to bind (-22) > > > drbd0: Registering drbd0 with CLMS subsystem > > > dr<b4>d0dr: bdd0r:b d0T_ryriencegi tveo r co[6nn60ec88t ]:ag > > > caisnta.<t1e >WUnFCaoblnene tcot iohan nd--le> kUencrnonelne cNtUeLL= d > > > pointer dereference at virtual address 00000008 > > > printing eip: > > > c039ae5c > > > *pde =3D 00000000 > > > Oops: 0000 [#1] > > > SMP > > > Modules linked in: drbd sd_mod sym53c8xx scsi_transport_spi cciss > > > scsi_mod tg3 eepro100 e100 mii > > > CPU: 1 > > > EIP: 0060:[<c039ae5c>] Not tainted VLI > > > EFLAGS: 00010246 (2.6.10-ssi-686-smp) > > > EIP is at sock_recvmsg+0xac/0xf0 > > > eax: f7036f20 ebx: 00000000 ecx: f71157f0 edx: 00000008 > > > esi: 00004100 edi: f7036e74 ebp: f7036f00 esp: f7036e10 > > > ds: 007b es: 007b ss: 0068 > > > Process drbd0_receiver (pid: 66088, threadinfo=3Df7036000 task=3Df711= 57f0) > > > Stack: 000000fc 00000000 f7036e7c ffffffff c0465060 20008790 36303838 > > > 00004100 > > > 00000008 00000000 00000000 00000000 f7036f20 f712a040 f7036e6= 0 > > > c03db8da > > > f712a040 f712a150 f712a040 f712a040 00000282 c03e108a f712a04= 0 > > > f712a040 > > > Call Trace: > > > [<c010687f>] show_stack+0x7f/0xa0 > > > [<c0106a34>] show_registers+0x164/0x220 > > > [<c0106dc4>] die+0xf4/0x1c0 > > > [<c011f325>] do_page_fault+0x375/0x695 > > > [<c01064d3>] error_code+0x2b/0x30 > > > [<f898e0e9>] drbd_recv+0x89/0x190 [drbd] > > > [<f898e99a>] drbd_recv_header+0x2a/0xf0 [drbd] > > > [<f8991f2c>] drbdd+0x2c/0x160 [drbd] > > > [<f8992b18>] drbdd_init+0x78/0x410 [drbd] > > > [<f8998e3e>] drbd_thread_setup+0x7e/0xf0 [drbd] > > > [<c01022e5>] kernel_thread_helper+0x5/0x10 > > > Code: 85 24 ff ff ff 89 45 e8 31 c0 89 85 3c ff ff ff 8b 45 0c 89 95 = 30 > > > ff ff ff 89 9d 34 ff ff ff 89 85 40 ff ff ff 89 b5 2c ff ff ff <8b> 4= 3 > > > 08 89 74 24 10 89 54 24 0c8b 55 0c 89 5c 24 04 89 3c 24 > > > > > > Entering kdb (current=3D0xf71157f0, pid 66088) on processor 1 Oops: O= ops > > > due to oops @ 0xc039ae5c > > > eax =3D 0xf7036f20 ebx =3D 0x00000000 ecx =3D 0xf71157f0 edx =3D 0x00= 000008 > > > esi =3D 0x00004100 edi =3D 0xf7036e74 esp =3D 0xf7036e10 eip =3D 0xc0= 39ae5c > > > ebp =3D 0xf7036f00 xss =3D 0xc0390068 xcs =3D 0x00000060 eflags =3D 0= x00010246 > > > xds =3D 0x0000007b xes =3D 0x0000007b origeax =3D 0xffffffff ®s = =3D 0xf7036ddc > > > [1]kdb> > > > > > > |
From: Roger T. <rog...@gm...> - 2005-05-27 17:50:59
Attachments:
patch-drbdadm_main.c
|
Gopal, Apparently drbdadm has been working properly with the old drbdsetup because the order of arguments passed to drbdsetup matches that expected by drbdsetup. If you try the drbdsetup patch I sent, you also have to use this drbdadm patch attached. I think you hit the drbdsetup bug because most users don't use drbdsetup directly. Let me know. -Roger On 5/27/05, Roger Tsang <rog...@gm...> wrote: > Hi Gopal, >=20 > Alright I'll take a look at the code. I don't seem to be having this > problem on OPENSSI-FC-1-2-STABLE and the code is synced to the trunk > which also happens to be the OPENSSI-DEBIAN branch. So we're using > the same (drbd) code, just on different kernels. >=20 > -Roger >=20 >=20 > On 5/27/05, Gopalakrishna NM <go...@hp...> wrote: > > Hi Roger, > > DRBD , 1.9 (built from OPENSSI-DEBIAN, May 27 12:58) and the kernel ( > > built from OPENSSI-DEBIAN, May 27 12:58) resulting in the kernel oops i= n > > the function sock_recvmsg. Till yesterday I could reproduce > > consistently(with the kernel & drbd built on May 25 ), but it suddenly > > disappeared when I built the new version of command drbdadm and > > drbdsetup and copied to test system. After few set up reboot, it > > resulted in oops again!. Today with new kernel and DRBD module, it is > > resulting in oops consistently. Ofcourse, I built new drbdadm and > > drbdsetup commands and copied to test system. > > > > The console message I am attaching at the end of this mail. . > > > > Looking at the DRBD code and generating few debugging message it is > > clear that following mdev(drbd_dev) fields are some how > > corrupting/interchanging. > > mdev->conf.my_addr_len =3D 1 > > mdev->conf.other_addr_len =3D 2 > > mdev->conf.this_nodenum =3D 16 > > mdev->conf.other_nodenum =3D 16 > > > > The function "drbd_wait_for_connect" could not successfully complete > > "bind" and resulting in error (-22) EINVAL. Because of this connect > > would fail in the function "drbd_try_connect" (This is expected). The > > reason for bind failure is some how address is not proper and it is > > happening because of "my_addr_len & other_addr_len " is having 1 and 2 > > respectively instead of 16(This is proper length). But node number > > "this_nodenum & "other_nodenum" is showing 16 instead of 1 and 2 > > respectively. Since address len field is 1, during bind it might be > > getting first and second character of the address and it would resulted > > in bind. > > > > Once it start working yesterday (After copying new copy of command > > drbdadm and drbdsetup), later I copied old command , but it was > > working!. So I am not even convinced it is command problem. when I buil= t > > new kernel and DRBD module today , again I am facing same problem > > consistently with new and old command. > > > > Does it any thing to do with race?. Any idea the relationship between > > drbdsetup and initialization of drbd?. or anything to do with alignment= ? > > Any timing issues? My understanding is at this stage of boot up > > drbdsetup would not have any problem. > > > > Your inputs would be very helpful. > > > > Thanks and regards, > > Gopal. > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DCONSO= LE messg=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > drbd: module cleanup done. > > modprobe -k drbd minor_count=3D1 > > drbd: initialised. Version: 0.7.10 (api:77/proto:74) > > drbd: SVN Revision: 1743 build by go...@ha..., > > 2005-05-27 12:58:01 > > drbd: registered as block device major 147 > > Starting DRBD resource: > > drbd0: resync bitmap: bits=3D1151992 words=3D36000 > > drbd0: size =3D 4499 MB (4607968 KB) > > drbd0: 4499 MB marked out-of-sync by on disk bit-map. > > drbd0: Found 6 transactions (230 active extents) in activity log. > > drbd0: Marked additional 0 KB as out-of-sync based on AL. > > drbd0: drbdsetup [66084]: cstate Unconfigured --> StandAlone > > drbd0: drbdsetup [66087]: cstate StandAlone --> Unconnected > > drbd0: drbd0_receiver [66088]: cstate Unconnected --> WFConnection > > drbd0: Unabl > > # WARNING: Do not type 'yes' while waiting for DRBD connection > > # unless you know what you are doing! You have been warned! > > # The only exception is when setting up DRBD first time. > > # > > e to bind (-22) > > drbd0: Registering drbd0 with CLMS subsystem > > dr<b4>d0dr: bdd0r:b d0T_ryriencegi tveo r co[6nn60ec88t ]:ag > > caisnta.<t1e >WUnFCaoblnene tcot iohan nd--le> kUencrnonelne cNtUeLLd > > pointer dereference at virtual address 00000008 > > printing eip: > > c039ae5c > > *pde =3D 00000000 > > Oops: 0000 [#1] > > SMP > > Modules linked in: drbd sd_mod sym53c8xx scsi_transport_spi cciss > > scsi_mod tg3 eepro100 e100 mii > > CPU: 1 > > EIP: 0060:[<c039ae5c>] Not tainted VLI > > EFLAGS: 00010246 (2.6.10-ssi-686-smp) > > EIP is at sock_recvmsg+0xac/0xf0 > > eax: f7036f20 ebx: 00000000 ecx: f71157f0 edx: 00000008 > > esi: 00004100 edi: f7036e74 ebp: f7036f00 esp: f7036e10 > > ds: 007b es: 007b ss: 0068 > > Process drbd0_receiver (pid: 66088, threadinfo=3Df7036000 task=3Df71157= f0) > > Stack: 000000fc 00000000 f7036e7c ffffffff c0465060 20008790 36303838 > > 00004100 > > 00000008 00000000 00000000 00000000 f7036f20 f712a040 f7036e60 > > c03db8da > > f712a040 f712a150 f712a040 f712a040 00000282 c03e108a f712a040 > > f712a040 > > Call Trace: > > [<c010687f>] show_stack+0x7f/0xa0 > > [<c0106a34>] show_registers+0x164/0x220 > > [<c0106dc4>] die+0xf4/0x1c0 > > [<c011f325>] do_page_fault+0x375/0x695 > > [<c01064d3>] error_code+0x2b/0x30 > > [<f898e0e9>] drbd_recv+0x89/0x190 [drbd] > > [<f898e99a>] drbd_recv_header+0x2a/0xf0 [drbd] > > [<f8991f2c>] drbdd+0x2c/0x160 [drbd] > > [<f8992b18>] drbdd_init+0x78/0x410 [drbd] > > [<f8998e3e>] drbd_thread_setup+0x7e/0xf0 [drbd] > > [<c01022e5>] kernel_thread_helper+0x5/0x10 > > Code: 85 24 ff ff ff 89 45 e8 31 c0 89 85 3c ff ff ff 8b 45 0c 89 95 30 > > ff ff ff 89 9d 34 ff ff ff 89 85 40 ff ff ff 89 b5 2c ff ff ff <8b> 43 > > 08 89 74 24 10 89 54 24 0c8b 55 0c 89 5c 24 04 89 3c 24 > > > > Entering kdb (current=3D0xf71157f0, pid 66088) on processor 1 Oops: Oop= s > > due to oops @ 0xc039ae5c > > eax =3D 0xf7036f20 ebx =3D 0x00000000 ecx =3D 0xf71157f0 edx =3D 0x0000= 0008 > > esi =3D 0x00004100 edi =3D 0xf7036e74 esp =3D 0xf7036e10 eip =3D 0xc039= ae5c > > ebp =3D 0xf7036f00 xss =3D 0xc0390068 xcs =3D 0x00000060 eflags =3D 0x0= 0010246 > > xds =3D 0x0000007b xes =3D 0x0000007b origeax =3D 0xffffffff ®s =3D = 0xf7036ddc > > [1]kdb> > > > |
From: Gopalakrishna NM <go...@hp...> - 2005-05-30 10:07:56
|
Hi Roger, At first sight , with the patch and recent DRBD checkins , problem has been resolved. I am trying with Latest kernel and doing some failove rtesting. I will update you further. Regards, Gopal. Roger Tsang wrote: > Gopal, > > Apparently drbdadm has been working properly with the old drbdsetup > because the order of arguments passed to drbdsetup matches that > expected by drbdsetup. If you try the drbdsetup patch I sent, you > also have to use this drbdadm patch attached. > > I think you hit the drbdsetup bug because most users don't use > drbdsetup directly. Let me know. > > -Roger > > On 5/27/05, Roger Tsang <rog...@gm...> wrote: > >>Hi Gopal, >> >>Alright I'll take a look at the code. I don't seem to be having this >>problem on OPENSSI-FC-1-2-STABLE and the code is synced to the trunk >>which also happens to be the OPENSSI-DEBIAN branch. So we're using >>the same (drbd) code, just on different kernels. >> >>-Roger >> >> >>On 5/27/05, Gopalakrishna NM <go...@hp...> wrote: >> >>>Hi Roger, >>>DRBD , 1.9 (built from OPENSSI-DEBIAN, May 27 12:58) and the kernel ( >>>built from OPENSSI-DEBIAN, May 27 12:58) resulting in the kernel oops in >>>the function sock_recvmsg. Till yesterday I could reproduce >>>consistently(with the kernel & drbd built on May 25 ), but it suddenly >>>disappeared when I built the new version of command drbdadm and >>>drbdsetup and copied to test system. After few set up reboot, it >>>resulted in oops again!. Today with new kernel and DRBD module, it is >>>resulting in oops consistently. Ofcourse, I built new drbdadm and >>>drbdsetup commands and copied to test system. >>> >>>The console message I am attaching at the end of this mail. . >>> >>>Looking at the DRBD code and generating few debugging message it is >>>clear that following mdev(drbd_dev) fields are some how >>>corrupting/interchanging. >>>mdev->conf.my_addr_len = 1 >>>mdev->conf.other_addr_len = 2 >>>mdev->conf.this_nodenum = 16 >>>mdev->conf.other_nodenum = 16 >>> >>>The function "drbd_wait_for_connect" could not successfully complete >>>"bind" and resulting in error (-22) EINVAL. Because of this connect >>>would fail in the function "drbd_try_connect" (This is expected). The >>>reason for bind failure is some how address is not proper and it is >>>happening because of "my_addr_len & other_addr_len " is having 1 and 2 >>>respectively instead of 16(This is proper length). But node number >>>"this_nodenum & "other_nodenum" is showing 16 instead of 1 and 2 >>>respectively. Since address len field is 1, during bind it might be >>>getting first and second character of the address and it would resulted >>>in bind. >>> >>>Once it start working yesterday (After copying new copy of command >>>drbdadm and drbdsetup), later I copied old command , but it was >>>working!. So I am not even convinced it is command problem. when I built >>>new kernel and DRBD module today , again I am facing same problem >>>consistently with new and old command. >>> >>>Does it any thing to do with race?. Any idea the relationship between >>>drbdsetup and initialization of drbd?. or anything to do with alignment? >>>Any timing issues? My understanding is at this stage of boot up >>>drbdsetup would not have any problem. >>> >>>Your inputs would be very helpful. >>> >>>Thanks and regards, >>>Gopal. >>> >>>======================CONSOLE messg============= >>>drbd: module cleanup done. >>>modprobe -k drbd minor_count=1 >>>drbd: initialised. Version: 0.7.10 (api:77/proto:74) >>>drbd: SVN Revision: 1743 build by go...@ha..., >>>2005-05-27 12:58:01 >>>drbd: registered as block device major 147 >>>Starting DRBD resource: >>>drbd0: resync bitmap: bits=1151992 words=36000 >>>drbd0: size = 4499 MB (4607968 KB) >>>drbd0: 4499 MB marked out-of-sync by on disk bit-map. >>>drbd0: Found 6 transactions (230 active extents) in activity log. >>>drbd0: Marked additional 0 KB as out-of-sync based on AL. >>>drbd0: drbdsetup [66084]: cstate Unconfigured --> StandAlone >>>drbd0: drbdsetup [66087]: cstate StandAlone --> Unconnected >>>drbd0: drbd0_receiver [66088]: cstate Unconnected --> WFConnection >>>drbd0: Unabl >>># WARNING: Do not type 'yes' while waiting for DRBD connection >>># unless you know what you are doing! You have been warned! >>># The only exception is when setting up DRBD first time. >>># >>>e to bind (-22) >>>drbd0: Registering drbd0 with CLMS subsystem >>>dr<b4>d0dr: bdd0r:b d0T_ryriencegi tveo r co[6nn60ec88t ]:ag >>>caisnta.<t1e >WUnFCaoblnene tcot iohan nd--le> kUencrnonelne cNtUeLLd >>>pointer dereference at virtual address 00000008 >>> printing eip: >>>c039ae5c >>>*pde = 00000000 >>>Oops: 0000 [#1] >>>SMP >>>Modules linked in: drbd sd_mod sym53c8xx scsi_transport_spi cciss >>>scsi_mod tg3 eepro100 e100 mii >>>CPU: 1 >>>EIP: 0060:[<c039ae5c>] Not tainted VLI >>>EFLAGS: 00010246 (2.6.10-ssi-686-smp) >>>EIP is at sock_recvmsg+0xac/0xf0 >>>eax: f7036f20 ebx: 00000000 ecx: f71157f0 edx: 00000008 >>>esi: 00004100 edi: f7036e74 ebp: f7036f00 esp: f7036e10 >>>ds: 007b es: 007b ss: 0068 >>>Process drbd0_receiver (pid: 66088, threadinfo=f7036000 task=f71157f0) >>>Stack: 000000fc 00000000 f7036e7c ffffffff c0465060 20008790 36303838 >>>00004100 >>> 00000008 00000000 00000000 00000000 f7036f20 f712a040 f7036e60 >>>c03db8da >>> f712a040 f712a150 f712a040 f712a040 00000282 c03e108a f712a040 >>>f712a040 >>>Call Trace: >>> [<c010687f>] show_stack+0x7f/0xa0 >>> [<c0106a34>] show_registers+0x164/0x220 >>> [<c0106dc4>] die+0xf4/0x1c0 >>> [<c011f325>] do_page_fault+0x375/0x695 >>> [<c01064d3>] error_code+0x2b/0x30 >>> [<f898e0e9>] drbd_recv+0x89/0x190 [drbd] >>> [<f898e99a>] drbd_recv_header+0x2a/0xf0 [drbd] >>> [<f8991f2c>] drbdd+0x2c/0x160 [drbd] >>> [<f8992b18>] drbdd_init+0x78/0x410 [drbd] >>> [<f8998e3e>] drbd_thread_setup+0x7e/0xf0 [drbd] >>> [<c01022e5>] kernel_thread_helper+0x5/0x10 >>>Code: 85 24 ff ff ff 89 45 e8 31 c0 89 85 3c ff ff ff 8b 45 0c 89 95 30 >>>ff ff ff 89 9d 34 ff ff ff 89 85 40 ff ff ff 89 b5 2c ff ff ff <8b> 43 >>>08 89 74 24 10 89 54 24 0c8b 55 0c 89 5c 24 04 89 3c 24 >>> >>>Entering kdb (current=0xf71157f0, pid 66088) on processor 1 Oops: Oops >>>due to oops @ 0xc039ae5c >>>eax = 0xf7036f20 ebx = 0x00000000 ecx = 0xf71157f0 edx = 0x00000008 >>>esi = 0x00004100 edi = 0xf7036e74 esp = 0xf7036e10 eip = 0xc039ae5c >>>ebp = 0xf7036f00 xss = 0xc0390068 xcs = 0x00000060 eflags = 0x00010246 >>>xds = 0x0000007b xes = 0x0000007b origeax = 0xffffffff ®s = 0xf7036ddc >>>[1]kdb> >>> |
From: Gopalakrishna NM <go...@hp...> - 2005-05-30 12:51:33
|
Hi Roger, First I brought up the DRBD primary and next DRBD secondary. When I bring down the primary node for reboot, secondary node simply hung(It is not completely hung. It responds to ping. But I can't login to this node or type any command. Even just press enter key is not showing the prompt ) . When I reboot the primary node again, it identifies the node 2 as root and continuously wait node 2 to join(Message: Searching for an existing root node...Found node 2 as the root node.) I have attached the message from the node 2 console when node 1 is going down. Any tips for debugging would be helpful. Regards, Gopal. drbd0: PingAck did not arrive in time. drbd0: drbd0_asender [131629]: cstate Connected --> NetworkFailure drbd0: asender terminated drbd0: drbd0_receiver [131622]: cstate NetworkFailure --> BrokenPipe drbd0: short read expecting header on sock: r=-512 drbd0: worker terminated drbd0: drbd0_receiver [131622]: cstate BrokenPipe --> Unconnected drbd0: Connection lost. drbd0: drbd0_receiver [131622]: cstate Unconnected --> WFConnection Taking over master from node 1. Node 1 has gone down!!! passed the first scan in ipcname_pull_data num_objects[MSG] = 0 num_objects[SEM] = 0 num_objects[SHM] = 0 ipcnameserver ready completed drbd0: drbd_nodedown: Signaling receiver thread. drbd0: drbd_set_state: (mdev->this_bdev->bd_contains == 0) in /usr/src/modules/drbd/drbd/drbd_fs.c:702 drbd0: Secondary/Unknown --> Primary/Unknown drbd0: Doing CLMS nodedown callback for service 9 Gopalakrishna NM wrote: > Hi Roger, > At first sight , with the patch and recent DRBD checkins , problem has > been resolved. I am trying with Latest kernel and doing some failove > rtesting. I will update you further. > > Regards, > Gopal. > > Roger Tsang wrote: > >> Gopal, >> >> Apparently drbdadm has been working properly with the old drbdsetup >> because the order of arguments passed to drbdsetup matches that >> expected by drbdsetup. If you try the drbdsetup patch I sent, you >> also have to use this drbdadm patch attached. >> >> I think you hit the drbdsetup bug because most users don't use >> drbdsetup directly. Let me know. >> >> -Roger >> >> On 5/27/05, Roger Tsang <rog...@gm...> wrote: >> >>> Hi Gopal, >>> >>> Alright I'll take a look at the code. I don't seem to be having this >>> problem on OPENSSI-FC-1-2-STABLE and the code is synced to the trunk >>> which also happens to be the OPENSSI-DEBIAN branch. So we're using >>> the same (drbd) code, just on different kernels. >>> >>> -Roger >>> >>> >>> On 5/27/05, Gopalakrishna NM <go...@hp...> wrote: >>> >>>> Hi Roger, >>>> DRBD , 1.9 (built from OPENSSI-DEBIAN, May 27 12:58) and the kernel ( >>>> built from OPENSSI-DEBIAN, May 27 12:58) resulting in the kernel >>>> oops in >>>> the function sock_recvmsg. Till yesterday I could reproduce >>>> consistently(with the kernel & drbd built on May 25 ), but it suddenly >>>> disappeared when I built the new version of command drbdadm and >>>> drbdsetup and copied to test system. After few set up reboot, it >>>> resulted in oops again!. Today with new kernel and DRBD module, it is >>>> resulting in oops consistently. Ofcourse, I built new drbdadm and >>>> drbdsetup commands and copied to test system. >>>> >>>> The console message I am attaching at the end of this mail. . >>>> >>>> Looking at the DRBD code and generating few debugging message it is >>>> clear that following mdev(drbd_dev) fields are some how >>>> corrupting/interchanging. >>>> mdev->conf.my_addr_len = 1 >>>> mdev->conf.other_addr_len = 2 >>>> mdev->conf.this_nodenum = 16 >>>> mdev->conf.other_nodenum = 16 >>>> >>>> The function "drbd_wait_for_connect" could not successfully complete >>>> "bind" and resulting in error (-22) EINVAL. Because of this connect >>>> would fail in the function "drbd_try_connect" (This is expected). The >>>> reason for bind failure is some how address is not proper and it is >>>> happening because of "my_addr_len & other_addr_len " is having 1 and 2 >>>> respectively instead of 16(This is proper length). But node number >>>> "this_nodenum & "other_nodenum" is showing 16 instead of 1 and 2 >>>> respectively. Since address len field is 1, during bind it might be >>>> getting first and second character of the address and it would resulted >>>> in bind. >>>> >>>> Once it start working yesterday (After copying new copy of command >>>> drbdadm and drbdsetup), later I copied old command , but it was >>>> working!. So I am not even convinced it is command problem. when I >>>> built >>>> new kernel and DRBD module today , again I am facing same problem >>>> consistently with new and old command. >>>> >>>> Does it any thing to do with race?. Any idea the relationship between >>>> drbdsetup and initialization of drbd?. or anything to do with >>>> alignment? >>>> Any timing issues? My understanding is at this stage of boot up >>>> drbdsetup would not have any problem. >>>> >>>> Your inputs would be very helpful. >>>> >>>> Thanks and regards, >>>> Gopal. >>>> >>>> ======================CONSOLE messg============= >>>> drbd: module cleanup done. >>>> modprobe -k drbd minor_count=1 >>>> drbd: initialised. Version: 0.7.10 (api:77/proto:74) >>>> drbd: SVN Revision: 1743 build by go...@ha..., >>>> 2005-05-27 12:58:01 >>>> drbd: registered as block device major 147 >>>> Starting DRBD resource: >>>> drbd0: resync bitmap: bits=1151992 words=36000 >>>> drbd0: size = 4499 MB (4607968 KB) >>>> drbd0: 4499 MB marked out-of-sync by on disk bit-map. >>>> drbd0: Found 6 transactions (230 active extents) in activity log. >>>> drbd0: Marked additional 0 KB as out-of-sync based on AL. >>>> drbd0: drbdsetup [66084]: cstate Unconfigured --> StandAlone >>>> drbd0: drbdsetup [66087]: cstate StandAlone --> Unconnected >>>> drbd0: drbd0_receiver [66088]: cstate Unconnected --> WFConnection >>>> drbd0: Unabl >>>> # WARNING: Do not type 'yes' while waiting for DRBD connection >>>> # unless you know what you are doing! You have been warned! >>>> # The only exception is when setting up DRBD first time. >>>> # >>>> e to bind (-22) >>>> drbd0: Registering drbd0 with CLMS subsystem >>>> dr<b4>d0dr: bdd0r:b d0T_ryriencegi tveo r co[6nn60ec88t ]:ag >>>> caisnta.<t1e >WUnFCaoblnene tcot iohan nd--le> kUencrnonelne cNtUeLLd >>>> pointer dereference at virtual address 00000008 >>>> printing eip: >>>> c039ae5c >>>> *pde = 00000000 >>>> Oops: 0000 [#1] >>>> SMP >>>> Modules linked in: drbd sd_mod sym53c8xx scsi_transport_spi cciss >>>> scsi_mod tg3 eepro100 e100 mii >>>> CPU: 1 >>>> EIP: 0060:[<c039ae5c>] Not tainted VLI >>>> EFLAGS: 00010246 (2.6.10-ssi-686-smp) >>>> EIP is at sock_recvmsg+0xac/0xf0 >>>> eax: f7036f20 ebx: 00000000 ecx: f71157f0 edx: 00000008 >>>> esi: 00004100 edi: f7036e74 ebp: f7036f00 esp: f7036e10 >>>> ds: 007b es: 007b ss: 0068 >>>> Process drbd0_receiver (pid: 66088, threadinfo=f7036000 task=f71157f0) >>>> Stack: 000000fc 00000000 f7036e7c ffffffff c0465060 20008790 36303838 >>>> 00004100 >>>> 00000008 00000000 00000000 00000000 f7036f20 f712a040 f7036e60 >>>> c03db8da >>>> f712a040 f712a150 f712a040 f712a040 00000282 c03e108a f712a040 >>>> f712a040 >>>> Call Trace: >>>> [<c010687f>] show_stack+0x7f/0xa0 >>>> [<c0106a34>] show_registers+0x164/0x220 >>>> [<c0106dc4>] die+0xf4/0x1c0 >>>> [<c011f325>] do_page_fault+0x375/0x695 >>>> [<c01064d3>] error_code+0x2b/0x30 >>>> [<f898e0e9>] drbd_recv+0x89/0x190 [drbd] >>>> [<f898e99a>] drbd_recv_header+0x2a/0xf0 [drbd] >>>> [<f8991f2c>] drbdd+0x2c/0x160 [drbd] >>>> [<f8992b18>] drbdd_init+0x78/0x410 [drbd] >>>> [<f8998e3e>] drbd_thread_setup+0x7e/0xf0 [drbd] >>>> [<c01022e5>] kernel_thread_helper+0x5/0x10 >>>> Code: 85 24 ff ff ff 89 45 e8 31 c0 89 85 3c ff ff ff 8b 45 0c 89 95 30 >>>> ff ff ff 89 9d 34 ff ff ff 89 85 40 ff ff ff 89 b5 2c ff ff ff <8b> 43 >>>> 08 89 74 24 10 89 54 24 0c8b 55 0c 89 5c 24 04 89 3c 24 >>>> >>>> Entering kdb (current=0xf71157f0, pid 66088) on processor 1 Oops: Oops >>>> due to oops @ 0xc039ae5c >>>> eax = 0xf7036f20 ebx = 0x00000000 ecx = 0xf71157f0 edx = 0x00000008 >>>> esi = 0x00004100 edi = 0xf7036e74 esp = 0xf7036e10 eip = 0xc039ae5c >>>> ebp = 0xf7036f00 xss = 0xc0390068 xcs = 0x00000060 eflags = 0x00010246 >>>> xds = 0x0000007b xes = 0x0000007b origeax = 0xffffffff ®s = >>>> 0xf7036ddc >>>> [1]kdb> >>>> > > |
From: Roger T. <rog...@gm...> - 2005-05-30 20:32:42
|
Gopal, The messages below show that ipc failed over before DRBD. This is not suppose to happen. On SSI-1.2 DRBD failover completes before any of the key services. I'm occupied right now, but hopefully by later tonight I'll give you a patch which changes DRBD's priority in CLMS to see if it fixes your problem. -Roger On 5/30/05, Gopalakrishna NM <go...@hp...> wrote: > Hi Roger, > First I brought up the DRBD primary and next DRBD secondary. When I > bring down the primary node for reboot, secondary node simply hung(It is > not completely hung. It responds to ping. But I can't login to this node > or type any command. Even just press enter key is not showing the > prompt ) . > When I reboot the primary node again, it identifies the node 2 as root > and continuously wait node 2 to join(Message: Searching for an existing > root node...Found node 2 as the root node.) >=20 > I have attached the message from the node 2 console when node 1 is going > down. Any tips for debugging would be helpful. >=20 > Regards, > Gopal. >=20 > drbd0: PingAck did not arrive in time. > drbd0: drbd0_asender [131629]: cstate Connected --> NetworkFailure > drbd0: asender terminated > drbd0: drbd0_receiver [131622]: cstate NetworkFailure --> BrokenPipe > drbd0: short read expecting header on sock: r=3D-512 > drbd0: worker terminated > drbd0: drbd0_receiver [131622]: cstate BrokenPipe --> Unconnected > drbd0: Connection lost. > drbd0: drbd0_receiver [131622]: cstate Unconnected --> WFConnection > Taking over master from node 1. > Node 1 has gone down!!! > passed the first scan in ipcname_pull_data > num_objects[MSG] =3D 0 > num_objects[SEM] =3D 0 > num_objects[SHM] =3D 0 > ipcnameserver ready completed > drbd0: drbd_nodedown: Signaling receiver thread. > drbd0: drbd_set_state: (mdev->this_bdev->bd_contains =3D=3D 0) in > /usr/src/modules/drbd/drbd/drbd_fs.c:702 > drbd0: Secondary/Unknown --> Primary/Unknown > drbd0: Doing CLMS nodedown callback for service 9 >=20 > Gopalakrishna NM wrote: > > Hi Roger, > > At first sight , with the patch and recent DRBD checkins , problem has > > been resolved. I am trying with Latest kernel and doing some failove > > rtesting. I will update you further. > > > > Regards, > > Gopal. > > > > Roger Tsang wrote: > > > >> Gopal, > >> > >> Apparently drbdadm has been working properly with the old drbdsetup > >> because the order of arguments passed to drbdsetup matches that > >> expected by drbdsetup. If you try the drbdsetup patch I sent, you > >> also have to use this drbdadm patch attached. > >> > >> I think you hit the drbdsetup bug because most users don't use > >> drbdsetup directly. Let me know. > >> > >> -Roger > >> > >> On 5/27/05, Roger Tsang <rog...@gm...> wrote: > >> > >>> Hi Gopal, > >>> > >>> Alright I'll take a look at the code. I don't seem to be having this > >>> problem on OPENSSI-FC-1-2-STABLE and the code is synced to the trunk > >>> which also happens to be the OPENSSI-DEBIAN branch. So we're using > >>> the same (drbd) code, just on different kernels. > >>> > >>> -Roger > >>> > >>> > >>> On 5/27/05, Gopalakrishna NM <go...@hp...> wrote: > >>> > >>>> Hi Roger, > >>>> DRBD , 1.9 (built from OPENSSI-DEBIAN, May 27 12:58) and the kernel = ( > >>>> built from OPENSSI-DEBIAN, May 27 12:58) resulting in the kernel > >>>> oops in > >>>> the function sock_recvmsg. Till yesterday I could reproduce > >>>> consistently(with the kernel & drbd built on May 25 ), but it sudden= ly > >>>> disappeared when I built the new version of command drbdadm and > >>>> drbdsetup and copied to test system. After few set up reboot, it > >>>> resulted in oops again!. Today with new kernel and DRBD module, it i= s > >>>> resulting in oops consistently. Ofcourse, I built new drbdadm and > >>>> drbdsetup commands and copied to test system. > >>>> > >>>> The console message I am attaching at the end of this mail. . > >>>> > >>>> Looking at the DRBD code and generating few debugging message it is > >>>> clear that following mdev(drbd_dev) fields are some how > >>>> corrupting/interchanging. > >>>> mdev->conf.my_addr_len =3D 1 > >>>> mdev->conf.other_addr_len =3D 2 > >>>> mdev->conf.this_nodenum =3D 16 > >>>> mdev->conf.other_nodenum =3D 16 > >>>> > >>>> The function "drbd_wait_for_connect" could not successfully complete > >>>> "bind" and resulting in error (-22) EINVAL. Because of this connect > >>>> would fail in the function "drbd_try_connect" (This is expected). Th= e > >>>> reason for bind failure is some how address is not proper and it is > >>>> happening because of "my_addr_len & other_addr_len " is having 1 and= 2 > >>>> respectively instead of 16(This is proper length). But node number > >>>> "this_nodenum & "other_nodenum" is showing 16 instead of 1 and 2 > >>>> respectively. Since address len field is 1, during bind it might be > >>>> getting first and second character of the address and it would resul= ted > >>>> in bind. > >>>> > >>>> Once it start working yesterday (After copying new copy of command > >>>> drbdadm and drbdsetup), later I copied old command , but it was > >>>> working!. So I am not even convinced it is command problem. when I > >>>> built > >>>> new kernel and DRBD module today , again I am facing same problem > >>>> consistently with new and old command. > >>>> > >>>> Does it any thing to do with race?. Any idea the relationship betwee= n > >>>> drbdsetup and initialization of drbd?. or anything to do with > >>>> alignment? > >>>> Any timing issues? My understanding is at this stage of boot up > >>>> drbdsetup would not have any problem. > >>>> > >>>> Your inputs would be very helpful. > >>>> > >>>> Thanks and regards, > >>>> Gopal. > >>>> > >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DCO= NSOLE messg=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >>>> drbd: module cleanup done. > >>>> modprobe -k drbd minor_count=3D1 > >>>> drbd: initialised. Version: 0.7.10 (api:77/proto:74) > >>>> drbd: SVN Revision: 1743 build by go...@ha..., > >>>> 2005-05-27 12:58:01 > >>>> drbd: registered as block device major 147 > >>>> Starting DRBD resource: > >>>> drbd0: resync bitmap: bits=3D1151992 words=3D36000 > >>>> drbd0: size =3D 4499 MB (4607968 KB) > >>>> drbd0: 4499 MB marked out-of-sync by on disk bit-map. > >>>> drbd0: Found 6 transactions (230 active extents) in activity log. > >>>> drbd0: Marked additional 0 KB as out-of-sync based on AL. > >>>> drbd0: drbdsetup [66084]: cstate Unconfigured --> StandAlone > >>>> drbd0: drbdsetup [66087]: cstate StandAlone --> Unconnected > >>>> drbd0: drbd0_receiver [66088]: cstate Unconnected --> WFConnection > >>>> drbd0: Unabl > >>>> # WARNING: Do not type 'yes' while waiting for DRBD connection > >>>> # unless you know what you are doing! You have been warned! > >>>> # The only exception is when setting up DRBD first time. > >>>> # > >>>> e to bind (-22) > >>>> drbd0: Registering drbd0 with CLMS subsystem > >>>> dr<b4>d0dr: bdd0r:b d0T_ryriencegi tveo r co[6nn60ec88t ]:ag > >>>> caisnta.<t1e >WUnFCaoblnene tcot iohan nd--le> kUencrnonelne cNtUeL= Ld > >>>> pointer dereference at virtual address 00000008 > >>>> printing eip: > >>>> c039ae5c > >>>> *pde =3D 00000000 > >>>> Oops: 0000 [#1] > >>>> SMP > >>>> Modules linked in: drbd sd_mod sym53c8xx scsi_transport_spi cciss > >>>> scsi_mod tg3 eepro100 e100 mii > >>>> CPU: 1 > >>>> EIP: 0060:[<c039ae5c>] Not tainted VLI > >>>> EFLAGS: 00010246 (2.6.10-ssi-686-smp) > >>>> EIP is at sock_recvmsg+0xac/0xf0 > >>>> eax: f7036f20 ebx: 00000000 ecx: f71157f0 edx: 00000008 > >>>> esi: 00004100 edi: f7036e74 ebp: f7036f00 esp: f7036e10 > >>>> ds: 007b es: 007b ss: 0068 > >>>> Process drbd0_receiver (pid: 66088, threadinfo=3Df7036000 task=3Df71= 157f0) > >>>> Stack: 000000fc 00000000 f7036e7c ffffffff c0465060 20008790 3630383= 8 > >>>> 00004100 > >>>> 00000008 00000000 00000000 00000000 f7036f20 f712a040 f7036e6= 0 > >>>> c03db8da > >>>> f712a040 f712a150 f712a040 f712a040 00000282 c03e108a f712a04= 0 > >>>> f712a040 > >>>> Call Trace: > >>>> [<c010687f>] show_stack+0x7f/0xa0 > >>>> [<c0106a34>] show_registers+0x164/0x220 > >>>> [<c0106dc4>] die+0xf4/0x1c0 > >>>> [<c011f325>] do_page_fault+0x375/0x695 > >>>> [<c01064d3>] error_code+0x2b/0x30 > >>>> [<f898e0e9>] drbd_recv+0x89/0x190 [drbd] > >>>> [<f898e99a>] drbd_recv_header+0x2a/0xf0 [drbd] > >>>> [<f8991f2c>] drbdd+0x2c/0x160 [drbd] > >>>> [<f8992b18>] drbdd_init+0x78/0x410 [drbd] > >>>> [<f8998e3e>] drbd_thread_setup+0x7e/0xf0 [drbd] > >>>> [<c01022e5>] kernel_thread_helper+0x5/0x10 > >>>> Code: 85 24 ff ff ff 89 45 e8 31 c0 89 85 3c ff ff ff 8b 45 0c 89 95= 30 > >>>> ff ff ff 89 9d 34 ff ff ff 89 85 40 ff ff ff 89 b5 2c ff ff ff <8b> = 43 > >>>> 08 89 74 24 10 89 54 24 0c8b 55 0c 89 5c 24 04 89 3c 24 > >>>> > >>>> Entering kdb (current=3D0xf71157f0, pid 66088) on processor 1 Oops: = Oops > >>>> due to oops @ 0xc039ae5c > >>>> eax =3D 0xf7036f20 ebx =3D 0x00000000 ecx =3D 0xf71157f0 edx =3D 0x0= 0000008 > >>>> esi =3D 0x00004100 edi =3D 0xf7036e74 esp =3D 0xf7036e10 eip =3D 0xc= 039ae5c > >>>> ebp =3D 0xf7036f00 xss =3D 0xc0390068 xcs =3D 0x00000060 eflags =3D = 0x00010246 > >>>> xds =3D 0x0000007b xes =3D 0x0000007b origeax =3D 0xffffffff ®s = =3D > >>>> 0xf7036ddc > >>>> [1]kdb> > >>>> > > > > > |
From: Roger T. <rog...@gm...> - 2005-05-30 22:45:36
Attachments:
drbd-priority.patch
|
See what you get with this patch. -Roger On 5/30/05, Roger Tsang <rog...@gm...> wrote: > Gopal, >=20 > The messages below show that ipc failed over before DRBD. This is not > suppose to happen. On SSI-1.2 DRBD failover completes before any of > the key services. I'm occupied right now, but hopefully by later > tonight I'll give you a patch which changes DRBD's priority in CLMS to > see if it fixes your problem. >=20 > -Roger >=20 >=20 > On 5/30/05, Gopalakrishna NM <go...@hp...> wrote: > > Hi Roger, > > First I brought up the DRBD primary and next DRBD secondary. When I > > bring down the primary node for reboot, secondary node simply hung(It i= s > > not completely hung. It responds to ping. But I can't login to this nod= e > > or type any command. Even just press enter key is not showing the > > prompt ) . > > When I reboot the primary node again, it identifies the node 2 as root > > and continuously wait node 2 to join(Message: Searching for an existin= g > > root node...Found node 2 as the root node.) > > > > I have attached the message from the node 2 console when node 1 is goin= g > > down. Any tips for debugging would be helpful. > > > > Regards, > > Gopal. > > > > drbd0: PingAck did not arrive in time. > > drbd0: drbd0_asender [131629]: cstate Connected --> NetworkFailure > > drbd0: asender terminated > > drbd0: drbd0_receiver [131622]: cstate NetworkFailure --> BrokenPipe > > drbd0: short read expecting header on sock: r=3D-512 > > drbd0: worker terminated > > drbd0: drbd0_receiver [131622]: cstate BrokenPipe --> Unconnected > > drbd0: Connection lost. > > drbd0: drbd0_receiver [131622]: cstate Unconnected --> WFConnection > > Taking over master from node 1. > > Node 1 has gone down!!! > > passed the first scan in ipcname_pull_data > > num_objects[MSG] =3D 0 > > num_objects[SEM] =3D 0 > > num_objects[SHM] =3D 0 > > ipcnameserver ready completed > > drbd0: drbd_nodedown: Signaling receiver thread. > > drbd0: drbd_set_state: (mdev->this_bdev->bd_contains =3D=3D 0) in > > /usr/src/modules/drbd/drbd/drbd_fs.c:702 > > drbd0: Secondary/Unknown --> Primary/Unknown > > drbd0: Doing CLMS nodedown callback for service 9 > > > > Gopalakrishna NM wrote: > > > Hi Roger, > > > At first sight , with the patch and recent DRBD checkins , problem ha= s > > > been resolved. I am trying with Latest kernel and doing some failove > > > rtesting. I will update you further. > > > > > > Regards, > > > Gopal. > > > > > > Roger Tsang wrote: > > > > > >> Gopal, > > >> > > >> Apparently drbdadm has been working properly with the old drbdsetup > > >> because the order of arguments passed to drbdsetup matches that > > >> expected by drbdsetup. If you try the drbdsetup patch I sent, you > > >> also have to use this drbdadm patch attached. > > >> > > >> I think you hit the drbdsetup bug because most users don't use > > >> drbdsetup directly. Let me know. > > >> > > >> -Roger > > >> > > >> On 5/27/05, Roger Tsang <rog...@gm...> wrote: > > >> > > >>> Hi Gopal, > > >>> > > >>> Alright I'll take a look at the code. I don't seem to be having th= is > > >>> problem on OPENSSI-FC-1-2-STABLE and the code is synced to the trun= k > > >>> which also happens to be the OPENSSI-DEBIAN branch. So we're using > > >>> the same (drbd) code, just on different kernels. > > >>> > > >>> -Roger > > >>> > > >>> > > >>> On 5/27/05, Gopalakrishna NM <go...@hp...> wrote: > > >>> > > >>>> Hi Roger, > > >>>> DRBD , 1.9 (built from OPENSSI-DEBIAN, May 27 12:58) and the kerne= l ( > > >>>> built from OPENSSI-DEBIAN, May 27 12:58) resulting in the kernel > > >>>> oops in > > >>>> the function sock_recvmsg. Till yesterday I could reproduce > > >>>> consistently(with the kernel & drbd built on May 25 ), but it sudd= enly > > >>>> disappeared when I built the new version of command drbdadm and > > >>>> drbdsetup and copied to test system. After few set up reboot, it > > >>>> resulted in oops again!. Today with new kernel and DRBD module, it= is > > >>>> resulting in oops consistently. Ofcourse, I built new drbdadm and > > >>>> drbdsetup commands and copied to test system. > > >>>> > > >>>> The console message I am attaching at the end of this mail. . > > >>>> > > >>>> Looking at the DRBD code and generating few debugging message it i= s > > >>>> clear that following mdev(drbd_dev) fields are some how > > >>>> corrupting/interchanging. > > >>>> mdev->conf.my_addr_len =3D 1 > > >>>> mdev->conf.other_addr_len =3D 2 > > >>>> mdev->conf.this_nodenum =3D 16 > > >>>> mdev->conf.other_nodenum =3D 16 > > >>>> > > >>>> The function "drbd_wait_for_connect" could not successfully comple= te > > >>>> "bind" and resulting in error (-22) EINVAL. Because of this connec= t > > >>>> would fail in the function "drbd_try_connect" (This is expected). = The > > >>>> reason for bind failure is some how address is not proper and it = is > > >>>> happening because of "my_addr_len & other_addr_len " is having 1 a= nd 2 > > >>>> respectively instead of 16(This is proper length). But node number > > >>>> "this_nodenum & "other_nodenum" is showing 16 instead of 1 and 2 > > >>>> respectively. Since address len field is 1, during bind it might b= e > > >>>> getting first and second character of the address and it would res= ulted > > >>>> in bind. > > >>>> > > >>>> Once it start working yesterday (After copying new copy of command > > >>>> drbdadm and drbdsetup), later I copied old command , but it was > > >>>> working!. So I am not even convinced it is command problem. when I > > >>>> built > > >>>> new kernel and DRBD module today , again I am facing same problem > > >>>> consistently with new and old command. > > >>>> > > >>>> Does it any thing to do with race?. Any idea the relationship betw= een > > >>>> drbdsetup and initialization of drbd?. or anything to do with > > >>>> alignment? > > >>>> Any timing issues? My understanding is at this stage of boot up > > >>>> drbdsetup would not have any problem. > > >>>> > > >>>> Your inputs would be very helpful. > > >>>> > > >>>> Thanks and regards, > > >>>> Gopal. > > >>>> > > >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= CONSOLE messg=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > >>>> drbd: module cleanup done. > > >>>> modprobe -k drbd minor_count=3D1 > > >>>> drbd: initialised. Version: 0.7.10 (api:77/proto:74) > > >>>> drbd: SVN Revision: 1743 build by go...@ha..., > > >>>> 2005-05-27 12:58:01 > > >>>> drbd: registered as block device major 147 > > >>>> Starting DRBD resource: > > >>>> drbd0: resync bitmap: bits=3D1151992 words=3D36000 > > >>>> drbd0: size =3D 4499 MB (4607968 KB) > > >>>> drbd0: 4499 MB marked out-of-sync by on disk bit-map. > > >>>> drbd0: Found 6 transactions (230 active extents) in activity log. > > >>>> drbd0: Marked additional 0 KB as out-of-sync based on AL. > > >>>> drbd0: drbdsetup [66084]: cstate Unconfigured --> StandAlone > > >>>> drbd0: drbdsetup [66087]: cstate StandAlone --> Unconnected > > >>>> drbd0: drbd0_receiver [66088]: cstate Unconnected --> WFConnection > > >>>> drbd0: Unabl > > >>>> # WARNING: Do not type 'yes' while waiting for DRBD connection > > >>>> # unless you know what you are doing! You have been warne= d! > > >>>> # The only exception is when setting up DRBD first time. > > >>>> # > > >>>> e to bind (-22) > > >>>> drbd0: Registering drbd0 with CLMS subsystem > > >>>> dr<b4>d0dr: bdd0r:b d0T_ryriencegi tveo r co[6nn60ec88t ]:ag > > >>>> caisnta.<t1e >WUnFCaoblnene tcot iohan nd--le> kUencrnonelne cNtU= eLLd > > >>>> pointer dereference at virtual address 00000008 > > >>>> printing eip: > > >>>> c039ae5c > > >>>> *pde =3D 00000000 > > >>>> Oops: 0000 [#1] > > >>>> SMP > > >>>> Modules linked in: drbd sd_mod sym53c8xx scsi_transport_spi cciss > > >>>> scsi_mod tg3 eepro100 e100 mii > > >>>> CPU: 1 > > >>>> EIP: 0060:[<c039ae5c>] Not tainted VLI > > >>>> EFLAGS: 00010246 (2.6.10-ssi-686-smp) > > >>>> EIP is at sock_recvmsg+0xac/0xf0 > > >>>> eax: f7036f20 ebx: 00000000 ecx: f71157f0 edx: 00000008 > > >>>> esi: 00004100 edi: f7036e74 ebp: f7036f00 esp: f7036e10 > > >>>> ds: 007b es: 007b ss: 0068 > > >>>> Process drbd0_receiver (pid: 66088, threadinfo=3Df7036000 task=3Df= 71157f0) > > >>>> Stack: 000000fc 00000000 f7036e7c ffffffff c0465060 20008790 36303= 838 > > >>>> 00004100 > > >>>> 00000008 00000000 00000000 00000000 f7036f20 f712a040 f7036= e60 > > >>>> c03db8da > > >>>> f712a040 f712a150 f712a040 f712a040 00000282 c03e108a f712a= 040 > > >>>> f712a040 > > >>>> Call Trace: > > >>>> [<c010687f>] show_stack+0x7f/0xa0 > > >>>> [<c0106a34>] show_registers+0x164/0x220 > > >>>> [<c0106dc4>] die+0xf4/0x1c0 > > >>>> [<c011f325>] do_page_fault+0x375/0x695 > > >>>> [<c01064d3>] error_code+0x2b/0x30 > > >>>> [<f898e0e9>] drbd_recv+0x89/0x190 [drbd] > > >>>> [<f898e99a>] drbd_recv_header+0x2a/0xf0 [drbd] > > >>>> [<f8991f2c>] drbdd+0x2c/0x160 [drbd] > > >>>> [<f8992b18>] drbdd_init+0x78/0x410 [drbd] > > >>>> [<f8998e3e>] drbd_thread_setup+0x7e/0xf0 [drbd] > > >>>> [<c01022e5>] kernel_thread_helper+0x5/0x10 > > >>>> Code: 85 24 ff ff ff 89 45 e8 31 c0 89 85 3c ff ff ff 8b 45 0c 89 = 95 30 > > >>>> ff ff ff 89 9d 34 ff ff ff 89 85 40 ff ff ff 89 b5 2c ff ff ff <8b= > 43 > > >>>> 08 89 74 24 10 89 54 24 0c8b 55 0c 89 5c 24 04 89 3c 24 > > >>>> > > >>>> Entering kdb (current=3D0xf71157f0, pid 66088) on processor 1 Oops= : Oops > > >>>> due to oops @ 0xc039ae5c > > >>>> eax =3D 0xf7036f20 ebx =3D 0x00000000 ecx =3D 0xf71157f0 edx =3D 0= x00000008 > > >>>> esi =3D 0x00004100 edi =3D 0xf7036e74 esp =3D 0xf7036e10 eip =3D 0= xc039ae5c > > >>>> ebp =3D 0xf7036f00 xss =3D 0xc0390068 xcs =3D 0x00000060 eflags = =3D 0x00010246 > > >>>> xds =3D 0x0000007b xes =3D 0x0000007b origeax =3D 0xffffffff ®s= =3D > > >>>> 0xf7036ddc > > >>>> [1]kdb> > > >>>> > > > > > > > > > |
From: Roger T. <rog...@gm...> - 2005-05-31 05:43:18
Attachments:
drbd-priority-2.patch
|
Try this one instead. On 5/30/05, Roger Tsang <rog...@gm...> wrote: > See what you get with this patch. >=20 > -Roger >=20 > On 5/30/05, Roger Tsang <rog...@gm...> wrote: > > Gopal, > > > > The messages below show that ipc failed over before DRBD. This is not > > suppose to happen. On SSI-1.2 DRBD failover completes before any of > > the key services. I'm occupied right now, but hopefully by later > > tonight I'll give you a patch which changes DRBD's priority in CLMS to > > see if it fixes your problem. > > > > -Roger > > > > > > On 5/30/05, Gopalakrishna NM <go...@hp...> wrote: > > > Hi Roger, > > > First I brought up the DRBD primary and next DRBD secondary. When I > > > bring down the primary node for reboot, secondary node simply hung(It= is > > > not completely hung. It responds to ping. But I can't login to this n= ode > > > or type any command. Even just press enter key is not showing the > > > prompt ) . > > > When I reboot the primary node again, it identifies the node 2 as roo= t > > > and continuously wait node 2 to join(Message: Searching for an exist= ing > > > root node...Found node 2 as the root node.) > > > > > > I have attached the message from the node 2 console when node 1 is go= ing > > > down. Any tips for debugging would be helpful. > > > > > > Regards, > > > Gopal. > > > > > > drbd0: PingAck did not arrive in time. > > > drbd0: drbd0_asender [131629]: cstate Connected --> NetworkFailure > > > drbd0: asender terminated > > > drbd0: drbd0_receiver [131622]: cstate NetworkFailure --> BrokenPipe > > > drbd0: short read expecting header on sock: r=3D-512 > > > drbd0: worker terminated > > > drbd0: drbd0_receiver [131622]: cstate BrokenPipe --> Unconnected > > > drbd0: Connection lost. > > > drbd0: drbd0_receiver [131622]: cstate Unconnected --> WFConnection > > > Taking over master from node 1. > > > Node 1 has gone down!!! > > > passed the first scan in ipcname_pull_data > > > num_objects[MSG] =3D 0 > > > num_objects[SEM] =3D 0 > > > num_objects[SHM] =3D 0 > > > ipcnameserver ready completed > > > drbd0: drbd_nodedown: Signaling receiver thread. > > > drbd0: drbd_set_state: (mdev->this_bdev->bd_contains =3D=3D 0) in > > > /usr/src/modules/drbd/drbd/drbd_fs.c:702 > > > drbd0: Secondary/Unknown --> Primary/Unknown > > > drbd0: Doing CLMS nodedown callback for service 9 > > > > > > Gopalakrishna NM wrote: > > > > Hi Roger, > > > > At first sight , with the patch and recent DRBD checkins , problem = has > > > > been resolved. I am trying with Latest kernel and doing some failo= ve > > > > rtesting. I will update you further. > > > > > > > > Regards, > > > > Gopal. > > > > > > > > Roger Tsang wrote: > > > > > > > >> Gopal, > > > >> > > > >> Apparently drbdadm has been working properly with the old drbdsetu= p > > > >> because the order of arguments passed to drbdsetup matches that > > > >> expected by drbdsetup. If you try the drbdsetup patch I sent, you > > > >> also have to use this drbdadm patch attached. > > > >> > > > >> I think you hit the drbdsetup bug because most users don't use > > > >> drbdsetup directly. Let me know. > > > >> > > > >> -Roger > > > >> > > > >> On 5/27/05, Roger Tsang <rog...@gm...> wrote: > > > >> > > > >>> Hi Gopal, > > > >>> > > > >>> Alright I'll take a look at the code. I don't seem to be having = this > > > >>> problem on OPENSSI-FC-1-2-STABLE and the code is synced to the tr= unk > > > >>> which also happens to be the OPENSSI-DEBIAN branch. So we're usi= ng > > > >>> the same (drbd) code, just on different kernels. > > > >>> > > > >>> -Roger > > > >>> > > > >>> > > > >>> On 5/27/05, Gopalakrishna NM <go...@hp...> wrote: > > > >>> > > > >>>> Hi Roger, > > > >>>> DRBD , 1.9 (built from OPENSSI-DEBIAN, May 27 12:58) and the ker= nel ( > > > >>>> built from OPENSSI-DEBIAN, May 27 12:58) resulting in the kernel > > > >>>> oops in > > > >>>> the function sock_recvmsg. Till yesterday I could reproduce > > > >>>> consistently(with the kernel & drbd built on May 25 ), but it su= ddenly > > > >>>> disappeared when I built the new version of command drbdadm and > > > >>>> drbdsetup and copied to test system. After few set up reboot, i= t > > > >>>> resulted in oops again!. Today with new kernel and DRBD module, = it is > > > >>>> resulting in oops consistently. Ofcourse, I built new drbdadm an= d > > > >>>> drbdsetup commands and copied to test system. > > > >>>> > > > >>>> The console message I am attaching at the end of this mail. . > > > >>>> > > > >>>> Looking at the DRBD code and generating few debugging message it= is > > > >>>> clear that following mdev(drbd_dev) fields are some how > > > >>>> corrupting/interchanging. > > > >>>> mdev->conf.my_addr_len =3D 1 > > > >>>> mdev->conf.other_addr_len =3D 2 > > > >>>> mdev->conf.this_nodenum =3D 16 > > > >>>> mdev->conf.other_nodenum =3D 16 > > > >>>> > > > >>>> The function "drbd_wait_for_connect" could not successfully comp= lete > > > >>>> "bind" and resulting in error (-22) EINVAL. Because of this conn= ect > > > >>>> would fail in the function "drbd_try_connect" (This is expected)= . The > > > >>>> reason for bind failure is some how address is not proper and i= t is > > > >>>> happening because of "my_addr_len & other_addr_len " is having 1= and 2 > > > >>>> respectively instead of 16(This is proper length). But node numb= er > > > >>>> "this_nodenum & "other_nodenum" is showing 16 instead of 1 and 2 > > > >>>> respectively. Since address len field is 1, during bind it might= be > > > >>>> getting first and second character of the address and it would r= esulted > > > >>>> in bind. > > > >>>> > > > >>>> Once it start working yesterday (After copying new copy of comma= nd > > > >>>> drbdadm and drbdsetup), later I copied old command , but it was > > > >>>> working!. So I am not even convinced it is command problem. when= I > > > >>>> built > > > >>>> new kernel and DRBD module today , again I am facing same proble= m > > > >>>> consistently with new and old command. > > > >>>> > > > >>>> Does it any thing to do with race?. Any idea the relationship be= tween > > > >>>> drbdsetup and initialization of drbd?. or anything to do with > > > >>>> alignment? > > > >>>> Any timing issues? My understanding is at this stage of boot up > > > >>>> drbdsetup would not have any problem. > > > >>>> > > > >>>> Your inputs would be very helpful. > > > >>>> > > > >>>> Thanks and regards, > > > >>>> Gopal. > > > >>>> > > > >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3DCONSOLE messg=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > >>>> drbd: module cleanup done. > > > >>>> modprobe -k drbd minor_count=3D1 > > > >>>> drbd: initialised. Version: 0.7.10 (api:77/proto:74) > > > >>>> drbd: SVN Revision: 1743 build by go...@ha..., > > > >>>> 2005-05-27 12:58:01 > > > >>>> drbd: registered as block device major 147 > > > >>>> Starting DRBD resource: > > > >>>> drbd0: resync bitmap: bits=3D1151992 words=3D36000 > > > >>>> drbd0: size =3D 4499 MB (4607968 KB) > > > >>>> drbd0: 4499 MB marked out-of-sync by on disk bit-map. > > > >>>> drbd0: Found 6 transactions (230 active extents) in activity log= . > > > >>>> drbd0: Marked additional 0 KB as out-of-sync based on AL. > > > >>>> drbd0: drbdsetup [66084]: cstate Unconfigured --> StandAlone > > > >>>> drbd0: drbdsetup [66087]: cstate StandAlone --> Unconnected > > > >>>> drbd0: drbd0_receiver [66088]: cstate Unconnected --> WFConnecti= on > > > >>>> drbd0: Unabl > > > >>>> # WARNING: Do not type 'yes' while waiting for DRBD connection > > > >>>> # unless you know what you are doing! You have been war= ned! > > > >>>> # The only exception is when setting up DRBD first time. > > > >>>> # > > > >>>> e to bind (-22) > > > >>>> drbd0: Registering drbd0 with CLMS subsystem > > > >>>> dr<b4>d0dr: bdd0r:b d0T_ryriencegi tveo r co[6nn60ec88t ]:ag > > > >>>> caisnta.<t1e >WUnFCaoblnene tcot iohan nd--le> kUencrnonelne cN= tUeLLd > > > >>>> pointer dereference at virtual address 00000008 > > > >>>> printing eip: > > > >>>> c039ae5c > > > >>>> *pde =3D 00000000 > > > >>>> Oops: 0000 [#1] > > > >>>> SMP > > > >>>> Modules linked in: drbd sd_mod sym53c8xx scsi_transport_spi ccis= s > > > >>>> scsi_mod tg3 eepro100 e100 mii > > > >>>> CPU: 1 > > > >>>> EIP: 0060:[<c039ae5c>] Not tainted VLI > > > >>>> EFLAGS: 00010246 (2.6.10-ssi-686-smp) > > > >>>> EIP is at sock_recvmsg+0xac/0xf0 > > > >>>> eax: f7036f20 ebx: 00000000 ecx: f71157f0 edx: 00000008 > > > >>>> esi: 00004100 edi: f7036e74 ebp: f7036f00 esp: f7036e10 > > > >>>> ds: 007b es: 007b ss: 0068 > > > >>>> Process drbd0_receiver (pid: 66088, threadinfo=3Df7036000 task= =3Df71157f0) > > > >>>> Stack: 000000fc 00000000 f7036e7c ffffffff c0465060 20008790 363= 03838 > > > >>>> 00004100 > > > >>>> 00000008 00000000 00000000 00000000 f7036f20 f712a040 f70= 36e60 > > > >>>> c03db8da > > > >>>> f712a040 f712a150 f712a040 f712a040 00000282 c03e108a f71= 2a040 > > > >>>> f712a040 > > > >>>> Call Trace: > > > >>>> [<c010687f>] show_stack+0x7f/0xa0 > > > >>>> [<c0106a34>] show_registers+0x164/0x220 > > > >>>> [<c0106dc4>] die+0xf4/0x1c0 > > > >>>> [<c011f325>] do_page_fault+0x375/0x695 > > > >>>> [<c01064d3>] error_code+0x2b/0x30 > > > >>>> [<f898e0e9>] drbd_recv+0x89/0x190 [drbd] > > > >>>> [<f898e99a>] drbd_recv_header+0x2a/0xf0 [drbd] > > > >>>> [<f8991f2c>] drbdd+0x2c/0x160 [drbd] > > > >>>> [<f8992b18>] drbdd_init+0x78/0x410 [drbd] > > > >>>> [<f8998e3e>] drbd_thread_setup+0x7e/0xf0 [drbd] > > > >>>> [<c01022e5>] kernel_thread_helper+0x5/0x10 > > > >>>> Code: 85 24 ff ff ff 89 45 e8 31 c0 89 85 3c ff ff ff 8b 45 0c 8= 9 95 30 > > > >>>> ff ff ff 89 9d 34 ff ff ff 89 85 40 ff ff ff 89 b5 2c ff ff ff <= 8b> 43 > > > >>>> 08 89 74 24 10 89 54 24 0c8b 55 0c 89 5c 24 04 89 3c 24 > > > >>>> > > > >>>> Entering kdb (current=3D0xf71157f0, pid 66088) on processor 1 Oo= ps: Oops > > > >>>> due to oops @ 0xc039ae5c > > > >>>> eax =3D 0xf7036f20 ebx =3D 0x00000000 ecx =3D 0xf71157f0 edx =3D= 0x00000008 > > > >>>> esi =3D 0x00004100 edi =3D 0xf7036e74 esp =3D 0xf7036e10 eip =3D= 0xc039ae5c > > > >>>> ebp =3D 0xf7036f00 xss =3D 0xc0390068 xcs =3D 0x00000060 eflags = =3D 0x00010246 > > > >>>> xds =3D 0x0000007b xes =3D 0x0000007b origeax =3D 0xffffffff &re= gs =3D > > > >>>> 0xf7036ddc > > > >>>> [1]kdb> > > > >>>> > > > > > > > > > > > > > >=20 >=20 > |
From: Aneesh K. <ane...@gm...> - 2005-05-31 07:14:46
|
On 5/31/05, Roger Tsang <rog...@gm...> wrote: > Try this one instead. >=20 I don't think you should use CLMS priority -1. As per the documentation http://ci-linux.sourceforge.net/enhancing.shtml -1 means no odering is ensured. What is the CLMS service do you think is happening along with DRBD. I see CFS failover happening at 1 . That means 0 should be ok for DRBD. -aneesh |
From: Roger T. <rog...@gm...> - 2005-05-31 12:43:45
|
Yeah I know. Look at the _entire_ patch. It manages to guarantee failover completion before all priority 0 services. -Roger On 5/31/05, Aneesh Kumar <ane...@gm...> wrote: > On 5/31/05, Roger Tsang <rog...@gm...> wrote: > > Try this one instead. > > >=20 > I don't think you should use CLMS priority -1. As per the > documentation http://ci-linux.sourceforge.net/enhancing.shtml -1 > means no odering is ensured. What is the CLMS service do you think is > happening along with DRBD. I see CFS failover happening at 1 . That > means 0 should be ok for DRBD. >=20 > -aneesh > |
From: Aneesh K. <ane...@gm...> - 2005-05-31 12:55:45
|
On 5/31/05, Roger Tsang <rog...@gm...> wrote: > Yeah I know. Look at the _entire_ patch. It manages to guarantee > failover completion before all priority 0 services. >=20 > - IIUC the use of completion will only make sure things are waited correctly within drbd. I guess using CLMS priority -1 means ( I may be wrong here ) other CLMS priority need not wait for this service to over. That means we may try a CFS failover before DRBD failover which is not what we want. -aneesh |
From: Roger T. <rog...@gm...> - 2005-05-31 13:04:03
|
Drbd doesn't return to the caller until completion. So clms_svcmgmt waits (for drbd) before calling priority 0,1 services. -Roger On 5/31/05, Aneesh Kumar <ane...@gm...> wrote: > On 5/31/05, Roger Tsang <rog...@gm...> wrote: > > Yeah I know. Look at the _entire_ patch. It manages to guarantee > > failover completion before all priority 0 services. > > > > - >=20 > IIUC the use of completion will only make sure things are waited > correctly within drbd. I guess using CLMS priority -1 means ( I may be > wrong here ) other CLMS priority need not wait for this service to > over. That means we may try a CFS failover before DRBD failover which > is not what we want. >=20 > -aneesh > |
From: Gopalakrishna NM <go...@hp...> - 2005-05-31 07:31:22
|
Hi Roger, Problem has not been resolved yet. The bt and demesg out put is attached. I think John is still working on CFS failover. . I was going through the CLMS code of "clms_svcmgmt.c", I found that -1 might not be an appropriate priority. Some time it may work and some time it may not. Sel below 2601 * Subsystems with band priority -1 have no dependencies on other 2602 * subsystems and therefore are executed in parallel with all other 2603 * priority band processing. Regards, Gopal. ========dmesg and kdb 'bt' output=== Node 1 has gone down!!! passed the first scan in ipcname_pull_data drbd0: drbd_nodedown: Signaling receiver thread. drbd0: drbd_set_state: (mdev->this_bdev->bd_contains == 0) in /usr/src/modules/drbd/drbd/drbd_fs.c:702 drbd0: Secondary/Unknown --> Primary/Unknown drbd0: Doing CLMS nodedown callback for service 9 num_objects[MSG] = 0 num_objects[SEM] = 0 num_objects[SHM] = 0 ipcnameserver ready completed Entering kdb (current=0xc1bf1050, pid 0) on processor 1 due to Keyboard Entry [1]kdb> BT BT = 0x0000000b [1]kdb> bt Stack traceback for pid 0 0xc1bf1050 0 0 1 1 I 0xc1bf1230 *swapper EBP EIP Function (args) Starting on an alternate kernel stack 0xdfdcefa8 0xc010206c default_idle+0x2c (0x4, 0xc0753a70, 0xdfdcef74, 0x0, 0x4) 0xc06ebfcc 0xc01496d3 handle_IRQ_event+0x33 (0x4, 0xc0678e00, 0x1, 0xc0679018, 0xf7d63720) 0xc06ebff8 0xc01497ef __do_IRQ+0xdf ======================= 0xc01080d1 do_IRQ+0x61 [1]kdb> |
From: Gopalakrishna NM <go...@hp...> - 2005-05-31 12:34:59
|
Hi Roger/John, The problem has been resolved. This time I did a mistake. I forgot to include "ext3" module in the initrd image(this was the file system that I created for DRBD device on both primary and secondary). when the primary node goes down, it tries to find the ext3 (request_module in get_fs_type), during CFS failover. . It could not complete the call after locking whole kernel. So failover is not complete and kernel appears to hung. I rebuilt the initrd image by including "ext3" module and problem has been resolved. Now secondary node could respond and retain connection once the primary goes down. I have to continue how other failover goes. Thanks and regards, GOpal. Gopalakrishna NM wrote: > Hi Roger, > Problem has not been resolved yet. The bt and demesg out put is > attached. I think John is still working on CFS failover. > > . I was going through the CLMS code of "clms_svcmgmt.c", I found that -1 > might not be an appropriate priority. Some time it may work and some > time it may not. > > Sel below > 2601 * Subsystems with band priority -1 have no dependencies on > other > 2602 * subsystems and therefore are executed in parallel with > all other > 2603 * priority band processing. > > Regards, > Gopal. > > > ========dmesg and kdb 'bt' output=== > > Node 1 has gone down!!! > passed the first scan in ipcname_pull_data > drbd0: drbd_nodedown: Signaling receiver thread. > drbd0: drbd_set_state: (mdev->this_bdev->bd_contains == 0) in > /usr/src/modules/drbd/drbd/drbd_fs.c:702 > drbd0: Secondary/Unknown --> Primary/Unknown > drbd0: Doing CLMS nodedown callback for service 9 > num_objects[MSG] = 0 > num_objects[SEM] = 0 > num_objects[SHM] = 0 > ipcnameserver ready completed > > > Entering kdb (current=0xc1bf1050, pid 0) on processor 1 due to Keyboard > Entry > [1]kdb> BT > BT = 0x0000000b > [1]kdb> bt > Stack traceback for pid 0 > 0xc1bf1050 0 0 1 1 I 0xc1bf1230 *swapper > EBP EIP Function (args) > Starting on an alternate kernel stack > 0xdfdcefa8 0xc010206c default_idle+0x2c (0x4, 0xc0753a70, 0xdfdcef74, > 0x0, 0x4) > 0xc06ebfcc 0xc01496d3 handle_IRQ_event+0x33 (0x4, 0xc0678e00, 0x1, > 0xc0679018, 0xf7d63720) > 0xc06ebff8 0xc01497ef __do_IRQ+0xdf > ======================= > 0xc01080d1 do_IRQ+0x61 > [1]kdb> > > > > |
From: Gopalakrishna NM <go...@hp...> - 2005-05-31 13:16:13
|
Roger Tsang wrote: > Drbd doesn't return to the caller until completion. So clms_svcmgmt > waits (for drbd) before calling priority 0,1 services. Not sure. does Priority -1 task handled with different thread in parallel (i.e the thread executes tasks in the priority band 0,1 .. etc.) Regards, Gopal. > > -Roger > > On 5/31/05, Aneesh Kumar <ane...@gm...> wrote: > >>On 5/31/05, Roger Tsang <rog...@gm...> wrote: >> >>>Yeah I know. Look at the _entire_ patch. It manages to guarantee >>>failover completion before all priority 0 services. >>> >>>- >> >>IIUC the use of completion will only make sure things are waited >>correctly within drbd. I guess using CLMS priority -1 means ( I may be >>wrong here ) other CLMS priority need not wait for this service to >>over. That means we may try a CFS failover before DRBD failover which >>is not what we want. >> >>-aneesh >> > > > |
From: Roger T. <rog...@gm...> - 2005-05-31 13:50:28
|
Last time I looked all I see is it going through the bands, calling out to all services and expecting a return 0 before moving on to the next one. I haven't verified whether the caller would make separate threads. If the service is not priority -1, it would wait for callback. So because it has to wait for callback naturally I assumed there wouldn't (need to) be more than one thread. Also (the docs say?) the nodedown function of these services must return 0 to the caller regardless of priority (not mentioned), but these services still need to callback. So I doubt the caller is a separate thread - considering the callback mechanism in the first place. -Roger On 5/31/05, Gopalakrishna NM <go...@hp...> wrote: > Roger Tsang wrote: > > Drbd doesn't return to the caller until completion. So clms_svcmgmt > > waits (for drbd) before calling priority 0,1 services. >=20 > Not sure. does Priority -1 task handled with different thread in > parallel (i.e the thread executes tasks in the priority band 0,1 .. etc.) >=20 > Regards, > Gopal. > > > > -Roger > > > > On 5/31/05, Aneesh Kumar <ane...@gm...> wrote: > > > >>On 5/31/05, Roger Tsang <rog...@gm...> wrote: > >> > >>>Yeah I know. Look at the _entire_ patch. It manages to guarantee > >>>failover completion before all priority 0 services. > >>> > >>>- > >> > >>IIUC the use of completion will only make sure things are waited > >>correctly within drbd. I guess using CLMS priority -1 means ( I may be > >>wrong here ) other CLMS priority need not wait for this service to > >>over. That means we may try a CFS failover before DRBD failover which > >>is not what we want. > >> > >>-aneesh > >> > > > > > > > |
From: Roger T. <rog...@gm...> - 2005-05-30 21:16:44
|
> drbd0: drbd_nodedown: Signaling receiver thread. > drbd0: drbd_set_state: (mdev->this_bdev->bd_contains =3D=3D 0) in > /usr/src/modules/drbd/drbd/drbd_fs.c:702 > drbd0: Secondary/Unknown --> Primary/Unknown > drbd0: Doing CLMS nodedown callback for service 9 >=20 Actually I think the IPC has nothing to do with this. :) Can you go into kdb next time this happens and send me the bt? Also dmesg.=20 Thanks. -Roger |
From: Roger T. <rog...@gm...> - 2005-05-30 21:22:47
|
Hi, And one more important thing I forgot to ask. Has SSI-1.9 failover been verified to work reliably (without DRBD)? I haven't heard from anyone saying that failover works on SSI-1.9. I just want to rule out whether it's a fs failover or DRBD issue because the console messages say DRBD already called back - ie. its failover completed. In any case your stack trace would help. -Roger On 5/30/05, Roger Tsang <rog...@gm...> wrote: > > drbd0: drbd_nodedown: Signaling receiver thread. > > drbd0: drbd_set_state: (mdev->this_bdev->bd_contains =3D=3D 0) in > > /usr/src/modules/drbd/drbd/drbd_fs.c:702 > > drbd0: Secondary/Unknown --> Primary/Unknown > > drbd0: Doing CLMS nodedown callback for service 9 > > >=20 > Actually I think the IPC has nothing to do with this. :) Can you go > into kdb next time this happens and send me the bt? Also dmesg. > Thanks. >=20 > -Roger > |