Re: [SSI-devel] Re: Regarding DRBD on 1.9
Brought to you by:
brucewalker,
rogertsang
From: Gopalakrishna NM <go...@hp...> - 2005-05-30 12:51:33
|
Hi Roger, First I brought up the DRBD primary and next DRBD secondary. When I bring down the primary node for reboot, secondary node simply hung(It is not completely hung. It responds to ping. But I can't login to this node or type any command. Even just press enter key is not showing the prompt ) . When I reboot the primary node again, it identifies the node 2 as root and continuously wait node 2 to join(Message: Searching for an existing root node...Found node 2 as the root node.) I have attached the message from the node 2 console when node 1 is going down. Any tips for debugging would be helpful. Regards, Gopal. drbd0: PingAck did not arrive in time. drbd0: drbd0_asender [131629]: cstate Connected --> NetworkFailure drbd0: asender terminated drbd0: drbd0_receiver [131622]: cstate NetworkFailure --> BrokenPipe drbd0: short read expecting header on sock: r=-512 drbd0: worker terminated drbd0: drbd0_receiver [131622]: cstate BrokenPipe --> Unconnected drbd0: Connection lost. drbd0: drbd0_receiver [131622]: cstate Unconnected --> WFConnection Taking over master from node 1. Node 1 has gone down!!! passed the first scan in ipcname_pull_data num_objects[MSG] = 0 num_objects[SEM] = 0 num_objects[SHM] = 0 ipcnameserver ready completed drbd0: drbd_nodedown: Signaling receiver thread. drbd0: drbd_set_state: (mdev->this_bdev->bd_contains == 0) in /usr/src/modules/drbd/drbd/drbd_fs.c:702 drbd0: Secondary/Unknown --> Primary/Unknown drbd0: Doing CLMS nodedown callback for service 9 Gopalakrishna NM wrote: > Hi Roger, > At first sight , with the patch and recent DRBD checkins , problem has > been resolved. I am trying with Latest kernel and doing some failove > rtesting. I will update you further. > > Regards, > Gopal. > > Roger Tsang wrote: > >> Gopal, >> >> Apparently drbdadm has been working properly with the old drbdsetup >> because the order of arguments passed to drbdsetup matches that >> expected by drbdsetup. If you try the drbdsetup patch I sent, you >> also have to use this drbdadm patch attached. >> >> I think you hit the drbdsetup bug because most users don't use >> drbdsetup directly. Let me know. >> >> -Roger >> >> On 5/27/05, Roger Tsang <rog...@gm...> wrote: >> >>> Hi Gopal, >>> >>> Alright I'll take a look at the code. I don't seem to be having this >>> problem on OPENSSI-FC-1-2-STABLE and the code is synced to the trunk >>> which also happens to be the OPENSSI-DEBIAN branch. So we're using >>> the same (drbd) code, just on different kernels. >>> >>> -Roger >>> >>> >>> On 5/27/05, Gopalakrishna NM <go...@hp...> wrote: >>> >>>> Hi Roger, >>>> DRBD , 1.9 (built from OPENSSI-DEBIAN, May 27 12:58) and the kernel ( >>>> built from OPENSSI-DEBIAN, May 27 12:58) resulting in the kernel >>>> oops in >>>> the function sock_recvmsg. Till yesterday I could reproduce >>>> consistently(with the kernel & drbd built on May 25 ), but it suddenly >>>> disappeared when I built the new version of command drbdadm and >>>> drbdsetup and copied to test system. After few set up reboot, it >>>> resulted in oops again!. Today with new kernel and DRBD module, it is >>>> resulting in oops consistently. Ofcourse, I built new drbdadm and >>>> drbdsetup commands and copied to test system. >>>> >>>> The console message I am attaching at the end of this mail. . >>>> >>>> Looking at the DRBD code and generating few debugging message it is >>>> clear that following mdev(drbd_dev) fields are some how >>>> corrupting/interchanging. >>>> mdev->conf.my_addr_len = 1 >>>> mdev->conf.other_addr_len = 2 >>>> mdev->conf.this_nodenum = 16 >>>> mdev->conf.other_nodenum = 16 >>>> >>>> The function "drbd_wait_for_connect" could not successfully complete >>>> "bind" and resulting in error (-22) EINVAL. Because of this connect >>>> would fail in the function "drbd_try_connect" (This is expected). The >>>> reason for bind failure is some how address is not proper and it is >>>> happening because of "my_addr_len & other_addr_len " is having 1 and 2 >>>> respectively instead of 16(This is proper length). But node number >>>> "this_nodenum & "other_nodenum" is showing 16 instead of 1 and 2 >>>> respectively. Since address len field is 1, during bind it might be >>>> getting first and second character of the address and it would resulted >>>> in bind. >>>> >>>> Once it start working yesterday (After copying new copy of command >>>> drbdadm and drbdsetup), later I copied old command , but it was >>>> working!. So I am not even convinced it is command problem. when I >>>> built >>>> new kernel and DRBD module today , again I am facing same problem >>>> consistently with new and old command. >>>> >>>> Does it any thing to do with race?. Any idea the relationship between >>>> drbdsetup and initialization of drbd?. or anything to do with >>>> alignment? >>>> Any timing issues? My understanding is at this stage of boot up >>>> drbdsetup would not have any problem. >>>> >>>> Your inputs would be very helpful. >>>> >>>> Thanks and regards, >>>> Gopal. >>>> >>>> ======================CONSOLE messg============= >>>> drbd: module cleanup done. >>>> modprobe -k drbd minor_count=1 >>>> drbd: initialised. Version: 0.7.10 (api:77/proto:74) >>>> drbd: SVN Revision: 1743 build by go...@ha..., >>>> 2005-05-27 12:58:01 >>>> drbd: registered as block device major 147 >>>> Starting DRBD resource: >>>> drbd0: resync bitmap: bits=1151992 words=36000 >>>> drbd0: size = 4499 MB (4607968 KB) >>>> drbd0: 4499 MB marked out-of-sync by on disk bit-map. >>>> drbd0: Found 6 transactions (230 active extents) in activity log. >>>> drbd0: Marked additional 0 KB as out-of-sync based on AL. >>>> drbd0: drbdsetup [66084]: cstate Unconfigured --> StandAlone >>>> drbd0: drbdsetup [66087]: cstate StandAlone --> Unconnected >>>> drbd0: drbd0_receiver [66088]: cstate Unconnected --> WFConnection >>>> drbd0: Unabl >>>> # WARNING: Do not type 'yes' while waiting for DRBD connection >>>> # unless you know what you are doing! You have been warned! >>>> # The only exception is when setting up DRBD first time. >>>> # >>>> e to bind (-22) >>>> drbd0: Registering drbd0 with CLMS subsystem >>>> dr<b4>d0dr: bdd0r:b d0T_ryriencegi tveo r co[6nn60ec88t ]:ag >>>> caisnta.<t1e >WUnFCaoblnene tcot iohan nd--le> kUencrnonelne cNtUeLLd >>>> pointer dereference at virtual address 00000008 >>>> printing eip: >>>> c039ae5c >>>> *pde = 00000000 >>>> Oops: 0000 [#1] >>>> SMP >>>> Modules linked in: drbd sd_mod sym53c8xx scsi_transport_spi cciss >>>> scsi_mod tg3 eepro100 e100 mii >>>> CPU: 1 >>>> EIP: 0060:[<c039ae5c>] Not tainted VLI >>>> EFLAGS: 00010246 (2.6.10-ssi-686-smp) >>>> EIP is at sock_recvmsg+0xac/0xf0 >>>> eax: f7036f20 ebx: 00000000 ecx: f71157f0 edx: 00000008 >>>> esi: 00004100 edi: f7036e74 ebp: f7036f00 esp: f7036e10 >>>> ds: 007b es: 007b ss: 0068 >>>> Process drbd0_receiver (pid: 66088, threadinfo=f7036000 task=f71157f0) >>>> Stack: 000000fc 00000000 f7036e7c ffffffff c0465060 20008790 36303838 >>>> 00004100 >>>> 00000008 00000000 00000000 00000000 f7036f20 f712a040 f7036e60 >>>> c03db8da >>>> f712a040 f712a150 f712a040 f712a040 00000282 c03e108a f712a040 >>>> f712a040 >>>> Call Trace: >>>> [<c010687f>] show_stack+0x7f/0xa0 >>>> [<c0106a34>] show_registers+0x164/0x220 >>>> [<c0106dc4>] die+0xf4/0x1c0 >>>> [<c011f325>] do_page_fault+0x375/0x695 >>>> [<c01064d3>] error_code+0x2b/0x30 >>>> [<f898e0e9>] drbd_recv+0x89/0x190 [drbd] >>>> [<f898e99a>] drbd_recv_header+0x2a/0xf0 [drbd] >>>> [<f8991f2c>] drbdd+0x2c/0x160 [drbd] >>>> [<f8992b18>] drbdd_init+0x78/0x410 [drbd] >>>> [<f8998e3e>] drbd_thread_setup+0x7e/0xf0 [drbd] >>>> [<c01022e5>] kernel_thread_helper+0x5/0x10 >>>> Code: 85 24 ff ff ff 89 45 e8 31 c0 89 85 3c ff ff ff 8b 45 0c 89 95 30 >>>> ff ff ff 89 9d 34 ff ff ff 89 85 40 ff ff ff 89 b5 2c ff ff ff <8b> 43 >>>> 08 89 74 24 10 89 54 24 0c8b 55 0c 89 5c 24 04 89 3c 24 >>>> >>>> Entering kdb (current=0xf71157f0, pid 66088) on processor 1 Oops: Oops >>>> due to oops @ 0xc039ae5c >>>> eax = 0xf7036f20 ebx = 0x00000000 ecx = 0xf71157f0 edx = 0x00000008 >>>> esi = 0x00004100 edi = 0xf7036e74 esp = 0xf7036e10 eip = 0xc039ae5c >>>> ebp = 0xf7036f00 xss = 0xc0390068 xcs = 0x00000060 eflags = 0x00010246 >>>> xds = 0x0000007b xes = 0x0000007b origeax = 0xffffffff ®s = >>>> 0xf7036ddc >>>> [1]kdb> >>>> > > |