From: Thomas S H. <tha...@gm...> - 2011-03-23 05:10:35
|
I am having some trouble with a chunkserver, it errors out and then the chunkserver stops working and reports %0 on the mfs cgi page Here is the error in the logs. 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set gid to 70003 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set uid to 70003 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25783]: closing 172.11.1.110:9422 2011-03-22T16:18:47+00:00 node10 mfschunkserver[25905]: main server module: listen on 172.11.1.110:9422 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connecting ... 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: stats file has been loaded 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: open files limit: 10000 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connected to Master 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/11/chunk_00000000019A9511_00000001.mfs 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - wrong id/version in header (00000000019A9511_00000000) 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - read error: Unknown error 2011-03-22T16:19:02+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - wrong id/version in header (0000000001EFFE4A_00000000) 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - read error: Unknown error 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/83/chunk_0000000001776783_00000001.mfs 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - wrong id/version in header (0000000001776783_00000000) 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - read error: Unknown error 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: 3 errors occurred in 60 seconds on folder: /mnt/moose1/ 2011-03-22T16:19:15+00:00 node10 mfschunkserver[25905]: replicator: hdd_create status: 21 What do these errors mean? And what is the best way to recover? If worse comes to worse we of course have replicated chunks, so we can format the chunkserver and start it back up, but I am very curious how to best approach the situation. -Thomas S Hatch |
From: Michal B. <mic...@ge...> - 2011-03-23 12:13:13
|
Hi Thomas! You have bad chunk headers (but we don't know why). You can just erase the wrong chunks or change (just for some time) these constants: #define LASTERRSIZE 3 #define LASTERRTIME 60 to: #define LASTERRSIZE 10 #define LASTERRTIME 1 in the mfschunkserver/hddspacemgr.c file, recompile CS and run it again. CS will stop to "unlink" the disks and will remove the wrong chunks by itself. Regards -Michal From: Thomas S Hatch [mailto:tha...@gm...] Sent: Wednesday, March 23, 2011 6:10 AM To: moosefs-users Subject: [Moosefs-users] Failing Chunkserver I am having some trouble with a chunkserver, it errors out and then the chunkserver stops working and reports %0 on the mfs cgi page Here is the error in the logs. 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set gid to 70003 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set uid to 70003 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25783]: closing 172.11.1.110:9422 2011-03-22T16:18:47+00:00 node10 mfschunkserver[25905]: main server module: listen on 172.11.1.110:9422 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connecting ... 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: stats file has been loaded 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: open files limit: 10000 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connected to Master 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/11/chunk_00000000019A9511_00000001.mfs 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - wrong id/version in header (00000000019A9511_00000000) 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - read error: Unknown error 2011-03-22T16:19:02+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - wrong id/version in header (0000000001EFFE4A_00000000) 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - read error: Unknown error 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/83/chunk_0000000001776783_00000001.mfs 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - wrong id/version in header (0000000001776783_00000000) 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - read error: Unknown error 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: 3 errors occurred in 60 seconds on folder: /mnt/moose1/ 2011-03-22T16:19:15+00:00 node10 mfschunkserver[25905]: replicator: hdd_create status: 21 What do these errors mean? And what is the best way to recover? If worse comes to worse we of course have replicated chunks, so we can format the chunkserver and start it back up, but I am very curious how to best approach the situation. -Thomas S Hatch |
From: Thomas S H. <tha...@gm...> - 2011-03-23 14:48:32
|
Thanks Michal! We were having some hardware issues on the node, and I suspect that this is a residual problem, I will give your suggestion a try! 2011/3/23 Michal Borychowski <mic...@ge...> > Hi Thomas! > > > > You have bad chunk headers (but we don’t know why). You can just erase the > wrong chunks or change (just for some time) these constants: > > > > #define LASTERRSIZE 3 > > #define LASTERRTIME 60 > > > > to: > > > > #define LASTERRSIZE 10 > > #define LASTERRTIME 1 > > > > in the mfschunkserver/hddspacemgr.c file, recompile CS and run it again. CS > will stop to “unlink” the disks and will remove the wrong chunks by itself. > > > > > > Regards > > -Michal > > > > *From:* Thomas S Hatch [mailto:tha...@gm...] > *Sent:* Wednesday, March 23, 2011 6:10 AM > *To:* moosefs-users > *Subject:* [Moosefs-users] Failing Chunkserver > > > > I am having some trouble with a chunkserver, it errors out and then the > chunkserver stops working and reports %0 on the mfs cgi page > > Here is the error in the logs. > > > > 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set gid to 70003 > > 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set uid to 70003 > > 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25783]: closing > 172.11.1.110:9422 > > 2011-03-22T16:18:47+00:00 node10 mfschunkserver[25905]: main server module: > listen on 172.11.1.110:9422 > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connecting ... > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: stats file has been > loaded > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: open files limit: > 10000 > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connected to Master > > 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: testing chunk: > /mnt/moose1/11/chunk_00000000019A9511_00000001.mfs > > 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: chunk_readcrc: > file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - wrong id/version > in header (00000000019A9511_00000000) > > 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: hdd_io_begin: > file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - read error: > Unknown error > > 2011-03-22T16:19:02+00:00 node10 mfschunkserver[25905]: testing chunk: > /mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs > > 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: chunk_readcrc: > file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - wrong id/version > in header (0000000001EFFE4A_00000000) > > 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: hdd_io_begin: > file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - read error: > Unknown error > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: testing chunk: > /mnt/moose1/83/chunk_0000000001776783_00000001.mfs > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: chunk_readcrc: > file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - wrong id/version > in header (0000000001776783_00000000) > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: hdd_io_begin: > file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - read error: > Unknown error > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: 3 errors occurred > in 60 seconds on folder: /mnt/moose1/ > > 2011-03-22T16:19:15+00:00 node10 mfschunkserver[25905]: replicator: > hdd_create status: 21 > > > > What do these errors mean? And what is the best way to recover? > > > > If worse comes to worse we of course have replicated chunks, so we can > format the chunkserver and start it back up, but I am very curious how to > best approach the situation. > > > > -Thomas S Hatch > |
From: Michal B. <mic...@ge...> - 2011-03-23 16:04:16
|
So if you know sth happend on the hardware side, just delete the broken chunks Regards Michal From: Thomas S Hatch [mailto:tha...@gm...] Sent: Wednesday, March 23, 2011 3:48 PM To: Michal Borychowski Cc: moosefs-users Subject: Re: [Moosefs-users] Failing Chunkserver Thanks Michal! We were having some hardware issues on the node, and I suspect that this is a residual problem, I will give your suggestion a try! 2011/3/23 Michal Borychowski <mic...@ge...> Hi Thomas! You have bad chunk headers (but we don't know why). You can just erase the wrong chunks or change (just for some time) these constants: #define LASTERRSIZE 3 #define LASTERRTIME 60 to: #define LASTERRSIZE 10 #define LASTERRTIME 1 in the mfschunkserver/hddspacemgr.c file, recompile CS and run it again. CS will stop to "unlink" the disks and will remove the wrong chunks by itself. Regards -Michal From: Thomas S Hatch [mailto:tha...@gm...] Sent: Wednesday, March 23, 2011 6:10 AM To: moosefs-users Subject: [Moosefs-users] Failing Chunkserver I am having some trouble with a chunkserver, it errors out and then the chunkserver stops working and reports %0 on the mfs cgi page Here is the error in the logs. 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set gid to 70003 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set uid to 70003 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25783]: closing 172.11.1.110:9422 2011-03-22T16:18:47+00:00 node10 mfschunkserver[25905]: main server module: listen on 172.11.1.110:9422 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connecting ... 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: stats file has been loaded 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: open files limit: 10000 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connected to Master 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/11/chunk_00000000019A9511_00000001.mfs 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - wrong id/version in header (00000000019A9511_00000000) 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - read error: Unknown error 2011-03-22T16:19:02+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - wrong id/version in header (0000000001EFFE4A_00000000) 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - read error: Unknown error 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/83/chunk_0000000001776783_00000001.mfs 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - wrong id/version in header (0000000001776783_00000000) 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - read error: Unknown error 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: 3 errors occurred in 60 seconds on folder: /mnt/moose1/ 2011-03-22T16:19:15+00:00 node10 mfschunkserver[25905]: replicator: hdd_create status: 21 What do these errors mean? And what is the best way to recover? If worse comes to worse we of course have replicated chunks, so we can format the chunkserver and start it back up, but I am very curious how to best approach the situation. -Thomas S Hatch |
From: Thomas S H. <tha...@gm...> - 2011-03-23 16:07:11
|
Yep, that worked! Thanks! 2011/3/23 Michal Borychowski <mic...@ge...> > So if you know sth happend on the hardware side, just delete the broken > chunks > > > > > > Regards > > Michal > > > > *From:* Thomas S Hatch [mailto:tha...@gm...] > *Sent:* Wednesday, March 23, 2011 3:48 PM > *To:* Michal Borychowski > *Cc:* moosefs-users > *Subject:* Re: [Moosefs-users] Failing Chunkserver > > > > Thanks Michal! > > We were having some hardware issues on the node, and I suspect that this is > a residual problem, I will give your suggestion a try! > > 2011/3/23 Michal Borychowski <mic...@ge...> > > Hi Thomas! > > > > You have bad chunk headers (but we don’t know why). You can just erase the > wrong chunks or change (just for some time) these constants: > > > > #define LASTERRSIZE 3 > > #define LASTERRTIME 60 > > > > to: > > > > #define LASTERRSIZE 10 > > #define LASTERRTIME 1 > > > > in the mfschunkserver/hddspacemgr.c file, recompile CS and run it again. CS > will stop to “unlink” the disks and will remove the wrong chunks by itself. > > > > > > Regards > > -Michal > > > > *From:* Thomas S Hatch [mailto:tha...@gm...] > *Sent:* Wednesday, March 23, 2011 6:10 AM > *To:* moosefs-users > *Subject:* [Moosefs-users] Failing Chunkserver > > > > I am having some trouble with a chunkserver, it errors out and then the > chunkserver stops working and reports %0 on the mfs cgi page > > Here is the error in the logs. > > > > 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set gid to 70003 > > 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set uid to 70003 > > 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25783]: closing > 172.11.1.110:9422 > > 2011-03-22T16:18:47+00:00 node10 mfschunkserver[25905]: main server module: > listen on 172.11.1.110:9422 > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connecting ... > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: stats file has been > loaded > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: open files limit: > 10000 > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connected to Master > > 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: testing chunk: > /mnt/moose1/11/chunk_00000000019A9511_00000001.mfs > > 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: chunk_readcrc: > file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - wrong id/version > in header (00000000019A9511_00000000) > > 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: hdd_io_begin: > file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - read error: > Unknown error > > 2011-03-22T16:19:02+00:00 node10 mfschunkserver[25905]: testing chunk: > /mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs > > 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: chunk_readcrc: > file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - wrong id/version > in header (0000000001EFFE4A_00000000) > > 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: hdd_io_begin: > file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - read error: > Unknown error > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: testing chunk: > /mnt/moose1/83/chunk_0000000001776783_00000001.mfs > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: chunk_readcrc: > file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - wrong id/version > in header (0000000001776783_00000000) > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: hdd_io_begin: > file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - read error: > Unknown error > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: 3 errors occurred > in 60 seconds on folder: /mnt/moose1/ > > 2011-03-22T16:19:15+00:00 node10 mfschunkserver[25905]: replicator: > hdd_create status: 21 > > > > What do these errors mean? And what is the best way to recover? > > > > If worse comes to worse we of course have replicated chunks, so we can > format the chunkserver and start it back up, but I am very curious how to > best approach the situation. > > > > -Thomas S Hatch > > > |