Re: [Jfs-discussion] Corrupt JFS root nodes on volumes with 500+ top level directories
Brought to you by:
blaschke-oss,
shaggyk
From: Tim N. <jfs...@ib...> - 2010-11-17 06:12:25
|
All, A quick update on this issue... Our application creates 1 new top level directory each day and after about 500 days *all* of the servers I've checked have corrupt root nodes. Even more troubling, after we repair a volume by running jfs_jfsck and recovering data from lost+found (see below), the problem re-occurs after about a month of creating new directories. However, if no new top level directories are created, and only changes lower down in the hierarchy are made, the problem does not reoccur. Does anyone have any theories about what is going on here? Is there anything we can do to prevent this from happening? Would moving all the data down one level (e.g. nested in a single root directory) help or is the root node like any other node and 500+ nested directories at any level too much for JFS? Because these are older machines, they are all running Debian 4 with a backported 2.6.26 kernel.. Is there any chance upgrading to Debian 5 and a newer kernel would help? Thanks in advance for any help :-) Tim On Aug 25, 2010, at 2:16 PM, Tim Nufire wrote: > Hello, > > I've got a problem that I'm hoping someone on this list can help me with... > > Read-only fsck.jfs checks on my oldest volumes are reporting an alarming number of corrupted root nodes despite the fact that these volumes appear to be healthy when mounted read-only. Here's the error that I'm getting... > > fsck.jfs -n -v /dev/md/10 > fsck.jfs version 1.1.14, 06-Apr-2009 > processing started: 8/13/2010 10.9.6 > The current device is: /dev/md/10 > Open(...READONLY...) returned rc = 0 > Primary superblock is valid. > The type of file system for the device is JFS. > Block size in bytes: 4096 > Filesystem size in blocks: 4756914448 > **Phase 1 - Check Blocks, Files/Directories, and Directory Entries > Invalid data format detected in root directory. > CANNOT CONTINUE. > ERRORS HAVE BEEN DETECTED. Run fsck with the -f parameter to repair. > processing terminated: 8/13/2010 10:10:05 with return code: 10062 exit code: 4. > > Despite the catastrophic sounding error above, mounting the file system read-only and listing the directory from the command-line works fine.... > > ls > 20090110 20090303 20090418 20090605 20090721 20090914 20091030 20091215 20100130 20100317 20100502 20100617 > 20090111 20090304 20090419 20090606 20090722 20090915 20091031 20091216 20100131 20100318 20100503 20100618 > 20090113 20090305 20090420 20090607 20090723 20090916 20091101 20091217 20100201 20100319 20100504 20100619 > 20090114 20090306 20090421 20090608 20090724 20090917 20091102 20091218 20100202 20100320 20100505 20100620 > 20090115 20090307 20090422 20090609 20090725 20090918 20091103 20091219 20100203 20100321 20100506 20100622 > 20090116 20090308 20090423 20090610 20090727 20090919 20091104 20091220 20100204 20100322 20100507 20100623 > 20090117 20090309 20090424 20090611 20090728 20090920 20091105 20091221 20100205 20100323 20100508 20100624 > 20090118 20090310 20090425 20090612 20090729 20090921 20091106 20091222 20100206 20100324 20100509 20100625 > 20090119 20090311 20090426 20090613 20090730 20090922 20091107 20091223 20100207 20100325 20100510 20100626 > 20090120 20090312 20090427 20090614 20090731 20090923 20091108 20091224 20100208 20100326 20100511 20100627 > 20090121 20090313 20090428 20090615 20090801 20090924 20091109 20091225 20100209 20100327 20100512 20100628 > 20090122 20090314 20090429 20090616 20090802 20090925 20091110 20091226 20100210 20100328 20100513 20100629 > 20090123 20090315 20090430 20090617 20090803 20090926 20091111 20091227 20100211 20100329 20100514 20100630 > 20090126 20090316 20090501 20090618 20090804 20090927 20091112 20091228 20100212 20100330 20100515 20100701 > 20090127 20090317 20090502 20090619 20090805 20090928 20091113 20091229 20100213 20100331 20100516 20100702 > 20090128 20090318 20090503 20090620 20090809 20090929 20091114 20091230 20100214 20100401 20100517 20100703 > 20090129 20090319 20090504 20090621 20090810 20090930 20091115 20091231 20100215 20100402 20100518 20100704 > 20090130 20090320 20090505 20090622 20090811 20091001 20091116 20100101 20100216 20100403 20100519 20100705 > 20090202 20090321 20090506 20090623 20090812 20091002 20091117 20100102 20100217 20100404 20100520 20100706 > 20090204 20090322 20090507 20090624 20090813 20091003 20091118 20100103 20100218 20100405 20100521 20100707 > 20090205 20090323 20090508 20090625 20090814 20091004 20091119 20100104 20100219 20100406 20100522 20100708 > 20090206 20090324 20090509 20090626 20090815 20091005 20091120 20100105 20100220 20100407 20100523 20100709 > 20090207 20090325 20090510 20090627 20090816 20091006 20091121 20100106 20100221 20100408 20100524 20100710 > 20090208 20090326 20090511 20090628 20090817 20091007 20091122 20100107 20100222 20100409 20100525 20100711 > 20090209 20090327 20090512 20090629 20090818 20091008 20091123 20100108 20100223 20100410 20100526 20100712 > 20090210 20090328 20090513 20090630 20090819 20091009 20091124 20100109 20100224 20100411 20100527 20100713 > 20090211 20090329 20090514 20090701 20090820 20091010 20091125 20100110 20100225 20100412 20100528 20100714 > 20090212 20090330 20090515 20090702 20090821 20091011 20091126 20100111 20100226 20100413 20100529 20100715 > 20090213 20090331 20090516 20090703 20090822 20091012 20091127 20100112 20100227 20100414 20100530 20100716 > 20090214 20090401 20090517 20090704 20090823 20091013 20091128 20100113 20100228 20100415 20100531 20100717 > 20090215 20090402 20090518 20090705 20090824 20091014 20091129 20100114 20100301 20100416 20100601 20100718 > 20090216 20090403 20090519 20090706 20090825 20091015 20091130 20100115 20100302 20100417 20100602 20100719 > 20090217 20090404 20090520 20090707 20090826 20091016 20091201 20100116 20100303 20100418 20100603 20100720 > 20090218 20090405 20090521 20090708 20090827 20091017 20091202 20100117 20100304 20100419 20100604 20100721 > 20090219 20090406 20090522 20090709 20090828 20091018 20091203 20100118 20100305 20100420 20100605 20100722 > 20090220 20090407 20090523 20090710 20090901 20091019 20091204 20100119 20100306 20100421 20100606 20100723 > 20090221 20090408 20090524 20090711 20090902 20091020 20091205 20100120 20100307 20100422 20100607 20100724 > 20090222 20090409 20090527 20090712 20090903 20091021 20091206 20100121 20100308 20100423 20100608 20100725 > 20090223 20090410 20090528 20090713 20090904 20091022 20091207 20100122 20100309 20100424 20100609 20100726 > 20090224 20090411 20090529 20090714 20090905 20091023 20091208 20100123 20100310 20100425 20100610 20100727 > 20090225 20090412 20090530 20090715 20090906 20091024 20091209 20100124 20100311 20100426 20100611 20100728 > 20090226 20090413 20090531 20090716 20090907 20091025 20091210 20100125 20100312 20100427 20100612 20100729 > 20090227 20090414 20090601 20090717 20090908 20091026 20091211 20100126 20100313 20100428 20100613 mount_check > 20090228 20090415 20090602 20090718 20090909 20091027 20091212 20100127 20100314 20100429 20100614 > 20090301 20090416 20090603 20090719 20090912 20091028 20091213 20100128 20100315 20100430 20100615 > 20090302 20090417 20090604 20090720 20090913 20091029 20091214 20100129 20100316 20100501 20100616 > > Running fsck.jfs read-wrirte re-initiallizes the root node and moves all of its former contents into lost+found. I can recover the data from lost+found so this is not fatal but still something I would like to fix/avoid. > > I have not repaired the above volume yet but have repaired others... Here's the fsck.jfs output for a read-write repair on a volume that had the same errors as those described above. > > fsck.jfs -v /dev/md10 > fsck.jfs version 1.1.14, 06-Apr-2009 > processing started: 4/23/2010 4.32.24 > Using default parameter: -p > The current device is: /dev/md10 > Open(...READ/WRITE EXCLUSIVE...) returned rc = 0 > Primary superblock is valid. > The type of file system for the device is JFS. > Block size in bytes: 4096 > Filesystem size in blocks: 4756914448 > **Phase 0 - Replay Journal Log > LOGREDO: Log record for Sync Point at: 0x05774f34 > LOGREDO: Beginning to update the Inode Allocation Map. > LOGREDO: Done updating the Inode Allocation Map. > LOGREDO: Beginning to update the Block Map. > LOGREDO: Incorrect leaf index detected (k=(d) 0, j=(d) 0, idx=(d) 0) while writing Block Map. > LOGREDO: Write Block Map control page failed in UpdateMaps(). > LOGREDO: Unable to update map(s). > logredo failed (rc=-231). fsck continuing. > **Phase 1 - Check Blocks, Files/Directories, and Directory Entries > Root directory has a corrupt tree. > Initialized tree created for root directory. > The root directory has an invalid data format. Will correct. > **Phase 2 - Count links > **Phase 3 - Duplicate Block Rescan and Directory Connectedness > **Phase 4 - Report Problems > **Phase 5 - Check Connectivity > **Phase 6 - Perform Approved Corrections > Superblock marked dirty because repairs are about to be written. > No \lost+found directory found in the filesystem. > Directory inode 18661404 has been reconnected to /lost+found/. > Directory inode 18637982 has been reconnected to /lost+found/. > Directory inode 18614880 has been reconnected to /lost+found/. > Directory inode 18595359 has been reconnected to /lost+found/. > Directory inode 18581312 has been reconnected to /lost+found/. > Directory inode 18556038 has been reconnected to /lost+found/. > . > . > . > Directory inode 448971 has been reconnected to /lost+found/. > File inode 443531 has been reconnected to /lost+found/. > Directory inode 442414 has been reconnected to /lost+found/. > . > . > . > Directory inode 2320 has been reconnected to /lost+found/. > Directory inode 101 has been reconnected to /lost+found/. > Directory inode 32 has been reconnected to /lost+found/. > 622 directories reconnected to /lost+found/. > 1 file reconnected to /lost+found/. > **Phase 7 - Rebuild File/Directory Allocation Maps > **Phase 8 - Rebuild Disk Allocation Maps > **Phase 9 - Reformat File System Log > logformat returned rc = 0 > Filesystem Summary: > Blocks in use for inodes: 2276956 > Inode count: 18215648 > File count: 16453081 > Directory count: 1529882 > Block count: 4756914448 > Free block count: 655162544 > 19027657792 kilobytes total disk space. > 6342069 kilobytes in 1529882 directories. > 16397493672 kilobytes in 16453081 user files. > 0 kilobytes in extended attributes > 0 kilobytes in access control lists > 15856013 kilobytes reserved for system use. > 2620650176 kilobytes are available for use. > Filesystem is clean. > All observed inconsistencies have been repaired. > Filesystem has been marked clean. > **** Filesystem was modified. **** > processing terminated: 4/23/2010 9:08:55 with return code: 0 exit code: 1. > > This problem appears to be related to age and/or the number of directories in the root node. It's hard to distinguish between these two attributes in our environment because the root node of our data volumes contain one directory for each day the volume has been in use. The tipping point appears to be around 500 days/directories. > > Is this a known issue? Is there really a problem with the root node or does fsck.jfs have an analysis bug? In any event, since the OS can list the contents of the root node, fsck.jfs should be able to do better than just dumping all the contents into lost+found. > > I've also seen corruption in my allocation maps which could be related... How can I help debug this further? > > Thanks! > > Tim > > ------------------------------------------------------------------------------ > Sell apps to millions through the Intel(R) Atom(Tm) Developer Program > Be part of this innovative community and reach millions of netbook users > worldwide. Take advantage of special opportunities to increase revenue and > speed time-to-market. Join now, and jumpstart your future. > http://p.sf.net/sfu/intel-atom-d2d_______________________________________________ > Jfs-discussion mailing list > Jfs...@li... > https://lists.sourceforge.net/lists/listinfo/jfs-discussion |