Re: [Jfs-discussion] JFS-Questions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Sorry it's taken me this long to respond.

On Tue, 2003-09-16 at 22:57, mt...@am... wrote:
> Hello Jfs-discussion,
> 
> I'm using JFS for quite a while and gained some experience to
> share for an estimation how critical (data-loss) the observed
> problems are.
> 
> Systems used: i386 with SuSE 7.3, 8.0, 8.1 and 8.2
> 
> - deleted filespace (several gigs) is not made available -
>   even after repetitive syncs - fsck-ing/remounting/reboot
>   helped... problem: it required a system-reboot because it was
>   the root-filesystem which was affected

I introduced this problem with a performance enhancement that went a
little too far in avoiding too-frequent writes to the journal.  I have
some ideas about how to fix it without losing the performance gain, but
I haven't fixed it yet.

> - partition was checked as clean on restart after crash -
>   none the less the filesystem had errors. it required a
>   manual and forced full JFS check to get rid of them.
>   -> how reliable is the clean/dirty detection really and what
>      could make it fail?

The clean/dirty detection by design assumes that the partition is clean
after replaying the journal unless an error is detected.  Otherwise,
rebooting after a power loss or crash would not be fast.  However, there
are several places in the code where JFS sees a problem and doesn't
handle it properly.  We have some work in progress to have JFS correctly
mark the superblock dirty in these instances, which would force fsck to
check the whole partition.

> 
> - after a HD-sector-failure I was able to rescue most of the disk's
>   sectors with dd_rescue - but not all sectors could be copied correctly.
>   -> does JFS employ block/sector checksums in order to be able
>      to detect integrity-errors (or is this possibe) - I was
>      worried which files contained data from errorous sectors.
>      Fortunately the system survived the failure quite well and
>      no essential files seem to got damaged - nontheless I asked
>      myself what would happen if it went worse - how could I detect
>      which files were OK and which were not.

No, JFS has no mechanisms in place to ensure data integrity.  The main
design goals were to ensure meta-data integrity allowing quick
recoverability after a crash or power failure.  It is recommended that
important data be backed up periodically.

> - a friend of mine lost all data on one parition due to
>   "invalid superblocks" - fsck.jfs printed this message and gave up...
>   great - all sectors still contained valid files, data, etc...
>   maybe it was some geometry offset due to changing hardware.
>   -> this is what worries me most - is there a practical solution
>      to recover data if the superblocks gets corrupted or slightly
>      byte-offset (+/-1 errors)
>      -> for example: is it possible to scan all sectors of the
>         partition and (at least partially) reconstruct the filesystem?
>   (when hex-looking at the partition-header it showed some "normal"
>   looking JFS signature and following blocks)

JFS was designed to be able to recover from any single point of failure
in the metadata.  If both the primary and secondary superblocks are
lost, it is probably that other important metadata are lost as well.  It
is unlikely that under normal circumstances, both superblocks would be
lost.  I am curious how this happened.

> - a similar problem like the one above happened to me - a disk
>   which I had formatted, set up and used in a removeable bay refused
>   to mount when connected via external interface (firewire - but
>   hex-dumping of sector contents worked fine!) - the disk mounted
>   fine when it was placed back into the internal bay again...
>   really weird! (Note: /etc/fstab entries were correct :))
>   Formatting the partition while hooked on firewire seems to have
>   solved the problem.

I don't have a clue here.

> - fsck once "trashed" my home-directory and all files inside (suse 8.2)
>   all files were renamed and dumped to lost+found...
>   -> I don't know how this happened, but it was quite horrible to
>      manually rename and move the files back!
>      Fortunately for me the locate database was still holding the
>      original names and structure - but it took quite long to
>      inspect the contents.
>      -> why the overhead of journaling when fsck won't recover
>         the filenames... there should be a way to maintain more
>         original information during fsck.

fsck found something wrong in your home directory that it couldn't fix,
so, rather than leaving a broken directory, it removed it and the
contents were put in lost+found.  Admittedly fsck should probably do a
better job when it finds a problem and may possibly prevent losing the
entire directory, but that's the way it works.

The journaling itself is there to prevent this kind of damage from
happening, which it usually does.  However, there are still bugs, and
other conditions that occasionally cause problems that fsck can't fix. 
Backing up data is always a good idea.

> 
> Best regards!
> 
> 
> 
> 
> PS - my personal jfs-feature-whishlist:
> - shrinking
> - built-in encryption-layer (as loop-aes would render journaling
>   useless - when included in the FS)

Hopefully, we'll get to these, but they aren't in our short-term plans.

-- 
David Kleikamp
IBM Linux Technology Center