Hi everyone,
This list has been quite dormant so I am forwarding this email which
discusses NTFS TNG driver a bit in the hope to stimulate some discussion on
this list. (-:
At 20:06 11/11/2001, Jeff Garzik <jg...@ma...> wrote:
>At the moment I'm looking at ntfs-TNG... You should check out zisofs
>in 2.4.15-pre2... zisofs_readpage handles the case where compression
>causes you to read in more pages than just the one requested. It
>injects the other pages into the page cache along with the one
>requested.
>
>Just something to think about :)
Yes, I know. Al Viro and HPA pointed that out to already on linux-fsdevel
when I asked which fs exist that support transparent compression and use
the page cache.
I don't have problems injecting pages. I do that already if you look in TNG
mft.c::map_mft_record_for_read(). I do that there from the ground up so to
speak (I allocate the page and do all the magic myself rather than calling
grab_cache_page_no_wait()). I guess the zisofs approach is cleaner because
it is less complex, might simplify my function some day. For the moment it
works so no point in wasting time in making it look nicer.
>Also, two questions:
I am assuming we are talking TNG driver in all the below...
>1) for the paranoid, is it ok to perform fixup-on-write (create an
>update sequence) always? ie. enable fixups for all inodes on write, if
>they do not already use fixups. I was thinking of that as an option,
>defaulting to off.
It's impossible they don't use fixups. By definition all inodes (MFT
records) and all INDEX_BLOCKS ($INDEX_ALLOCATION attribute data in other
words), MUST be protected by fixups.
If you look in the above mentioned map_mft_record_for_read() you will see
that it reads in all mft records belonging to a page and it uses its own
end_async_io() handler, which in turn removes the fixups via
post_read_mst_fixup() before unlocking the page.
Thus anyone calling map_mft_record_for_read() is guaranteed to see data
clean from fixups.
In the commit to disk path, a dirty page will be locked, marked not
uptodate and all mft records/inodes will have their mrec_lock semaphores
downed for writing (not necessarily in this order obviously, the order has
to fit in with the one map_mft_record_for_read/write() so it doesn't
deadlock...), then the fixups will be applied via pre_write_mst_fixup(),
the page written synchronously, the fast __post_read_fixup() function will
remove the fixups we just applied again, the page will be marked uptodate
and unlocked and the commit to disk is done.
The only downside to this I can see is that we always flush to disk all mft
records in a dirty page even if only one of them is dirty. I suppose this
could be handled more nicely if we have a dirty bit in each mft
record/inode as well, so we only take the mrec_lock semaphore for writing
if the record is marked dirty in the first place. But not taking could lead
to races because we would have to mark the page notuptodate and lock it so
having unlocked mft records in that might kill us somewhere (maybe, haven't
thought about it enough yet).
I have pretty much only been focused on getting reading to work. To finish
it off we need: 1) attribute list support and 2) support for compressed
files. I have been thinking for quite a while of how best to implement
both of these... The reverse engineered functions for attribute list
searches are plain scary (I know I reverse engineered them but I still
don't understand the significance of some of the variables in them). You
can look at the raw reverse engineering results in
linux-ntfs/include/attrib_RE.h and linux-ntfs/libntfs/attrib_RE.c if you like.
I would very much like to have a stateful search context that can support
attribute list searches. Just like the reversed functions do but it's not
straightforward. The current TNG code has just what I would like it to have
in the find_attr() and find_first_attr() functions in
tng/linux/fs/ntfs/attrib.c. They employ a nice and easy search context and
allow (re)commencing searches from any point. The find_attr() function is a
slightly simplified and optimized result from reverse engineering the
windows NT4 counterpart. If you have a look you will notice how complex the
collation of attributes actually is (all the comparison functions called
are found in unistr.c, again mostly reverse engineered stuff, especially
the comparison "magic" table and functions accessing it).
>2) have you seen any code or routines for dealing with the B+tree
>directories? Reading is pretty easy but I want to make sure updating
>directories is optimal.
You mean not counting the lookup() function in TNG? Then I only know of the
windows NTFS driver itself.
The HPFS driver should be using them as well, since NTFS was designed on
the basis of it initially but I haven't had a close look.
I started reverse engineering the windows ones and you can see the
tng/linux/fs/ntfs/dir.c::ntfs_lookup_ino_by_name() is the result but I have
modified the logic heavily as I never finished the reverse engineering. The
functions are herrendously complex and a lot of SoftIce use will be
required to finish them off. The call different collation functions using
function pointer tables, so they look at some values and then call
do_some_special_comparison[magic_index]. I haven't figured out how to
determine which function is used when yet, at least not from just reading
the code.
My lookup functions seems to be doing a good job though so it might be good
enough, only time will tell, once we start writing things...
For writing, my idea is to do a directory index lookup() and get returned
either the found value or the location of where to insert it (function will
be mostly identical to what
tng/linux/fs/ntfs/dir.c::ntfs_lookup_ino_by_name() does). Once that is
done, call an index insertion function, which in turn will do the hard work
of creating space in the right position, and if that is not possible it
will make a decision what to do depending on the circumstances or how we
want to implement it... It could either add the entry in a new index block
and just link the next entry to this index block or it could split the
index block in two equal(for some value of equal) parts and then add itself
to the appropriate part or (fill in with your own ideas)...
As far as I see this, there is no right or wrong way. It's just a B+ tree
and as long as the order of entries is correct we will work just fine with
Windows NTFS driver. Once we have it all working we can always switch to
more sophisticated algorithms for dealing with the trees. Maybe we could
balance a tree when we detect it is all over the place for example but that
is all not required. We could always reverse engineer Windows some more if
we want to do the same thing but I don't see the need - and it's a lot of
work for something we don't absolutely have to do...
Anton
--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/
|