Thread: Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

lse-tech

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Paul M. <Pau...@us...> - 2003-02-27 14:10:13

> On Wed, Feb 26, 2003 at 10:23:09AM -0800, Martin J. Bligh wrote:
> > It actually seems a fraction slower (see systimes for Kernbench-16,
> > for instance).
>
> Can you play a bit with the hash table sizes? Perhaps double the
> dcache hash and half the inode hash ?
>
> I suspect it really just needs a better hash function. I'll cook
> something up based on FNV hash.

One other possibility here...  On x86, you have 4-byte pointers,
which gives you 8 such pointers per 32-byte cache line.  This means
that an insertion at the head of one hash chain will force an L2
miss for any other CPU attempting to traverse any of these 8 chains.

Now, if there is per-CPU locality of reference (and there might
or might not be, depending on the workload, the filesystem layout,
and the hash function), then inserting the new element after the
first element in the chain would cause only those CPUs traversing
that particular chain to see the L2 miss.

Just a thought...

                                    Thanx, Paul

[Lse-tech] Re: [PATCH] New dcache / inode hash tuning patch

From: Andi K. <ak...@su...> - 2003-02-28 08:34:28

tor...@tr... (Linus Torvalds) writes:

[please read the whole post before answering. thanks]

>    But quite frankly, if the hash list heads are actually noticeable
>    memory users, the hash is likely to be _way_ too big. The list heads
>    are _much_ smaller than the entries they point to, the hash list just
>    shouldn't be big enough that it really matters.

On big machines it is currently 1+MB. IMHO this is way too big.

> 
>  - hash list cache footprint
> 
>    Again: the hash head array itself is at least dense in the cache, and
>    each entry is _much_ smaller than the actual data structures it
>    points to.  So even if you improve the hash heads to be better from a
>    cache standpoint, you're only getting a very small percentage of the
>    real cache costs. 

It would be possible to cache line optimize the layout of struct dentry
in addition. May be an interesting add-on project for someone...
But for lookup walking even one cache line - the one containing d_hash -
should be needed. Unless d_hash is unlucky enough to cross a cache
line for its two members ... but I doubt that.

> 
>    So let's say that the cache costs of the dcache is 4% (according to
>    the oprofile run), then saving a few procent of that is not actually
>    going to be noticeable at a user level.
> 
> And the downsides of the hash list is that addition/removal is costlier
> due to the conditionals, and a non-conditional version (a common

But the list walking is faster.  Testing for NULL generates much better 
code on i386 than having to dedicate a register for storing the head
to test against. List walking happens more often than insertion/deletion.

I believe the conditionals are completely left in the noise compared
to the cache misses the two pointer head version causes. You can execute
a lot of conditionals in the time needed to serve one cache miss!

Please take a look at the x86 assembly generated by list_for_each
vs hlist_for_each. hlist_for_each looks much nicer, especially when you
can use the register normally wasted on the head for something else
in the loop body.

In case of dcache rcu it also made things simpler/faster because it didn't
require the complicated is_bucket race breaker check.

> In other words: it may be that our current dentry hashes are too big,
> and that is certainly worth fixing if so.  But the "hlist" approach very
> fundamentally cannot really help the _real_ problem very much, and it
> will (slightly) hurt the case where the hashes are actually cached. 

I admit it is a kind of micro optimization, but I believe it is an useful
one. Frankly wasting two pointers for a hash bucket in a potentially
big hash table is just s*d.

> 
> So I really think that the only longterm fix is to make the lookup data
> structures be "local" to the base of the lookup, in order to get away
> from the inherently non-local nature of the current hash lookups. 

Yes, that may be a good idea. I don't have time to work on this though.

Still even with local lookups single pointer buckets will likely help ;)

Also isn't it a bit late in the 2.5 cycle to think about such radical
changes like local lookup? It sounds more like a nice 2.7 project.  I
believe my patch with a bit more tweaking (my current 64K hash table
seems to be too small) is suitable even for an soon to be stable
kernel.

Also my patch had some other changes that I believe should be included
anyways because they're independent and improvement. It replaces the
max_dentries race break hack with a better algorithm to detect cycles on walk.

Also it does more prefetches while list walking which I believe to be 
useful.

-Andi

Re: [Lse-tech] Re: [PATCH] New dcache / inode hash tuning patch

From: Paul M. <pm...@en...> - 2003-02-28 10:28:50

>
>It would be possible to cache line optimize the layout of struct dentry
>in addition. May be an interesting add-on project for someone...

I played with this a few months ago, and sent a preliminary patch to
linux-kernel, but in my (fairly brief) testing on a 4-way system I
wasn't able to produce a measurable benefit from it. I think part of
this may have been due to contention on dcache_lock, so maybe dcache_rcu
will help there.

The main changes were to bring together all the data needed for checking
a non-matching dcache hash entry (modulo hash collisions) into one
cacheline, and to separate out the mostly-read-only fields from the 
change-frequently fields.

The original patch is at 

http://marc.theaimsgroup.com/?l=linux-kernel&m=102650654002932&w=2

>But for lookup walking even one cache line - the one containing d_hash -
>should be needed. Unless d_hash is unlucky enough to cross a cache
>line for its two members ... but I doubt that.

No, but on a 32-byte cache line system, d_parent, d_hash and d_name are
all on different cache lines, and they're used when checking each entry.
On 64-byte systems, d_parent and d_hash will be on the same line, but
d_name is still on a separate line and d_name.hash gets checked before
d_parent. So bringing these three fields on to the same cacheline
would theoretically be a win.

Paul

Re: [Lse-tech] Re: [PATCH] New dcache / inode hash tuning patch

From: Andi K. <ak...@su...> - 2003-02-28 10:41:05

On Fri, Feb 28, 2003 at 02:27:27AM -0800, Paul Menage wrote:
> >But for lookup walking even one cache line - the one containing d_hash -
> >should be needed. Unless d_hash is unlucky enough to cross a cache
> >line for its two members ... but I doubt that.
> 
> No, but on a 32-byte cache line system, d_parent, d_hash and d_name are
> all on different cache lines, and they're used when checking each entry.

... and dcache RCU checks d_bucket and d_move_count too in the hash 
walking loop.


> On 64-byte systems, d_parent and d_hash will be on the same line, but
> d_name is still on a separate line and d_name.hash gets checked before
> d_parent. So bringing these three fields on to the same cacheline
> would theoretically be a win.

Ok you're right. Optimizing the layout a bit would be probably a good 
idea. I won't include it in the hash patchkit for now to not do too
many things with the same patch.

-Andi

[Lse-tech] Re: [PATCH] New dcache / inode hash tuning patch

From: Linus T. <tor...@tr...> - 2003-02-28 18:50:30

On 28 Feb 2003, Andi Kleen wrote:
>
> Also isn't it a bit late in the 2.5 cycle to think about such radical
> changes like local lookup?

Look again at my suggestion: forcing locality by just changing the hash
function to _intentionally_ bunch directory hashes together.

Yes, it's "pseudo-locality" (and my example hash was truly broken: we 
must _not_ use the directory dentry hash value as part of the hash, as the 
hash needs to be stable over the whole lifetime of the directory dentry), 
but the point is that by just changing the hashing algorithm you can 
potentially get a good portion of the locality we want.

Right now the dcache hash is often something like 17 bits - and we could
easily make it so that roughly "half" the bits would be based purely on
the directory. That would still give each directory ~8 bits worth of
"local hashing", which is fairly reasonable.

> It sounds more like a nice 2.7 project.

It sounds more like changing two lines of code to me.

> I believe my patch with a bit more tweaking (my current 64K hash table
> seems to be too small) is suitable even for an soon to be stable
> kernel.

Quite frankly, right now the only report I've seen about your patch is 
that it made things slightly _slower_.

For a patch that is supposed to speed stuff up, that's a damn bad track 
record. Sorry.

I'd suggest you drop the hash size changes, and try with _just_ the hlist 
stuff, and once that is verified to perform well, _then_ worry about 
hashing changes. Because quite frankly, I suspect my "local hash" thing 
performs better than "make the hashes smaller". And the hash algorithm and 
size is _totally_ independent from whether it uses the regular lists or 
the hlists..

			Linus

[Lse-tech] Re: [PATCH] New dcache / inode hash tuning patch

From: Andi K. <ak...@su...> - 2003-02-28 18:59:17

On Fri, Feb 28, 2003 at 10:47:38AM -0800, Linus Torvalds wrote:
> Right now the dcache hash is often something like 17 bits - and we could
> easily make it so that roughly "half" the bits would be based purely on
> the directory. That would still give each directory ~8 bits worth of
> "local hashing", which is fairly reasonable.

Ok I will see if that helps.
> 
> > I believe my patch with a bit more tweaking (my current 64K hash table
> > seems to be too small) is suitable even for an soon to be stable
> > kernel.
> 
> Quite frankly, right now the only report I've seen about your patch is 
> that it made things slightly _slower_.

Actually that's not quite true. The report had a completely different
profile (lots of other functions had different percentages), so it likely
wasn't a comparable workload. I also don't think the NUMAQs are a good test
platform for this because they have 2MB of fast cache per CPU, while
the typical linux multiprocessor machine has much less. Yes you can 
fit an 1MB hash table into a 2 MB cache....

I'll generate some new numbers here locally over the weekend on P4,
but I only have a dual to test on and see how it performs.

-Andi

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Andi K. <ak...@su...> - 2003-02-27 16:06:51

On Thu, Feb 27, 2003 at 06:02:55AM -0800, Paul McKenney wrote:
> One other possibility here...  On x86, you have 4-byte pointers,
> which gives you 8 such pointers per 32-byte cache line.  This means
> that an insertion at the head of one hash chain will force an L2
> miss for any other CPU attempting to traverse any of these 8 chains.
> 
> Now, if there is per-CPU locality of reference (and there might
> or might not be, depending on the workload, the filesystem layout,
> and the hash function), then inserting the new element after the
> first element in the chain would cause only those CPUs traversing
> that particular chain to see the L2 miss.

I would have expected that with a 16K entry hash table such conflicts are 
statistically rare. Of course it really depends on the hash function.

It may be possible to not insert at the head,
but behind the second entry in the bucket to avoid the problem you
describe.  But I'm not sure it is worth it.

Just making the hash function better would be probably the biggest win.

I coded up FNV now and a hash bucket statistics patch now and after
some reasonable load (lots of find /, many parallel kernel compiles) 
I get this distribution:

worst case bucket length 32 
average bucket length    13.4637 
total number of entries  220388

average length of 13 is not very good 

Probably need a bigger table again. 16K entries seems to be too small.

The hash is actually not pure FNV, but combined r5 hash for the string
and then combination using FNV with the parent address.
it may be worth trying to put both into the same FNV.

Unfortunately I'm running out of time on this so won't be able to do
further experiments on that for now.

Latest patch including FNV and hashtable dumper attached in case anybody
wants to play further with this.

-Andi


diff -X ../KDIFX -burpN linux/fs/dcache.c linux-2.5.63-work/fs/dcache.c
--- linux/fs/dcache.c	2003-02-21 12:13:53.000000000 +0100
+++ linux-2.5.63-work/fs/dcache.c	2003-02-27 16:25:27.000000000 +0100
@@ -25,9 +25,15 @@
 #include <linux/module.h>
 #include <linux/mount.h>
 #include <asm/uaccess.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
 
 #define DCACHE_PARANOIA 1
 /* #define DCACHE_DEBUG 1 */
+#ifdef CONFIG_PROC_FS
+#define DCACHE_HASH_STATISTICS 1
+#endif
+#define DCACHE_FNV_HASH 1
 
 spinlock_t dcache_lock __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
 rwlock_t dparent_lock __cacheline_aligned_in_smp = RW_LOCK_UNLOCKED;
@@ -41,23 +47,24 @@ static kmem_cache_t *dentry_cache; 
  *
  * This hash-function tries to avoid losing too many bits of hash
  * information, yet avoid using a prime hash-size or similar.
+ *
+ * AK: using a prime hash with a prime near some power-of-two would be 
+ * likely better. Any hash guru interested? Same for the inode hash.
+ *
+ * We probably should have CONFIG_SMALL and CONFIG_LARGE for this.
+ * Don't scale it by memory size, otherwise big systems are eaten
+ * by the cache misses.
+ * 
+ * Sizes need to be power-of-two for now.
  */
-#define D_HASHBITS     d_hash_shift
-#define D_HASHMASK     d_hash_mask
+#define D_NUMBUCKETS	(16*1024) 	/* 64K RAM on a 32bit machine */
+#define D_HASHBITS      13              /* = log2(D_NUMBUCKETS) */
+#define D_HASHMASK      (D_NUMBUCKETS-1)
 
-static unsigned int d_hash_mask;
-static unsigned int d_hash_shift;
-static struct list_head *dentry_hashtable;
+static struct hlist_head *dentry_hashtable;
 static LIST_HEAD(dentry_unused);
-static int max_dentries;
 static void * hashtable_end;
 
-static inline int is_bucket(void * addr)
-{
-	return ((addr < (void *)dentry_hashtable)
-			|| (addr > hashtable_end) ? 0 : 1);
-}	
-
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
@@ -292,6 +299,7 @@ struct dentry * d_find_alias(struct inod
 	while (next != head) {
 		tmp = next;
 		next = tmp->next;
+		prefetch(next);
 		alias = list_entry(tmp, struct dentry, d_alias);
  		if (!d_unhashed(alias)) {
 			if (alias->d_flags & DCACHE_DISCONNECTED)
@@ -378,6 +386,7 @@ static void prune_dcache(int count)
 		if (tmp == &dentry_unused)
 			break;
 		list_del_init(tmp);
+		prefetch(dentry_unused.prev);
  		dentry_stat.nr_unused--;
 		dentry = list_entry(tmp, struct dentry, d_lru);
 
@@ -603,15 +612,15 @@ void shrink_dcache_parent(struct dentry 
  * done under dcache_lock.
  *
  */
-void shrink_dcache_anon(struct list_head *head)
+void shrink_dcache_anon(struct hlist_head *head)
 {
-	struct list_head *lp;
+	struct hlist_node *lp;
 	int found;
 	do {
 		found = 0;
 		spin_lock(&dcache_lock);
-		list_for_each(lp, head) {
-			struct dentry *this = list_entry(lp, struct dentry, d_hash);
+		hlist_for_each(lp, head) {
+			struct dentry *this = hlist_entry(lp, struct dentry, d_hash);
 			list_del(&this->d_lru);
 
 			/* don't add non zero d_count dentries 
@@ -727,7 +736,7 @@ struct dentry * d_alloc(struct dentry * 
 	dentry->d_mounted = 0;
 	dentry->d_cookie = NULL;
 	dentry->d_bucket = NULL;
-	INIT_LIST_HEAD(&dentry->d_hash);
+	INIT_HLIST_NODE(&dentry->d_hash);
 	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_LIST_HEAD(&dentry->d_subdirs);
 	INIT_LIST_HEAD(&dentry->d_alias);
@@ -797,12 +806,105 @@ struct dentry * d_alloc_root(struct inod
 	return res;
 }
 
-static inline struct list_head * d_hash(struct dentry * parent, unsigned long hash)
+#ifdef DCACHE_FNV_HASH
+
+/* FNV hash
+ * By Bob Jenkins, 1996.  bob...@bu....  You may use this
+ * code any way you wish, private, educational, or commercial.  It's free.
+ *
+ * See http://burtleburtle.net/bob/hash/evahash.html
+ * This version taken from the Oracle OCFS2 sources which was
+ * Copyright (C) 2002 Oracle Corporation.  All rights reserved.
+ * and hacked up by AK.
+ * 
+ * Only use this for the mixing of the parent with the name hash
+ * for now. name hash uses the r5 hash from reiserfs. (CHECKME)
+ * 
+ * On 32bit most of this function should be optimized away; even
+ * on 64bit (x86-64) it generates reasonable and completely jumpless
+ * code with gcc 3.2.
+ */ 
+
+#define mix(a,b,c)              \
+{                               \
+  a -= b; a -= c; a ^= (c>>13); \
+  b -= c; b -= a; b ^= (a<<8);  \
+  c -= a; c -= b; c ^= (b>>13); \
+  a -= b; a -= c; a ^= (c>>12); \
+  b -= c; b -= a; b ^= (a<<16); \
+  c -= a; c -= b; c ^= (b>>5);  \
+  a -= b; a -= c; a ^= (c>>3);  \
+  b -= c; b -= a; b ^= (a<<10); \
+  c -= a; c -= b; c ^= (b>>15); \
+}
+
+static struct hlist_head * d_hash(struct dentry * parent, unsigned long hash)
+{
+	extern void __bad_size(void);
+	union { 
+		struct { 
+			struct dentry *parent; 
+			unsigned long hash; 
+		} s __attribute__((packed)); 
+		u8 bytes[sizeof(parent)+sizeof(hash)];
+	} v;
+	u32 a, b, c, len;
+
+	/* Set up the internal state */
+	a = b = 0x9e3779b9;	/* the golden ratio; an arbitrary value */
+	c = 0x12345678;		/* another arbitary value. same as ocfs. */
+
+	v.s.parent = parent;
+	v.s.hash = hash; 
+	
+	if (sizeof(v.bytes) == 16) { /* 64bit */ 
+		a += ((u32)(v.bytes[11])) << 0;  
+		a += ((u32)(v.bytes[12])) << 8;  
+		a += ((u32)(v.bytes[13])) << 16;  
+		a += ((u32)(v.bytes[14])) << 24;  
+		b += v.bytes[15];
+		mix (a, b, c);	/* c is constant and could be precomputed */
+	} else if (sizeof(v.bytes) > 16) 
+		__bad_size();
+	len = min_t(u32, 11, sizeof(v.bytes)); 
+	c += len; 
+	switch (len) {	/* all the case statements fall through */
+	    case 11:
+		    c += ((__u32) v.bytes[10] << 24);
+	    case 10:
+		    c += ((__u32) v.bytes[9] << 16);
+	    case 9:
+		    c += ((__u32) v.bytes[8] << 8);
+		    /* the first byte of c is reserved for the length */
+	    case 8:
+		    b += ((__u32) v.bytes[7] << 24);
+	    case 7:
+		    b += ((__u32) v.bytes[6] << 16);
+	    case 6:
+		    b += ((__u32) v.bytes[5] << 8);
+	    case 5:
+		    b += v.bytes[4];
+	    case 4:
+		    a += ((__u32) v.bytes[3] << 24);
+	    case 3:
+		    a += ((__u32) v.bytes[2] << 16);
+	    case 2:
+		    a += ((__u32) v.bytes[1] << 8);
+	    case 1:
+		    a += v.bytes[0];
+		    /* case 0: nothing left to add */
+	}
+	mix (a, b, c);
+	return dentry_hashtable + (c & D_HASHMASK);
+}
+#else
+static inline struct hlist_head * d_hash(struct dentry * parent, unsigned long hash)
 {
 	hash += (unsigned long) parent / L1_CACHE_BYTES;
 	hash = hash ^ (hash >> D_HASHBITS);
 	return dentry_hashtable + (hash & D_HASHMASK);
 }
+#endif
 
 /**
  * d_alloc_anon - allocate an anonymous dentry
@@ -860,7 +962,7 @@ struct dentry * d_alloc_anon(struct inod
 			res->d_flags |= DCACHE_DISCONNECTED;
 			res->d_vfs_flags &= ~DCACHE_UNHASHED;
 			list_add(&res->d_alias, &inode->i_dentry);
-			list_add(&res->d_hash, &inode->i_sb->s_anon);
+			hlist_add_head(&res->d_hash, &inode->i_sb->s_anon);
 			spin_unlock(&res->d_lock);
 		}
 		inode = NULL; /* don't drop reference */
@@ -947,21 +1049,23 @@ struct dentry * d_lookup(struct dentry *
 	unsigned int len = name->len;
 	unsigned int hash = name->hash;
 	const unsigned char *str = name->name;
-	struct list_head *head = d_hash(parent,hash);
+	struct hlist_head *head = d_hash(parent,hash);
 	struct dentry *found = NULL;
-	struct list_head *tmp;
-	int lookup_count = 0;
+	struct hlist_node *node;
+	int loop_count = 0;
+	struct dentry *loop_marker = NULL;
 
 	rcu_read_lock();
 	
-	/* lookup is terminated when flow reaches any bucket head */
-	for(tmp = head->next; !is_bucket(tmp); tmp = tmp->next) {
+	hlist_for_each (node, head) { 
 		struct dentry *dentry; 
 		unsigned long move_count;
 		struct qstr * qstr;
 		
+		prefetch(node->next);
+
 		smp_read_barrier_depends();
-		dentry = list_entry(tmp, struct dentry, d_hash);
+		dentry = hlist_entry(node, struct dentry, d_hash);
 
 		/* if lookup ends up in a different bucket 
 		 * due to concurrent rename, fail it
@@ -969,11 +1073,28 @@ struct dentry * d_lookup(struct dentry *
 		if (unlikely(dentry->d_bucket != head))
 			break;
 
-		/* to avoid race if dentry keep coming back to original
-		 * bucket due to double moves
+		/* to avoid a race if the dentry keeps coming back to the 
+		 * original bucket due to double moves. Check for a cycle
+		 * in each power of two. It could be still a false
+		 * positive due to memory reuse, so also recheck
+		 * again using an exclusive lock. This should happen
+		 * only in exceptional cases. Recursion is bounded to 
+		 * one because the race cannot occur with dcache lock
+		 * hold.
+		 *
+		 * UNTESTED CURRENTLY AS I CANNOT REPRODUCE THIS RACE!
 		 */
-		if (unlikely(++lookup_count > max_dentries))
-			break;
+		if (unlikely(loop_count & (loop_count - 1))) {
+			if (unlikely(dentry == loop_marker)) { 
+				rcu_read_unlock();
+				spin_lock(&dcache_lock);
+				dentry = d_lookup(parent, name);		
+				spin_unlock(&dcache_lock);
+				return dentry;
+			}
+			loop_marker = dentry; 
+		}
+		loop_count++;
 
 		/*
 		 * We must take a snapshot of d_move_count followed by
@@ -1031,7 +1152,8 @@ int d_validate(struct dentry *dentry, st
 	unsigned long dent_addr = (unsigned long) dentry;
 	unsigned long min_addr = PAGE_OFFSET;
 	unsigned long align_mask = 0x0F;
-	struct list_head *base, *lhp;
+	struct hlist_head *base;
+	struct hlist_node *lhp;
 
 	if (dent_addr < min_addr)
 		goto out;
@@ -1047,12 +1169,13 @@ int d_validate(struct dentry *dentry, st
 		goto out;
 
 	spin_lock(&dcache_lock);
-	lhp = base = d_hash(dparent, dentry->d_name.hash);
-	while ((lhp = lhp->next) != base) {
+	base = d_hash(dparent, dentry->d_name.hash);
+	hlist_for_each(lhp,base) { 
+		prefetch(lhp->next);
 		/* read_barrier_depends() not required for d_hash list
 		 * as it is parsed under dcache_lock
 		 */
-		if (dentry == list_entry(lhp, struct dentry, d_hash)) {
+		if (dentry == hlist_entry(lhp, struct dentry, d_hash)) {
 			dget(dentry);
 			spin_unlock(&dcache_lock);
 			return 1;
@@ -1113,12 +1236,11 @@ void d_delete(struct dentry * dentry)
  
 void d_rehash(struct dentry * entry)
 {
-	struct list_head *list = d_hash(entry->d_parent, entry->d_name.hash);
+	struct hlist_head *list = d_hash(entry->d_parent, entry->d_name.hash);
 	spin_lock(&dcache_lock);
- 	if (!list_empty(&entry->d_hash) && !d_unhashed(entry)) BUG();
  	entry->d_vfs_flags &= ~DCACHE_UNHASHED;
 	entry->d_bucket = list;
- 	list_add_rcu(&entry->d_hash, list);
+ 	hlist_add_head_rcu(&entry->d_hash, list);
 	spin_unlock(&dcache_lock);
 }
 
@@ -1171,10 +1293,6 @@ static inline void switch_names(struct d
  * We could be nicer about the deleted file, and let it show
  * up under the name it got deleted rather than the name that
  * deleted it.
- *
- * Careful with the hash switch. The hash switch depends on
- * the fact that any list-entry can be a head of the list.
- * Think about it.
  */
  
 /**
@@ -1197,8 +1315,8 @@ void d_move(struct dentry * dentry, stru
 	/* Move the dentry to the target hash queue, if on different bucket */
 	if (dentry->d_bucket != target->d_bucket) {
 		dentry->d_bucket = target->d_bucket;
-		list_del_rcu(&dentry->d_hash);
-		list_add_rcu(&dentry->d_hash, &target->d_hash);
+		hlist_del_rcu(&dentry->d_hash);
+		hlist_add_head_rcu(&dentry->d_hash, target->d_bucket);
 	}
 
 	/* Unhash the target: dput() will then get rid of it */
@@ -1281,6 +1399,7 @@ static char * __d_path( struct dentry *d
 			continue;
 		}
 		parent = dentry->d_parent;
+		prefetch(parent);
 		namelen = dentry->d_name.len;
 		buflen -= namelen + 1;
 		if (buflen < 0)
@@ -1500,9 +1619,6 @@ out:
 
 static void __init dcache_init(unsigned long mempages)
 {
-	struct list_head *d;
-	unsigned long order;
-	unsigned int nr_hash;
 	int i;
 
 	/* 
@@ -1521,49 +1637,17 @@ static void __init dcache_init(unsigned 
 	if (!dentry_cache)
 		panic("Cannot create dentry cache");
 	
-	/* approximate maximum number of dentries in one hash bucket */
-	max_dentries = (mempages * (PAGE_SIZE / sizeof(struct dentry)));
-
 	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
 
-#if PAGE_SHIFT < 13
-	mempages >>= (13 - PAGE_SHIFT);
-#endif
-	mempages *= sizeof(struct list_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
-		;
-
-	do {
-		unsigned long tmp;
-
-		nr_hash = (1UL << order) * PAGE_SIZE /
-			sizeof(struct list_head);
-		d_hash_mask = (nr_hash - 1);
-
-		tmp = nr_hash;
-		d_hash_shift = 0;
-		while ((tmp >>= 1UL) != 0UL)
-			d_hash_shift++;
-
-		dentry_hashtable = (struct list_head *)
-			__get_free_pages(GFP_ATOMIC, order);
-	} while (dentry_hashtable == NULL && --order >= 0);
-
-	printk(KERN_INFO "Dentry cache hash table entries: %d (order: %ld, %ld bytes)\n",
-			nr_hash, order, (PAGE_SIZE << order));
-
+	dentry_hashtable = (struct hlist_head *)
+		__get_free_pages(GFP_ATOMIC, 
+			 get_order(D_NUMBUCKETS * sizeof(struct hlist_head)));
 	if (!dentry_hashtable)
 		panic("Failed to allocate dcache hash table\n");
 
-	hashtable_end = dentry_hashtable + nr_hash;
-
-	d = dentry_hashtable;
-	i = nr_hash;
-	do {
-		INIT_LIST_HEAD(d);
-		d++;
-		i--;
-	} while (i);
+	for (i = 0; i < D_NUMBUCKETS; i++) 
+		INIT_HLIST_HEAD(&dentry_hashtable[i]);
+	hashtable_end = dentry_hashtable + i; 
 }
 
 /* SLAB cache for __getname() consumers */
@@ -1577,6 +1661,62 @@ EXPORT_SYMBOL(d_genocide);
 extern void bdev_cache_init(void);
 extern void cdev_cache_init(void);
 
+#ifdef DCACHE_HASH_STATISTICS
+
+static int dcache_seq_show(struct seq_file *seq, void *v)
+{ 
+	char buf[20];
+	struct hlist_node *nd;
+	int cnt = 0;
+	rcu_read_lock(); 
+	hlist_for_each (nd, (struct hlist_head *)v)
+		cnt++;
+	rcu_read_unlock(); 
+	sprintf(buf, "%lu: %d", ((struct hlist_head *)v) - dentry_hashtable, cnt);
+	seq_printf(seq, "%-19s\n", buf);  
+	return 0;
+} 
+
+static void *dcache_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	if (*pos < D_NUMBUCKETS*20) { 
+		void *r = dentry_hashtable + (*pos)/20; 
+		(*pos)+=20;
+		return r;
+	}
+	return NULL;	
+} 
+
+
+static void *dcache_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	return dcache_seq_next(seq, NULL, pos);
+} 
+
+
+static void dcache_seq_stop(struct seq_file *swap, void *v)
+{
+}
+
+static struct seq_operations dcache_seq_ops = {
+	.start = dcache_seq_start,
+	.next =	dcache_seq_next,
+	.stop = dcache_seq_stop,
+	.show =	dcache_seq_show
+};
+
+static int dcache_seq_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &dcache_seq_ops);
+}
+
+static struct file_operations dcache_proc_fops = {
+	.open = dcache_seq_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+};
+#endif
+
 void __init vfs_caches_init(unsigned long mempages)
 {
 	names_cachep = kmem_cache_create("names_cache", 
@@ -1597,4 +1737,14 @@ void __init vfs_caches_init(unsigned lon
 	mnt_init(mempages);
 	bdev_cache_init();
 	cdev_cache_init();
+
+#ifdef DCACHE_HASH_STATISTICS
+ {
+	 struct proc_dir_entry *pe;
+	 pe = create_proc_entry("dcache", 0444, &proc_root);
+	 if (pe) { 
+		 pe->proc_fops = &dcache_proc_fops; 
+	 } 
+ }
+#endif
 }
diff -X ../KDIFX -burpN linux/fs/fs-writeback.c linux-2.5.63-work/fs/fs-writeback.c
--- linux/fs/fs-writeback.c	2003-02-10 19:39:17.000000000 +0100
+++ linux-2.5.63-work/fs/fs-writeback.c	2003-02-26 13:53:13.000000000 +0100
@@ -90,7 +90,7 @@ void __mark_inode_dirty(struct inode *in
 		 * Only add valid (hashed) inodes to the superblock's
 		 * dirty list.  Add blockdev inodes as well.
 		 */
-		if (list_empty(&inode->i_hash) && !S_ISBLK(inode->i_mode))
+		if (hnode_empty(&inode->i_hash) && !S_ISBLK(inode->i_mode))
 			goto out;
 
 		/*
diff -X ../KDIFX -burpN linux/fs/hugetlbfs/inode.c linux-2.5.63-work/fs/hugetlbfs/inode.c
--- linux/fs/hugetlbfs/inode.c	2003-02-15 10:37:04.000000000 +0100
+++ linux-2.5.63-work/fs/hugetlbfs/inode.c	2003-02-26 13:54:11.000000000 +0100
@@ -189,7 +189,7 @@ void truncate_hugepages(struct address_s
 
 static void hugetlbfs_delete_inode(struct inode *inode)
 {
-	list_del_init(&inode->i_hash);
+	hlist_del_init(&inode->i_hash);
 	list_del_init(&inode->i_list);
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
@@ -208,7 +208,7 @@ static void hugetlbfs_forget_inode(struc
 {
 	struct super_block *super_block = inode->i_sb;
 
-	if (list_empty(&inode->i_hash))
+	if (hnode_empty(&inode->i_hash))
 		goto out_truncate;
 
 	if (!(inode->i_state & (I_DIRTY|I_LOCK))) {
@@ -223,7 +223,7 @@ static void hugetlbfs_forget_inode(struc
 
 	/* write_inode_now() ? */
 	inodes_stat.nr_unused--;
-	list_del_init(&inode->i_hash);
+	hlist_del_init(&inode->i_hash);
 out_truncate:
 	list_del_init(&inode->i_list);
 	inode->i_state |= I_FREEING;
diff -X ../KDIFX -burpN linux/fs/inode.c linux-2.5.63-work/fs/inode.c
--- linux/fs/inode.c	2003-02-10 19:39:00.000000000 +0100
+++ linux-2.5.63-work/fs/inode.c	2003-02-26 19:07:19.000000000 +0100
@@ -49,11 +49,9 @@
  * Inode lookup is no longer as critical as it used to be:
  * most of the lookups are going to be through the dcache.
  */
-#define I_HASHBITS	i_hash_shift
-#define I_HASHMASK	i_hash_mask
-
-static unsigned int i_hash_mask;
-static unsigned int i_hash_shift;
+#define I_NUMBUCKETS	(8*1024)
+#define I_HASHBITS	12	/* = log2(I_NUMBUCKETS) */ 
+#define I_HASHMASK	(I_NUMBUCKETS-1)
 
 /*
  * Each inode can be on two separate lists. One is
@@ -69,8 +67,8 @@ static unsigned int i_hash_shift;
 
 LIST_HEAD(inode_in_use);
 LIST_HEAD(inode_unused);
-static struct list_head *inode_hashtable;
-static LIST_HEAD(anon_hash_chain); /* for inodes with NULL i_sb */
+static struct hlist_head *inode_hashtable;
+static HLIST_HEAD(anon_hash_chain); /* for inodes with NULL i_sb */
 
 /*
  * A simple spinlock to protect the list manipulations.
@@ -162,7 +160,7 @@ void destroy_inode(struct inode *inode) 
 void inode_init_once(struct inode *inode)
 {
 	memset(inode, 0, sizeof(*inode));
-	INIT_LIST_HEAD(&inode->i_hash);
+	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_data.clean_pages);
 	INIT_LIST_HEAD(&inode->i_data.dirty_pages);
 	INIT_LIST_HEAD(&inode->i_data.locked_pages);
@@ -284,7 +282,7 @@ static int invalidate_list(struct list_h
 			continue;
 		invalidate_inode_buffers(inode);
 		if (!atomic_read(&inode->i_count)) {
-			list_del_init(&inode->i_hash);
+			hlist_del_init(&inode->i_hash);
 			list_del(&inode->i_list);
 			list_add(&inode->i_list, dispose);
 			inode->i_state |= I_FREEING;
@@ -422,7 +420,7 @@ static void prune_icache(int nr_to_scan)
 			if (!can_unuse(inode))
 				continue;
 		}
-		list_del_init(&inode->i_hash);
+		hlist_del_init(&inode->i_hash);
 		list_move(&inode->i_list, &freeable);
 		inode->i_state |= I_FREEING;
 		nr_pruned++;
@@ -460,50 +458,42 @@ static int shrink_icache_memory(int nr, 
  * by hand after calling find_inode now! This simplifies iunique and won't
  * add any additional branch in the common code.
  */
-static struct inode * find_inode(struct super_block * sb, struct list_head *head, int (*test)(struct inode *, void *), void *data)
+static struct inode * find_inode(struct super_block * sb, struct hlist_head *head, int (*test)(struct inode *, void *), void *data)
 {
-	struct list_head *tmp;
-	struct inode * inode;
+	struct hlist_node *node;
+	struct inode * inode = NULL;
 
-	tmp = head;
-	for (;;) {
-		tmp = tmp->next;
-		inode = NULL;
-		if (tmp == head)
-			break;
-		inode = list_entry(tmp, struct inode, i_hash);
+	hlist_for_each (node, head) { 
+		prefetch(node->next);
+		inode = hlist_entry(node, struct inode, i_hash);
 		if (inode->i_sb != sb)
 			continue;
 		if (!test(inode, data))
 			continue;
 		break;
 	}
-	return inode;
+	return node ? inode : NULL;
 }
 
 /*
  * find_inode_fast is the fast path version of find_inode, see the comment at
  * iget_locked for details.
  */
-static struct inode * find_inode_fast(struct super_block * sb, struct list_head *head, unsigned long ino)
+static struct inode * find_inode_fast(struct super_block * sb, struct hlist_head *head, unsigned long ino)
 {
-	struct list_head *tmp;
-	struct inode * inode;
+	struct hlist_node *node;
+	struct inode * inode = NULL;
 
-	tmp = head;
-	for (;;) {
-		tmp = tmp->next;
-		inode = NULL;
-		if (tmp == head)
-			break;
-		inode = list_entry(tmp, struct inode, i_hash);
+	hlist_for_each (node, head) {
+		prefetch(node->next);
+		inode = list_entry(node, struct inode, i_hash);
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		break;
 	}
-	return inode;
+	return node ? inode : NULL;
 }
 
 /**
@@ -553,7 +543,7 @@ EXPORT_SYMBOL(unlock_new_inode);
  * We no longer cache the sb_flags in i_flags - see fs.h
  *	-- rm...@ar...
  */
-static struct inode * get_new_inode(struct super_block *sb, struct list_head *head, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *data)
+static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *head, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *data)
 {
 	struct inode * inode;
 
@@ -570,7 +560,7 @@ static struct inode * get_new_inode(stru
 
 			inodes_stat.nr_inodes++;
 			list_add(&inode->i_list, &inode_in_use);
-			list_add(&inode->i_hash, head);
+			hlist_add_head(&inode->i_hash, head);
 			inode->i_state = I_LOCK|I_NEW;
 			spin_unlock(&inode_lock);
 
@@ -603,7 +593,7 @@ set_failed:
  * get_new_inode_fast is the fast path version of get_new_inode, see the
  * comment at iget_locked for details.
  */
-static struct inode * get_new_inode_fast(struct super_block *sb, struct list_head *head, unsigned long ino)
+static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_head *head, unsigned long ino)
 {
 	struct inode * inode;
 
@@ -618,7 +608,7 @@ static struct inode * get_new_inode_fast
 			inode->i_ino = ino;
 			inodes_stat.nr_inodes++;
 			list_add(&inode->i_list, &inode_in_use);
-			list_add(&inode->i_hash, head);
+			hlist_add_head(&inode->i_hash, head);
 			inode->i_state = I_LOCK|I_NEW;
 			spin_unlock(&inode_lock);
 
@@ -670,7 +660,7 @@ ino_t iunique(struct super_block *sb, in
 {
 	static ino_t counter = 0;
 	struct inode *inode;
-	struct list_head * head;
+	struct hlist_head * head;
 	ino_t res;
 	spin_lock(&inode_lock);
 retry:
@@ -724,7 +714,7 @@ struct inode *igrab(struct inode *inode)
  * Note, @test is called with the inode_lock held, so can't sleep.
  */
 static inline struct inode *ifind(struct super_block *sb,
-		struct list_head *head, int (*test)(struct inode *, void *),
+		struct hlist_head *head, int (*test)(struct inode *, void *),
 		void *data)
 {
 	struct inode *inode;
@@ -756,7 +746,7 @@ static inline struct inode *ifind(struct
  * Otherwise NULL is returned.
  */
 static inline struct inode *ifind_fast(struct super_block *sb,
-		struct list_head *head, unsigned long ino)
+		struct hlist_head *head, unsigned long ino)
 {
 	struct inode *inode;
 
@@ -794,7 +784,7 @@ static inline struct inode *ifind_fast(s
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct list_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
 
 	return ifind(sb, head, test, data);
 }
@@ -816,7 +806,7 @@ EXPORT_SYMBOL(ilookup5);
  */
 struct inode *ilookup(struct super_block *sb, unsigned long ino)
 {
-	struct list_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_head *head = inode_hashtable + hash(sb, ino);
 
 	return ifind_fast(sb, head, ino);
 }
@@ -848,7 +838,7 @@ struct inode *iget5_locked(struct super_
 		int (*test)(struct inode *, void *),
 		int (*set)(struct inode *, void *), void *data)
 {
-	struct list_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
 	struct inode *inode;
 
 	inode = ifind(sb, head, test, data);
@@ -881,7 +871,7 @@ EXPORT_SYMBOL(iget5_locked);
  */
 struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 {
-	struct list_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_head *head = inode_hashtable + hash(sb, ino);
 	struct inode *inode;
 
 	inode = ifind_fast(sb, head, ino);
@@ -907,11 +897,11 @@ EXPORT_SYMBOL(iget_locked);
  
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
-	struct list_head *head = &anon_hash_chain;
+	struct hlist_head *head = &anon_hash_chain;
 	if (inode->i_sb)
 		head = inode_hashtable + hash(inode->i_sb, hashval);
 	spin_lock(&inode_lock);
-	list_add(&inode->i_hash, head);
+	hlist_add_head(&inode->i_hash, head);
 	spin_unlock(&inode_lock);
 }
 
@@ -925,7 +915,7 @@ void __insert_inode_hash(struct inode *i
 void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_lock);
-	list_del_init(&inode->i_hash);
+	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_lock);
 }
 
@@ -933,7 +923,7 @@ void generic_delete_inode(struct inode *
 {
 	struct super_operations *op = inode->i_sb->s_op;
 
-	list_del_init(&inode->i_hash);
+	hlist_del_init(&inode->i_hash);
 	list_del_init(&inode->i_list);
 	inode->i_state|=I_FREEING;
 	inodes_stat.nr_inodes--;
@@ -962,7 +952,7 @@ static void generic_forget_inode(struct 
 {
 	struct super_block *sb = inode->i_sb;
 
-	if (!list_empty(&inode->i_hash)) {
+	if (!hnode_empty(&inode->i_hash)) {
 		if (!(inode->i_state & (I_DIRTY|I_LOCK))) {
 			list_del(&inode->i_list);
 			list_add(&inode->i_list, &inode_unused);
@@ -974,7 +964,7 @@ static void generic_forget_inode(struct 
 		write_inode_now(inode, 1);
 		spin_lock(&inode_lock);
 		inodes_stat.nr_unused--;
-		list_del_init(&inode->i_hash);
+		hlist_del_init(&inode->i_hash);
 	}
 	list_del_init(&inode->i_list);
 	inode->i_state|=I_FREEING;
@@ -1220,48 +1210,18 @@ void wake_up_inode(struct inode *inode)
  */
 void __init inode_init(unsigned long mempages)
 {
-	struct list_head *head;
-	unsigned long order;
-	unsigned int nr_hash;
 	int i;
-
 	for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
 		init_waitqueue_head(&i_wait_queue_heads[i].wqh);
 
-	mempages >>= (14 - PAGE_SHIFT);
-	mempages *= sizeof(struct list_head);
-	for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
-		;
-
-	do {
-		unsigned long tmp;
-
-		nr_hash = (1UL << order) * PAGE_SIZE /
-			sizeof(struct list_head);
-		i_hash_mask = (nr_hash - 1);
-
-		tmp = nr_hash;
-		i_hash_shift = 0;
-		while ((tmp >>= 1UL) != 0UL)
-			i_hash_shift++;
-
-		inode_hashtable = (struct list_head *)
-			__get_free_pages(GFP_ATOMIC, order);
-	} while (inode_hashtable == NULL && --order >= 0);
-
-	printk("Inode-cache hash table entries: %d (order: %ld, %ld bytes)\n",
-			nr_hash, order, (PAGE_SIZE << order));
-
+	inode_hashtable = (struct hlist_head *)
+		__get_free_pages(GFP_ATOMIC, 
+		     get_order(I_NUMBUCKETS*sizeof(struct hlist_head))); 
 	if (!inode_hashtable)
 		panic("Failed to allocate inode hash table\n");
 
-	head = inode_hashtable;
-	i = nr_hash;
-	do {
-		INIT_LIST_HEAD(head);
-		head++;
-		i--;
-	} while (i);
+	for (i = 0; i < I_NUMBUCKETS; i++)
+		INIT_HLIST_HEAD(&inode_hashtable[i]);
 
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode),
diff -X ../KDIFX -burpN linux/fs/super.c linux-2.5.63-work/fs/super.c
--- linux/fs/super.c	2003-02-10 19:38:30.000000000 +0100
+++ linux-2.5.63-work/fs/super.c	2003-02-26 13:51:33.000000000 +0100
@@ -63,7 +63,7 @@ static struct super_block *alloc_super(v
 		INIT_LIST_HEAD(&s->s_io);
 		INIT_LIST_HEAD(&s->s_files);
 		INIT_LIST_HEAD(&s->s_instances);
-		INIT_LIST_HEAD(&s->s_anon);
+		INIT_HLIST_HEAD(&s->s_anon);
 		init_rwsem(&s->s_umount);
 		sema_init(&s->s_lock, 1);
 		down_write(&s->s_umount);
diff -X ../KDIFX -burpN linux/include/linux/dcache.h linux-2.5.63-work/include/linux/dcache.h
--- linux/include/linux/dcache.h	2003-02-21 12:13:54.000000000 +0100
+++ linux-2.5.63-work/include/linux/dcache.h	2003-02-26 13:16:27.000000000 +0100
@@ -80,8 +80,8 @@ struct dentry {
 	unsigned long d_move_count;	/* to indicated moved dentry while lockless lookup */
 	struct inode  * d_inode;	/* Where the name belongs to - NULL is negative */
 	struct dentry * d_parent;	/* parent directory */
-	struct list_head * d_bucket;	/* lookup hash bucket */
-	struct list_head d_hash;	/* lookup hash list */
+	struct hlist_head * d_bucket;	/* lookup hash bucket */
+	struct hlist_node d_hash;	/* lookup hash list */
 	struct list_head d_lru;		/* LRU list */
 	struct list_head d_child;	/* child of parent list */
 	struct list_head d_subdirs;	/* our children */
@@ -171,7 +171,7 @@ extern rwlock_t dparent_lock;
 static __inline__ void __d_drop(struct dentry * dentry)
 {
 	dentry->d_vfs_flags |= DCACHE_UNHASHED;
-	list_del_rcu(&dentry->d_hash);
+	hlist_del_rcu(&dentry->d_hash);
 }
 
 static __inline__ void d_drop(struct dentry * dentry)
@@ -198,7 +198,7 @@ extern struct dentry * d_alloc_anon(stru
 extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
 extern void shrink_dcache_sb(struct super_block *);
 extern void shrink_dcache_parent(struct dentry *);
-extern void shrink_dcache_anon(struct list_head *);
+extern void shrink_dcache_anon(struct hlist_head *);
 extern int d_invalidate(struct dentry *);
 
 /* only used at mount-time */
diff -X ../KDIFX -burpN linux/include/linux/fs.h linux-2.5.63-work/include/linux/fs.h
--- linux/include/linux/fs.h	2003-02-26 12:55:30.000000000 +0100
+++ linux-2.5.63-work/include/linux/fs.h	2003-02-26 13:16:27.000000000 +0100
@@ -353,7 +353,7 @@ struct block_device {
 };
 
 struct inode {
-	struct list_head	i_hash;
+	struct hlist_node	i_hash;
 	struct list_head	i_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
@@ -601,7 +601,7 @@ struct super_block {
 
 	struct list_head	s_dirty;	/* dirty inodes */
 	struct list_head	s_io;		/* parked for writeback */
-	struct list_head	s_anon;		/* anonymous dentries for (nfs) exporting */
+	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 	struct list_head	s_files;
 
 	struct block_device	*s_bdev;
diff -X ../KDIFX -burpN linux/include/linux/list.h linux-2.5.63-work/include/linux/list.h
--- linux/include/linux/list.h	2003-02-10 19:37:56.000000000 +0100
+++ linux-2.5.63-work/include/linux/list.h	2003-02-26 13:19:48.000000000 +0100
@@ -319,6 +319,95 @@ static inline void list_splice_init(stru
 	for (pos = (head)->next, n = pos->next; pos != (head); \
 		pos = n, ({ read_barrier_depends(); 0;}), n = pos->next)
 
+/* 
+ * Double linked lists with a single pointer list head. 
+ * Mostly useful for hash tables where the two pointer list head is 
+ * too wasteful.
+ * You lose the ability to access the tail in O(1).
+ */ 
+
+struct hlist_head { 
+	struct hlist_node *first; 
+}; 
+
+struct hlist_node { 
+	struct hlist_node *next, **pprev; 
+}; 
+
+#define HLIST_HEAD_INIT { first: NULL } 
+#define HLIST_HEAD(name) struct hlist_head name = {  first: NULL }
+#define INIT_HLIST_HEAD(ptr) ((ptr)->first = NULL) 
+#define INIT_HLIST_NODE(ptr) ((ptr)->next = NULL, (ptr)->pprev = NULL)
+
+/* This is really misnamed */
+static __inline__ int hnode_empty(struct hlist_node *h) 
+{ 
+	return h->pprev==0;
+} 
+
+static __inline__ int hlist_empty(struct hlist_head *h) 
+{ 
+	return !h->first;
+} 
+
+static __inline__ void hlist_del(struct hlist_node *n) 
+{
+	/* The if could be avoided by a common dummy pprev target. */
+	if (!n->pprev) 
+		return; 
+	*(n->pprev) = n->next;  
+	if (n->next) 
+		n->next->pprev = n->pprev;
+}  
+
+#define hlist_del_rcu hlist_del  /* list_del_rcu is identical too? */
+
+static __inline__ void hlist_del_init(struct hlist_node *n) 
+{
+	/* The if could be avoided by a common dummy pprev target. */
+	if (!n->pprev) 
+		return; 
+	*(n->pprev) = n->next;  
+	if (n->next) 
+		n->next->pprev = n->pprev;
+	INIT_HLIST_NODE(n);
+}  
+
+static __inline__ void hlist_add_head(struct hlist_node *n, struct hlist_head *h) 
+{ 
+	n->next = h->first; 
+	if (h->first) 
+		h->first->pprev = &n->next;
+	h->first = n; 
+	n->pprev = &h->first; 
+} 
+
+static __inline__ void hlist_add_head_rcu(struct hlist_node *n, struct hlist_head *h) 
+{ 
+	n->next = h->first;
+	n->pprev = &h->first; 
+	smp_wmb();
+	if (h->first) 
+		h->first->pprev = &n->next;
+	h->first = n; 
+} 
+
+/* next must be != NULL */
+static __inline__ void hlist_add_before(struct hlist_node *n, struct hlist_node *next)
+{
+	n->pprev = next->pprev;
+	n->next = next; 
+	next->pprev = &n->next; 
+	*(n->pprev) = n;
+}
+
+#define hlist_entry(ptr, type, member) container_of(ptr,type,member)
+
+/* Cannot easily do prefetch unfortunately */
+#define hlist_for_each(pos, head) \
+	for (pos = (head)->first; pos; \
+	     pos = pos->next) 
+
 #else
 #warning "don't include kernel headers in userspace"
 #endif /* __KERNEL__ */

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Martin J. B. <mb...@ar...> - 2003-02-28 22:07:19

>> One other possibility here...  On x86, you have 4-byte pointers,
>> which gives you 8 such pointers per 32-byte cache line.  This means
>> that an insertion at the head of one hash chain will force an L2
>> miss for any other CPU attempting to traverse any of these 8 chains.
>> 
>> Now, if there is per-CPU locality of reference (and there might
>> or might not be, depending on the workload, the filesystem layout,
>> and the hash function), then inserting the new element after the
>> first element in the chain would cause only those CPUs traversing
>> that particular chain to see the L2 miss.
> 
> I would have expected that with a 16K entry hash table such conflicts are 
> statistically rare. Of course it really depends on the hash function.
> 
> It may be possible to not insert at the head,
> but behind the second entry in the bucket to avoid the problem you
> describe.  But I'm not sure it is worth it.
> 
> Just making the hash function better would be probably the biggest win.
> 
> I coded up FNV now and a hash bucket statistics patch now and after
> some reasonable load (lots of find /, many parallel kernel compiles) 
> I get this distribution:
> 
> worst case bucket length 32 
> average bucket length    13.4637 
> total number of entries  220388
> 
> average length of 13 is not very good 
> 
> Probably need a bigger table again. 16K entries seems to be too small.
> 
> The hash is actually not pure FNV, but combined r5 hash for the string
> and then combination using FNV with the parent address.
> it may be worth trying to put both into the same FNV.
> 
> Unfortunately I'm running out of time on this so won't be able to do
> further experiments on that for now.
> 
> Latest patch including FNV and hashtable dumper attached in case anybody
> wants to play further with this.

 fs/dcache.c: In function `dcache_seq_show':
fs/dcache.c:1675: warning: long unsigned int format, unsigned int arg (arg 3)
make[1]: warning: jobserver unavailable: using -j1.  Add `+' to parent make rule.
fs/built-in.o(.text+0x164da): In function `dcache_seq_next':
: undefined reference to `__divdi3'
make: *** [.tmp_vmlinux1] Error 1

http://www.aracnet.com/~fletch/linux/config.25

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Andi K. <ak...@su...> - 2003-02-28 22:32:37

>  fs/dcache.c: In function `dcache_seq_show':
> fs/dcache.c:1675: warning: long unsigned int format, unsigned int arg (arg 3)
> make[1]: warning: jobserver unavailable: using -j1.  Add `+' to parent make rule.
> fs/built-in.o(.text+0x164da): In function `dcache_seq_next':
> : undefined reference to `__divdi3'
> make: *** [.tmp_vmlinux1] Error 1

I only tested on x86-64 sorry which has 64bit division of course.

Just casting the *pos in dcache_seq_next to u32 should do the trick as we 
don't care about large file support for that file


-Andi

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Martin J. B. <mb...@ar...> - 2003-03-01 19:17:39

--On Friday, February 28, 2003 23:32:32 +0100 Andi Kleen <ak...@su...> wrote:

>>  fs/dcache.c: In function `dcache_seq_show':
>> fs/dcache.c:1675: warning: long unsigned int format, unsigned int arg (arg 3)
>> make[1]: warning: jobserver unavailable: using -j1.  Add `+' to parent make rule.
>> fs/built-in.o(.text+0x164da): In function `dcache_seq_next':
>> : undefined reference to `__divdi3'
>> make: *** [.tmp_vmlinux1] Error 1
> 
> I only tested on x86-64 sorry which has 64bit division of course.
> 
> Just casting the *pos in dcache_seq_next to u32 should do the trick as we 
> don't care about large file support for that file

OK, that fixed it - thanks. Pretty much equivalent, if anything, it's
a tad slower, but hardly measurable.

Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus)
                              Elapsed        User      System         CPU
              2.5.63-mjb2       44.43      557.16       95.31     1467.83
        2.5.63-mjb2-andi2       44.34      556.89       96.60     1473.33

Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus)
                              Elapsed        User      System         CPU
              2.5.63-mjb2       45.39      560.26      117.25     1492.33
        2.5.63-mjb2-andi2       45.50      559.74      118.44     1490.00

DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This 
benchmarking was performed for research purposes only, and the run results
are non-compliant and not-comparable with any published results.

Results are shown as percentages of the first set displayed

SDET 1  (see disclaimer)
                           Throughput    Std. Dev
              2.5.63-mjb2       100.0%         0.9%
        2.5.63-mjb2-andi2        96.3%         4.3%

SDET 2  (see disclaimer)
                           Throughput    Std. Dev
              2.5.63-mjb2       100.0%         6.5%
        2.5.63-mjb2-andi2       101.9%         5.9%

SDET 4  (see disclaimer)
                           Throughput    Std. Dev
              2.5.63-mjb2       100.0%         2.3%
        2.5.63-mjb2-andi2        99.1%         1.4%

SDET 8  (see disclaimer)
                           Throughput    Std. Dev
              2.5.63-mjb2       100.0%         3.3%
        2.5.63-mjb2-andi2        97.7%         3.7%

SDET 16  (see disclaimer)
                           Throughput    Std. Dev
              2.5.63-mjb2       100.0%         2.7%
        2.5.63-mjb2-andi2       103.9%         2.9%

SDET 32  (see disclaimer)
                           Throughput    Std. Dev
              2.5.63-mjb2       100.0%         0.6%
        2.5.63-mjb2-andi2        96.2%         1.1%

SDET 64  (see disclaimer)
                           Throughput    Std. Dev
              2.5.63-mjb2       100.0%         1.2%
        2.5.63-mjb2-andi2        98.2%         0.8%

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Andi K. <ak...@su...> - 2003-03-01 19:30:41

On Sat, Mar 01, 2003 at 11:17:31AM -0800, Martin J. Bligh wrote:
> 
> 
> --On Friday, February 28, 2003 23:32:32 +0100 Andi Kleen <ak...@su...> wrote:
> 
> >>  fs/dcache.c: In function `dcache_seq_show':
> >> fs/dcache.c:1675: warning: long unsigned int format, unsigned int arg (arg 3)
> >> make[1]: warning: jobserver unavailable: using -j1.  Add `+' to parent make rule.
> >> fs/built-in.o(.text+0x164da): In function `dcache_seq_next':
> >> : undefined reference to `__divdi3'
> >> make: *** [.tmp_vmlinux1] Error 1
> > 
> > I only tested on x86-64 sorry which has 64bit division of course.
> > 
> > Just casting the *pos in dcache_seq_next to u32 should do the trick as we 
> > don't care about large file support for that file
> 
> OK, that fixed it - thanks. Pretty much equivalent, if anything, it's
> a tad slower, but hardly measurable.

Is that comparing to the previous patch or straight 2.5.63 ? 

If it's "only in the noise slower" I would consider it a win already
because it uses much less memory than the old code.

-Andi

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Martin J. B. <mb...@ar...> - 2003-03-01 19:37:22

>> >>  fs/dcache.c: In function `dcache_seq_show':
>> >> fs/dcache.c:1675: warning: long unsigned int format, unsigned int arg (arg 3)
>> >> make[1]: warning: jobserver unavailable: using -j1.  Add `+' to parent make rule.
>> >> fs/built-in.o(.text+0x164da): In function `dcache_seq_next':
>> >> : undefined reference to `__divdi3'
>> >> make: *** [.tmp_vmlinux1] Error 1
>> > 
>> > I only tested on x86-64 sorry which has 64bit division of course.
>> > 
>> > Just casting the *pos in dcache_seq_next to u32 should do the trick as we 
>> > don't care about large file support for that file
>> 
>> OK, that fixed it - thanks. Pretty much equivalent, if anything, it's
>> a tad slower, but hardly measurable.
> 
> Is that comparing to the previous patch or straight 2.5.63 ? 
> 
> If it's "only in the noise slower" I would consider it a win already
> because it uses much less memory than the old code.

Was 63-mjb2 vs 63-mjb2+your new patch (ie nothing touching the old
patch). Is it possible to split out the shrinkage from the new cache
algorithm? Maybe we should auto-tune sizes based on the amount of mem
in the machine or something ?

M.

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Martin J. B. <mb...@ar...> - 2003-03-01 19:44:45

>>> OK, that fixed it - thanks. Pretty much equivalent, if anything, it's
>>> a tad slower, but hardly measurable.
>> 
>> Is that comparing to the previous patch or straight 2.5.63 ? 
>> 
>> If it's "only in the noise slower" I would consider it a win already
>> because it uses much less memory than the old code.
> 
> Was 63-mjb2 vs 63-mjb2+your new patch (ie nothing touching the old
> patch). Is it possible to split out the shrinkage from the new cache
> algorithm? Maybe we should auto-tune sizes based on the amount of mem
> in the machine or something ?

Oh, I forgot to include the diff (+ worse with your patch, - better)

       252     1.5% total
       235     5.0% default_idle
        71     9.2% d_lookup
        34     2.4% do_anonymous_page
        27    17.8% path_lookup
        23     0.0% d_hash
        17     3.7% vm_enough_memory
        10    11.1% link_path_walk
        10     7.0% do_schedule
        10    22.7% do_page_cache_readahead
...
       -10   -13.0% do_wp_page
       -14    -6.8% file_move
       -35   -12.8% __fput
       -96    -4.5% .text.lock.file_table

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Andi K. <ak...@su...> - 2003-03-01 19:48:04

> Oh, I forgot to include the diff (+ worse with your patch, - better)
> 
>        252     1.5% total
>        235     5.0% default_idle
>         71     9.2% d_lookup

That's probably noise then. 


-Andi

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Martin J. B. <mb...@ar...> - 2003-03-01 19:50:06

>>>> OK, that fixed it - thanks. Pretty much equivalent, if anything, it's
>>>> a tad slower, but hardly measurable.
>>> 
>>> Is that comparing to the previous patch or straight 2.5.63 ? 
>>> 
>>> If it's "only in the noise slower" I would consider it a win already
>>> because it uses much less memory than the old code.
>> 
>> Was 63-mjb2 vs 63-mjb2+your new patch (ie nothing touching the old
>> patch). Is it possible to split out the shrinkage from the new cache
>> algorithm? Maybe we should auto-tune sizes based on the amount of mem
>> in the machine or something ?
> 
> Oh, I forgot to include the diff (+ worse with your patch, - better)
> 
>        252     1.5% total
>        235     5.0% default_idle
>         71     9.2% d_lookup
>         34     2.4% do_anonymous_page
>         27    17.8% path_lookup
>         23     0.0% d_hash
>         17     3.7% vm_enough_memory
>         10    11.1% link_path_walk
>         10     7.0% do_schedule
>         10    22.7% do_page_cache_readahead
> ...
>        -10   -13.0% do_wp_page
>        -14    -6.8% file_move
>        -35   -12.8% __fput
>        -96    -4.5% .text.lock.file_table

Sorry ... must be sleepy still ... above was for kernbench. Below is 
for SDET 64 (the dcache stuff does seem significantly more expensive).

      1508     3.4% total
       750     2.3% default_idle
       221    10.8% __down
       117    18.5% __wake_up
       115    44.4% d_lookup
        55    10.8% zap_pte_range
        42    22.8% atomic_dec_and_lock
        37    20.1% __copy_to_user_ll
        34     7.9% page_remove_rmap
        27     3.3% do_schedule
        25   138.9% .text.lock.dcache
        25    22.9% path_lookup
        24     8.7% release_pages
        23    67.6% buffered_rmqueue
        23     9.1% get_empty_filp
        23    16.8% __fput
        22    34.9% copy_process
        22    37.3% pte_alloc_one
        21    26.2% clear_page_tables
        19   118.8% vma_link
        19    35.8% __brelse
        19    28.8% .text.lock.dec_and_lock
        16    28.1% __find_get_block_slow
        15    23.1% copy_mm
        14     0.0% d_rehash
        13    34.2% kmem_cache_free
        13    37.1% __insert_inode_hash
        13   108.3% prune_dcache
...
       -10   -16.7% vm_enough_memory
       -11   -15.3% link_path_walk
       -11   -10.8% free_pages_and_swap_cache
       -11   -26.2% file_ra_state_init
       -11    -9.3% do_page_fault
       -12   -36.4% dentry_open
       -13   -39.4% generic_file_aio_write_nolock
       -14   -15.6% file_move
       -14   -35.9% exit_notify
       -15   -14.2% do_no_page
       -16   -22.5% follow_mount
       -17   -11.3% do_wp_page
       -20   -26.0% filemap_nopage
       -33    -9.0% find_get_page
       -41    -9.4% .text.lock.file_table
       -43   -38.7% __find_get_block
       -45   -67.2% current_kernel_time

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Andi K. <ak...@su...> - 2003-03-01 19:55:16

> 
> Sorry ... must be sleepy still ... above was for kernbench. Below is 
> for SDET 64 (the dcache stuff does seem significantly more expensive).
> 
>       1508     3.4% total
>        750     2.3% default_idle
>        221    10.8% __down
>        117    18.5% __wake_up
>        115    44.4% d_lookup

No + or - entry for d_lookup. This means it didn't change at all ?

-Andi

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Martin J. B. <mb...@ar...> - 2003-03-01 20:44:05

>> Sorry ... must be sleepy still ... above was for kernbench. Below is 
>> for SDET 64 (the dcache stuff does seem significantly more expensive).
>> 
>>       1508     3.4% total
>>        750     2.3% default_idle
>>        221    10.8% __down
>>        117    18.5% __wake_up
>>        115    44.4% d_lookup
> 
> No + or - entry for d_lookup. This means it didn't change at all ?

No ... + is implicit, ie it was 115 more ticks (at 100Hz) ... 44% more 
ticks with your patch than without.

M.

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Andi K. <ak...@su...> - 2003-03-01 20:57:00

On Sat, Mar 01, 2003 at 12:44:00PM -0800, Martin J. Bligh wrote:
> >> Sorry ... must be sleepy still ... above was for kernbench. Below is 
> >> for SDET 64 (the dcache stuff does seem significantly more expensive).
> >> 
> >>       1508     3.4% total
> >>        750     2.3% default_idle
> >>        221    10.8% __down
> >>        117    18.5% __wake_up
> >>        115    44.4% d_lookup
> > 
> > No + or - entry for d_lookup. This means it didn't change at all ?
> 
> No ... + is implicit, ie it was 115 more ticks (at 100Hz) ... 44% more 
> ticks with your patch than without.

64K seems to be definitely too small for you.

128 or 256 ?

-Andi

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Martin J. B. <mb...@ar...> - 2003-03-02 00:15:32

>> >> Sorry ... must be sleepy still ... above was for kernbench. Below is 
>> >> for SDET 64 (the dcache stuff does seem significantly more expensive).
>> >> 
>> >>       1508     3.4% total
>> >>        750     2.3% default_idle
>> >>        221    10.8% __down
>> >>        117    18.5% __wake_up
>> >>        115    44.4% d_lookup
>> > 
>> > No + or - entry for d_lookup. This means it didn't change at all ?
>> 
>> No ... + is implicit, ie it was 115 more ticks (at 100Hz) ... 44% more 
>> ticks with your patch than without.
> 
> 64K seems to be definitely too small for you.
> 
> 128 or 256 ?

Tried 256. Times about the same.

diffprofile between straight -mjb2 and -mjb2 with patch&256K 
(+ worse with patch, - better). Hmmm... maybe current_kernel_time diff
is me switching off atime recently, though I thought that was before the 
baseline ... 

      1146     2.4% total
      1007     2.9% default_idle
        45    16.5% d_lookup
        39    51.3% grab_block
        31     4.7% __wake_up
        29    14.9% __copy_to_user_ll
        17     0.0% d_hash
        16    29.1% .text.lock.dec_and_lock
        16     0.7% __down
        11    10.4% __find_get_block
        10    30.3% .text.lock.dcache
        10     2.1% copy_page_range
...
        -9   -11.4% filemap_nopage
        -9    -7.8% do_page_fault
       -10    -7.0% do_wp_page
       -10   -50.0% mark_buffer_dirty
       -11    -2.5% .text.lock.file_table
       -11   -10.1% do_no_page
       -16   -15.5% ext2_free_blocks
       -18   -18.8% dput
       -25   -33.8% __find_get_block_slow
       -40   -11.1% find_get_page

diff from 64K to 256K ... (+ worse with 256K, - better).

       669     1.9% default_idle
       338     0.7% total
        61   113.0% grab_block
        44    60.3% __find_get_block
        12     5.7% __copy_to_user_ll
...
       -10    -2.2% .text.lock.file_table
       -10   -18.9% .text.lock.dcache
       -10   -18.5% path_release
       -10   -38.5% prune_dcache
       -10    -1.8% zap_pte_range
       -14    -2.8% copy_page_range
       -15   -17.4% .text.lock.dec_and_lock
       -17    -5.1% find_get_page
       -18   -11.2% __fput
       -19   -19.2% clear_page_tables
       -24   -11.1% atomic_dec_and_lock
       -25   -33.8% __find_get_block_slow
       -25   -18.5% path_lookup
       -37    -5.1% __wake_up
       -81   -20.3% d_lookup
       -95    -4.2% __down

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Andi K. <ak...@su...> - 2003-03-01 19:47:18

> Was 63-mjb2 vs 63-mjb2+your new patch (ie nothing touching the old
> patch). Is it possible to split out the shrinkage from the new cache
> algorithm? Maybe we should auto-tune sizes based on the amount of mem
> in the machine or something ?

I would prefer to only have CONFIG_SMALL (very small hash table
for embedded stuff etc.) and the normal 64-128K hash table (64K =16K
buckets) More memory in the machine doesn't usually mean more cache, and 
it is as well a cache size thing.

Your experiments seem to suggest that the 64K table performs similar
to the 1MB table that plain 2.5.63 uses on your box. I was toying
with the idea of increasing it to 128K, need to benchmark that, but
definitely not above that because 256K on 64bit machines would 
be already excessive.

-Andi

Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

From: Martin J. B. <mb...@ar...> - 2003-03-01 20:45:41

>> Was 63-mjb2 vs 63-mjb2+your new patch (ie nothing touching the old
>> patch). Is it possible to split out the shrinkage from the new cache
>> algorithm? Maybe we should auto-tune sizes based on the amount of mem
>> in the machine or something ?
> 
> I would prefer to only have CONFIG_SMALL (very small hash table
> for embedded stuff etc.) and the normal 64-128K hash table (64K =16K
> buckets) More memory in the machine doesn't usually mean more cache, and 
> it is as well a cache size thing.
> 
> Your experiments seem to suggest that the 64K table performs similar
> to the 1MB table that plain 2.5.63 uses on your box. I was toying
> with the idea of increasing it to 128K, need to benchmark that, but
> definitely not above that because 256K on 64bit machines would 
> be already excessive.

OK, I can try your patch with the 1Mb original hash table size ... I'd
rather bench the changes to hash size seperately from those to the hash
algorithm.

M.