[Squashfs-devel] [PATCH] squashfs: add config for metadata cache size

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

The metadata cache size was hard-coded to 8 metadata blocks.  A large
parallel workload can cause a lot of spinlock thrashing in
squashfs_cache_get if the number of metadata blocks is smaller than
the number of parallel metadata reads, as the decompression time can
keep the metadata cache full, and squashfs_cache_get uses a simple
spinlock to synchronize the cache.  Allow the cache size to be tuned
by adding CONFIG_SQUASHFS_METADATA_CACHE_SIZE which defaults to the
old hard-coded value of 8.  A good setting for systems with plenty of
memory would be something a big larger than the expected number of
parallel readers on a single squashfs.  For highly memory constrained
systems, a smaller setting may be appropriate.

This issue was discovered on an embedded where a large performance drop
in boot times was noticed when the system from an 8 core (4 physical)
machine to a 16 core (8 physical) machine. It was discovered that much
CPU time was being spun away in the spin_lock call in
squashfs_cache_get. This was due to the fact that the metadata cache is
fixed at 8 entries, and having more cores allowed more parallel file
system walks (which happens to be a part of one of our service start
scripts for each of our many parallel services). Because when each cache
entry is released all waiting cores are awakened to attempt to grab
another cache entry, those cores fight over the spinlock just to find
out they are not going to get another cache entry.

While this commit isn't a general solution, it does provide a simple way
for one to configure their kernel to alleviate the performance issue.

A better solution would be to use a less CPU intensive and preemptable
synchronization method, and to only wake up one waiter when one cache
entry comes up.

Others have pointed out this issue:

http://lkml.iu.edu/hypermail/linux/kernel/1805.0/01702.html

And this is a similar issue, but on the data cache, but points out many
of the same technical issues with squashfs_cache_get:

https://chrisdown.name/2018/04/17/kernel-adventures-the-curious-case-of-squashfs-stalls.html

A simple way to reproduce and measure the time for various parallel
workloads (assuming a fairly large number of directories and files in
the squashfs):

time ( N=16; for ((i=0;i<$N;++i)); \
do find /path/to/mounted/squash/ -print > /dev/null & done; \
for ((i=0;i<$N;++i)); do wait; done)

On one system, with N=8, the loop above takes 1 second of elapsed time,
but on the same system with N=16, it takes 13 seconds (when 2 would be a
reasonable scale up).
---
 fs/squashfs/Kconfig       | 20 ++++++++++++++++++++
 fs/squashfs/squashfs_fs.h |  2 +-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/squashfs/Kconfig b/fs/squashfs/Kconfig
index 60fc98bdf4212..311e3141df9af 100644
--- a/fs/squashfs/Kconfig
+++ b/fs/squashfs/Kconfig
@@ -264,3 +264,23 @@ config SQUASHFS_FRAGMENT_CACHE_SIZE
 
 	  Note there must be at least one cached fragment.  Anything
 	  much more than three will probably not make much difference.
+
+config SQUASHFS_METADATA_CACHE_SIZE
+	int "Number of metadata blocks cached" if SQUASHFS_EMBEDDED
+	depends on SQUASHFS
+	default "8"
+	help
+	  By default SquashFS caches the last 8 metadata blocks read from the
+	  filesystem.  A metadata block is 8KiB.  Increasing this amount may
+	  mean SquashFS has to re-read metadata less often from disk, at the
+	  expense of extra system memory.  Decreasing this amount will mean
+	  SquashFS uses less memory at the expense of extra reads from disk.
+
+	  Note there must be at least one cached metadata block.  A setting too
+	  low with a large parallel workload can cause a lot of spinlock
+	  thrashing in squashfs_cache_get.  A good setting for the metadata
+	  cache size is something a bit larger than the number of expected
+	  parallel metadata reads.  When booting with multiple services on a
+	  single squashfs on a machine with a lot of cores, a higher setting
+	  than the default will net a large performance improvement by avoiding
+	  spinlock thrashing.
diff --git a/fs/squashfs/squashfs_fs.h b/fs/squashfs/squashfs_fs.h
index 95f8e89017689..c4e32358f922c 100644
--- a/fs/squashfs/squashfs_fs.h
+++ b/fs/squashfs/squashfs_fs.h
@@ -202,7 +202,7 @@ static inline int squashfs_block_size(__le32 raw)
 #define SQUASHFS_XATTR_OFFSET(A)	((unsigned int) ((A) & 0xffff))
 
 /* cached data constants for filesystem */
-#define SQUASHFS_CACHED_BLKS		8
+#define SQUASHFS_CACHED_BLKS		CONFIG_SQUASHFS_METADATA_CACHE_SIZE
 
 /* meta index cache */
 #define SQUASHFS_META_INDEXES	(SQUASHFS_METADATA_SIZE / sizeof(unsigned int))
-- 
2.34.1