New I/O Layer, Sparse Files and Sub-Second Timestamps
New I/O Layer
A completely new I/O layer abstraction replaces the ubiquitous use of memory mapped files across the DwarFS code base. Memory-mapping is still the default, but processing is done in "segments" rather than in whole files. This required a significant amount of changes (this release adds/touches more than 5,000 lines of code and almost 10,000 lines of new tests) in almost every part of the code that was processing file data, code that previously assumed any file could simply be accessed as a contiguous piece of memory.
In the new abstraction layer, backends are pluggable and configurable through the DWARFS_IOLAYER_OPTS environment variable, letting you:
- Configure the size up to which files are mapped "eagerly", i.e. as a whole and not in segments. This is mostly relevant for 32-bit systems, on which this is set to a reasonable default (32 MiB).
- Switch from
mmap()to classicread()for maximum robustness on unreliable storage or faulty hardware.
The latter is relevant if you're seeing "bus errors" (SIGBUS), as many have done in the past (#45, [#50], [#108], [#163], [#213]). You can switch to the read-based backend using:
$ export DWARFS_IOLAYER_OPTS=open_mode=read
Sparse File Support
This release includes end-to-end sparse file support:
mkdwarfsdetects holes in files and preserves sparseness in the image.- The FUSE driver exposes sparsity via
lseek()where supported (Linux, FreeBSD). dwarfsextractwrites sparse files and preserves them when targeting archive formats that support holes (e.g.,tar).
Compatibility: images containing sparse files require DwarFS ≥ 0.14.0. You can use --no-sparse-files to explicitly treat sparse files as non-sparse and keep compatibility with older versions. If your input does not contain sparse files, the images remain backwards-compatible even without that flag.
Sparse file support matrix
| Feature / OS | Linux | FreeBSD | macOS | Windows |
|---|---|---|---|---|
Reading sparse files (mkdwarfs) |
✅ | ✅ | ✅ | ✅ |
Writing sparse files (dwarfsextract) |
✅ | ✅ | ✅ | ✅ |
| Exposing sparse files via FUSE layer | ✅ | ✅ | ❌ | ❌ |
FUSE-level sparseness requires
lseek()support in the FUSE implementation (currently Linux and FreeBSD). On macOS and Windows, files are exposed via FUSE as non-sparse even thoughdwarfsextractcan still write sparse files when extracting. Missinglseek()support is tracked here for Windows and here for macOS.
Sub-second timestamps
Configurable sub-second timestamp resolution down to nanoseconds (using --time-resolution). The default remains one second. This is fully backwards-compatible: older DwarFS versions can read images with sub-second resolution, but will ignore the sub-second part.
Bug fixes
-
Leading dots in
--input-listfile paths were incorrectly treated as literal directory names instead of being expanded. This has been fixed. Fixes [#292]. -
The SPDX license identifier in GPL-licensed source files was incorrectly specified as
GPL-3.0-onlyinstead ofGPL-3.0-or-later. This has been corrected. Fixes [#275]. -
Fixed an off-by-one error when recovering
self_indexfields in metadata, which could cause the sentinel directory to have a non-zeroself_entry. While harmless by itself (since that entry is never actually used), this would cause the metadata consistency check to fail. The fix covers three aspects: correcting the off-by-one error; ensuring theself_entryrecovery code does not run for the sentinel directory; and changing the metadata consistency check to only warn about a non-zeroself_entryrather than fail. Runningmkdwarfswith--rebuild-metadatawill also reset a non-zero sentinelself_entryto zero. -
Fixed the implementation of the
readoperation in the FUSE driver to send positive error code values to libfuse. This was likely never triggered in practice, but in cases where parts of the filesystem image vanish while being accessed (which previously caused SIGBUS crashes), libfuse would not understand the negative error codes. -
Moved the FUSE driver binaries from
sbintobinand kept only themount.dwarfs/mount.dwarfs2symlinks insbin. This better aligns with user expectations, other FUSE drivers, and the fact that the man pages are installed in section 1. (Thanks to Ahmad Khalifa for the fix.) -
The
dwarfs2binary was broken in builds using shared libraries. (Thanks to Ahmad Khalifa for the fix.) -
When setting CPU thread affinity for worker group threads via
DWARFS_WORKER_GROUP_AFFINITY, the code did notCPU_ZEROthecpu_set_tstructure before setting individual CPUs. This could pin threads to random CPUs in addition to the requested ones. -
The FITS categorizer would scan entire files for the end-of-header marker if their size was a multiple of 2880 bytes, causing significant slowdowns on large non-FITS files. Additional checks now ensure scanning only continues if the data truly looks like a standards-compliant FITS header.
-
GCC caught a potential null-pointer dereference on error when opening a file in
mkdwarfs. This has been fixed. -
Numerous fixes for 32-bit architectures, mostly related to integer overflows with file sizes larger than 4 GiB.
-
Another off-by-one error caused the first regular file inode to be excluded from the file-size cache. This would be hard to notice unless that file was highly fragmented. The cache will be fixed when rebuilding the metadata.
-
The FUSE driver’s
enable_nlinkoption is now the default behavior and cannot be disabled. The previous optimization skipped building a table of hardlink counts, which produced inherently incorrect file status information (hardlinked files share an inode, so reporting a link count of 1 is wrong). The hardlink table is now stored in the metadata by default; if there are no hardlinks, it consumes no space. You can still omit the hardlink table with--no-hardlink-table, at the cost of building it on-the-fly when the filesystem image is loaded (typically fast — e.g., ~300 ms for 14 million files). -
Fixed a typo in
dwarfs-format.md. (Thanks to Dennis Brakhane for spotting this and sending a PR.)
Features
-
New I/O layer abstraction that supports “classic”
mmap-based file access, granularmmap-based access on 32-bit systems, and fullymmap-less access if desired. This applies to all DwarFS tools. By default, tools use the most efficient method—memory-mapping whole files on 64-bit systems and mapping file segments on 32-bit systems (to conserve address space). This can be controlled via the newDWARFS_IOLAYER_OPTSenvironment variable described indwarfs-env(7). -
Full support for sparse files.
mkdwarfsnow detects and efficiently processes sparse files, skipping holes where possible and preserving them in the filesystem image. This is supported on all platforms. The FUSE driver implementslseek()where supported by the FUSE library (currently Linux and FreeBSD); Windows and macOS fall back to showing files as non-sparse.dwarfsextractextracts sparse files as such and preserves sparse representations when extracting to archive formats that support them (e.g., tar). Note: Sparse file support is not backwards compatible; images containing sparse files cannot be processed by DwarFS versions prior to 0.14.0. By default,mkdwarfsenables sparse file support if it detects sparse input. Use--no-sparse-filesto disable it and ensure compatibility with older versions. -
Support for subsecond timestamp resolution. The default remains one second, but finer resolutions (down to nanoseconds) can be specified with
--time-resolution.mkdwarfswill warn if the requested resolution is finer than the native filesystem resolution. This is fully backwards compatible: older DwarFS versions will handle such images but ignore the subsecond parts. Fixes [#294]. -
Desktop integration for Linux. A new
--auto-mountpointoption automatically creates or selects a mount-point directory, making it easier to mount DwarFS images from file managers. Desktop files and MIME type definitions are now installed to enable double-click mounting of.dwarfsfiles. (Thanks to Ahmad Khalifa for the implementation.) -
Shell completion for
mkdwarfs(bash and zsh). (Thanks to Ahmad Khalifa for the contribution.) -
Improved error handling when DwarFS tools encounter
SIGBUS(usually caused by accessing memory-mapped files on unreliable or faulty storage like network shares or flaky USB drives). WhenSIGBUSis caught, tools now print an error suggesting switching frommmap- toread-based I/O viaDWARFS_IOLAYER_OPTS. -
dwarfscknow checks metadata consistency by default (unless--no-checkis given), improving detection of filesystem image corruption. -
If sparse files are supported by the FUSE library, the FUSE driver exposes new options
cache_sparseandno_cache_sparseto control whether sparse files should be cached in the kernel page cache. Seedwarfs(1)for details. -
The JSON output from
dwarfscknow contains a complete raw metadata dump when the detail level includesmetadata_full_dump. -
dwarfsckno longer artificially limits string sizes when dumping metadata. (Thanks to Dennis Brakhane for the contribution.) -
Accelerated search for the start of a DwarFS image in files with custom headers; the new code is about four times faster, scanning at more than 6 GiB/s on a modern CPU.
-
The cache size can now be configured for
dwarfsck, useful with the--checksumoption. -
Both
dwarfsckanddwarfsextractnow limit the amount of data requested from the filesystem image at once to avoid exhausting memory (and virtual address space on 32-bit systems). -
Improved self-extracting binary stub with better compatibility for
qemu,binfmt_misc, and old kernels. The stub now works on Linux kernels as old as 2.6.21 (and possibly older), and it now usesnanoprintfto further reduce binary size. -
The FUSE driver will now show the name of the mounted file system image in the mount point listing (e.g., in
dformountoutput) on Linux, FreeBSD and macOS, as well as the filesystem subtype (dwarfs) on Linux and FreeBSD.
Compatibility
-
The accepted minor version for the DwarFS image format has been incremented. Release v0.16.0 will also increment the written minor version. This means images produced with v0.16.0 will not be readable by DwarFS tools prior to v0.14.0. See the “Features” section in
dwarfs-format(7)for details. -
The
(no_)cache_imageoption has been removed from the FUSE driver.
Docs
-
Added documentation on manual FSST decoding to
dwarfs-format.md. (Thanks to Dennis Brakhane for the PR.) -
Several cleanups and additions to
dwarfs-format.md, including a glossary of terms, clarification of blocks vs. sections, and descriptions of compatibility handling via features, plus details on the representation of sparse files and hardlinks. -
New manual page
dwarfs-env(7)documenting DwarFS-specific environment variables.
Build
-
Removed the dependency on Boost.Iostreams and the hard dependency on Boost.System.
-
Removed the hard dependency on the
datelibrary, which caused build issues on distributions that no longer bundle it (e.g., SUSE). -
The build system now creates symbolic mount links at install time rather than in the build directory.
Test
- Significantly improved test coverage.
New Contributors
- @thekhalifa made their first contribution in [#277]
- @brakhane made their first contribution in [#296]
Full Changelog: https://github.com/mhx/dwarfs/compare/v0.13.0...v0.14.0