I've got a fairly substantial APFS file (compressed to 15gb in 7z) that by default isn't recognized by 7-zip as an APFS file. What's the best way to send you this file?
I debugged it down to two places in the APFS handler code:
The first starting at line 2823 (your comment about "is this possible" in the code? The answer is yes, it's possible. Uncommenting the continue gets me past this issue and onto the next issue.
Here index is -1, and it returns false, so the entire file is deemed unrecognizable as an APFS despite there being multiple volumes that parse OK.
If I allow the code to continue past the loop and ignore the S_FALSE error return, the operations complete and I can actually parse and receive the contents of the APFS file.
The APFS handler seems to decide that any errors at all, even minor ones, result in a "not an archive" error result and you get nothing.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you have VMWare, just run a Mac image, create a small volume, add some files to it, then delete a few files, and then add some more. Just enough to scramble the volume's btree a bit. If you want a real challenge, make a snapshot and then update some files. Then you'll have multiple INODE's with the same OID and different XID's.
One of the fundamental problems is that the APFS filesystem isn't supposed to be read top-down from the start of the btree to the end. It's a jumpscotch thing. You start at the two known INODE's (2 and 3), read their DIR_REC entries, recurse in for directories, and for non-directories, you read each INODE and its props.
Non-directory INODE's have a private_id which if its not the same as the INODE's OID, points to the OID where you'll find DSTREAM file extents.
There are also 4 kinds of compression algorithms, Z, LZVN, LZFSE, LZBITMAP if you find a decmpfs attribute record.
Those attribute records can also point to DSTREAM's externally if they're too big.
Also in addition to symlink attribute records, you need to look for DREC_EXT_TYPE_SIBLING_ID records on a DIR_REC entry, which tells you that it's a hardlink. For example, a DIR_REC entry might name a file "123.png" and point to OID 9987, but INode 9987's filename is "generic_image.png", and you find that several other DIR_REC entries in other directories also point to that same INODE 9987 and have filenames "789.png" and "456.png".
Without writing too much pseudo-code, a recursion loop for reading the entire structure into a CObjectVector of Inode entries would look like this:
We have two matching oid's in here with different values, which causes the loop to end prematurely, which is why volume index 3 in the loop returns -1 and fails. I was missing about 400k items out of the APFS file because of this.
Line 2542:
ReadObjectMap(apfs.omap_oid, &vol, map.Omap);
Doesn't check the return value so it didn't detect the early abort.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So ... making some progress, but hit another barrier. This time, when decompressing resource forks.
For method = 8, we hit the LzfseDecoder, and the decompression loop exits properly, but the final 7 extra byte check fails on line 244:
On entry, unpackSize = 2396, packSize = 1622
On exiting the loop and checking afterward, unpackSize = 0 and packSize = 63
On exiting, the unpackSize remainder is different for each file, but consistently not 7.
// LZVN encoder writes 7 additional zero bytesif(packSize!=7)returnS_FALSE;do{Byteb;if(!m_InStream.ReadByte(b))returnS_FALSE;packSize--;if(b!=0)returnS_FALSE;}while(packSize!=0);returnS_OK;
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The APFS file I'm testing, it looks like a hot mess. I have DIR_REC entries before INODE's, FILE_EXTENT records and DSTREAM's with no INODE, INODE's with a private_id both less than, equal to, and greater than the current ID ... it's just crazy.
I also have FILE_EXTENT records with valid length and type, but a position of 0.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've actually had a big breakthrough just now ... All this time I've been opening the file through the VMDK, and operating on the file within 7z through the VMDK handler. I extracted out the APFS file instead just a few minutes ago, and suddenly things are looking much better and I'm not getting the failures I was seeing before.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
grrr ... may be a red-herring ... still digging. I had a defect in my code where it wasn't asking for streams until it got to subarchives, and the APFS file has no subarchives.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Still a hot mess. Attached are my code changes so far, which I'm sure still aren't right. Also included a log from running that APFS file using SHOW_DEBUG_INFO
It's pretty clean, and seems to work really well. If anything, it's answered some questions where the Apple APFS documentation is open to interpretation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've got a fairly substantial APFS file (compressed to 15gb in 7z) that by default isn't recognized by 7-zip as an APFS file. What's the best way to send you this file?
I debugged it down to two places in the APFS handler code:
The first starting at line 2823 (your comment about "is this possible" in the code? The answer is yes, it's possible. Uncommenting the
continue
gets me past this issue and onto the next issue.The next issue is here in Open2() around line 2382:
sb.max_file_systems
is 100, and apparently oncei
gets to 3, we come to OpenVolume around lines 2563:Here
index
is -1, and it returns false, so the entire file is deemed unrecognizable as an APFS despite there being multiple volumes that parse OK.If I allow the code to continue past the loop and ignore the S_FALSE error return, the operations complete and I can actually parse and receive the contents of the APFS file.
The APFS handler seems to decide that any errors at all, even minor ones, result in a "not an archive" error result and you get nothing.
Can you make of find "Minimal reproducible example" of this issue?
If you have VMWare, just run a Mac image, create a small volume, add some files to it, then delete a few files, and then add some more. Just enough to scramble the volume's btree a bit. If you want a real challenge, make a snapshot and then update some files. Then you'll have multiple INODE's with the same OID and different XID's.
One of the fundamental problems is that the APFS filesystem isn't supposed to be read top-down from the start of the btree to the end. It's a jumpscotch thing. You start at the two known INODE's (2 and 3), read their DIR_REC entries, recurse in for directories, and for non-directories, you read each INODE and its props.
Non-directory INODE's have a private_id which if its not the same as the INODE's OID, points to the OID where you'll find DSTREAM file extents.
There are also 4 kinds of compression algorithms, Z, LZVN, LZFSE, LZBITMAP if you find a decmpfs attribute record.
Those attribute records can also point to DSTREAM's externally if they're too big.
Also in addition to symlink attribute records, you need to look for DREC_EXT_TYPE_SIBLING_ID records on a DIR_REC entry, which tells you that it's a hardlink. For example, a DIR_REC entry might name a file "123.png" and point to OID 9987, but INode 9987's filename is "generic_image.png", and you find that several other DIR_REC entries in other directories also point to that same INODE 9987 and have filenames "789.png" and "456.png".
Without writing too much pseudo-code, a recursion loop for reading the entire structure into a CObjectVector of Inode entries would look like this:
For the btree itself, as you're reading it, you'll take the OID with the highest XID.
Around line 1939:
We have two matching oid's in here with different values, which causes the loop to end prematurely, which is why volume index 3 in the loop returns -1 and fails. I was missing about 400k items out of the APFS file because of this.
Line 2542:
Doesn't check the return value so it didn't detect the early abort.
I don't want to download big archive file.
So now I'm not ready to debug or change that code.
What size APFS file could I construct that'd be acceptable?
The APFS spec definitely talks about multiple transaction ID's assigned to a single OID, and to take the highest one under normal circumstances.
So ... making some progress, but hit another barrier. This time, when decompressing resource forks.
For method = 8, we hit the LzfseDecoder, and the decompression loop exits properly, but the final 7 extra byte check fails on line 244:
On entry, unpackSize = 2396, packSize = 1622
On exiting the loop and checking afterward, unpackSize = 0 and packSize = 63
On exiting, the unpackSize remainder is different for each file, but consistently not 7.
I suppose I have changed that code already after v23.01.
The APFS file I'm testing, it looks like a hot mess. I have DIR_REC entries before INODE's, FILE_EXTENT records and DSTREAM's with no INODE, INODE's with a private_id both less than, equal to, and greater than the current ID ... it's just crazy.
I also have FILE_EXTENT records with valid length and type, but a position of 0.
what way that apfs was created?
what software was used?
I've actually had a big breakthrough just now ... All this time I've been opening the file through the VMDK, and operating on the file within 7z through the VMDK handler. I extracted out the APFS file instead just a few minutes ago, and suddenly things are looking much better and I'm not getting the failures I was seeing before.
Did you use 7-zip to extract from vmdk?
Yes.
grrr ... may be a red-herring ... still digging. I had a defect in my code where it wasn't asking for streams until it got to subarchives, and the APFS file has no subarchives.
Bah, still a hot mess. I'll be less spammy in the future.
Still a hot mess. Attached are my code changes so far, which I'm sure still aren't right. Also included a log from running that APFS file using SHOW_DEBUG_INFO
I found a reference implementation of APFS here: https://github.com/Paragon-Software-Group/paragon_apfs_sdk_ce
It's pretty clean, and seems to work really well. If anything, it's answered some questions where the Apple APFS documentation is open to interpretation.