Thread: [ext2resize] Re: Odd stride issues
Status: Inactive
Brought to you by:
adilger
From: Andreas D. <ad...@cl...> - 2002-09-11 23:42:52
|
[ note - I've CC'd the discussion to ext...@li... ] On Sep 11, 2002 15:08 -0700, Robert Walsh wrote: > If I create a filesystem without specifying a stride value, ext2online > prints out warning messages: > > # mke2fs -j -R resize=19398656 /dev/sdc1 38797312 > # mount /dev/sdc1 /t > # ext2online /t 19398656 > ext2online v1.1.18 - 2001/03/18 for EXT2FS 0.5b > This is odd, the RAID stride is not constant at -1540! > This is odd, the RAID stride is negative (-1540)! > Using a RAID stride value of 0 > > Sometimes the stride number is not negative, but it almost always > complains that it's not constant. Can you look at the output from dumpe2fs to see when/where/why the block and inode bitmaps come after the inode table (which in my books means that a stride was used). If not, then the "stride detection" code in ext2online is broken for some reason... It's not supposed to be critical stuff, mind you, just trying to keep the new groups layed out as the original creator intended. > The resize almost never completes when it prints the above errors and > almost always completes when it doesn't. It gives this kernel message > when it fails: > > verify_group_input: Block bitmap (11240450) in GDT table > (11239424-11240451) That would be badness then. Probably the user-space calcs are bad, and not the check here, so the kernel is getting bad data. If you could trace through where these calculations are going wrong, I can tell you what the intention of them is, if you have questions. > Another thing: the parser that picks apart the -R option arbitrarily > breaks when certain paramater sequences are given to it. For example, > before I modified it, it was choking on -R resize=19398656,stride=8 but > working OK when you switched the order. The working order broke when I > changed to a different stride value, etc. It seems like strtoul is not > behaving itself and crapping on the end_ptr parameter inconsistently. I > believe it's a piece of inlined code, so maybe it's doing the wrong > thing. That parameter doesn't actually work yet, for a reason that currently escapes me. Something about us needing data (like blocksize, or blocks per group, or something) that we don't have until inside libext2 where the filesystem is being created... We need to go from "-R resize=foo" to a number of group descriptor blocks that are passed in the superblock to the libext2 create routines. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Andreas D. <ad...@cl...> - 2002-09-13 19:38:11
|
On Sep 12, 2002 16:56 -0700, Robert Walsh wrote: > Should the reserved blocks be reserved even in a block group that does > not contain a super block backup? Or are they available for use in the > block group? The reserved blocks are only used in the groups that have a superblock backup (and hence also a group descriptor table backup). They are not touched (allocated or used) in groups that do not have backups. In theory, the ext2resize/ext2online code should be able to handle having arbitrary locations for the inode table and the bitmaps, but it may be getting confused by the fact that there is no free blocks before the inode table for the table at all, in the case of a zero stride. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Robert W. <rj...@du...> - 2002-09-13 21:06:33
|
So, here's what I've done so far: I haven't touched mke2fs at all - I'm going to let it place the stuff wherever it thinks it should. I commented out the stride calculation code in ext2online. It now assumes a stride of 0. I twiddled ext2online's idea of where to but the block bitmap and inode bitmap in groups with a sb backup. They now get placed right after the inode table, which is exactly where mke2fs puts them. It now works, although this exercise was really to validate my ideas and I don't really consider it a solution. The real solution involves fixing mke2fs so that it keeps a consistent layout, modifying the stride calculations in ext2online so that it gets calculated correctly and modifying the subsequent layout so that it more closely matches the fixed layout from mke2fs. I'll also try ensure that even if it gets a weird layout from mke2fs, ext2online will still be able to handle it. I'll keep you informed of my progress and send you diffs to mke2fs when I get something sane working. Regards, Robert. --=20 Robert Walsh Amalgamated Durables, Inc. - "We don't make the things you buy." Email: rj...@du... |
From: Robert W. <rj...@du...> - 2002-09-12 18:49:16
Attachments:
has_resize.txt
no_resize.txt
|
> Can you look at the output from dumpe2fs to see when/where/why the block > and inode bitmaps come after the inode table (which in my books means > that a stride was used). If not, then the "stride detection" code in > ext2online is broken for some reason... It's not supposed to be critical > stuff, mind you, just trying to keep the new groups layed out as the > original creator intended. Attached is the output of two different mke2fs's on a small loopback file. One has the resize inode. The other doesn't. Neither had a stride specified. As you can see, the one that has the resize inode has a different layout depending on whether there is supposed to be a backup super block there or not, which is what is causing ext2online to get upset. > > verify_group_input: Block bitmap (11240450) in GDT table > > (11239424-11240451) > > That would be badness then. Probably the user-space calcs are bad, and > not the check here, so the kernel is getting bad data. If you could > trace through where these calculations are going wrong, I can tell you > what the intention of them is, if you have questions. Sure. I'm going to trace through this today, so I'll let you know as soon as I get some interesting results. I don't know if I made it clear yesterday, but this message never occurs when the stride is specified _AND_ set insanely huge. Looking at the mke2fs code, when the stride is set to an insanely huge value, it should result in the start_blk for each block group being set back to the first block in the block group. Hmm. Regards, Robert. |
From: Andreas D. <ad...@cl...> - 2002-09-12 19:26:03
|
On Sep 12, 2002 11:49 -0700, Robert Walsh wrote: > Attached is the output of two different mke2fs's on a small loopback > file. One has the resize inode. The other doesn't. Neither had a > stride specified. As you can see, the one that has the resize inode has > a different layout depending on whether there is supposed to be a backup > super block there or not, which is what is causing ext2online to get > upset. This may have something to do with how the libext2 code does block allocation. I would think that it puts all of the inode tables at the same offsets, but it might be missing a "reserved_gdt" count in its initial calculations. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Robert W. <rj...@du...> - 2002-09-12 20:30:42
|
On Thu, 2002-09-12 at 12:24, Andreas Dilger wrote: > On Sep 12, 2002 11:49 -0700, Robert Walsh wrote: > > Attached is the output of two different mke2fs's on a small loopback > > file. One has the resize inode. The other doesn't. Neither had a > > stride specified. As you can see, the one that has the resize inode has > > a different layout depending on whether there is supposed to be a backup > > super block there or not, which is what is causing ext2online to get > > upset. > > This may have something to do with how the libext2 code does block > allocation. I would think that it puts all of the inode tables at > the same offsets, but it might be missing a "reserved_gdt" count in > its initial calculations. Well, there is this in the code that parses the resize parameter: /* XXX param->s_res_gdt_blocks = resize - existing cur_groups = (resize - sb->s_first_data_block + EXT2_BLOCKS_PER_GROUP(super) - 1) /bpg; cur_gdb = (cur_groups + gdpb - 1) / gdpb; */ I assume this was an attempt to set this up, right? But the problem is figuring out what "existing" is. I also assume this was the missing information you mentioned in your email yesterday: > That parameter doesn't actually work yet, for a reason that currently > escapes me. Something about us needing data (like blocksize, or blocks > per group, or something) that we don't have until inside libext2 where > the filesystem is being created... We need to go from "-R resize=foo" > to a number of group descriptor blocks that are passed in the > superblock to the libext2 create routines. Anyway, I'm going to see if I can get the above bit working and if that has an effect on the inode table offsets. Regards, Robert. |
From: Andreas D. <ad...@cl...> - 2002-09-12 20:47:43
|
On Sep 12, 2002 13:30 -0700, Robert Walsh wrote: > On Thu, 2002-09-12 at 12:24, Andreas Dilger wrote: > > On Sep 12, 2002 11:49 -0700, Robert Walsh wrote: > > > Attached is the output of two different mke2fs's on a small loopback > > > file. One has the resize inode. The other doesn't. Neither had a > > > stride specified. As you can see, the one that has the resize inode has > > > a different layout depending on whether there is supposed to be a backup > > > super block there or not, which is what is causing ext2online to get > > > upset. > > > > This may have something to do with how the libext2 code does block > > allocation. I would think that it puts all of the inode tables at > > the same offsets, but it might be missing a "reserved_gdt" count in > > its initial calculations. > > Well, there is this in the code that parses the resize parameter: > > /* XXX param->s_res_gdt_blocks = resize - existing > cur_groups = (resize - sb->s_first_data_block + > EXT2_BLOCKS_PER_GROUP(super) - 1) /bpg; > cur_gdb = (cur_groups + gdpb - 1) / gdpb; > */ > > I assume this was an attempt to set this up, right? But the problem is > figuring out what "existing" is. I also assume this was the missing > information you mentioned in your email yesterday: > > > That parameter doesn't actually work yet, for a reason that currently > > escapes me. Something about us needing data (like blocksize, or blocks > > per group, or something) that we don't have until inside libext2 where > > the filesystem is being created... We need to go from "-R resize=foo" > > to a number of group descriptor blocks that are passed in the > > superblock to the libext2 create routines. > > Anyway, I'm going to see if I can get the above bit working and if that > has an effect on the inode table offsets. I don't think this will have an effect. If you specify "-O resize_inode" to mke2fs, it will pick a default of 1024x the specified filesystem size, and the libext2 code will create a resize inode with enough blocks, after it has determined all of the needed parameters (ext2fs_initialize() maybe). I _think_ there is a calculation in that same function which determines the inode table offset, based on the group blocks + 3 (super, block, inode bitmaps). It probably needs s_reserved_gdt_blocks added in there and we will be happy. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Robert W. <rj...@du...> - 2002-09-13 04:14:30
|
> I don't think this will have an effect. If you specify "-O resize_inode" > to mke2fs, it will pick a default of 1024x the specified filesystem size, > and the libext2 code will create a resize inode with enough blocks, > after it has determined all of the needed parameters (ext2fs_initialize() > maybe). I _think_ there is a calculation in that same function which > determines the inode table offset, based on the group blocks + 3 (super, > block, inode bitmaps). It probably needs s_reserved_gdt_blocks added in > there and we will be happy. Is this the one? /* * Overhead is the number of bookkeeping blocks per group. It * includes the superblock backup, the group descriptor * backups, the inode bitmap, the block bitmap, and the inode * table. */ overhead = (int) (2 + fs->inode_blocks_per_group); if (ext2fs_bg_has_super(fs, fs->group_desc_count - 1)) overhead += 1 + fs->desc_blocks + super->s_reserved_gdt_blocks; Looks like it only does this if there's a backup superblock there. Would this work: /* * Overhead is the number of bookkeeping blocks per group. It * includes the superblock backup, the group descriptor * backups, the inode bitmap, the block bitmap, and the inode * table. */ overhead = (int) (2 + fs->inode_blocks_per_group); /* if (ext2fs_bg_has_super(fs, fs->group_desc_count - 1)) */ overhead += 1 + fs->desc_blocks + super->s_reserved_gdt_blocks; Would that be safe? Regards, Robert. |
From: Andreas D. <ad...@cl...> - 2002-09-13 23:26:49
|
On Sep 12, 2002 15:41 -0700, Robert Walsh wrote: > > I don't think this will have an effect. If you specify "-O resize_inode" > > to mke2fs, it will pick a default of 1024x the specified filesystem size, > > and the libext2 code will create a resize inode with enough blocks, > > after it has determined all of the needed parameters (ext2fs_initialize() > > maybe). I _think_ there is a calculation in that same function which > > determines the inode table offset, based on the group blocks + 3 (super, > > block, inode bitmaps). It probably needs s_reserved_gdt_blocks added in > > there and we will be happy. > > Is this the one? > > /* > * Overhead is the number of bookkeeping blocks per group. It > * includes the superblock backup, the group descriptor > * backups, the inode bitmap, the block bitmap, and the inode > * table. > */ > overhead = (int) (2 + fs->inode_blocks_per_group); > > if (ext2fs_bg_has_super(fs, fs->group_desc_count - 1)) > overhead += 1 + fs->desc_blocks + super->s_reserved_gdt_blocks; Nope, this value is only used to determine whether the last group is too small to be included in the filesystem. The real culprit appears to be in ext2fs_allocate_group_table(), where it is calculating the start_blk value. It isn't including the reserved GDT blocks there, just before it allocates the inode table. Since I changed the above "overhead" calculation to not require the reserved blocks for non-backup-holding groups, the code in ext2fs_allocate_group_table() probably needs to special-case the situation where (the corrected) start_blk + fs->inode_blocks_per_group is more than last_blk, and move the inode table down until it fits. Or, we could revert the "overhead" calculation changes (and put the comment that "being clever is tricky" back in) like you suggested, so that it always includes the backup and reserved GDT. The only problem is as follows: say you are reserving for a huge filesystem, and you want 1024 reserved group blocks (the maximum, for a 4kB block filesystem, which would give you an upper filesystem size of 16TB[*]) If it turns out that the last group in your filesystem does not contain backups (probably true), but it is smaller than ~4MB you will lose that space until you resize slightly larger than 4MB, even though you don't need to store a group descriptor table into that group ever. Maybe it's not a big deal (4MB is in the noise these days ;-), and you are already "wasting" at least 8MB of space for the initial reserved GDT blocks in groups 0 and 1. It's not like you don't get that space back later when you resize, and presumably you are going to resize if you are reserving blocks. Cheers, Andreas [*] coincidentally also the current maximum 4kB filesystem size because of the 2^32 block limit). -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Robert W. <rj...@du...> - 2002-09-13 04:32:30
|
Hi Andreas, Should the reserved blocks be reserved even in a block group that does not contain a super block backup? Or are they available for use in the block group? Regards, Robert. |
From: Robert W. <rj...@du...> - 2002-09-13 17:49:03
|
Hi, It appears to me that what's happening is that ext2online is trying to guess the stride value using the differences in the buffer bitmap position in each block group. This doesn't always give valid values, but whether it does or it doesn't, it then tries to calculate the inode table offset using its location in the first block group. It then creates a bunch of new groups to add using the inode table offset as it's key. It places the buffer bitmap and inode bitmap 2 blocks and 1 block before the inode table, respectively. On blocks groups without a super block backup, this is fine. On those with a super block backup, this causes the bitmaps to fall into the range of the reserved gdt blocks because in those circumstances, the inode table generated by mke2fs happens to fall before the bitmaps. I guess that mke2fs is producing a legal layout, or is it? It certainly works for us, but it doesn't produce anything that looks like the layout in your ols2002 paper. One thought might be to rearrange the order in which it allocates blocks, so that it always allocates the bitmaps first and then the inode table. This will at least guarantee the ordering will be correct. I need to look at this a little more to be sure, though. Regards, Robert. --=20 Robert Walsh Amalgamated Durables, Inc. - "We don't make the things you buy." Email: rj...@du... |
From: Andreas D. <ad...@cl...> - 2002-09-13 19:57:56
|
On Sep 13, 2002 10:49 -0700, Robert Walsh wrote: > It appears to me that what's happening is that ext2online is trying to > guess the stride value using the differences in the buffer bitmap > position in each block group. This doesn't always give valid values, > but whether it does or it doesn't, it then tries to calculate the inode > table offset using its location in the first block group. It then > creates a bunch of new groups to add using the inode table offset as > it's key. It places the buffer bitmap and inode bitmap 2 blocks and 1 > block before the inode table, respectively. On blocks groups without a > super block backup, this is fine. On those with a super block backup, > this causes the bitmaps to fall into the range of the reserved gdt > blocks because in those circumstances, the inode table generated by > mke2fs happens to fall before the bitmaps. Good detective work. Yes this is pretty much a summary of how things happen on the ext2online side, based on how previous ext2 filesystems were set up. I think mke2fs is changing its behaviour slightly because the reserved group descriptors are using up the space that it had saved for the inode and block bitmaps, and the inode table is using them up first. > I guess that mke2fs is producing a legal layout, or is it? Yes, it is totally legal, but just not very common. However, I have also learned recently that Ted's resize2fs tool also will move only the bitmap blocks to after the inode table in the case it needs to (offline) resize, while ext2resize will work to keep the old ordering. > It certainly works for us, but it doesn't produce anything that > looks like the layout in your ols2002 paper. One thought might be to > rearrange the order in which it allocates blocks, so that it always > allocates the bitmaps first and then the inode table. This will at > least guarantee the ordering will be correct. I need to look at this > a little more to be sure, though. Would this be for mke2fs or ext2online? There might be two issues here. 1) mke2fs should try to keep the old "standard" layout as much as possible, because there are likely other tools which don't know anything about the fact that bitmaps can move (GNU parted might be one of those, because it uses a very old version of the ext2resize code). This simply means making sure that the location of the inode table is offset enough to compensate for the reserved group descriptor blocks as we previously discussed. 2) ext2online shouldn't pass bad data (i.e. invalid bitmap locations) to the kernel. Rather than picking the offset of the first group's inode table blindly, it should calculate itself what the offset needs to be sb + (2 bitmaps if not "striped") + in-use GDT blk + reserved GDT blk and use the maximum of that and the "current" inode table offset. I suppose it is almost doing this, but is getting confused because it assumes that if _any_ bitmaps are after the inode table, then it is "striped" and they will _all_ be after the inode table. The bug happens when it can't find a regular striping pattern and uses stride = 0, it assumes that there is enough space before the inode table to hold the bitmaps. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |