ext2resize-devel Mailing List for GNU ext2resize (Page 7)
Status: Inactive
Brought to you by:
adilger
You can subscribe to this list here.
2000 |
Jan
|
Feb
|
Mar
(8) |
Apr
(1) |
May
|
Jun
(3) |
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(5) |
Dec
(2) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2001 |
Jan
|
Feb
|
Mar
(6) |
Apr
(2) |
May
|
Jun
|
Jul
(3) |
Aug
(2) |
Sep
|
Oct
(5) |
Nov
(5) |
Dec
|
2002 |
Jan
(14) |
Feb
(8) |
Mar
(5) |
Apr
|
May
|
Jun
|
Jul
(3) |
Aug
(3) |
Sep
(12) |
Oct
(12) |
Nov
(10) |
Dec
(10) |
2003 |
Jan
|
Feb
(7) |
Mar
(1) |
Apr
(6) |
May
(3) |
Jun
|
Jul
(3) |
Aug
(3) |
Sep
(2) |
Oct
(4) |
Nov
(1) |
Dec
(2) |
2004 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(2) |
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2005 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
(14) |
Sep
(4) |
Oct
|
Nov
|
Dec
(5) |
2006 |
Jan
|
Feb
(4) |
Mar
(19) |
Apr
(1) |
May
(9) |
Jun
(34) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
From: Andreas D. <ad...@cl...> - 2002-11-04 01:28:29
|
There is now (at very long last) a patch for ext3 online resizing on 2.5 kernels. Sadly, it was built against 2.5.44, and I now see that it does not apply cleanly to 2.5.current because of the EA/ACL and persistent-mount-option changes from Ted. Those should not really affect the operation though, so it should be easy to fix it up for the current tree. Patch is in CVS, and also on the SF "patches" page. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ |
From: Robert W. <rj...@du...> - 2002-11-02 08:59:05
|
Hi all, I have posted new ext3 online resize patches to linux-2.4.18 and linux-2.4.19. New in these patches is a fix to a problem where the wrong entry from the resize inode dind block was being zeroed out after a new gdb was created. This caused the resize to fail at seemingly arbitrary points, and eventually caused corruption of the filesystem. I have also posted new patches to e2fsprogs-1.28 and e2fsprogs-1.29.=20 New in these patches is a fix to a bug in the resize inode size calculations to handle really large resize inodes and removal of the hack to e2fsck to work around this bug. I haven't tested the patch to 1.29 yet, but it does compile; caveat emptor. Note: I just noticed that e2fsprogs-1.30 has been released. My initial attempt to apply the 1.29 patch failed, and the port looks nasty, actually. I'll try get to that in the middle of November, if I have the time. Please use 1.28 (or 1.29, if you feel brave) for the moment. The upshot of the above patches is that resize should now work for resizing any arbitrary sized filesystem to any new (larger) size. It also means that the resulting filesystem passes fsck without any problems and no hacky workarounds. I have tested the patches on 2.4.19 growing various sized filesystems from 1GB to 1TB. Next week, I should have a chance to test them on filesystems from 1TB to 4TB (with large block device support.) In the next month or two, I'm going to try scrape enough disks together to test up to 16TB - I'll let you know how it goes. All of the above patches are in the CVS tree and posted on the ext2resize SourceForge patch tracker page. Let me know if you have any problems. Regards, Robert. --=20 Robert Walsh Amalgamated Durables, Inc. - "We don't make the things you buy." Email: rj...@du... |
From: Andreas D. <ad...@cl...> - 2002-10-23 19:04:27
|
On Oct 23, 2002 11:43 -0700, Robert Walsh wrote: > > have you done any work with porting the ext3 kernel patches to 2.5? > > If not, I will give it a shot this weekend if I have time. > > Nope - too many other things to worry about :-) I haven't even > downloaded a 2.5.* release yet... Oh well, I got stuck on the journal_flush() issue before I started porting to 2.5. At least I got UML+2.5 working, so when I ever get a round tuit I can start hacking on 2.5. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Robert W. <rj...@du...> - 2002-10-23 18:43:42
|
> have you done any work with porting the ext3 kernel patches to 2.5? > If not, I will give it a shot this weekend if I have time. Nope - too many other things to worry about :-) I haven't even downloaded a 2.5.* release yet... |
From: Robert W. <rj...@du...> - 2002-10-23 14:13:47
|
Hi all, I checked in some new patches for ext3 online resize yesterday. The patches are against 2.4.18 and 2.4.19. I'll roll a 2.4.20 patch as soon as 2.4.20 is finalized. These patches should solve the oops that occurs when resizing and doing I/O at the same time. Please let me know if you see any other problems. The patches are also available on the ext2resize sourceforge patch page. Regards, Robert. --=20 Robert Walsh Amalgamated Durables, Inc. - "We don't make the things you buy." Email: rj...@du... |
From: Andrew M. <ak...@di...> - 2002-10-12 17:13:55
|
Andreas Dilger wrote: > > Robert, > have you done any work with porting the ext3 kernel patches to 2.5? > If not, I will give it a shot this weekend if I have time. > > Andrew, > would you be willing to accept them into your tree and/or do you think > they would make it into 2.5 if I were to submit them? Well that would involve Stephen and Ted of course. But I would support it. The requirement makes sense, it doesn't break the fs for current users and you support people well... |
From: Andreas D. <ad...@cl...> - 2002-10-12 15:46:16
|
Robert, have you done any work with porting the ext3 kernel patches to 2.5? If not, I will give it a shot this weekend if I have time. Andrew, would you be willing to accept them into your tree and/or do you think they would make it into 2.5 if I were to submit them? Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Andreas D. <ad...@cl...> - 2002-10-09 08:42:02
|
On Oct 03, 2002 10:58 -0700, Robert Walsh wrote: > > I could probably do this in a couple of hours if you would be willing > > to do the testing, or is the description enough? Of course, if you > > are not running multiple resizers at one time (by accident normally, > > of course) then the only place you actually need the sb lock is at > > the critical region previously mentioned in ext3_group_add() because > > we are updating the free counts, which are also updated by other parts > > of the code. > > Running multiple resizers at once sounds like a weird situation - > probably a mistake on the users behalf, right? However, if it's > possible to do it, then it shouldn't result in corrupt data or an oops, > so it should be accounted for. > > Sounds like you've got a better idea about how to do this than I do, so > why don't you go ahead and I'll definitely give it a good testing. Sorry for the delay - the following patch is a new version of the kernel patch. (We have some 1000-node acceptance tests we are supposed to run for Lustre, but are hitting some bugs as we've never had access to that many nodes before ;-). It is basically just an edited of patches/online-ext3-2.4.18.diff so it is possible that it doesn't even compile, but it should be close. The diff against that patch is fairly small, if you want to see just the changes. The lock_super() is moved to be strictly after journal_start() (to avoid the oops you were having), and is not held for cases where it is not needed. The critical area for adding groups ended up larger than I originally thought, since we need to keep anything from changing in the superblock after we have deleted the pointers to the backup group descriptors, but I don't think that is a big deal. As a side benefit, the moving of the superblock lock removes a bit of unpleasantness around ext3_free_blocks() (dropping the lock and getting it again). Cheers, Andreas ========================================================================= diff -rNu linux-2.4.18-orig/Documentation/Configure.help linux-2.4.18/Documentation/Configure.help --- linux-2.4.18-orig/Documentation/Configure.help Mon Feb 25 11:37:51 2002 +++ linux-2.4.18/Documentation/Configure.help Tue Sep 10 11:18:06 2002 @@ -14078,6 +14078,20 @@ generated. To turn debugging off again, do "echo 0 > /proc/sys/fs/jbd-debug". +Online resize for ext3 filesystems +CONFIG_EXT3_RESIZE + This option gives you the ability to increase the size of an ext3 + filesystem while it is mounted (in use). In order to do this, you + must also be able to resize the underlying disk partition, probably + via a Logical Volume Manager (LVM), metadevice (MD), or hardware + RAID device - none of that capability is included in this feature. + If you do not know what any of these things are, or you have not + configured your kernel for them, you should probably say N here. If + you choose Y, then your kernel will be about 3k larger, and you need + to get some more software (http://ext2resize.sourceforge.net/) in + order to actually resize your filesystem, otherwise this feature + will just sit unused inside the kernel. + Buffer Head tracing (DEBUG) CONFIG_BUFFER_DEBUG If you are a kernel developer working with file systems or in the diff -rNu linux-2.4.18-orig/fs/Config.in linux-2.4.18/fs/Config.in --- linux-2.4.18-orig/fs/Config.in Mon Feb 25 11:38:07 2002 +++ linux-2.4.18/fs/Config.in Tue Sep 10 11:18:06 2002 @@ -27,6 +27,7 @@ # dep_tristate ' Journal Block Device support (JBD for ext3)' CONFIG_JBD $CONFIG_EXT3_FS define_bool CONFIG_JBD $CONFIG_EXT3_FS dep_mbool ' JBD (ext3) debugging support' CONFIG_JBD_DEBUG $CONFIG_JBD +dep_mbool ' Online ext3 resize support (DANGEROUS)' CONFIG_EXT3_RESIZE $CONFIG_EXT3_FS $CONFIG_EXPERIMENTAL # msdos file systems tristate 'DOS FAT fs support' CONFIG_FAT_FS diff -rNu linux-2.4.18-orig/fs/ext3/Makefile linux-2.4.18/fs/ext3/Makefile --- linux-2.4.18-orig/fs/ext3/Makefile Fri Dec 21 09:41:55 2001 +++ linux-2.4.18/fs/ext3/Makefile Tue Sep 10 11:18:06 2002 @@ -11,6 +11,7 @@ obj-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \ ioctl.o namei.o super.o symlink.o +obj-$(CONFIG_EXT3_RESIZE) += resize.o obj-m := $(O_TARGET) include $(TOPDIR)/Rules.make diff -rNu linux-2.4.18-orig/fs/ext3/balloc.c linux-2.4.18/fs/ext3/balloc.c --- linux-2.4.18-orig/fs/ext3/balloc.c Mon Feb 25 11:38:08 2002 +++ linux-2.4.18/fs/ext3/balloc.c Tue Sep 10 11:18:06 2002 @@ -423,7 +423,7 @@ error_return: ext3_std_error(sb, err); unlock_super(sb); - if (dquot_freed_blocks) + if (dquot_freed_blocks && !(EXT3_I(inode)->i_state & EXT3_STATE_RESIZE)) DQUOT_FREE_BLOCK(inode, dquot_freed_blocks); return; } @@ -821,13 +821,13 @@ unsigned long ext3_count_free_blocks (struct super_block * sb) { -#ifdef EXT3FS_DEBUG struct ext3_super_block * es; unsigned long desc_count, bitmap_count, x; int bitmap_nr; struct ext3_group_desc * gdp; int i; - + + if (test_opt(sb, DEBUG)) { lock_super (sb); es = sb->u.ext3_sb.s_es; desc_count = 0; @@ -848,13 +848,12 @@ i, le16_to_cpu(gdp->bg_free_blocks_count), x); bitmap_count += x; } - printk("ext3_count_free_blocks: stored = %lu, computed = %lu, %lu\n", + printk(__FUNCTION__": stored = %u, computed gdt = %lu, bitmap = %lu\n", le32_to_cpu(es->s_free_blocks_count), desc_count, bitmap_count); unlock_super (sb); return bitmap_count; -#else + } else return le32_to_cpu(sb->u.ext3_sb.s_es->s_free_blocks_count); -#endif } static inline int block_in_use (unsigned long block, diff -rNu linux-2.4.18-orig/fs/ext3/inode.c linux-2.4.18/fs/ext3/inode.c --- linux-2.4.18-orig/fs/ext3/inode.c Mon Feb 25 11:38:08 2002 +++ linux-2.4.18/fs/ext3/inode.c Tue Sep 10 11:18:06 2002 @@ -1984,36 +1984,33 @@ int ext3_get_inode_loc (struct inode *inode, struct ext3_iloc *iloc) { struct buffer_head *bh = 0; + struct super_block *sb = inode->i_sb; + struct ext3_sb_info *sbi = EXT3_SB(inode->i_sb); + unsigned long ino = inode->i_ino; unsigned long block; unsigned long block_group; unsigned long group_desc; unsigned long desc; unsigned long offset; struct ext3_group_desc * gdp; - - if ((inode->i_ino != EXT3_ROOT_INO && - inode->i_ino != EXT3_ACL_IDX_INO && - inode->i_ino != EXT3_ACL_DATA_INO && - inode->i_ino != EXT3_JOURNAL_INO && - inode->i_ino < EXT3_FIRST_INO(inode->i_sb)) || - inode->i_ino > le32_to_cpu( - inode->i_sb->u.ext3_sb.s_es->s_inodes_count)) { - ext3_error (inode->i_sb, "ext3_get_inode_loc", - "bad inode number: %lu", inode->i_ino); + + if ((ino != EXT3_ROOT_INO && ino != EXT3_ACL_IDX_INO && + ino != EXT3_ACL_DATA_INO && ino != EXT3_JOURNAL_INO && + ino != EXT3_RESIZE_INO && ino < EXT3_FIRST_INO(sb)) || + ino > le32_to_cpu(sbi->s_es->s_inodes_count)) { + ext3_error(sb, __FUNCTION__, "bad inode number: %lu", ino); goto bad_inode; } - block_group = (inode->i_ino - 1) / EXT3_INODES_PER_GROUP(inode->i_sb); - if (block_group >= inode->i_sb->u.ext3_sb.s_groups_count) { - ext3_error (inode->i_sb, "ext3_get_inode_loc", - "group >= groups count"); + block_group = (ino - 1) / sbi->s_inodes_per_group; + if (block_group >= sbi->s_groups_count) { + ext3_error(sb, __FUNCTION__, "group >= groups count"); goto bad_inode; } - group_desc = block_group >> EXT3_DESC_PER_BLOCK_BITS(inode->i_sb); - desc = block_group & (EXT3_DESC_PER_BLOCK(inode->i_sb) - 1); - bh = inode->i_sb->u.ext3_sb.s_group_desc[group_desc]; + group_desc = block_group >> sbi->s_desc_per_block_bits; + desc = block_group & (sbi->s_desc_per_block - 1); + bh = sbi->s_group_desc[group_desc]; if (!bh) { - ext3_error (inode->i_sb, "ext3_get_inode_loc", - "Descriptor not loaded"); + ext3_error(sb, __FUNCTION__, "Descriptor not loaded"); goto bad_inode; } @@ -2021,17 +2018,16 @@ /* * Figure out the offset within the block group inode table */ - offset = ((inode->i_ino - 1) % EXT3_INODES_PER_GROUP(inode->i_sb)) * - EXT3_INODE_SIZE(inode->i_sb); + offset = ((ino - 1) % sbi->s_inodes_per_group) * sbi->s_inode_size; block = le32_to_cpu(gdp[desc].bg_inode_table) + - (offset >> EXT3_BLOCK_SIZE_BITS(inode->i_sb)); - if (!(bh = sb_bread(inode->i_sb, block))) { - ext3_error (inode->i_sb, "ext3_get_inode_loc", - "unable to read inode block - " - "inode=%lu, block=%lu", inode->i_ino, block); + (offset >> EXT3_BLOCK_SIZE_BITS(sb)); + if (!(bh = sb_bread(sb, block))) { + ext3_error(sb, __FUNCTION__, + "unable to read inode block - inode=%lu, block=%lu", + ino, block); goto bad_inode; } - offset &= (EXT3_BLOCK_SIZE(inode->i_sb) - 1); + offset &= EXT3_BLOCK_SIZE(sb) - 1; iloc->bh = bh; iloc->raw_inode = (struct ext3_inode *) (bh->b_data + offset); diff -rNu linux-2.4.18-orig/fs/ext3/ioctl.c linux-2.4.18/fs/ext3/ioctl.c --- linux-2.4.18-orig/fs/ext3/ioctl.c Fri Nov 9 14:25:04 2001 +++ linux-2.4.18/fs/ext3/ioctl.c Tue Sep 10 11:19:00 2002 @@ -11,6 +11,8 @@ #include <linux/jbd.h> #include <linux/ext3_fs.h> #include <linux/ext3_jbd.h> +#include <linux/locks.h> +#include <linux/smp_lock.h> #include <linux/sched.h> #include <asm/uaccess.h> @@ -140,6 +142,51 @@ ext3_journal_stop(handle, inode); return err; } +#ifdef CONFIG_EXT3_RESIZE + case EXT3_IOC_GROUP_EXTEND: { + unsigned long n_blocks_count; + struct super_block *sb = inode->i_sb; + int err; + + if (!capable(CAP_SYS_RESOURCE)) + return -EACCES; + + if (sb->s_flags & MS_RDONLY) + return -EROFS; + + if (get_user(n_blocks_count, (__u32 *)arg)) + return -EFAULT; + + lock_kernel(); + err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count); + unlock_kernel(); + journal_flush(EXT3_SB(sb)->s_journal); + + return err; + } + case EXT3_IOC_GROUP_ADD: { + struct ext3_new_group_data input; + struct super_block *sb = inode->i_sb; + int err; + + if (!capable(CAP_SYS_RESOURCE)) + return -EACCES; + + if (inode->i_sb->s_flags & MS_RDONLY) + return -EROFS; + + if (copy_from_user(&input, (struct ext3_new_group_input *)arg, + sizeof(input))) + return -EFAULT; + + lock_kernel(); + err = ext3_group_add(sb, &input); + unlock_kernel(); + journal_flush(EXT3_SB(sb)->s_journal); + + return err; + } +#endif /* CONFIG_EXT3_RESIZE */ #ifdef CONFIG_JBD_DEBUG case EXT3_IOC_WAIT_FOR_READONLY: /* diff -rNu linux-2.4.18-orig/fs/ext3/resize.c linux-2.4.18/fs/ext3/resize.c --- linux-2.4.18-orig/fs/ext3/resize.c Wed Dec 31 16:00:00 1969 +++ linux-2.4.18/fs/ext3/resize.c Tue Sep 10 11:19:04 2002 @@ -0,0 +1,958 @@ +/* + * linux/fs/ext3/resize.c + * + * Support for resizing an ext3 filesystem while it is mounted. + * + * Copyright (C) 2001, 2002 Andreas Dilger <ad...@cl...> + * + * This could probably be made into a module, because it is not often in use. + */ + +#include <linux/config.h> + +#define EXT3FS_DEBUG + +#include <linux/sched.h> +#include <linux/smp_lock.h> +#include <linux/ext3_jbd.h> + +#include <linux/errno.h> +#include <linux/locks.h> +#include <linux/slab.h> + + +#define outside(b, first, last) ((b) < (first) || (b) >= (last)) +#define inside(b, first, last) ((b) >= (first) && (b) < (last)) + +static int verify_group_input(struct super_block *sb, + struct ext3_new_group_data *input) +{ + struct ext3_sb_info *sbi = EXT3_SB(sb); + struct ext3_super_block *es = sbi->s_es; + unsigned start = le32_to_cpu(es->s_blocks_count); + unsigned end = start + input->blocks_count; + unsigned group = input->group; + unsigned itend = input->inode_table + EXT3_SB(sb)->s_itb_per_group; + unsigned overhead = ext3_bg_has_super(sb, group) ? + (1 + ext3_bg_num_gdb(sb, group) + + le16_to_cpu(es->s_reserved_gdt_blocks)) : 0; + unsigned metaend = start + overhead; + struct buffer_head *bh; + int free_blocks_count; + int err = -EINVAL; + + input->free_blocks_count = free_blocks_count = + input->blocks_count - 2 - overhead - sbi->s_itb_per_group; + + if (test_opt(sb, DEBUG)) + printk("EXT3-fs: adding %s group %u: %u blocks " + "(%d free, %u reserved)\n", + ext3_bg_has_super(sb, input->group) ? "normal" : + "no-super", input->group, input->blocks_count, + free_blocks_count, input->reserved_blocks); + + if (group != sbi->s_groups_count) + ext3_warning(sb, __FUNCTION__, + "Cannot add at group %u (only %lu groups)", + input->group, sbi->s_groups_count); + else if ((start - le32_to_cpu(es->s_first_data_block)) % + EXT3_BLOCKS_PER_GROUP(sb)) + ext3_warning(sb, __FUNCTION__, "Last group not full"); + else if (input->reserved_blocks > input->blocks_count / 5) + ext3_warning(sb, __FUNCTION__, "Reserved blocks too high (%u)", + input->reserved_blocks); + else if (free_blocks_count < 0) + ext3_warning(sb, __FUNCTION__, "Bad blocks count %u", + input->blocks_count); + else if (!(bh = sb_bread(sb, end - 1))) + ext3_warning(sb, __FUNCTION__, "Cannot read last block (%u)", + end - 1); + else if (outside(input->block_bitmap, start, end)) + ext3_warning(sb, __FUNCTION__, + "Block bitmap not in group (block %u)", + input->block_bitmap); + else if (outside(input->inode_bitmap, start, end)) + ext3_warning(sb, __FUNCTION__, + "Inode bitmap not in group (block %u)", + input->inode_bitmap); + else if (outside(input->inode_table, start, end) || + outside(itend - 1, start, end)) + ext3_warning(sb, __FUNCTION__, + "Inode table not in group (blocks %u-%u)", + input->inode_table, itend - 1); + else if (input->inode_bitmap == input->block_bitmap) + ext3_warning(sb, __FUNCTION__, + "Block bitmap same as inode bitmap (%u)", + input->block_bitmap); + else if (inside(input->block_bitmap, input->inode_table, itend)) + ext3_warning(sb, __FUNCTION__, + "Block bitmap (%u) in inode table (%u-%u)", + input->block_bitmap, input->inode_table, itend-1); + else if (inside(input->inode_bitmap, input->inode_table, itend)) + ext3_warning(sb, __FUNCTION__, + "Inode bitmap (%u) in inode table (%u-%u)", + input->inode_bitmap, input->inode_table, itend-1); + else if (inside(input->block_bitmap, start, metaend)) + ext3_warning(sb, __FUNCTION__, + "Block bitmap (%u) in GDT table (%u-%u)", + input->block_bitmap, start, metaend - 1); + else if (inside(input->inode_bitmap, start, metaend)) + ext3_warning(sb, __FUNCTION__, + "Inode bitmap (%u) in GDT table (%u-%u)", + input->inode_bitmap, start, metaend - 1); + else if (inside(input->inode_table, start, metaend) || + inside(itend - 1, start, metaend)) + ext3_warning(sb, __FUNCTION__, + "Inode table (%u-%u) overlaps GDT table (%u-%u)", + input->inode_table, itend - 1, start, metaend - 1); + else { + brelse(bh); + err = 0; + } + + return err; +} + +static struct buffer_head *bclean(handle_t *handle, struct super_block *sb, + unsigned long blk) +{ + struct buffer_head *bh; + int err; + + bh = sb_getblk(sb, blk); + mark_buffer_uptodate(bh, 1); + if ((err = ext3_journal_get_write_access(handle, bh))) { + brelse(bh); + bh = ERR_PTR(err); + } else + memset(bh->b_data, 0, sb->s_blocksize); + + return bh; +} + +/* + * To avoid calling the atomic setbit hundreds or thousands of times, we only + * need to use it within a single byte (to ensure we get endianness right). + * We can use memset for the rest of the bitmap as there are no other users. + */ +static void mark_bitmap_end(int start_bit, int end_bit, char *bitmap) +{ + int i; + + if (start_bit >= end_bit) + return; + + ext3_debug("mark end bits +%d through +%d used\n", start_bit, end_bit); + for (i = start_bit; i < ((start_bit + 7) & ~7UL); i++) + ext3_set_bit(i, bitmap); + if (i < end_bit) + memset(bitmap + (i >> 3), 0xff, (end_bit - i) >> 3); +} + +/* + * Set up the block and inode bitmaps, and the inode table for the new group. + * This doesn't need to be part of the main transaction, since we are only + * changing blocks outside the actual filesystem. We still do journaling to + * ensure the recovery is correct in case of a failure just after resize. + * If any part of this fails, we simply abort the resize. + * + * We only pass inode because of the ext3 journal wrappers. + */ +static int setup_new_group_blocks(struct super_block *sb, struct inode *inode, + struct ext3_new_group_data *input) +{ + struct ext3_sb_info *sbi = EXT3_SB(sb); + unsigned long start = input->group * sbi->s_blocks_per_group + + le32_to_cpu(sbi->s_es->s_first_data_block); + int reserved_gdb = ext3_bg_has_super(sb, input->group) ? + le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks) : 0; + unsigned long gdblocks = ext3_bg_num_gdb(sb, input->group); + struct buffer_head *bh; + handle_t *handle; + unsigned long block; + int bit; + int i; + int err = 0, err2; + + handle = ext3_journal_start(inode, reserved_gdb + gdblocks + + 2 + sbi->s_itb_per_group); + if (IS_ERR(handle)) + return PTR_ERR(handle); + + lock_super(sb); + if (input->group != sbi->s_groups_count) { + err = -EBUSY; + goto exit_journal; + } + + if (IS_ERR(bh = bclean(handle, sb, input->block_bitmap))) { + err = PTR_ERR(bh); + goto exit_journal; + } + + if (ext3_bg_has_super(sb, input->group)) { + ext3_debug("mark backup superblock %#04lx (+0)\n", start); + ext3_set_bit(0, bh->b_data); + } + + /* Copy all of the GDT blocks into the backup in this group */ + for (i = 0, bit = 1, block = start + 1; + i < gdblocks; i++, block++, bit++) { + struct buffer_head *gdb; + + ext3_debug("update backup group %#04lx (+%d)\n", block, bit); + + gdb = sb_getblk(sb, block); + mark_buffer_uptodate(gdb, 1); + if ((err = ext3_journal_get_write_access(handle, gdb))) { + brelse(gdb); + goto exit_bh; + } + memcpy(gdb->b_data, sbi->s_group_desc[i], bh->b_size); + ext3_journal_dirty_metadata(handle, gdb); + ext3_set_bit(bit, bh->b_data); + brelse(gdb); + } + + /* Zero out all of the reserved backup group descriptor table blocks */ + for (i = 0, bit = gdblocks + 1, block = start + bit; + i < reserved_gdb; i++, block++, bit++) { + struct buffer_head *gdb; + + ext3_debug("clear reserved block %#04lx (+%d)\n", block, bit); + + if (IS_ERR(gdb = bclean(handle, sb, block))) { + err = PTR_ERR(bh); + goto exit_bh; + } + ext3_journal_dirty_metadata(handle, gdb); + ext3_set_bit(bit, bh->b_data); + brelse(gdb); + } + ext3_debug("mark block bitmap %#04x (+%ld)\n", input->block_bitmap, + input->block_bitmap - start); + ext3_set_bit(input->block_bitmap - start, bh->b_data); + ext3_debug("mark inode bitmap %#04x (+%ld)\n", input->inode_bitmap, + input->inode_bitmap - start); + ext3_set_bit(input->inode_bitmap - start, bh->b_data); + + /* Zero out all of the inode table blocks */ + for (i = 0, block = input->inode_table, bit = block - start; + i < sbi->s_itb_per_group; i++, bit++, block++) { + struct buffer_head *it; + + ext3_debug("clear inode block %#04x (+%ld)\n", block, bit); + if (IS_ERR(it = bclean(handle, sb, block))) { + err = PTR_ERR(it); + goto exit_bh; + } + ext3_journal_dirty_metadata(handle, it); + brelse(it); + ext3_set_bit(bit, bh->b_data); + } + mark_bitmap_end(input->blocks_count, EXT3_BLOCKS_PER_GROUP(sb), + bh->b_data); + ext3_journal_dirty_metadata(handle, bh); + brelse(bh); + + /* Mark unused entries in inode bitmap used */ + ext3_debug("clear inode bitmap %#04x (+%ld)\n", + input->inode_bitmap, input->inode_bitmap - start); + if (IS_ERR(bh = bclean(handle, sb, input->inode_bitmap))) { + err = PTR_ERR(bh); + goto exit_journal; + } + + mark_bitmap_end(EXT3_INODES_PER_GROUP(sb), EXT3_BLOCKS_PER_GROUP(sb), + bh->b_data); + ext3_journal_dirty_metadata(handle, bh); +exit_bh: + brelse(bh); + +exit_journal: + unlock_super(sb); + if ((err2 = ext3_journal_stop(handle, inode)) && !err) + err = err2; + + return err; +} + +/* + * Iterate through the groups which hold BACKUP superblock/GDT copies in an + * ext3 filesystem. The counters should be initialized to 1, 5, and 7 before + * calling this for the first time. In a sparse filesystem it will be the + * sequence of powers of 3, 5, and 7: 1, 3, 5, 7, 9, 25, 27, 49, 81, ... + * For a non-sparse filesystem it will be every group: 1, 2, 3, 4, ... + */ +unsigned ext3_list_backups(struct super_block *sb, unsigned *three, + unsigned *five, unsigned *seven) +{ + unsigned *min = three; + int mult = 3; + unsigned ret; + + if (!EXT3_HAS_RO_COMPAT_FEATURE(sb, + EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER)) { + ret = *min; + *min += 1; + return ret; + } + + if (*five < *min) { + min = five; + mult = 5; + } + if (*seven < *min) { + min = seven; + mult = 7; + } + + ret = *min; + *min *= mult; + + return ret; +} + +/* + * Check that all of the backup GDT blocks are held in the primary GDT block. + * It is assumed that they are stored in group order. Returns the number of + * groups in current filesystem that have BACKUPS, or -ve error code. + */ +static int verify_reserved_gdb(struct super_block *sb, + struct buffer_head *primary) +{ + const unsigned long blk = primary->b_blocknr; + const unsigned long end = EXT3_SB(sb)->s_groups_count; + unsigned three = 1; + unsigned five = 5; + unsigned seven = 7; + unsigned grp; + __u32 *p = (__u32 *)primary->b_data; + int gdbackups = 0; + + while ((grp = ext3_list_backups(sb, &three, &five, &seven)) < end) { + if (le32_to_cpu(*p++) != grp * EXT3_BLOCKS_PER_GROUP(sb) + blk){ + ext3_warning(sb, __FUNCTION__, + "reserved GDT %ld missing grp %d (%ld)\n", + blk, grp, + grp * EXT3_BLOCKS_PER_GROUP(sb) + blk); + return -EINVAL; + } + if (++gdbackups > EXT3_ADDR_PER_BLOCK(sb)) + return -EFBIG; + } + + return gdbackups; +} + +/* + * Called when we need to bring a reserved group descriptor table block into + * use from the resize inode. The primary copy of the new GDT block currently + * is an indirect block (under the double indirect block in the resize inode). + * The new backup GDT blocks will be stored as leaf blocks in this indirect + * block, in group order. Even though we know all the block numbers we need, + * we check to ensure that the resize inode has actually reserved these blocks. + * + * Don't need to update the block bitmaps because the blocks are still in use. + * + * We get all of the error cases out of the way, so that we are sure to not + * fail once we start modifying the data on disk, because JBD has no rollback. + */ +static int add_new_gdb(handle_t *handle, struct inode *inode, + struct ext3_new_group_data *input, + struct buffer_head **primary) +{ + struct super_block *sb = inode->i_sb; + struct ext3_super_block *es = EXT3_SB(sb)->s_es; + unsigned long gdb_num = input->group / EXT3_DESC_PER_BLOCK(sb); + unsigned long gdb_off = input->group % EXT3_DESC_PER_BLOCK(sb); + unsigned long gdblock = EXT3_SB(sb)->s_sbh->b_blocknr + 1 + gdb_num; + struct buffer_head **o_group_desc, **n_group_desc; + struct buffer_head *dind; + int gdbackups; + struct ext3_iloc iloc; + __u32 *data; + int err; + + if (test_opt(sb, DEBUG)) + printk("EXT3-fs: ext3_add_new_gdb: adding group block %lu\n", + gdb_num); + + /* + * If we are not using the primary superblock/GDT copy don't resize, + * because the user tools have no way of handling this. Probably a + * bad time to do it anyways. + */ + if (EXT3_SB(sb)->s_sbh->b_blocknr != + le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block)) { + ext3_warning(sb, __FUNCTION__, + "won't resize using backup superblock at %lu\n", + EXT3_SB(sb)->s_sbh->b_blocknr); + return -EPERM; + } + + *primary = sb_bread(sb, gdblock); + if (!*primary) + return -EIO; + + if ((gdbackups = verify_reserved_gdb(sb, *primary)) < 0) { + err = gdbackups; + goto exit_bh; + } + + data = EXT3_I(inode)->i_data + EXT3_DIND_BLOCK; + dind = sb_bread(sb, le32_to_cpu(*data)); + if (!dind) { + err = -EIO; + goto exit_bh; + } + + data = (__u32 *)dind->b_data; + if (le32_to_cpu(data[gdb_num % EXT3_ADDR_PER_BLOCK(sb)]) != gdblock) { + ext3_warning(sb, __FUNCTION__, + "new group %u GDT block %lu not reserved\n", + input->group, gdblock); + err = -EINVAL; + goto exit_dind; + } + + if ((err = ext3_journal_get_write_access(handle, EXT3_SB(sb)->s_sbh))) + goto exit_dind; + + if ((err = ext3_journal_get_write_access(handle, *primary))) + goto exit_sbh; + + if ((err = ext3_journal_get_write_access(handle, dind))) + goto exit_primary; + + /* ext3_reserve_inode_write() gets a reference on the iloc */ + if ((err = ext3_reserve_inode_write(handle, inode, &iloc))) + goto exit_dindj; + + n_group_desc = (struct buffer_head **)kmalloc((gdb_num + 1) * + sizeof(struct buffer_head *), GFP_KERNEL); + if (!n_group_desc) { + err = -ENOMEM; + ext3_warning (sb, __FUNCTION__, + "not enough memory for %lu groups", gdb_num + 1); + goto exit_inode; + } + + /* + * Finally, we have all of the possible failures behind us... + * + * Remove new GDT block from inode double-indirect block and clear out + * the new GDT block for use (which also "frees" the backup GDT blocks + * from the reserved inode). We don't need to change the bitmaps for + * these blocks, because they are marked as in-use from being in the + * reserved inode, and will become GDT blocks (primary and backup). + */ + /* + printk("removing block %d = %ld from dindir %ld[%ld]\n", + ((__u32 *)(dind->b_data))[gdb_off], gdblock, dind->b_blocknr, + gdb_num); */ + data[gdb_off] = 0; + ext3_journal_dirty_metadata(handle, dind); + brelse(dind); + inode->i_blocks -= (gdbackups + 1) * sb->s_blocksize >> 9; + ext3_mark_iloc_dirty(handle, inode, &iloc); + memset((*primary)->b_data, 0, sb->s_blocksize); + ext3_journal_dirty_metadata(handle, *primary); + + o_group_desc = EXT3_SB(sb)->s_group_desc; + memcpy(n_group_desc, o_group_desc, + EXT3_SB(sb)->s_gdb_count * sizeof(struct buffer_head *)); + n_group_desc[gdb_num] = *primary; + EXT3_SB(sb)->s_group_desc = n_group_desc; + EXT3_SB(sb)->s_gdb_count++; + kfree(o_group_desc); + + es->s_reserved_gdt_blocks = + cpu_to_le16(le16_to_cpu(es->s_reserved_gdt_blocks) - 1); + ext3_journal_dirty_metadata(handle, EXT3_SB(sb)->s_sbh); + + return 0; + +exit_inode: + //ext3_journal_release_buffer(handle, iloc.bh); + brelse(iloc.bh); +exit_dindj: + //ext3_journal_release_buffer(handle, dind); +exit_primary: + //ext3_journal_release_buffer(handle, *primary); +exit_sbh: + //ext3_journal_release_buffer(handle, *primary); +exit_dind: + brelse(dind); +exit_bh: + brelse(*primary); + + ext3_debug("leaving with error %d\n", err); + return err; +} + +/* + * Called when we are adding a new group which has a backup copy of each of + * the GDT blocks (i.e. sparse group) and there are reserved GDT blocks. + * We need to add these reserved backup GDT blocks to the resize inode, so + * that they are kept for future resizing and not allocated to files. + * + * Each reserved backup GDT block will go into a different indirect block. + * The indirect blocks are actually the primary reserved GDT blocks, + * so we know in advance what their block numbers are. We only get the + * double-indirect block to verify it is pointing to the primary reserved + * GDT blocks so we don't overwrite a data block by accident. The reserved + * backup GDT blocks are stored in their reserved primary GDT block. + */ +static int reserve_backup_gdb(handle_t *handle, struct inode *inode, + struct ext3_new_group_data *input) +{ + struct super_block *sb = inode->i_sb; + int reserved_gdb =le16_to_cpu(EXT3_SB(sb)->s_es->s_reserved_gdt_blocks); + struct buffer_head **primary; + struct buffer_head *dind; + struct ext3_iloc iloc; + unsigned long blk; + __u32 *data, *end; + int gdbackups = 0; + int res, i; + int err; + + primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_KERNEL); + if (!primary) + return -ENOMEM; + + data = EXT3_I(inode)->i_data + EXT3_DIND_BLOCK; + dind = sb_bread(sb, le32_to_cpu(*data)); + if (!dind) { + err = -EIO; + goto exit_free; + } + + blk = EXT3_SB(sb)->s_sbh->b_blocknr + 1 + EXT3_SB(sb)->s_gdb_count; + data = (__u32 *)dind->b_data + EXT3_SB(sb)->s_gdb_count; + end = (__u32 *)dind->b_data + EXT3_ADDR_PER_BLOCK(sb); + + /* Get each reserved primary GDT block and verify it holds backups */ + for (res = 0; res < reserved_gdb; res++, blk++) { + if (le32_to_cpu(*data) != blk) { + ext3_warning(sb, __FUNCTION__, + "reserved block %lu not at offset %d\n", + blk, data - (__u32 *)dind->b_data); + err = -EINVAL; + goto exit_bh; + } + primary[res] = sb_bread(sb, blk); + if (!primary[res]) { + err = -EIO; + goto exit_bh; + } + if ((gdbackups = verify_reserved_gdb(sb, primary[res])) < 0) { + brelse(primary[res]); + err = gdbackups; + goto exit_bh; + } + if (++data >= end) + data = (__u32 *)dind->b_data; + } + + for (i = 0; i < reserved_gdb; i++) { + if ((err = ext3_journal_get_write_access(handle, primary[i]))) { + /* + int j; + for (j = 0; j < i; j++) + ext3_journal_release_buffer(handle, primary[j]); + */ + goto exit_bh; + } + } + + if ((err = ext3_reserve_inode_write(handle, inode, &iloc))) + goto exit_bh; + + /* + * Finally we can add each of the reserved backup GDT blocks from + * the new group to its reserved primary GDT block. + */ + blk = input->group * EXT3_BLOCKS_PER_GROUP(sb); + for (i = 0; i < reserved_gdb; i++) { + int err2; + data = (__u32 *)primary[i]->b_data; + /* printk("reserving backup %lu[%u] = %lu\n", + primary[i]->b_blocknr, gdbackups, + blk + primary[i]->b_blocknr); */ + data[gdbackups] = cpu_to_le32(blk + primary[i]->b_blocknr); + err2 = ext3_journal_dirty_metadata(handle, primary[i]); + if (!err) + err = err2; + } + inode->i_blocks += reserved_gdb * sb->s_blocksize >> 9; + ext3_mark_iloc_dirty(handle, inode, &iloc); + +exit_bh: + while (--res >= 0) + brelse(primary[res]); + brelse(dind); + +exit_free: + kfree(primary); + + return err; +} + +/* + * Update the backup copies of the ext3 metadata. These don't need to be part + * of the main resize transaction, because e2fsck will re-write them if there + * is a problem (basically only OOM will cause a problem). However, we + * _should_ update the backups if possible, in case the primary gets trashed + * for some reason and we need to run e2fsck from a backup superblock. The + * important part is that the new block and inode counts are in the backup + * superblocks, and the location of the new group metadata in the GDT backups. + * + * We do not need lock_super() for this, because these blocks are not + * otherwise touched by the filesystem code when it is mounted. We don't + * need to worry about last changing from sbi->s_groups_count, because the + * worst that can happen is that we do not copy the full number of backups + * at this time. The resize which changed s_groups_count will backup again. + * + * We only pass inode because of the ext3 journal wrappers. + */ +static void update_backups(struct super_block *sb, struct inode *inode, + int blk_off, char *data, int size) +{ + struct ext3_sb_info *sbi = EXT3_SB(sb); + const unsigned long last = sbi->s_groups_count; + const int bpg = EXT3_BLOCKS_PER_GROUP(sb); + unsigned three = 1; + unsigned five = 5; + unsigned seven = 7; + unsigned group; + int rest = sb->s_blocksize - size; + handle_t *handle; + int err = 0, err2; + + handle = ext3_journal_start(inode, EXT3_MAX_TRANS_DATA); + if (IS_ERR(handle)) { + group = 1; + err = PTR_ERR(handle); + goto exit_err; + } + + while ((group = ext3_list_backups(sb, &three, &five, &seven)) < last) { + struct buffer_head *bh; + + /* Out of journal space, and can't get more - abort - so sad */ + if (handle->h_buffer_credits == 0 && + ext3_journal_extend(handle, EXT3_MAX_TRANS_DATA) && + (err = ext3_journal_restart(handle, EXT3_MAX_TRANS_DATA))) + break; + + bh = sb_getblk(sb, group * bpg + blk_off); + mark_buffer_uptodate(bh, 1); + ext3_debug(sb, __FUNCTION__, "update metadata backup %#04lx\n", + bh->b_blocknr); + if ((err = ext3_journal_get_write_access(handle, bh))) + break; + memcpy(bh->b_data, data, size); + if (rest) + memset(bh->b_data + size, 0, rest); + ext3_journal_dirty_metadata(handle, bh); + brelse(bh); + } + if ((err2 = ext3_journal_stop(handle, inode)) && !err) + err = err2; + + /* + * Ugh! Need to have e2fsck write the backup copies. It is too + * late to revert the resize, we shouldn't fail just because of + * the backup copies (they are only needed in case of corruption). + * + * However, if we got here we have a journal problem too, so we + * can't really start a transaction to mark the superblock. + * Chicken out and just set the flag on the hope it will be written + * to disk, and if not - we will simply wait until next fsck. + */ +exit_err: + if (err) { + ext3_warning(sb, __FUNCTION__, + "can't update backup for group %d (err %d), " + "forcing fsck on next reboot\n", group, err); + sbi->s_mount_state &= ~EXT3_VALID_FS; + sbi->s_es->s_state &= ~cpu_to_le16(EXT3_VALID_FS); + mark_buffer_dirty(sbi->s_sbh); + } +} + +/* Add group descriptor data to an existing or new group descriptor block. + * Ensure we handle all possible error conditions _before_ we start modifying + * the filesystem, because we cannot abort the transaction and not have it + * write the data to disk. + * + * If we are on a GDT block boundary, we need to get the reserved GDT block. + * Otherwise, we may need to add backup GDT blocks for a sparse group. + * + * We only need to hold the superblock lock while we are actually adding + * in the new group's counts to the superblock. Prior to that we have + * not really "added" the group at all. We re-check that we are still + * adding in the last group in case things have changed since verifying. + */ +int ext3_group_add(struct super_block *sb, struct ext3_new_group_data *input) +{ + struct ext3_sb_info *sbi = EXT3_SB(sb); + struct ext3_super_block *es = sbi->s_es; + int reserved_gdb = ext3_bg_has_super(sb, input->group) ? + le16_to_cpu(es->s_reserved_gdt_blocks) : 0; + struct buffer_head *primary = NULL; + struct ext3_group_desc *gdp; + struct inode *inode = NULL; + struct inode bogus; + handle_t *handle; + int gdb_off, gdb_num; + int err, err2; + + gdb_num = input->group / EXT3_DESC_PER_BLOCK(sb); + gdb_off = input->group % EXT3_DESC_PER_BLOCK(sb); + + if (gdb_off == 0 && !EXT3_HAS_RO_COMPAT_FEATURE(sb, + EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER)) { + ext3_warning(sb, __FUNCTION__, + "Can't resize non-sparse filesystem further\n"); + return -EPERM; + } + + if (reserved_gdb || gdb_off == 0) { + if (!EXT3_HAS_COMPAT_FEATURE(sb, + EXT3_FEATURE_COMPAT_RESIZE_INODE)){ + ext3_warning(sb, __FUNCTION__, + "No reserved GDT blocks, can't resize\n"); + return -EPERM; + } + inode = iget(sb, EXT3_RESIZE_INO); + if (!inode || is_bad_inode(inode)) { + ext3_warning(sb, __FUNCTION__, + "Error opening resize inode\n"); + iput(inode); + return -ENOENT; + } + } else { + /* Used only for ext3 journal wrapper functions to get sb */ + inode = &bogus; + bogus.i_sb = sb; + } + + if ((err = verify_group_input(sb, input))) + goto exit_put; + + if ((err = setup_new_group_blocks(sb, inode, input))) + goto exit_put; + + /* + * We will always be modifying at least the superblock and a GDT + * block. If we are adding a group past the last current GDT block, + * we will also modify the inode and the dindirect block. If we + * are adding a group with superblock/GDT backups we will also + * modify each of the reserved GDT dindirect blocks. + */ + handle = ext3_journal_start(inode, ext3_bg_has_super(sb, input->group) ? + 3 + reserved_gdb : 4); + if (IS_ERR(handle)) { + err = PTR_ERR(handle); + goto exit_put; + } + + lock_super(sb); + if (input->group != EXT3_SB(sb)->s_groups_count) { + ext3_warning(sb, __FUNCTION__, + "multiple resizers run on filesystem!\n"); + err = -EBUSY; + goto exit_journal; + } + + if ((err = ext3_journal_get_write_access(handle, sbi->s_sbh))) + goto exit_journal; + + /* + * We will only either add reserved group blocks to a backup group + * or remove reserved blocks for the first group in a new group block. + * Doing both would be mean more complex code, and sane people don't + * use non-sparse filesystems anymore. This is already checked above. + */ + if (gdb_off) { + primary = sbi->s_group_desc[gdb_num]; + if ((err = ext3_journal_get_write_access(handle, primary))) + goto exit_journal; + + if (reserved_gdb && ext3_bg_num_gdb(sb, input->group) && + (err = reserve_backup_gdb(handle, inode, input))) + goto exit_journal; + } else if ((err = add_new_gdb(handle, inode, input, &primary))) + goto exit_journal; + + /* Finally update group descriptor block for new group */ + gdp = (struct ext3_group_desc *)primary->b_data + gdb_off; + + gdp->bg_block_bitmap = cpu_to_le32(input->block_bitmap); + gdp->bg_inode_bitmap = cpu_to_le32(input->inode_bitmap); + gdp->bg_inode_table = cpu_to_le32(input->inode_table); + gdp->bg_free_blocks_count = cpu_to_le16(input->free_blocks_count); + gdp->bg_free_inodes_count = cpu_to_le16(EXT3_INODES_PER_GROUP(sb)); + + EXT3_SB(sb)->s_groups_count++; + ext3_journal_dirty_metadata(handle, primary); + + /* Update superblock with new block counts */ + es->s_blocks_count = cpu_to_le32(le32_to_cpu(es->s_blocks_count) + + input->blocks_count); + es->s_free_blocks_count = + cpu_to_le32(le32_to_cpu(es->s_free_blocks_count) + + input->free_blocks_count); + es->s_r_blocks_count = cpu_to_le32(le32_to_cpu(es->s_r_blocks_count) + + input->reserved_blocks); + es->s_inodes_count = cpu_to_le32(le32_to_cpu(es->s_inodes_count) + + EXT3_INODES_PER_GROUP(sb)); + es->s_free_inodes_count = + cpu_to_le32(le32_to_cpu(es->s_free_inodes_count) + + EXT3_INODES_PER_GROUP(sb)); + ext3_journal_dirty_metadata(handle, EXT3_SB(sb)->s_sbh); + sb->s_dirt = 1; + +exit_journal: + unlock_super(sb); + handle->h_sync = 1; + if ((err2 = ext3_journal_stop(handle, inode)) && !err) + err = err2; + if (!err) { + update_backups(sb, inode, sbi->s_sbh->b_blocknr, (char *)es, + sizeof(struct ext3_super_block)); + update_backups(sb, inode, primary->b_blocknr, primary->b_data, + primary->b_size); + } +exit_put: + if (inode != &bogus) + iput(inode); + return err; +} /* ext3_group_add */ + +/* Extend the filesystem to the new number of blocks specified. This entry + * point is only used to extend the current filesystem to the end of the last + * existing group. It can be accessed via ioctl, or by "remount,resize=<size>" + * for emergencies (because it has no dependencies on reserved blocks). + * + * If we _really_ wanted, we could use default values to call ext3_group_add() + * allow the "remount" trick to work for arbitrary resizing, assuming enough + * GDT blocks are reserved to grow to the desired size. + */ +int ext3_group_extend(struct super_block *sb, struct ext3_super_block *es, + unsigned long n_blocks_count) +{ + unsigned long o_blocks_count; + unsigned long o_groups_count; + unsigned long last; + int add; + struct inode *inode; + struct buffer_head * bh; + handle_t *handle; + int err; + + o_blocks_count = le32_to_cpu(es->s_blocks_count); + o_groups_count = EXT3_SB(sb)->s_groups_count; + + if (test_opt(sb, DEBUG)) + printk("EXT3-fs: extending last group from %lu to %lu blocks\n", + o_blocks_count, n_blocks_count); + + if (n_blocks_count == 0 || n_blocks_count == o_blocks_count) + return 0; + + if (n_blocks_count < o_blocks_count) { + ext3_warning(sb, __FUNCTION__, + "can't shrink FS - resize aborted"); + return -EBUSY; + } + + /* Handle the remaining blocks in the last group only. */ + last = (o_blocks_count - le32_to_cpu(es->s_first_data_block)) % + EXT3_BLOCKS_PER_GROUP(sb); + + if (last == 0) { + ext3_warning(sb, __FUNCTION__, + "need to use ext2online to resize further\n"); + return -EPERM; + } + + add = EXT3_BLOCKS_PER_GROUP(sb) - last; + + if (o_blocks_count + add > n_blocks_count) + add = n_blocks_count - o_blocks_count; + + if (o_blocks_count + add < n_blocks_count) + ext3_warning(sb, __FUNCTION__, + "will only finish group (%lu blocks, %u new)", + o_blocks_count + add, add); + + /* See if the device is actually as big as what was requested */ + bh = sb_bread(sb, o_blocks_count + add -1); + if (!bh) { + ext3_warning(sb, __FUNCTION__, + "can't read last block, resize aborted"); + return -ENOSPC; + } + brelse(bh); + + if (!(inode = get_empty_inode())) { + ext3_warning(sb, __FUNCTION__, + "error getting dummy resize inode"); + return -ENOMEM; + } + + /* Fake out an inode to "free" the new blocks in this group. */ + inode->i_sb = sb; + inode->i_ino = 0; + EXT3_I(inode)->i_state = EXT3_STATE_RESIZE; + + /* We will update the superblock, one block bitmap, and + * one group descriptor via ext3_free_blocks(). + */ + handle = ext3_journal_start(inode, 3); + if (IS_ERR(handle)) { + err = PTR_ERR(handle); + ext3_warning(sb, __FUNCTION__, "error %d on journal start",err); + goto exit_put; + } + + lock_super(sb); + if (o_blocks_count != le32_to_cpu(es->s_blocks_count)) { + ext3_warning(sb, __FUNCTION__, + "multiple resizers run on filesystem!\n"); + goto exit_put; + } + + if ((err = ext3_journal_get_write_access(handle, + EXT3_SB(sb)->s_sbh))) { + ext3_warning(sb, __FUNCTION__, + "error %d on journal write access", err); + unlock_super(sb); + ext3_journal_stop(handle, inode); + goto exit_put; + } + es->s_blocks_count = cpu_to_le32(o_blocks_count + add); + ext3_journal_dirty_metadata(handle, EXT3_SB(sb)->s_sbh); + sb->s_dirt = 1; + unlock_super(sb); + ext3_debug("freeing blocks %ld through %ld\n", o_blocks_count, + o_blocks_count + add); + ext3_free_blocks(handle, inode, o_blocks_count, add); + ext3_debug("freed blocks %ld through %ld\n", o_blocks_count, + o_blocks_count + add); + if ((err = ext3_journal_stop(handle, inode))) + goto exit_put; + if (test_opt(sb, DEBUG)) + printk("EXT3-fs: extended group to %u blocks\n", + le32_to_cpu(es->s_blocks_count)); + update_backups(sb, inode, EXT3_SB(sb)->s_sbh->b_blocknr, (char *)es, + sizeof(struct ext3_super_block)); +exit_put: + iput(inode); + + return err; +} /* ext3_group_extend */ diff -rNu linux-2.4.18-orig/fs/ext3/super.c linux-2.4.18/fs/ext3/super.c --- linux-2.4.18-orig/fs/ext3/super.c Mon Feb 25 11:38:08 2002 +++ linux-2.4.18/fs/ext3/super.c Tue Sep 10 11:18:06 2002 @@ -496,6 +496,7 @@ static int parse_options (char * options, unsigned long * sb_block, struct ext3_sb_info *sbi, unsigned long * inum, + unsigned long *n_blocks_count, int is_remount) { unsigned long *mount_options = &sbi->s_mount_opt; @@ -566,6 +567,27 @@ else if (!strcmp (this_char, "nogrpid") || !strcmp (this_char, "sysvgroups")) clear_opt (*mount_options, GRPID); +#ifdef CONFIG_EXT3_RESIZE + else if (!strcmp(this_char, "resize")) { + printk("EXT3-fs: parse_options: resize=%s\n", value); + if (!n_blocks_count) { + printk("EXT3-fs: resize option only available " + "for remount\n"); + return 0; + } + if (!value || !*value) { + printk("EXT3-fs: resize requires number of " + "blocks\n"); + return 0; + } + *n_blocks_count = simple_strtoul(value, &value, 0); + if (*value) { + printk("EXT3-fs: invalid resize option: %s\n", + value); + return 0; + } + } +#endif /* CONFIG_EXT3_RESIZE */ else if (!strcmp (this_char, "resgid")) { unsigned long v; if (want_numeric(value, "resgid", &v)) @@ -921,7 +943,8 @@ sbi->s_mount_opt = 0; sbi->s_resuid = EXT3_DEF_RESUID; sbi->s_resgid = EXT3_DEF_RESGID; - if (!parse_options ((char *) data, &sb_block, sbi, &journal_inum, 0)) { + if (!parse_options ((char *) data, &sb_block, sbi, &journal_inum, + NULL, 0)) { sb->s_dev = 0; goto out_fail; } @@ -1621,6 +1644,7 @@ { struct ext3_super_block * es; struct ext3_sb_info *sbi = EXT3_SB(sb); + unsigned long n_blocks_count = 0; unsigned long tmp; clear_ro_after(sb); @@ -1628,7 +1652,7 @@ /* * Allow the "check" option to be passed as a remount option. */ - if (!parse_options(data, &tmp, sbi, &tmp, 1)) + if (!parse_options(data, &tmp, sbi, &tmp, &n_blocks_count, 1)) return -EINVAL; if (sbi->s_mount_opt & EXT3_MOUNT_ABORT) @@ -1636,7 +1660,8 @@ es = sbi->s_es; - if ((*flags & MS_RDONLY) != (sb->s_flags & MS_RDONLY)) { + if ((*flags & MS_RDONLY) != (sb->s_flags & MS_RDONLY) || + n_blocks_count > le32_to_cpu(es->s_blocks_count)) { if (sbi->s_mount_opt & EXT3_MOUNT_ABORT) return -EROFS; @@ -1675,6 +1700,8 @@ */ ext3_clear_journal_err(sb, es); sbi->s_mount_state = le16_to_cpu(es->s_state); + if ((ret = ext3_group_extend(sb, es, n_blocks_count))) + return ret; if (!ext3_setup_super (sb, es, 0)) sb->s_flags &= ~MS_RDONLY; } diff -rNu linux-2.4.18-orig/include/linux/ext3_fs.h linux-2.4.18/include/linux/ext3_fs.h --- linux-2.4.18-orig/include/linux/ext3_fs.h Mon Feb 25 11:38:13 2002 +++ linux-2.4.18/include/linux/ext3_fs.h Tue Sep 10 11:18:06 2002 @@ -213,20 +213,50 @@ */ #define EXT3_STATE_JDATA 0x00000001 /* journaled data exists */ #define EXT3_STATE_NEW 0x00000002 /* inode is newly created */ +#define EXT3_STATE_RESIZE 0x00000004 /* fake inode for resizing */ /* * ioctl commands */ -#define EXT3_IOC_GETFLAGS _IOR('f', 1, long) -#define EXT3_IOC_SETFLAGS _IOW('f', 2, long) -#define EXT3_IOC_GETVERSION _IOR('f', 3, long) -#define EXT3_IOC_SETVERSION _IOW('f', 4, long) -#define EXT3_IOC_GETVERSION_OLD _IOR('v', 1, long) -#define EXT3_IOC_SETVERSION_OLD _IOW('v', 2, long) + +/* Used to pass group descriptor data when online resize is done */ +struct ext3_new_group_input { + __u32 group; /* Group number for this data */ + __u32 block_bitmap; /* Absolute block number of block bitmap */ + __u32 inode_bitmap; /* Absolute block number of inode bitmap */ + __u32 inode_table; /* Absolute block number of inode table start */ + __u32 blocks_count; /* Total number of blocks in this group */ + __u16 reserved_blocks; /* Number of reserved blocks in this group */ + __u16 unused; +}; + +/* The struct ext3_new_group_input in kernel space, with free_blocks_count */ +struct ext3_new_group_data { + __u32 group; + __u32 block_bitmap; + __u32 inode_bitmap; + __u32 inode_table; + __u32 blocks_count; + __u16 reserved_blocks; + __u16 unused; + __u32 free_blocks_count; +}; + +#define EXT3_IOC_GETFLAGS _IOR('f', 1, long) +#define EXT3_IOC_SETFLAGS _IOW('f', 2, long) +#define EXT3_IOC_GETVERSION_NEW _IOR('f', 3, long) +#define EXT3_IOC_SETVERSION_NEW _IOW('f', 4, long) +#define EXT3_IOC_GROUP_EXTEND _IOW('f', 7, unsigned long) +#define EXT3_IOC_GROUP_ADD _IOW('f', 8,struct ext3_new_group_input) +#define EXT3_IOC_GETVERSION_OLD _IOR('v', 1, long) +#define EXT3_IOC_SETVERSION_OLD _IOW('v', 2, long) #ifdef CONFIG_JBD_DEBUG #define EXT3_IOC_WAIT_FOR_READONLY _IOR('f', 99, long) #endif +#define EXT3_IOC_SETVERSION EXT3_IOC_SETVERSION_NEW +#define EXT3_IOC_GETVERSION EXT3_IOC_GETVERSION_NEW + /* * Structure of an inode on the disk */ @@ -429,7 +459,7 @@ */ __u8 s_prealloc_blocks; /* Nr of blocks to try to preallocate*/ __u8 s_prealloc_dir_blocks; /* Nr to preallocate for dirs */ - __u16 s_padding1; + __u16 s_reserved_gdt_blocks; /* Per group desc for online growth */ /* * Journaling support valid if EXT3_FEATURE_COMPAT_HAS_JOURNAL set. */ @@ -651,6 +681,17 @@ extern int ext3_orphan_add(handle_t *, struct inode *); extern int ext3_orphan_del(handle_t *, struct inode *); +/* resize.c */ +#ifdef CONFIG_EXT3_RESIZE +extern int ext3_group_add(struct super_block *sb, + struct ext3_new_group_data *input); +extern int ext3_group_extend(struct super_block *sb, + struct ext3_super_block *es, + unsigned long n_blocks_count); +#else +#define ext3_group_extend(sb, es, n_blocks_count) 0 +#endif + /* super.c */ extern void ext3_error (struct super_block *, const char *, const char *, ...) __attribute__ ((format (printf, 3, 4))); -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Brian J. M. <064...@in...> - 2002-10-03 18:07:54
|
On Thu, Oct 03, 2002 at 10:58:13AM -0700, Robert Walsh wrote: >=20 > Running multiple resizers at once sounds like a weird situation - > probably a mistake on the users behalf, right? However, if it's > possible to do it, then it shouldn't result in corrupt data or an oops, > so it should be accounted for. Putting aside the "least surprise" principle, think of a situation where you have a daemon watching the free space on all filesystems and automagically expanding them when they reach a threshold. It is possible then, through no mistake for two resizing operations to take place simultaneously. You are correct; whether it is advisable or not to do parallel resizing operations, it should either not break when you do it or it should not be possible (i.e. locking out subsequent simultaneous resize requests). b. --=20 Brian J. Murrell |
From: Robert W. <rj...@du...> - 2002-10-03 17:58:23
|
> I could probably do this in a couple of hours if you would be willing > to do the testing, or is the description enough? Of course, if you > are not running multiple resizers at one time (by accident normally, > of course) then the only place you actually need the sb lock is at > the critical region previously mentioned in ext3_group_add() because > we are updating the free counts, which are also updated by other parts > of the code. Running multiple resizers at once sounds like a weird situation - probably a mistake on the users behalf, right? However, if it's possible to do it, then it shouldn't result in corrupt data or an oops, so it should be accounted for. Sounds like you've got a better idea about how to do this than I do, so why don't you go ahead and I'll definitely give it a good testing. Regards, Robert. |
From: Andreas D. <ad...@cl...> - 2002-10-03 03:20:39
|
On Oct 02, 2002 17:32 -0700, Robert Walsh wrote: > > > The intention is that other processes cannot start using the new space > > until it is completely set up. We want to avoid a situation where a > > file is allocated in the new space, and then we say "oops, there was a > > problem, let's yank that new space back" == missing data. > > > > I think this can be done fairly cleanly by ensuring everything is set up > > before updating the superblock (in memory and on disk), and only holding > > the lock over that update. For extend-the-end resizing, we already have > > to drop the lock for ext3_free_blocks(), because it gets the lock itself. > > Until we get around to doing this, do you think the existing solution of > dropping the lock before calling journal_start or journal_stop and > reacquiring it after is sufficient? Probably not - a quick look at ext3_group_extend already shows races w.r.t. multiple resizers if the lock were dropped. It should be relatively straight forward to rearrange the code to enter without the sb lock, get the journal handle first, and do the rest of the operations with the sb lock held (until ext3_free_blocks is called). The call to update_backups can (and should, for the same reason) be called without the sb lock. For ext3_group_add(), we have a bit more complexity, but nothing fatal. We can (and probably should) hold i_sem for protection against change of the reserved inode. As long as you re-check that this is still the last group being added after getting the sb lock, the only critical region where we need to hold the sb lock is from "gdp->bg_block_bitmap = ..." through "es->s_free_inodes_count = ...". I could probably do this in a couple of hours if you would be willing to do the testing, or is the description enough? Of course, if you are not running multiple resizers at one time (by accident normally, of course) then the only place you actually need the sb lock is at the critical region previously mentioned in ext3_group_add() because we are updating the free counts, which are also updated by other parts of the code. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Robert W. <rj...@du...> - 2002-10-03 00:31:42
|
> The intention is that other processes cannot start using the new space > until it is completely set up. We want to avoid a situation where a > file is allocated in the new space, and then we say "oops, there was a > problem, let's yank that new space back" =3D=3D missing data. >=20 > I think this can be done fairly cleanly by ensuring everything is set up > before updating the superblock (in memory and on disk), and only holding > the lock over that update. For extend-the-end resizing, we already have > to drop the lock for ext3_free_blocks(), because it gets the lock itself. Until we get around to doing this, do you think the existing solution of dropping the lock before calling journal_start or journal_stop and reacquiring it after is sufficient? Regards, Robert. --=20 Robert Walsh Amalgamated Durables, Inc. - "We don't make the things you buy." Email: rj...@du... |
From: Andreas D. <ad...@cl...> - 2002-10-02 21:16:33
|
On Oct 02, 2002 13:43 -0700, Robert Walsh wrote: > [ Andreas: just for context - I was getting an oops while doing I/O on a > filesystem that was being resized. ] > > > Is this ext3? Probably you have someone doing a journal_start() > > while holding the superblock lock. That's the wrong order. > > That was it. The resize code does a lock_super/unlock_super around the > two resize entry points in ioctl.c and calls journal_start/journal_stop > in several places while doing the resize. This was causing the oops > while doing I/O and resizing simultaneously. > > Since I don't know when the superblock actually really needs to be > locked, I err'd on the side of caution and put an > unlock_super/lock_super around each call to journal_start and > journal_stop. This seemed to make the problem go away but it's a lot of > locking/unlocking to be going on. > > Is it really necessary for the resize code to hold onto the superblock > locks for such a long length of time? Would it be possible to minimize > the amount of locking needed here? The superblock lock is needed in a few places for resizing: - to protect the s_gdb_count, s_groups_count, s_group_desc (for adding new groups only) - the on-disk values s_*_blocks_count, s_*inodes_count (for all resizes) The intention is that other processes cannot start using the new space until it is completely set up. We want to avoid a situation where a file is allocated in the new space, and then we say "oops, there was a problem, let's yank that new space back" == missing data. I think this can be done fairly cleanly by ensuring everything is set up before updating the superblock (in memory and on disk), and only holding the lock over that update. For extend-the-end resizing, we already have to drop the lock for ext3_free_blocks(), because it gets the lock itself. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Robert W. <rj...@du...> - 2002-10-02 20:42:50
|
[ Andreas: just for context - I was getting an oops while doing I/O on a filesystem that was being resized. ] > Is this ext3? Probably you have someone doing a journal_start() > while holding the superblock lock. That's the wrong order. That was it. The resize code does a lock_super/unlock_super around the two resize entry points in ioctl.c and calls journal_start/journal_stop in several places while doing the resize. This was causing the oops while doing I/O and resizing simultaneously. Since I don't know when the superblock actually really needs to be locked, I err'd on the side of caution and put an unlock_super/lock_super around each call to journal_start and journal_stop. This seemed to make the problem go away but it's a lot of locking/unlocking to be going on. Is it really necessary for the resize code to hold onto the superblock locks for such a long length of time? Would it be possible to minimize the amount of locking needed here? I can create a new patch with the fix, but if there's a better way of doing this, I'd like to wait... Regards, Robert.=20 --=20 Robert Walsh Amalgamated Durables, Inc. - "We don't make the things you buy." Email: rj...@du... |
From: Andreas D. <ad...@cl...> - 2002-09-13 23:26:49
|
On Sep 12, 2002 15:41 -0700, Robert Walsh wrote: > > I don't think this will have an effect. If you specify "-O resize_inode" > > to mke2fs, it will pick a default of 1024x the specified filesystem size, > > and the libext2 code will create a resize inode with enough blocks, > > after it has determined all of the needed parameters (ext2fs_initialize() > > maybe). I _think_ there is a calculation in that same function which > > determines the inode table offset, based on the group blocks + 3 (super, > > block, inode bitmaps). It probably needs s_reserved_gdt_blocks added in > > there and we will be happy. > > Is this the one? > > /* > * Overhead is the number of bookkeeping blocks per group. It > * includes the superblock backup, the group descriptor > * backups, the inode bitmap, the block bitmap, and the inode > * table. > */ > overhead = (int) (2 + fs->inode_blocks_per_group); > > if (ext2fs_bg_has_super(fs, fs->group_desc_count - 1)) > overhead += 1 + fs->desc_blocks + super->s_reserved_gdt_blocks; Nope, this value is only used to determine whether the last group is too small to be included in the filesystem. The real culprit appears to be in ext2fs_allocate_group_table(), where it is calculating the start_blk value. It isn't including the reserved GDT blocks there, just before it allocates the inode table. Since I changed the above "overhead" calculation to not require the reserved blocks for non-backup-holding groups, the code in ext2fs_allocate_group_table() probably needs to special-case the situation where (the corrected) start_blk + fs->inode_blocks_per_group is more than last_blk, and move the inode table down until it fits. Or, we could revert the "overhead" calculation changes (and put the comment that "being clever is tricky" back in) like you suggested, so that it always includes the backup and reserved GDT. The only problem is as follows: say you are reserving for a huge filesystem, and you want 1024 reserved group blocks (the maximum, for a 4kB block filesystem, which would give you an upper filesystem size of 16TB[*]) If it turns out that the last group in your filesystem does not contain backups (probably true), but it is smaller than ~4MB you will lose that space until you resize slightly larger than 4MB, even though you don't need to store a group descriptor table into that group ever. Maybe it's not a big deal (4MB is in the noise these days ;-), and you are already "wasting" at least 8MB of space for the initial reserved GDT blocks in groups 0 and 1. It's not like you don't get that space back later when you resize, and presumably you are going to resize if you are reserving blocks. Cheers, Andreas [*] coincidentally also the current maximum 4kB filesystem size because of the 2^32 block limit). -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Robert W. <rj...@du...> - 2002-09-13 21:06:33
|
So, here's what I've done so far: I haven't touched mke2fs at all - I'm going to let it place the stuff wherever it thinks it should. I commented out the stride calculation code in ext2online. It now assumes a stride of 0. I twiddled ext2online's idea of where to but the block bitmap and inode bitmap in groups with a sb backup. They now get placed right after the inode table, which is exactly where mke2fs puts them. It now works, although this exercise was really to validate my ideas and I don't really consider it a solution. The real solution involves fixing mke2fs so that it keeps a consistent layout, modifying the stride calculations in ext2online so that it gets calculated correctly and modifying the subsequent layout so that it more closely matches the fixed layout from mke2fs. I'll also try ensure that even if it gets a weird layout from mke2fs, ext2online will still be able to handle it. I'll keep you informed of my progress and send you diffs to mke2fs when I get something sane working. Regards, Robert. --=20 Robert Walsh Amalgamated Durables, Inc. - "We don't make the things you buy." Email: rj...@du... |
From: Andreas D. <ad...@cl...> - 2002-09-13 19:57:56
|
On Sep 13, 2002 10:49 -0700, Robert Walsh wrote: > It appears to me that what's happening is that ext2online is trying to > guess the stride value using the differences in the buffer bitmap > position in each block group. This doesn't always give valid values, > but whether it does or it doesn't, it then tries to calculate the inode > table offset using its location in the first block group. It then > creates a bunch of new groups to add using the inode table offset as > it's key. It places the buffer bitmap and inode bitmap 2 blocks and 1 > block before the inode table, respectively. On blocks groups without a > super block backup, this is fine. On those with a super block backup, > this causes the bitmaps to fall into the range of the reserved gdt > blocks because in those circumstances, the inode table generated by > mke2fs happens to fall before the bitmaps. Good detective work. Yes this is pretty much a summary of how things happen on the ext2online side, based on how previous ext2 filesystems were set up. I think mke2fs is changing its behaviour slightly because the reserved group descriptors are using up the space that it had saved for the inode and block bitmaps, and the inode table is using them up first. > I guess that mke2fs is producing a legal layout, or is it? Yes, it is totally legal, but just not very common. However, I have also learned recently that Ted's resize2fs tool also will move only the bitmap blocks to after the inode table in the case it needs to (offline) resize, while ext2resize will work to keep the old ordering. > It certainly works for us, but it doesn't produce anything that > looks like the layout in your ols2002 paper. One thought might be to > rearrange the order in which it allocates blocks, so that it always > allocates the bitmaps first and then the inode table. This will at > least guarantee the ordering will be correct. I need to look at this > a little more to be sure, though. Would this be for mke2fs or ext2online? There might be two issues here. 1) mke2fs should try to keep the old "standard" layout as much as possible, because there are likely other tools which don't know anything about the fact that bitmaps can move (GNU parted might be one of those, because it uses a very old version of the ext2resize code). This simply means making sure that the location of the inode table is offset enough to compensate for the reserved group descriptor blocks as we previously discussed. 2) ext2online shouldn't pass bad data (i.e. invalid bitmap locations) to the kernel. Rather than picking the offset of the first group's inode table blindly, it should calculate itself what the offset needs to be sb + (2 bitmaps if not "striped") + in-use GDT blk + reserved GDT blk and use the maximum of that and the "current" inode table offset. I suppose it is almost doing this, but is getting confused because it assumes that if _any_ bitmaps are after the inode table, then it is "striped" and they will _all_ be after the inode table. The bug happens when it can't find a regular striping pattern and uses stride = 0, it assumes that there is enough space before the inode table to hold the bitmaps. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Andreas D. <ad...@cl...> - 2002-09-13 19:38:11
|
On Sep 12, 2002 16:56 -0700, Robert Walsh wrote: > Should the reserved blocks be reserved even in a block group that does > not contain a super block backup? Or are they available for use in the > block group? The reserved blocks are only used in the groups that have a superblock backup (and hence also a group descriptor table backup). They are not touched (allocated or used) in groups that do not have backups. In theory, the ext2resize/ext2online code should be able to handle having arbitrary locations for the inode table and the bitmaps, but it may be getting confused by the fact that there is no free blocks before the inode table for the table at all, in the case of a zero stride. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Robert W. <rj...@du...> - 2002-09-13 17:49:03
|
Hi, It appears to me that what's happening is that ext2online is trying to guess the stride value using the differences in the buffer bitmap position in each block group. This doesn't always give valid values, but whether it does or it doesn't, it then tries to calculate the inode table offset using its location in the first block group. It then creates a bunch of new groups to add using the inode table offset as it's key. It places the buffer bitmap and inode bitmap 2 blocks and 1 block before the inode table, respectively. On blocks groups without a super block backup, this is fine. On those with a super block backup, this causes the bitmaps to fall into the range of the reserved gdt blocks because in those circumstances, the inode table generated by mke2fs happens to fall before the bitmaps. I guess that mke2fs is producing a legal layout, or is it? It certainly works for us, but it doesn't produce anything that looks like the layout in your ols2002 paper. One thought might be to rearrange the order in which it allocates blocks, so that it always allocates the bitmaps first and then the inode table. This will at least guarantee the ordering will be correct. I need to look at this a little more to be sure, though. Regards, Robert. --=20 Robert Walsh Amalgamated Durables, Inc. - "We don't make the things you buy." Email: rj...@du... |
From: Robert W. <rj...@du...> - 2002-09-13 04:32:30
|
Hi Andreas, Should the reserved blocks be reserved even in a block group that does not contain a super block backup? Or are they available for use in the block group? Regards, Robert. |
From: Robert W. <rj...@du...> - 2002-09-13 04:14:30
|
> I don't think this will have an effect. If you specify "-O resize_inode" > to mke2fs, it will pick a default of 1024x the specified filesystem size, > and the libext2 code will create a resize inode with enough blocks, > after it has determined all of the needed parameters (ext2fs_initialize() > maybe). I _think_ there is a calculation in that same function which > determines the inode table offset, based on the group blocks + 3 (super, > block, inode bitmaps). It probably needs s_reserved_gdt_blocks added in > there and we will be happy. Is this the one? /* * Overhead is the number of bookkeeping blocks per group. It * includes the superblock backup, the group descriptor * backups, the inode bitmap, the block bitmap, and the inode * table. */ overhead = (int) (2 + fs->inode_blocks_per_group); if (ext2fs_bg_has_super(fs, fs->group_desc_count - 1)) overhead += 1 + fs->desc_blocks + super->s_reserved_gdt_blocks; Looks like it only does this if there's a backup superblock there. Would this work: /* * Overhead is the number of bookkeeping blocks per group. It * includes the superblock backup, the group descriptor * backups, the inode bitmap, the block bitmap, and the inode * table. */ overhead = (int) (2 + fs->inode_blocks_per_group); /* if (ext2fs_bg_has_super(fs, fs->group_desc_count - 1)) */ overhead += 1 + fs->desc_blocks + super->s_reserved_gdt_blocks; Would that be safe? Regards, Robert. |
From: Andreas D. <ad...@cl...> - 2002-09-12 20:47:43
|
On Sep 12, 2002 13:30 -0700, Robert Walsh wrote: > On Thu, 2002-09-12 at 12:24, Andreas Dilger wrote: > > On Sep 12, 2002 11:49 -0700, Robert Walsh wrote: > > > Attached is the output of two different mke2fs's on a small loopback > > > file. One has the resize inode. The other doesn't. Neither had a > > > stride specified. As you can see, the one that has the resize inode has > > > a different layout depending on whether there is supposed to be a backup > > > super block there or not, which is what is causing ext2online to get > > > upset. > > > > This may have something to do with how the libext2 code does block > > allocation. I would think that it puts all of the inode tables at > > the same offsets, but it might be missing a "reserved_gdt" count in > > its initial calculations. > > Well, there is this in the code that parses the resize parameter: > > /* XXX param->s_res_gdt_blocks = resize - existing > cur_groups = (resize - sb->s_first_data_block + > EXT2_BLOCKS_PER_GROUP(super) - 1) /bpg; > cur_gdb = (cur_groups + gdpb - 1) / gdpb; > */ > > I assume this was an attempt to set this up, right? But the problem is > figuring out what "existing" is. I also assume this was the missing > information you mentioned in your email yesterday: > > > That parameter doesn't actually work yet, for a reason that currently > > escapes me. Something about us needing data (like blocksize, or blocks > > per group, or something) that we don't have until inside libext2 where > > the filesystem is being created... We need to go from "-R resize=foo" > > to a number of group descriptor blocks that are passed in the > > superblock to the libext2 create routines. > > Anyway, I'm going to see if I can get the above bit working and if that > has an effect on the inode table offsets. I don't think this will have an effect. If you specify "-O resize_inode" to mke2fs, it will pick a default of 1024x the specified filesystem size, and the libext2 code will create a resize inode with enough blocks, after it has determined all of the needed parameters (ext2fs_initialize() maybe). I _think_ there is a calculation in that same function which determines the inode table offset, based on the group blocks + 3 (super, block, inode bitmaps). It probably needs s_reserved_gdt_blocks added in there and we will be happy. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Robert W. <rj...@du...> - 2002-09-12 20:30:42
|
On Thu, 2002-09-12 at 12:24, Andreas Dilger wrote: > On Sep 12, 2002 11:49 -0700, Robert Walsh wrote: > > Attached is the output of two different mke2fs's on a small loopback > > file. One has the resize inode. The other doesn't. Neither had a > > stride specified. As you can see, the one that has the resize inode has > > a different layout depending on whether there is supposed to be a backup > > super block there or not, which is what is causing ext2online to get > > upset. > > This may have something to do with how the libext2 code does block > allocation. I would think that it puts all of the inode tables at > the same offsets, but it might be missing a "reserved_gdt" count in > its initial calculations. Well, there is this in the code that parses the resize parameter: /* XXX param->s_res_gdt_blocks = resize - existing cur_groups = (resize - sb->s_first_data_block + EXT2_BLOCKS_PER_GROUP(super) - 1) /bpg; cur_gdb = (cur_groups + gdpb - 1) / gdpb; */ I assume this was an attempt to set this up, right? But the problem is figuring out what "existing" is. I also assume this was the missing information you mentioned in your email yesterday: > That parameter doesn't actually work yet, for a reason that currently > escapes me. Something about us needing data (like blocksize, or blocks > per group, or something) that we don't have until inside libext2 where > the filesystem is being created... We need to go from "-R resize=foo" > to a number of group descriptor blocks that are passed in the > superblock to the libext2 create routines. Anyway, I'm going to see if I can get the above bit working and if that has an effect on the inode table offsets. Regards, Robert. |
From: Andreas D. <ad...@cl...> - 2002-09-12 19:26:03
|
On Sep 12, 2002 11:49 -0700, Robert Walsh wrote: > Attached is the output of two different mke2fs's on a small loopback > file. One has the resize inode. The other doesn't. Neither had a > stride specified. As you can see, the one that has the resize inode has > a different layout depending on whether there is supposed to be a backup > super block there or not, which is what is causing ext2online to get > upset. This may have something to do with how the libext2 code does block allocation. I would think that it puts all of the inode tables at the same offsets, but it might be missing a "reserved_gdt" count in its initial calculations. Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Robert W. <rj...@du...> - 2002-09-12 18:49:16
|
> Can you look at the output from dumpe2fs to see when/where/why the block > and inode bitmaps come after the inode table (which in my books means > that a stride was used). If not, then the "stride detection" code in > ext2online is broken for some reason... It's not supposed to be critical > stuff, mind you, just trying to keep the new groups layed out as the > original creator intended. Attached is the output of two different mke2fs's on a small loopback file. One has the resize inode. The other doesn't. Neither had a stride specified. As you can see, the one that has the resize inode has a different layout depending on whether there is supposed to be a backup super block there or not, which is what is causing ext2online to get upset. > > verify_group_input: Block bitmap (11240450) in GDT table > > (11239424-11240451) > > That would be badness then. Probably the user-space calcs are bad, and > not the check here, so the kernel is getting bad data. If you could > trace through where these calculations are going wrong, I can tell you > what the intention of them is, if you have questions. Sure. I'm going to trace through this today, so I'll let you know as soon as I get some interesting results. I don't know if I made it clear yesterday, but this message never occurs when the stride is specified _AND_ set insanely huge. Looking at the mke2fs code, when the stride is set to an insanely huge value, it should result in the start_blk for each block group being set back to the first block in the block group. Hmm. Regards, Robert. |