From: Alex G. <ag...@is...> - 2016-08-05 23:35:46
|
On Tuesday, August 2, 2016, Ilya Dryomov <idr...@gm...> wrote: > On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev <ag...@is... > <javascript:;>> wrote: > > On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin <vs...@vl... > <javascript:;>> wrote: > >> Alex Gorbachev wrote on 08/01/2016 04:05 PM: > >>> Hi Ilya, > >>> > >>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov <idr...@gm... > <javascript:;>> wrote: > >>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev < > ag...@is... <javascript:;>> wrote: > >>>>> RBD illustration showing RBD ignoring discard until a certain > >>>>> threshold - why is that? This behavior is unfortunately incompatible > >>>>> with ESXi discard (UNMAP) behavior. > >>>>> > >>>>> Is there a way to lower the discard sensitivity on RBD devices? > >>>>> > >>> <snip> > >>>>> > >>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 > >>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { > >>>>> print SUM/1024 " KB" }' > >>>>> 819200 KB > >>>>> > >>>>> root@e1:/var/log# blkdiscard -o 0 -l 40960000 /dev/rbd28 > >>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { > >>>>> print SUM/1024 " KB" }' > >>>>> 782336 KB > >>>> > >>>> Think about it in terms of underlying RADOS objects (4M by default). > >>>> There are three cases: > >>>> > >>>> discard range | command > >>>> ----------------------------------------- > >>>> whole object | delete > >>>> object's tail | truncate > >>>> object's head | zero > >>>> > >>>> Obviously, only delete and truncate free up space. In all of your > >>>> examples, except the last one, you are attempting to discard the head > >>>> of the (first) object. > >>>> > >>>> You can free up as little as a sector, as long as it's the tail: > >>>> > >>>> Offset Length Type > >>>> 0 4194304 data > >>>> > >>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 > >>>> > >>>> Offset Length Type > >>>> 0 4193792 data > >>> > >>> Looks like ESXi is sending in each discard/unmap with the fixed > >>> granularity of 8192 sectors, which is passed verbatim by SCST. There > >>> is a slight reduction in size via rbd diff method, but now I > >>> understand that actual truncate only takes effect when the discard > >>> happens to clip the tail of an image. > >>> > >>> So far looking at > >>> https://kb.vmware.com/selfservice/microsites/search. > do?language=en_US&cmd=displayKC&externalId=2057513 > >>> > >>> ...the only variable we can control is the count of 8192-sector chunks > >>> and not their size. Which means that most of the ESXi discard > >>> commands will be disregarded by Ceph. > >>> > >>> Vlad, is 8192 sectors coming from ESXi, as in the debug: > >>> > >>> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector > >>> 1342099456, nr_sects 8192) > >> > >> Yes, correct. However, to make sure that VMware is not (erroneously) > enforced to do this, you need to perform one more check. > >> > >> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here > correct granularity and alignment (4M, I guess?) > > > > This seems to reflect the granularity (4194304), which matches the > > 8192 pages (8192 x 512 = 4194304). However, there is no alignment > > value. > > > > Can discard_alignment be specified with RBD? > > It's exported as a read-only sysfs attribute, just like > discard_granularity: > > # cat /sys/block/rbd0/discard_alignment > 4194304 > Is there a way to perhaps increase the discard granularity? The way I see it based on the discussion so far, here is why discard/unmap is failing to work with VMWare: - RBD provides space in 4MB blocks, which must be discarded entirely, or at least hitting the tail. - SCST communicates to ESXi that discard alignment is 4MB and discard granularity is also 4MB - ESXI's VMFS5 is aligned on 1MB, so 4MB discards never actually free anything What is it were possible to make a 6MB discard granularity? Thank you, Alex > > > Thanks, > > Ilya > -- -- Alex Gorbachev Storcium |