Re: [PATCH] au_proc_getmntent: optimize reading, retry

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Wed, 22 May 2019 at 21:15, <sf...@us...> wrote:

> Kirill Kolyshkin:
> > Can you tell me what is your reason to not retry reading by default? The
> > code
> > has just checked that this is an aufs mount so it should definitely be
> > present in
> > /proc/mounts. Unless, of course, this mount was unmounted by someone in
> > between statfs() and reading. If you have this exact case in mind (I
> can't
> > think
> > of anything else) and don't want to retry because of efficiency, you can
> add
> > another statfs() to after reading /proc/mounts and not finding the mount
> --
> > that way you can be sure that the mount is still there but it eluded the
> > /proc/mounts.
>
> Yes, such race was in my mind.
> In other words, it is hard to identify the reason why /proc/mounts
> doesn't show the entry.  The problem of /proc/mounts, or someone else
> unmounted?

The only way to make sure is to retry reading /proc/mounts. Or, do
stat+statfs
(exactly as you suggested below) to check that this is indeed the aufs root
directory (and then you still need to retry reading /proc/mounts).

> Additionally I guess(hope) such parallel mount/unmounts are
> rare.  And I wonder "2" is the absolute correct solution?  "3" cannot be
> happen? Never?
>

If you're asking about number of retries, I chose 3 in my patch just to be
on the safe side. Let's assume that a probability of not finding a specific
mount is 0.01, or 1%, then in case of a second retry it will be
0.01^2 = 0.0001, or 0.01%. With three retries, it will be 0.01^3, or
0.0001%.

Practically, I never saw it in my test that second retry won't work, but
maybe
we can do more testing to figure out what a good number of retries should
be.

>
> Statfs(), you say, won't help I am afraid.  Even if it tells us that the
> dir is aufs, it is not the proof of the aufs mountpoint.  It can be a
> subdir of another aufs mount.
> An extra stat(2) call may help in this point.  It will tell us the inode
> number, and if it is AUFS_ROOT_INO, then that path is the aufs mountpoint.
> But I wonder do we really have to issue stat(2) and statfs(2) just to
> make sure the aufs mount is still there?  Isn't it rather heavy and racy?
>

It is racy, not very heavy though, and I suggest to only do it in case we
tried to find the mount in /proc/mounts and failed, before retry. Reading
/proc/mounts is super heavy, compared to a couple of syscalls, looks
like a decent price to pay to avoid re-reading it.

>
> > I have also took a deeper look at that other error I mentioned earlier.
> > Found out
> > it's a race in au_xino_create(). In case xino mount option is provided
> (we
> > use
> > xino=/dev/shm/aufs.xino in Docker), and multiple mounts (note: for
> different
> > mount points) are performed in parallel, one mount can reach
> >
> > >  file = vfsub_filp_open(fpath, O_RDWR | O_CREAT | O_EXCL | O_LARGEFILE,
> > 0666);
> >
> > line of code, while another mount already created that file, but haven't
> > unlinked it yet.
> >
> > As a result, we have an error like these in the kernel log:
> >
> > [2233986.956753] aufs au_xino_create:767:dockerd[17144]: open
> > /dev/shm/aufs.xino(-17)
> > [2233988.732636] aufs au_xino_create:767:dockerd[17518]: open
> > /dev/shm/aufs.xino(-17)
>
> Thank you very much for the report.
> Here -17 means EEXIST "File exists" error.
> It is an expected behaviour (and I am glad that I know it is working
> expectedly). As you might know, the default path of XINO files are the
> top dir of the first writable branch, and a writable branch is not
> sharable between the multiple aufs mounts.  So by default XINO files are
> dedicated to a single aufs mount. Not shared, no confliction happens.
>

This is correct for read-write mounts, what about read-only ones?

The doc says that if there's no writable branch, path used will be
/tmp/.aufs.xino.
Sure, aufs kernel module code creates and then removes this file, so another
mount will create another file (with the same name), but I see that it is a
clear
possibility to have a race in this case (such as multiple read-only mounts
happening
at the same time).

>
> > Currently, I am working around this unfortunate issue by calling mount(2)
> > under
> > an exclusive lock, to make sure no two aufs mounts (again, for different
> > mount
> > points) are performed in parallel, but perhaps there is a better way?
> >
> > I am going to mitigate this race by adding a random suffix to xino file
> > name; do you think
> > it is a decent workaround?
>
> If your first writable branch is somewhere on /dev/shm, then you can
> remove "xino=" option.  In this case, the XINO files will be created
> under /dev/shm and not shared.  Moreover "xino=" option is something
> like a last resort generally.  As long as the filesystem of your first
> writable branch doesn't support XINO, or you want a little gain around
> the aufs internal XINO handling, you may want "xino=".  Otherwise you
> can omit it.
>

My guess is, whoever wrote docker aufs support, thought that using tmpfs
for xino file is faster. Or maybe they saw an advise (from some old
documentation,
such as http://aufs.sourceforge.net/aufs2/brsync/README.txt) to never
put xino file to an SSD (I'm sure many docker users use ssd hard drivers).

> Of course adding a random/unique suffix is a good idea.  If I were you,
> I'd use $$ in shell script manner such like
>         mount -t aufs -obr=...,xino=/dev/shm/aufs.xino.$$ ...
>

My code is in Go, so I'm patching it this way

-       opts := "dio,xino=/dev/shm/aufs.xino"
+       rnd := strconv.FormatInt(int64(1e9+rand.Uint32()%1e9), 36)[1:5]
+       opts := "dio,xino=/dev/shm/aufs."+rnd

(rnd is some 4 random characters in 0-9a-z range)