Re: [PATCH] au_proc_getmntent: optimize reading, retry
Status: Beta
Brought to you by:
sfjro
From: Kirill K. <kol...@gm...> - 2019-05-23 06:57:20
|
On Wed, 22 May 2019 at 21:15, <sf...@us...> wrote: > Kirill Kolyshkin: > > Can you tell me what is your reason to not retry reading by default? The > > code > > has just checked that this is an aufs mount so it should definitely be > > present in > > /proc/mounts. Unless, of course, this mount was unmounted by someone in > > between statfs() and reading. If you have this exact case in mind (I > can't > > think > > of anything else) and don't want to retry because of efficiency, you can > add > > another statfs() to after reading /proc/mounts and not finding the mount > -- > > that way you can be sure that the mount is still there but it eluded the > > /proc/mounts. > > Yes, such race was in my mind. > In other words, it is hard to identify the reason why /proc/mounts > doesn't show the entry. The problem of /proc/mounts, or someone else > unmounted? The only way to make sure is to retry reading /proc/mounts. Or, do stat+statfs (exactly as you suggested below) to check that this is indeed the aufs root directory (and then you still need to retry reading /proc/mounts). > Additionally I guess(hope) such parallel mount/unmounts are > rare. And I wonder "2" is the absolute correct solution? "3" cannot be > happen? Never? > If you're asking about number of retries, I chose 3 in my patch just to be on the safe side. Let's assume that a probability of not finding a specific mount is 0.01, or 1%, then in case of a second retry it will be 0.01^2 = 0.0001, or 0.01%. With three retries, it will be 0.01^3, or 0.0001%. Practically, I never saw it in my test that second retry won't work, but maybe we can do more testing to figure out what a good number of retries should be. > > Statfs(), you say, won't help I am afraid. Even if it tells us that the > dir is aufs, it is not the proof of the aufs mountpoint. It can be a > subdir of another aufs mount. > An extra stat(2) call may help in this point. It will tell us the inode > number, and if it is AUFS_ROOT_INO, then that path is the aufs mountpoint. > But I wonder do we really have to issue stat(2) and statfs(2) just to > make sure the aufs mount is still there? Isn't it rather heavy and racy? > It is racy, not very heavy though, and I suggest to only do it in case we tried to find the mount in /proc/mounts and failed, before retry. Reading /proc/mounts is super heavy, compared to a couple of syscalls, looks like a decent price to pay to avoid re-reading it. > > > I have also took a deeper look at that other error I mentioned earlier. > > Found out > > it's a race in au_xino_create(). In case xino mount option is provided > (we > > use > > xino=/dev/shm/aufs.xino in Docker), and multiple mounts (note: for > different > > mount points) are performed in parallel, one mount can reach > > > > > file = vfsub_filp_open(fpath, O_RDWR | O_CREAT | O_EXCL | O_LARGEFILE, > > 0666); > > > > line of code, while another mount already created that file, but haven't > > unlinked it yet. > > > > As a result, we have an error like these in the kernel log: > > > > [2233986.956753] aufs au_xino_create:767:dockerd[17144]: open > > /dev/shm/aufs.xino(-17) > > [2233988.732636] aufs au_xino_create:767:dockerd[17518]: open > > /dev/shm/aufs.xino(-17) > > Thank you very much for the report. > Here -17 means EEXIST "File exists" error. > It is an expected behaviour (and I am glad that I know it is working > expectedly). As you might know, the default path of XINO files are the > top dir of the first writable branch, and a writable branch is not > sharable between the multiple aufs mounts. So by default XINO files are > dedicated to a single aufs mount. Not shared, no confliction happens. > This is correct for read-write mounts, what about read-only ones? The doc says that if there's no writable branch, path used will be /tmp/.aufs.xino. Sure, aufs kernel module code creates and then removes this file, so another mount will create another file (with the same name), but I see that it is a clear possibility to have a race in this case (such as multiple read-only mounts happening at the same time). > > > Currently, I am working around this unfortunate issue by calling mount(2) > > under > > an exclusive lock, to make sure no two aufs mounts (again, for different > > mount > > points) are performed in parallel, but perhaps there is a better way? > > > > I am going to mitigate this race by adding a random suffix to xino file > > name; do you think > > it is a decent workaround? > > If your first writable branch is somewhere on /dev/shm, then you can > remove "xino=" option. In this case, the XINO files will be created > under /dev/shm and not shared. Moreover "xino=" option is something > like a last resort generally. As long as the filesystem of your first > writable branch doesn't support XINO, or you want a little gain around > the aufs internal XINO handling, you may want "xino=". Otherwise you > can omit it. > My guess is, whoever wrote docker aufs support, thought that using tmpfs for xino file is faster. Or maybe they saw an advise (from some old documentation, such as http://aufs.sourceforge.net/aufs2/brsync/README.txt) to never put xino file to an SSD (I'm sure many docker users use ssd hard drivers). > Of course adding a random/unique suffix is a good idea. If I were you, > I'd use $$ in shell script manner such like > mount -t aufs -obr=...,xino=/dev/shm/aufs.xino.$$ ... > My code is in Go, so I'm patching it this way - opts := "dio,xino=/dev/shm/aufs.xino" + rnd := strconv.FormatInt(int64(1e9+rand.Uint32()%1e9), 36)[1:5] + opts := "dio,xino=/dev/shm/aufs."+rnd (rnd is some 4 random characters in 0-9a-z range) |