From mboxrd@z Thu Jan  1 00:00:00 1970
From: john at keeping.me.uk (John Keeping)
Date: Wed, 20 Jun 2018 00:02:59 +0100
Subject: cache-size implementation downsides
In-Reply-To: <20180616154621.GA1922@john.keeping.me.uk>
References: <20180613190241.GC11657@chatter>
 <20180616154621.GA1922@john.keeping.me.uk>
Message-ID: <20180619230259.GF1922@john.keeping.me.uk>

On Sat, Jun 16, 2018 at 04:46:21PM +0100, John Keeping wrote:
> On Wed, Jun 13, 2018 at 03:02:42PM -0400, Konstantin Ryabitsev wrote:
> > 2. I have witnessed cache corruption due to collisions (which is
> > a bug in itself). One of our frontends was hit by a lot of agressive
> > crawling of snapshots that raised the load to 60+ (many, many gzip
> > processes). After we blackholed the bot, some of the cache objects for
> > non-snapshot URLs had trailing gzip junk in them, meaning that either
> > two instances were writing to the same file, or something else resulted
> > in cache corruption. This is probably a race condition somewhere in the
> > locking code.
> 
> I've had a look at this, and I think we might end up dropping our lock
> too early thanks to this code (in fill_slot()):
> 
> 	/* Restore stdout */
> 	if (dup2(tmp, STDOUT_FILENO) == -1) {
> 
> Before this line, STDOUT_FILENO refers to lock_fd which has a POSIX
> advisory record lock on the entire file.  However, the documentation for
> that says:
> 
>        *  If a process closes any file descriptor referring to a file, then all
>           of the process's locks on that file are released, regardless of the
>           file descriptor(s)  on  which  the  locks  were  obtained.  This is
>           bad: it means that a process can lose its locks on a file such as
>           /etc/passwd or  /etc/mtab  when for some reason a library function
>           decides to open, read, and close the same file.
> 
> I haven't verified this, but I suspect that dup'ing the original stdout
> over STDOUT_FILENO is equivalent to closing a file descriptor referring
> to our lock file.  And thus the lock is released at this point, which is
> before we rename the lock file over the cache file.
> 
> If that is correct, then there is a window during which a new process
> can open the lock file to write new content and successfully acquire the
> lock on that file even though it is still being used by another process.

I confirmed this behaviour with trace-cmd.

Before:
            cgit-7291  : posix_lock_inode:     fl=0x0xffff88020faa6258 dev=0x8:0x14 ino=0x4e3e13 fl_next=0x(nil) fl_owner=0x0xffff88007a8f2680 fl_pid=7291 fl_flags=FL_POSIX fl_type=F_WRLCK fl_start=0 fl_end=9223372036854775807 ret=0
            cgit-7291  : fcntl_setlk:          fl=0x0xffff88020faa6258 dev=0x8:0x14 ino=0x4e3e13 fl_next=0x(nil) fl_owner=0x0xffff88007a8f2680 fl_pid=7291 fl_flags=FL_POSIX fl_type=F_WRLCK fl_start=0 fl_end=9223372036854775807 ret=0
            cgit-7291  : locks_get_lock_context: dev=0x8:0x14 ino=0x4e3e13 type=F_UNLCK ctx=0xffff8801beeb7930
            cgit-7291  : posix_lock_inode:     fl=0x0xffffc90003627da8 dev=0x8:0x14 ino=0x4e3e13 fl_next=0x(nil) fl_owner=0x0xffff88007a8f2680 fl_pid=7291 fl_flags=FL_POSIX|FL_CLOSE fl_type=F_UNLCK fl_start=0 fl_end=9223372036854775807 ret=0
            cgit-7291  : locks_remove_posix:   fl=0x0xffffc90003627da8 dev=0x8:0x14 ino=0x4e3e13 fl_next=0x(nil) fl_owner=0x0xffff88007a8f2680 fl_pid=7291 fl_flags=FL_POSIX|FL_CLOSE fl_type=F_UNLCK fl_start=0 fl_end=9223372036854775807 ret=0
            cgit-7291  : sys_enter_rename:     oldname: 0x559bd0bd5830, newname: 0x559bd0bd57d0
            cgit-7291  : sys_exit_rename:      0x0

After:
            cgit-7488  : posix_lock_inode:     fl=0x0xffff8802122c43e8 dev=0x8:0x14 ino=0x4e3e13 fl_next=0x(nil) fl_owner=0x0xffff8800310958c0 fl_pid=7488 fl_flags=FL_POSIX fl_type=F_WRLCK fl_start=0 fl_end=9223372036854775807 ret=0
            cgit-7488  : fcntl_setlk:          fl=0x0xffff8802122c43e8 dev=0x8:0x14 ino=0x4e3e13 fl_next=0x(nil) fl_owner=0x0xffff8800310958c0 fl_pid=7488 fl_flags=FL_POSIX fl_type=F_WRLCK fl_start=0 fl_end=9223372036854775807 ret=0
            cgit-7488  : sys_enter_rename:     oldname: 0x56512cd7f830, newname: 0x56512cd7f7d0
            cgit-7488  : sys_exit_rename:      0x0
            cgit-7488  : locks_get_lock_context: dev=0x8:0x14 ino=0x4e3e13 type=F_UNLCK ctx=0xffff8802144daa10
            cgit-7488  : posix_lock_inode:     fl=0x0xffffc900038d7da8 dev=0x8:0x14 ino=0x4e3e13 fl_next=0x0xffff880006f5b780 fl_owner=0x0xffff8800310958c0 fl_pid=7488 fl_flags=FL_POSIX|FL_CLOSE fl_type=F_UNLCK fl_start=0 fl_end=9223372036854775807 ret=0
            cgit-7488  : locks_remove_posix:   fl=0x0xffffc900038d7da8 dev=0x8:0x14 ino=0x4e3e13 fl_next=0x0xffff880006f5b780 fl_owner=0x0xffff8800310958c0 fl_pid=7488 fl_flags=FL_POSIX|FL_CLOSE fl_type=F_UNLCK fl_start=0 fl_end=9223372036854775807 ret=0


I'm planning to queue the patch below on jk/for-jason and send a PR in
the next day or two, but it would be nice to get a reviewed-by before I
do that.

> -- >8 --
> Subject: [PATCH] cache: close race window when unlocking slots
> 
> We use POSIX advisory record locks to control access to cache slots, but
> these have an unhelpful behaviour in that they are released when any
> file descriptor referencing the file is closed by this process.
> 
> Mostly this is okay, since we know we won't be opening the lock file
> anywhere else, but there is one place that it does matter: when we
> restore stdout we dup2() over a file descriptor referring to the file,
> thus closing that descriptor.
> 
> Since we restore stdout before unlocking the slot, this creates a window
> during which the slot content can be overwritten.  The fix is reasonably
> straightforward: simply restore stdout after unlocking the slot, but the
> diff is a bit bigger because this requires us to move the temporary
> stdout FD into struct cache_slot.
> 
> Signed-off-by: John Keeping <john at keeping.me.uk>
> ---
>  cache.c | 37 ++++++++++++++-----------------------
>  1 file changed, 14 insertions(+), 23 deletions(-)
> 
> diff --git a/cache.c b/cache.c
> index 0901e6e..2c70be7 100644
> --- a/cache.c
> +++ b/cache.c
> @@ -29,6 +29,7 @@ struct cache_slot {
>  	cache_fill_fn fn;
>  	int cache_fd;
>  	int lock_fd;
> +	int stdout_fd;
>  	const char *cache_name;
>  	const char *lock_name;
>  	int match;
> @@ -197,6 +198,13 @@ static int unlock_slot(struct cache_slot *slot, int replace_old_slot)
>  	else
>  		err = unlink(slot->lock_name);
>  
> +	/* Restore stdout and close the temporary FD. */
> +	if (slot->stdout_fd >= 0) {
> +		dup2(slot->stdout_fd, STDOUT_FILENO);
> +		close(slot->stdout_fd);
> +		slot->stdout_fd = -1;
> +	}
> +
>  	if (err)
>  		return errno;
>  
> @@ -208,42 +216,24 @@ static int unlock_slot(struct cache_slot *slot, int replace_old_slot)
>   */
>  static int fill_slot(struct cache_slot *slot)
>  {
> -	int tmp;
> -
>  	/* Preserve stdout */
> -	tmp = dup(STDOUT_FILENO);
> -	if (tmp == -1)
> +	slot->stdout_fd = dup(STDOUT_FILENO);
> +	if (slot->stdout_fd == -1)
>  		return errno;
>  
>  	/* Redirect stdout to lockfile */
> -	if (dup2(slot->lock_fd, STDOUT_FILENO) == -1) {
> -		close(tmp);
> +	if (dup2(slot->lock_fd, STDOUT_FILENO) == -1)
>  		return errno;
> -	}
>  
>  	/* Generate cache content */
>  	slot->fn();
>  
>  	/* Make sure any buffered data is flushed to the file */
> -	if (fflush(stdout)) {
> -		close(tmp);
> +	if (fflush(stdout))
>  		return errno;
> -	}
>  
>  	/* update stat info */
> -	if (fstat(slot->lock_fd, &slot->cache_st)) {
> -		close(tmp);
> -		return errno;
> -	}
> -
> -	/* Restore stdout */
> -	if (dup2(tmp, STDOUT_FILENO) == -1) {
> -		close(tmp);
> -		return errno;
> -	}
> -
> -	/* Close the temporary filedescriptor */
> -	if (close(tmp))
> +	if (fstat(slot->lock_fd, &slot->cache_st))
>  		return errno;
>  
>  	return 0;
> @@ -393,6 +383,7 @@ int cache_process(int size, const char *path, const char *key, int ttl,
>  	strbuf_addstr(&lockname, ".lock");
>  	slot.fn = fn;
>  	slot.ttl = ttl;
> +	slot.stdout_fd = -1;
>  	slot.cache_name = filename.buf;
>  	slot.lock_name = lockname.buf;
>  	slot.key = key;
> -- 
> 2.17.1