List for cgit developers and users
 help / color / mirror / Atom feed
From: konstantin at linuxfoundation.org (Konstantin Ryabitsev)
Subject: cache-size implementation downsides
Date: Wed, 13 Jun 2018 15:02:42 -0400	[thread overview]
Message-ID: <20180613190241.GC11657@chatter> (raw)

Hi, all:

TL;DR: The way cache-size is implemented is pretty clever and efficient,
but has important downsides due to filename collisions.

Long:

The way caching is implemented in cgit, it uses the path of the URL
request (plus any query strings) as the cache key. That key is then
passed through a fast one-way hashing function to arrive to a
deterministic integer between 0 and the value of cache-size. That
integer is then used to generate the name of the file that will be
stored in the cache-root directory.

This is clever and fast, because this means cgit doesn't need to list
the contents of the cache-root directory to make sure there are no more
than cache-size number of files in it (directory listing is expensive).
However, this also means that if the value of cache-size is set
sufficiently low, there will inevitably be collisions. To make sure that
we don't serve the wrong cache, cgit stores the full key (path+query
string) inside the cache file at offset 0. When there is a cache file in
place matching the request, cgit makes sure that it is for the correct
key before serving it. If the keys differ, the old cache value is
thrown out and the new response is generated and cached.

Like I said, this is clever and fast because we get enforcement of the
maximum number of files basically for free. However, collisions have
important downsides on a busy site like git.kernel.org:

1. If two popular but expensive URLs have the same cache-file object,
they can negate a lot of cache savings. E.g. if a very expensive diff
has the same cache-file as another very expensive diff, and both are
alternately hit with sufficient frequency, it results in a lot of cache
misses and high load.

2. I have witnessed cache corruption due to collisions (which is
a bug in itself). One of our frontends was hit by a lot of agressive
crawling of snapshots that raised the load to 60+ (many, many gzip
processes). After we blackholed the bot, some of the cache objects for
non-snapshot URLs had trailing gzip junk in them, meaning that either
two instances were writing to the same file, or something else resulted
in cache corruption. This is probably a race condition somewhere in the
locking code.

3. If we raise the value of cache-size higher in hopes to avoid
collisions, we risk running out of disk space due to large snapshots
accumulating in cache-root, since nothing cleans them up.

#3 is our interim solution, and we use a cronjob [1] that runs
every 5 minutes to clean up old snapshots static URLs, attempting to
respect the cache-ttl values. However, I wonder if a more robust cache
backend implementation can help resolve some of these problems,
especially the high chance of collisions when cache-size is set
sufficiently low.

.. [1] https://gist.github.com/mricon/28dd36fa093f9cb0748f63756f0f0a8d

-K
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL: <http://lists.zx2c4.com/pipermail/cgit/attachments/20180613/ce149cb0/attachment.asc>


             reply	other threads:[~2018-06-13 19:02 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-13 19:02 konstantin [this message]
2018-06-16 15:46 ` john
2018-06-19 23:02   ` john
2018-06-20  6:01   ` list
2018-06-20  7:44     ` john
2018-06-20  7:55       ` list

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180613190241.GC11657@chatter \
    --to=cgit@lists.zx2c4.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).