From mboxrd@z Thu Jan 1 00:00:00 1970 From: konstantin at linuxfoundation.org (Konstantin Ryabitsev) Date: Wed, 13 Jun 2018 15:02:42 -0400 Subject: cache-size implementation downsides Message-ID: <20180613190241.GC11657@chatter> Hi, all: TL;DR: The way cache-size is implemented is pretty clever and efficient, but has important downsides due to filename collisions. Long: The way caching is implemented in cgit, it uses the path of the URL request (plus any query strings) as the cache key. That key is then passed through a fast one-way hashing function to arrive to a deterministic integer between 0 and the value of cache-size. That integer is then used to generate the name of the file that will be stored in the cache-root directory. This is clever and fast, because this means cgit doesn't need to list the contents of the cache-root directory to make sure there are no more than cache-size number of files in it (directory listing is expensive). However, this also means that if the value of cache-size is set sufficiently low, there will inevitably be collisions. To make sure that we don't serve the wrong cache, cgit stores the full key (path+query string) inside the cache file at offset 0. When there is a cache file in place matching the request, cgit makes sure that it is for the correct key before serving it. If the keys differ, the old cache value is thrown out and the new response is generated and cached. Like I said, this is clever and fast because we get enforcement of the maximum number of files basically for free. However, collisions have important downsides on a busy site like git.kernel.org: 1. If two popular but expensive URLs have the same cache-file object, they can negate a lot of cache savings. E.g. if a very expensive diff has the same cache-file as another very expensive diff, and both are alternately hit with sufficient frequency, it results in a lot of cache misses and high load. 2. I have witnessed cache corruption due to collisions (which is a bug in itself). One of our frontends was hit by a lot of agressive crawling of snapshots that raised the load to 60+ (many, many gzip processes). After we blackholed the bot, some of the cache objects for non-snapshot URLs had trailing gzip junk in them, meaning that either two instances were writing to the same file, or something else resulted in cache corruption. This is probably a race condition somewhere in the locking code. 3. If we raise the value of cache-size higher in hopes to avoid collisions, we risk running out of disk space due to large snapshots accumulating in cache-root, since nothing cleans them up. #3 is our interim solution, and we use a cronjob [1] that runs every 5 minutes to clean up old snapshots static URLs, attempting to respect the cache-ttl values. However, I wonder if a more robust cache backend implementation can help resolve some of these problems, especially the high chance of collisions when cache-size is set sufficiently low. .. [1] https://gist.github.com/mricon/28dd36fa093f9cb0748f63756f0f0a8d -K -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: not available URL: