From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <69a4c6b7395b8751d482ae47469aed19@plan9.bell-labs.com> To: 9fans@cse.psu.edu Subject: Re: [9fans] fossil/venti/manber From: "Russ Cox" In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Mon, 28 Apr 2003 17:55:34 -0400 Topicbox-Message-UUID: 9a6e19c8-eacb-11e9-9e20-41e7f4b1d025 The reference in the Venti paper was only to the use of content-dependent block boundaries, not the subsequent file similarity search. Venti deals with that by coalescing duplicate writes instead. These are some notes I wrote for myself about whether it would make sense in Fossil and also in a Venti-based backup system I wrote. The backup system stores entire file system images (FFS, in the one case I have running). I think the notes happen to answer your questions, mainly because I think they clear up your misunderstandings about how applying Manber's idea would actually work. If you have questions after reading the notes, I'd be happy to address them directly, but I think we're not on the same page at the moment. Russ Adding LBFS (a.k.a. Manber) chunking to Venti * Content-dependent boundaries Manber's scheme for detecting similar files [Manber] introduced the idea of using content-dependent block boundaries to break a file into chunks. Then you hash the chunks to come up with short names for them and look for files sharing chunks. The big idea of the scheme is the content-dependent boundaries. In most systems, block boundaries are offset-dependent (usually you insert a boundary at multiples of the constant block size); in such a system, inserting or removing a single byte at the beginning of the file changes every block in the file. In Manber's scheme, the effects of inserting or removing a single byte are felt only by the surrounding blocks. Figure 1 of [LBFS] illustrates this well. LBFS is a low-bandwidth file system that has recently repopularized Manber's idea. It uses content-dependent chunking combined with SHA1-naming of blocks to avoid sending a chunk over the wire more than once. I'm going to refer to ``LBFS chunking'' for the rest of this because the details I have in mind align more closely with the LBFS scheme than with Manber's, though in spirit the two are the same. * Venti support In theory, Venti is just a SHA1-addressed block store, so it would be up to the higher level applications like Vac and Fossil to chunk the stream how they want. In practice, these applications provide extra block type information to Venti to allow general-purpose applications like venti/copy to walk around in byte streams. Specifically, the blocks form a Merkle hash tree as illustrated at the top of page 2 of [Fossil]. The data stream is the bottom row of blocks. The internal blocks in the tree are just lists of pointers to blocks in the next level of the tree. Because block boundaries are currently offset-dependent, seeking to an arbitrary offset in the stream at the bottom of the hash tree is easy -- the sequence of pointers to follow can be computed before even beginning to walk the tree. If we were going to use LBFS chunking with Venti, implementing fast arbitrary seeks in the stream would require changing the format of the internal blocks. We'd have to make them lists of (score, size) offset, where size gives the number of bytes in the stream represented by the tree rooted at score. The pointers would now be 26 or 28 bytes instead of 20. These new internal blocks could be VtManberType0, etc. Because Venti ignores repeated writes of a block, the space coalescing happens for free in the Venti+Manber scheme -- the Venti mechanisms for ignoring repeated writes handle ignoring common blocks too. * Who should use it? Applications could choose whether to use the old offset-dependent blocks scheme or the new content-dependent blocks, as appropriate. But what does ``as appropriate'' mean? In general the scheme works well on streams with insert and delete as common operations. LBFS [LBFS] was a big win because they were dealing with big files (both text and binary) that had this property. Plan 9 source files tend to be quite small already. The median source file size in the Plan 9 kernel sources right now is under 16kB, so LBFS chunking won't help -- we'll still be changing two blocks for each operation. Object and binary files would have the property, though I'm not sure. Plan 9's binaries are statically linked, so deleting one byte of code in the middle of the file will change all calls and jumps across that byte. The dynamic linking used in FreeBSD (where LBFS was built) might avoid this problem. The median Plan 9 kernel object file is still only around 16kB. Troff and TeX documents, which tend to be big, will probably work well with this, but they change less often. It looks like LBFS-style chunking might be useful for some files but perhaps not for enough to make it worth implementing. There's also the question of when LBFS chunking might really hurt. Most databases (and also Fossil's directory streams) work in terms of power-of-two-sized blocks, under the assumption that a power-of-two sized block will map to some integral number of disk blocks. streams). If we do LBFS chunking under such a representation, then each block change also affects the chunk to its left and right, a factor of three when the average LBFS chunk size is close to the application block size. That could be a big lose. Then again, if only a little is changing, then only three littles change in the LBFS version, and if everything is changing, then everything changes in the LBFS version. It only really is a big deal when something like every third block is changing, and how often does that really happen? If we added LBFS chunking to libvac, then it might make sense to have Fossil use it for file data but not directories. In general, it could only help. In some cases, though, it seems like it could hurt a lot, and unfortunately those cases are also the ones that tend to involve lots of data and be more sensitive to performance. We'd have to do the experiment, perhaps with seanq's traces, to see how much of a win it would be for the little guys, and then separately figure out how much of a lose it would be for the big guys. In the backup system, it's almost certainly a lose, unless FFS is managing to lay out files contiguously lots of the time. I kind of doubt this, but don't have any hard evidence. In that case, there's no benefit (the file systems use 8kB or 16kB blocks, so you don't get any chunking benefits inside the blocks), and the interference with the surrounding blocks will triple the incremental cost of backups. * Links [Manber] http://citeseer.nj.nec.com/manber94finding.html [LBFS] http://citeseer.nj.nec.com/muthitacharoen01lowbandwidth.html [Fossil] http://plan9.bell-labs.com/sys/doc/fossil.pdf