From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <69a4c6b7395b8751d482ae47469aed19@plan9.bell-labs.com>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] fossil/venti/manber
From: "Russ Cox" <rsc@plan9.bell-labs.com>
In-Reply-To: <Pine.GSO.4.50L0.0304282143170.9486-100000@goliat>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Date: Mon, 28 Apr 2003 17:55:34 -0400
Topicbox-Message-UUID: 9a6e19c8-eacb-11e9-9e20-41e7f4b1d025

The reference in the Venti paper was only to the use of
content-dependent block boundaries, not the subsequent
file similarity search.  Venti deals with that by coalescing
duplicate writes instead.  These are some notes I wrote for
myself about whether it would make sense in Fossil and also
in a Venti-based backup system I wrote.  The backup system
stores entire file system images (FFS, in the one case I have
running).

I think the notes happen to answer your questions, mainly
because I think they clear up your misunderstandings about
how applying Manber's idea would actually work.
If you have questions after reading the notes, I'd be happy
to address them directly, but I think we're not on the same
page at the moment.

Russ


Adding LBFS (a.k.a.  Manber) chunking to Venti

* Content-dependent boundaries

Manber's scheme for detecting similar files [Manber] introduced the
idea of using content-dependent block boundaries to break a file into
chunks.  Then you hash the chunks to come up with short names for them
and look for files sharing chunks.

The big idea of the scheme is the content-dependent boundaries.  In
most systems, block boundaries are offset-dependent (usually you
insert a boundary at multiples of the constant block size); in such a
system, inserting or removing a single byte at the beginning of the
file changes every block in the file.  In Manber's scheme, the effects
of inserting or removing a single byte are felt only by the
surrounding blocks.  Figure 1 of [LBFS] illustrates this well.

LBFS is a low-bandwidth file system that has recently repopularized
Manber's idea.  It uses content-dependent chunking combined with
SHA1-naming of blocks to avoid sending a chunk over the wire more than
once.  I'm going to refer to ``LBFS chunking'' for the rest of this
because the details I have in mind align more closely with the LBFS
scheme than with Manber's, though in spirit the two are the same.


* Venti support

In theory, Venti is just a SHA1-addressed block store, so it would be
up to the higher level applications like Vac and Fossil to chunk the
stream how they want.  In practice, these applications provide extra
block type information to Venti to allow general-purpose applications
like venti/copy to walk around in byte streams.

Specifically, the blocks form a Merkle hash tree as illustrated at the
top of page 2 of [Fossil].  The data stream is the bottom row of
blocks.  The internal blocks in the tree are just lists of pointers to
blocks in the next level of the tree.  Because block boundaries are
currently offset-dependent, seeking to an arbitrary offset in the
stream at the bottom of the hash tree is easy -- the sequence of
pointers to follow can be computed before even beginning to walk the
tree.

If we were going to use LBFS chunking with Venti, implementing fast
arbitrary seeks in the stream would require changing the format of the
internal blocks.  We'd have to make them lists of (score, size)
offset, where size gives the number of bytes in the stream represented
by the tree rooted at score.  The pointers would now be 26 or 28 bytes
instead of 20.  These new internal blocks could be VtManberType0, etc.

Because Venti ignores repeated writes of a block, the space coalescing
happens for free in the Venti+Manber scheme -- the Venti mechanisms
for ignoring repeated writes handle ignoring common blocks too.


* Who should use it?

Applications could choose whether to use the old offset-dependent
blocks scheme or the new content-dependent blocks, as appropriate.
But what does ``as appropriate'' mean?

In general the scheme works well on streams with insert and delete as
common operations.  LBFS [LBFS] was a big win because they were
dealing with big files (both text and binary) that had this property.
Plan 9 source files tend to be quite small already.  The median source
file size in the Plan 9 kernel sources right now is under 16kB, so
LBFS chunking won't help -- we'll still be changing two blocks for
each operation.  Object and binary files would have the property,
though I'm not sure.  Plan 9's binaries are statically linked, so
deleting one byte of code in the middle of the file will change all
calls and jumps across that byte.  The dynamic linking used in FreeBSD
(where LBFS was built) might avoid this problem.  The median Plan 9
kernel object file is still only around 16kB.  Troff and TeX
documents, which tend to be big, will probably work well with this,
but they change less often.  It looks like LBFS-style chunking might
be useful for some files but perhaps not for enough to make it worth
implementing.

There's also the question of when LBFS chunking might really hurt.
Most databases (and also Fossil's directory streams) work in terms of
power-of-two-sized blocks, under the assumption that a power-of-two
sized block will map to some integral number of disk blocks.
streams).  If we do LBFS chunking under such a representation, then
each block change also affects the chunk to its left and right, a
factor of three when the average LBFS chunk size is close to the
application block size.  That could be a big lose.  Then again, if
only a little is changing, then only three littles change in the LBFS
version, and if everything is changing, then everything changes in the
LBFS version.  It only really is a big deal when something like every
third block is changing, and how often does that really happen?

If we added LBFS chunking to libvac, then it might make sense to have
Fossil use it for file data but not directories.  In general, it could
only help.  In some cases, though, it seems like it could hurt a lot,
and unfortunately those cases are also the ones that tend to involve
lots of data and be more sensitive to performance.
We'd have to do the experiment, perhaps with seanq's traces, to see
how much of a win it would be for the little guys, and then separately
figure out how much of a lose it would be for the big guys.

In the backup system, it's almost certainly a lose, unless FFS is
managing to lay out files contiguously lots of the time.  I kind of
doubt this, but don't have any hard evidence.  In that case, there's
no benefit (the file systems use 8kB or 16kB blocks, so you don't get
any chunking benefits inside the blocks), and the interference with
the surrounding blocks will triple the incremental cost of backups.

* Links

[Manber] http://citeseer.nj.nec.com/manber94finding.html
[LBFS] http://citeseer.nj.nec.com/muthitacharoen01lowbandwidth.html
[Fossil] http://plan9.bell-labs.com/sys/doc/fossil.pdf