9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] fossil/venti/manber
@ 2003-04-28 20:17 FODEMESI Gergely
  2003-04-28 21:55 ` Russ Cox
  0 siblings, 1 reply; 3+ messages in thread
From: FODEMESI Gergely @ 2003-04-28 20:17 UTC (permalink / raw)
  To: plan9 mailing list

Hi,

 in the venti paper there is a reference to manber's algorithm, for
possible future development to venti/fossil. (Udi Manber: Finding similar
files in a large file system)

Did anyone consider giving this possible development a second thought?

I'd like to elaborate on the possibility of using this algorithm with
venti.
Could somebody correct me if the following comments are false?

1. Anchors would be needed to synchronize to block boundries.

2. In order to somehow detect possible similar bit-streams, venti must
know more about the meta information on these similar bit-streams
(files/directories). By this I mean venti format has to be extended with
meta information on files.

3. Venti would have to implement a method of generating possible anchors
to possible similar bit-streams to "new" (i.e. freshly stored)
bit-streams. This should probably be done parallel to storing new blocks.
"Lazy anchoring?"

4. Except for databases with dynamically changing sizes (are there any?),
what kind of bit-streams could such a method be used for?

5. Depending on the comments to 4. could anybody imagine changing venti
format in order to provide such a seemingly marginally useful feature? See
Russ's comment on possibly never changing the venti format.

 thanks for listening: gergo


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [9fans] fossil/venti/manber
  2003-04-28 20:17 [9fans] fossil/venti/manber FODEMESI Gergely
@ 2003-04-28 21:55 ` Russ Cox
  2003-04-28 22:13   ` William Josephson
  0 siblings, 1 reply; 3+ messages in thread
From: Russ Cox @ 2003-04-28 21:55 UTC (permalink / raw)
  To: 9fans

The reference in the Venti paper was only to the use of
content-dependent block boundaries, not the subsequent
file similarity search.  Venti deals with that by coalescing
duplicate writes instead.  These are some notes I wrote for
myself about whether it would make sense in Fossil and also
in a Venti-based backup system I wrote.  The backup system
stores entire file system images (FFS, in the one case I have
running).

I think the notes happen to answer your questions, mainly
because I think they clear up your misunderstandings about
how applying Manber's idea would actually work.
If you have questions after reading the notes, I'd be happy
to address them directly, but I think we're not on the same
page at the moment.

Russ


Adding LBFS (a.k.a.  Manber) chunking to Venti

* Content-dependent boundaries

Manber's scheme for detecting similar files [Manber] introduced the
idea of using content-dependent block boundaries to break a file into
chunks.  Then you hash the chunks to come up with short names for them
and look for files sharing chunks.

The big idea of the scheme is the content-dependent boundaries.  In
most systems, block boundaries are offset-dependent (usually you
insert a boundary at multiples of the constant block size); in such a
system, inserting or removing a single byte at the beginning of the
file changes every block in the file.  In Manber's scheme, the effects
of inserting or removing a single byte are felt only by the
surrounding blocks.  Figure 1 of [LBFS] illustrates this well.

LBFS is a low-bandwidth file system that has recently repopularized
Manber's idea.  It uses content-dependent chunking combined with
SHA1-naming of blocks to avoid sending a chunk over the wire more than
once.  I'm going to refer to ``LBFS chunking'' for the rest of this
because the details I have in mind align more closely with the LBFS
scheme than with Manber's, though in spirit the two are the same.


* Venti support

In theory, Venti is just a SHA1-addressed block store, so it would be
up to the higher level applications like Vac and Fossil to chunk the
stream how they want.  In practice, these applications provide extra
block type information to Venti to allow general-purpose applications
like venti/copy to walk around in byte streams.

Specifically, the blocks form a Merkle hash tree as illustrated at the
top of page 2 of [Fossil].  The data stream is the bottom row of
blocks.  The internal blocks in the tree are just lists of pointers to
blocks in the next level of the tree.  Because block boundaries are
currently offset-dependent, seeking to an arbitrary offset in the
stream at the bottom of the hash tree is easy -- the sequence of
pointers to follow can be computed before even beginning to walk the
tree.

If we were going to use LBFS chunking with Venti, implementing fast
arbitrary seeks in the stream would require changing the format of the
internal blocks.  We'd have to make them lists of (score, size)
offset, where size gives the number of bytes in the stream represented
by the tree rooted at score.  The pointers would now be 26 or 28 bytes
instead of 20.  These new internal blocks could be VtManberType0, etc.

Because Venti ignores repeated writes of a block, the space coalescing
happens for free in the Venti+Manber scheme -- the Venti mechanisms
for ignoring repeated writes handle ignoring common blocks too.


* Who should use it?

Applications could choose whether to use the old offset-dependent
blocks scheme or the new content-dependent blocks, as appropriate.
But what does ``as appropriate'' mean?

In general the scheme works well on streams with insert and delete as
common operations.  LBFS [LBFS] was a big win because they were
dealing with big files (both text and binary) that had this property.
Plan 9 source files tend to be quite small already.  The median source
file size in the Plan 9 kernel sources right now is under 16kB, so
LBFS chunking won't help -- we'll still be changing two blocks for
each operation.  Object and binary files would have the property,
though I'm not sure.  Plan 9's binaries are statically linked, so
deleting one byte of code in the middle of the file will change all
calls and jumps across that byte.  The dynamic linking used in FreeBSD
(where LBFS was built) might avoid this problem.  The median Plan 9
kernel object file is still only around 16kB.  Troff and TeX
documents, which tend to be big, will probably work well with this,
but they change less often.  It looks like LBFS-style chunking might
be useful for some files but perhaps not for enough to make it worth
implementing.

There's also the question of when LBFS chunking might really hurt.
Most databases (and also Fossil's directory streams) work in terms of
power-of-two-sized blocks, under the assumption that a power-of-two
sized block will map to some integral number of disk blocks.
streams).  If we do LBFS chunking under such a representation, then
each block change also affects the chunk to its left and right, a
factor of three when the average LBFS chunk size is close to the
application block size.  That could be a big lose.  Then again, if
only a little is changing, then only three littles change in the LBFS
version, and if everything is changing, then everything changes in the
LBFS version.  It only really is a big deal when something like every
third block is changing, and how often does that really happen?

If we added LBFS chunking to libvac, then it might make sense to have
Fossil use it for file data but not directories.  In general, it could
only help.  In some cases, though, it seems like it could hurt a lot,
and unfortunately those cases are also the ones that tend to involve
lots of data and be more sensitive to performance.
We'd have to do the experiment, perhaps with seanq's traces, to see
how much of a win it would be for the little guys, and then separately
figure out how much of a lose it would be for the big guys.

In the backup system, it's almost certainly a lose, unless FFS is
managing to lay out files contiguously lots of the time.  I kind of
doubt this, but don't have any hard evidence.  In that case, there's
no benefit (the file systems use 8kB or 16kB blocks, so you don't get
any chunking benefits inside the blocks), and the interference with
the surrounding blocks will triple the incremental cost of backups.

* Links

[Manber] http://citeseer.nj.nec.com/manber94finding.html
[LBFS] http://citeseer.nj.nec.com/muthitacharoen01lowbandwidth.html
[Fossil] http://plan9.bell-labs.com/sys/doc/fossil.pdf



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [9fans] fossil/venti/manber
  2003-04-28 21:55 ` Russ Cox
@ 2003-04-28 22:13   ` William Josephson
  0 siblings, 0 replies; 3+ messages in thread
From: William Josephson @ 2003-04-28 22:13 UTC (permalink / raw)
  To: 9fans

On Mon, Apr 28, 2003 at 05:55:34PM -0400, Russ Cox wrote:
> I think the notes happen to answer your questions, mainly
> because I think they clear up your misunderstandings about
> how applying Manber's idea would actually work.
> If you have questions after reading the notes, I'd be happy
> to address them directly, but I think we're not on the same
> page at the moment.

<snip good discussion of chunking>

For what it is worth, on common datasets (e.g. /usr and /home on Linux
and FreeBSD boxes, the root disk on Windows boxes, and sets containing
a majority of CAD and graphics files), using chunking turns out to be
a significant win compared to fixed size blocks, at least over time.
For the first run (i.e. first write of a set of files) and for small
files, the difference is usually fairly trivial.  The chunking
approach is not as big a win as using content-addressed blocks in the
first place, but still a win.  Whether or not it makes sense to go to
the effort of implementing it for Plan 9 is another question, of
course.  I'm afraid I can't name numbers as I'd like to do, but
suffice it to say that this idea is being pursued commercially....



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2003-04-28 22:13 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-28 20:17 [9fans] fossil/venti/manber FODEMESI Gergely
2003-04-28 21:55 ` Russ Cox
2003-04-28 22:13   ` William Josephson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).