9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] thoughs about venti+fossil
@ 2008-03-05  4:00 Enrico Weigelt
  2008-03-05  4:11 ` Roman Shaposhnik
                   ` (4 more replies)
  0 siblings, 5 replies; 67+ messages in thread
From: Enrico Weigelt @ 2008-03-05  4:00 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs


Hi folks,


some thoughts about venti that go around in my mind:

1. how stable is the keying ? sha-1 has only 160 bits, while
   data blocks may be up to 56k long. so, the mapping is only
   unique into one direction (not one-to-one). how can we be
   *really sure*, that - even on very large storages (TB or
   even PB) - data to each key is alway (one-to-one) unique ?

2. what approx. compression level could be assumed on large
   storages by putting together equal data blocks (on several
   kind of data, eg. typical office documents vs. media) ?

3. what happens on the space consumtion if venti is used as
   storage for heavily rw filesystems for a longer time
   (not as permanent archive) - how much space will be wasted ?
   should we add some method for block expiry (eg. timeouts
   of reference counters) ?

4. assuming #1 can be answered 100% yes - would it suit for
   an very large (several PB) heavily distributed storage
   (eg. for some kind of distributed, redundant filesystem) ?


cu
--
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 67+ messages in thread
* [9fans] thoughs about venti+fossil
@ 2008-03-05 14:03 erik quanstrom
  2008-03-05 16:00 ` Russ Cox
  0 siblings, 1 reply; 67+ messages in thread
From: erik quanstrom @ 2008-03-05 14:03 UTC (permalink / raw)
  To: 9fans

> >> http://www.nmt.edu/~val/review/hash/index.html
> >>
> >> Not that this analysis is without flaws, though.
> >
> > have you invented the 9fans.net effect?
> 
> Meaning? I guess the reference went over my head.

the link was inaccessable when i tried to access it.
i figured the combined traffic of 9fans brought it down. ;-).

> > this link may or may not be similar.  but it is on point:
> > http://www.valhenson.org/review/hash.pdf
> 
> I believe it to be exactly the same paper.
> 
> > do you care to elaborate on the flaws of this analysis?
> 
> I tend to agree with counter arguments published here:
>     http://monotone.ca/docs/Hash-Integrity.html
> I'm not an expert in this field (although I dabbled
> in cryptograhy somewhat given my math background) and
> thus I would love if somebody can show that the
> counter arguments don't stand.
> 

the analysis in §4.1 is just wrong.  pedanticly, i can't get
past the fact that the paper talkes about "sha-1(1)" and
"sha-1(x), x>0".  i'm not sure what that means since sha-1
operates on blocks not integers.  but the real problem is
that the author doesn't appear to understand
"cryptograpically strong".  the invented function may have
the same probability of collision as sha-1, but it is not
cryptographically strong.

also, i think the author doesn't fully appreciate the
power of really big numbers.  you'd need 10^(12+3.2)/2
tb hard drives *full of data* to have a reasonable chance
of a hash collision with 8k blocks.  at $250 each, this
would cost 2.48e17 dollars.

i'm pretty sure that there are other limits in venti that
kick in before 9000 yottabytes.  that's not in standard
si form because yotta is the biggest si prefix i can find.

i think that venti has a different problem.  indexing by
sha-1 hash trades time and index lookups for space.
but disk space is cheep relative to our needs and table
lookup and fragmentation that venti implies results
in a lot of random i/o.  modern disks are at least 25x
faster doing sequential i/o.

- erik


^ permalink raw reply	[flat|nested] 67+ messages in thread
* Re: [9fans] thoughs about venti+fossil
@ 2008-03-06 19:09 Brian L. Stuart
  2008-03-06 19:50 ` Charles Forsyth
  0 siblings, 1 reply; 67+ messages in thread
From: Brian L. Stuart @ 2008-03-06 19:09 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

From: "Russ Cox" <rsc@swtch.com>
> sure.  use sha-256 and your probability of collision goes
> down even further.  but *you* (probably) still won't be *sure*.

I should probably not put my 2 cents worth in here,
but my resistance is weak...

It is true that you cannot be sure that there won't be a
collision in venti, regardless of what hashing function
you use.  It is probabilistic, and doesn't prevent it
from happening tomorrow, or from not happening until
the sun burns out.  But it seems to me that there's
a bigger picture.  The reason we would not want a collision
is that it would, in effect, be a form of data corruption.
But it's only one possible source.  It's possible that
network communication could be corrupted but still
pass the CRC checks (if they're even present).  It's
possible that the disk could be corrupted in such a
way that a block is in error, but still passes the
ECC check.  It's possible that a bit in the main
memory might flip (or two bits if we have parity
memory).  In the end, we have to rely on the fact
that these are all very unlikely to happen; their
probabilities are quite low.  A higher probability
of damage comes from a potential fire in the machine
room.  We often add some form of off-site backup
to handle this.  But it can't make us sure that
an earthquake won't hit the off-site backup location
at the same time we have a fire locally.  Rather,
the probability of both is low enough we accept it.
The amount of effort we put into mitigating an error
is proportional to the probability of that error
occurring and the amount of harm the error would
cause.

What does all this mean for venti?  If we want to
reduce the overall probability of data corruption,
we want to put our efforts into addressing the one
with the highest probability.  Making the others
better won't appreciably help the overall probability.
And a venti collision is not the one with the
highest probability among those I've listed.  In
fact, I'd suspect its the one with the lowest
probability.  So putting attention on making it
less likely is really misplaced effort from a
practical standpoint.

The truth is that the first time I read the venti
papers, I was bothered the same way.  Yes, there
can be problems, but generally we design systems
where the design itself doesn't contain any known
sources of failure.  In venti, we have.  And it
bugged me for quite a while.  But when I finally
realized objectively that the probability of a
hardware failure is orders of magnitude greater
than a collision, I started to accept that venti
is a very well-designed system, and is as reliable
as any other form of archive.

BLS


^ permalink raw reply	[flat|nested] 67+ messages in thread
* Re: [9fans] thoughs about venti+fossil
@ 2015-04-21 18:30 hruodr
  2015-04-21 19:46 ` Russ Cox
  0 siblings, 1 reply; 67+ messages in thread
From: hruodr @ 2015-04-21 18:30 UTC (permalink / raw)
  To: 9fans


Dear Sirs!

I have a question about this old discussion:

http://www.mail-archive.com/9fans@cse.psu.edu/msg17960.html

And about the following answer:

> if you change lump.c to say
>
>       int verifywrites = 1;
>
> then venti will check every block as it is written to make
> sure there is no hash collision.  this is not the default (anymore).

Does this mean, that Plan9 by default supposes that A=B if hash(A)=hash(B)?

That this was not the default before, but it is now?

That I still have the possibility of a "full check" of A=B (and not supposing
it after checking hash(A)=hash(B)) by changing "int verifywrites = 1;" in
"lump.c"?

I do not want to revive the discussion, because I have the feeling
that the discussion about the thema easily becomes very ideological.
I had it just some time ago, here my last word:

http://thread.gmane.org/gmane.os.openbsd.misc/206951/focus=207340

In fact I feel bad declaring A=B when hash(A)=hash(B), but I know that
the probability of an error, at least by experience, is very small,
negligible (although I have scruples saying it). Just a bad feeling,
perhaps more against this kind of "probabilistic programming" than
against the probability of error (that I do not exactly know).

My question is hence not about the theory behind, not about a justification
of the test with hash functions. I just want to know why the
default was changed, why originaly A=B was checked even if one was sure
that the probability of error by only checking hash(A)=hash(B) is
negligible, why the possibility of changing this default exists.

Rodrigo.



^ permalink raw reply	[flat|nested] 67+ messages in thread
* Re: [9fans] thoughs about venti+fossil
@ 2015-04-23  7:21 hruodr
  0 siblings, 0 replies; 67+ messages in thread
From: hruodr @ 2015-04-23  7:21 UTC (permalink / raw)
  To: 9fans


On Tue, 21 Apr 2015, Russ Cox wrote:

> My paper with Sean Rhea and Alex Pesterev documents the performance 
> effect of double-checking the equality in some detail.
> https://www.usenix.org/legacy/event/usenix08/tech/full_papers/rhea/rhea.pdf

Very nice paper. Specialy from chapter 3.

By reading it, my same original question arises again:

****
  Foundation’s CAS layer is modeled on the Venti [34]
  content-addressed storage server, but we have adapted the
  Venti algorithms for use in a single-disk system and also
  optionally eliminated the assumption that SHA-1 is free
  of collisions, producing two operating modes for Foundation:
  compare-by-hash and compare-by-value. (Page 4)
****


And the question is answered in the paper:

****
While we originally investigated this mode [Compare-by-value] 
due to (in our opinion, unfounded) concerns about cryptographic
hash collisions (see [5, 16] for a lively debate), we
were surprised to find that its overall write performance
was close to that of compare-by-hash mode, despite the
added comparisons. Moreover, compare-by-value is always
faster for reads, as naming blocks by their log offsets
completely eliminates index lookups during reads. (page 7)
****


> (Caveat: in the usual academic tradition, the paper uses "Venti" to mean
> the system described in the original paper, not the system in Plan 9 today.
> The current Plan 9 implementation is much closer to what the paper calls 
> "Foundation: Compare by Hash".)

New question: and if I compile it with "int verifywrites = 1", is it
closer to "Compare-by-Value"?

I mean offset as handle.


> Hope this helps.

I wanted to know if the optional compiling with full check was in
consideration of people that have concerns about the (in)correctness of
compare-by-hash. One can disagree about the risk of using compare-by-hash,
but one cannot disagree in the fact that one disagrees. :)

I think, everyone should decide himself if he uses "compare-as-hash" 
and where he uses it. In some applications I would even take much more 
risk than compare-by-hash. And I find interesting the experiments with
hash functions, including compare-by-hash.

I appreciate that the option of "compare-by-value" is there. Documentation 
about where "compare-by-hash" is used, is important in orther that people 
may decide by themselve.

Interesting would also be the possibility of easily changing the hash
functions. As you note in the paper, this is important in "compare-by-value"
for increasing performance.

The problem I have with "compare-by-hash" is not only the probability of
hash colisions, but that it seems to rely on empirical knowledge about 
the used hash function. People used to analytical arguments may find
empirical arguments and empirical programming gruesome. If the empirical 
knowledge changes, if one discovers that the used hash function does not 
distribute homogenously enough its domain in its range, then one will 
want (specially in the case of compare by hash) to change the hash 
function with a better one. Trial and error is the empirical method 
of solving (and making) problems.

Rodrigo.




^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2015-04-23  7:21 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-05  4:00 [9fans] thoughs about venti+fossil Enrico Weigelt
2008-03-05  4:11 ` Roman Shaposhnik
2008-03-05  4:43   ` erik quanstrom
2008-03-05  5:09     ` Roman Shaposhnik
2008-03-05  5:52   ` Enrico Weigelt
2008-03-05  6:24     ` geoff
2008-03-05  6:35     ` Taj Khattra
     [not found]     ` <7f575fa27b41329b9ae24f40e6e5a3cd@plan9.bell-labs.com>
2008-03-06  4:04       ` Enrico Weigelt
2008-03-06  4:13         ` Bruce Ellis
2008-03-06  4:15         ` andrey mirtchovski
2008-03-06  4:31           ` Bruce Ellis
2008-03-06  6:16             ` Enrico Weigelt
2008-03-06 18:50               ` ron minnich
2008-03-06 19:43                 ` Charles Forsyth
2008-03-06 19:45               ` Paul Lalonde
2008-03-06 20:18                 ` Bruce Ellis
2008-03-06 21:39                   ` Paul Lalonde
2008-03-08  9:06                     ` Enrico Weigelt
2008-03-06 22:10                   ` Martin Harriss
2008-03-06  6:40           ` Enrico Weigelt
2008-03-06 14:35             ` erik quanstrom
2008-03-06 14:58             ` Tom Lieber
2008-03-06 15:09             ` Charles Forsyth
2008-03-06 17:09               ` Robert Raschke
2008-03-10 10:19               ` sqweek
2008-03-10 12:29                 ` Gorka Guardiola
2008-03-10 13:20                 ` erik quanstrom
2008-03-10 19:00                   ` Wes Kussmaul
2008-03-10 19:27                     ` erik quanstrom
2008-03-10 20:55                       ` Bakul Shah
2008-03-11  2:04                       ` Wes Kussmaul
2008-03-11  2:10                         ` erik quanstrom
2008-03-11  6:03                           ` Bruce Ellis
2008-03-10 16:18                 ` Russ Cox
2008-03-10 18:06                   ` Bruce Ellis
2008-03-10 18:31                     ` Eric Van Hensbergen
2008-03-10 18:40                       ` Bruce Ellis
2008-03-10 18:46                     ` Geoffrey Avila
2008-03-10 20:28                       ` Charles Forsyth
2008-03-10 21:35                     ` Charles Forsyth
2008-03-06  9:54           ` Wilhelm B. Kloke
2008-03-08  9:37             ` Enrico Weigelt
2008-03-08  9:57               ` Bruce Ellis
2008-03-08 10:46               ` Charles Forsyth
2008-03-08 15:37               ` erik quanstrom
2008-03-06  4:40         ` cummij
2008-03-06  5:15           ` Bruce Ellis
2008-03-06  5:40         ` Uriel
2008-03-06  5:55           ` Bruce Ellis
2008-03-11 18:34             ` Uriel
2008-03-06 12:26           ` erik quanstrom
2008-03-05  5:04 ` geoff
2008-03-05  8:43 ` Charles Forsyth
2008-03-05  9:05   ` Gorka Guardiola
2008-03-05 14:33 ` Russ Cox
2008-03-06 12:39   ` Enrico Weigelt
2008-03-06 16:58     ` Russ Cox
2008-03-06 18:16       ` andrey mirtchovski
     [not found] ` <a553f487750f88281db1cce3378577c7@terzarima.net>
2008-03-06  5:38   ` Enrico Weigelt
2008-03-06  9:44     ` Joel C. Salomon
2008-03-05 14:03 erik quanstrom
2008-03-05 16:00 ` Russ Cox
2008-03-06 19:09 Brian L. Stuart
2008-03-06 19:50 ` Charles Forsyth
2015-04-21 18:30 hruodr
2015-04-21 19:46 ` Russ Cox
2015-04-23  7:21 hruodr

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).