From mboxrd@z Thu Jan  1 00:00:00 1970
Date: Thu, 23 Apr 2015 07:21:38 +0000
From: hruodr@gmail.com
To: 9fans@9fans.net
Message-ID: <55389d82.M9BOTEwLjblmxdW/%hruodr@gmail.com>
User-Agent: Heirloom mailx 12.4 7/29/08
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [9fans] thoughs about venti+fossil
Topicbox-Message-UUID: 4e402ccc-ead9-11e9-9d60-3106f5b1d025


On Tue, 21 Apr 2015, Russ Cox wrote:

> My paper with Sean Rhea and Alex Pesterev documents the performance=20
> effect of double-checking the equality in some detail.
> https://www.usenix.org/legacy/event/usenix08/tech/full_papers/rhea/rhea=
.pdf

Very nice paper. Specialy from chapter 3.

By reading it, my same original question arises again:

****
  Foundation=E2=80=99s CAS layer is modeled on the Venti [34]
  content-addressed storage server, but we have adapted the
  Venti algorithms for use in a single-disk system and also
  optionally eliminated the assumption that SHA-1 is free
  of collisions, producing two operating modes for Foundation:
  compare-by-hash and compare-by-value. (Page 4)
****


And the question is answered in the paper:

****
While we originally investigated this mode [Compare-by-value]=20
due to (in our opinion, unfounded) concerns about cryptographic
hash collisions (see [5, 16] for a lively debate), we
were surprised to find that its overall write performance
was close to that of compare-by-hash mode, despite the
added comparisons. Moreover, compare-by-value is always
faster for reads, as naming blocks by their log offsets
completely eliminates index lookups during reads. (page 7)
****


> (Caveat: in the usual academic tradition, the paper uses "Venti" to mea=
n
> the system described in the original paper, not the system in Plan 9 to=
day.
> The current Plan 9 implementation is much closer to what the paper call=
s=20
> "Foundation: Compare by Hash".)

New question: and if I compile it with "int verifywrites =3D 1", is it
closer to "Compare-by-Value"?

I mean offset as handle.


> Hope this helps.

I wanted to know if the optional compiling with full check was in
consideration of people that have concerns about the (in)correctness of
compare-by-hash. One can disagree about the risk of using compare-by-hash=
,
but one cannot disagree in the fact that one disagrees. :)

I think, everyone should decide himself if he uses "compare-as-hash"=20
and where he uses it. In some applications I would even take much more=20
risk than compare-by-hash. And I find interesting the experiments with
hash functions, including compare-by-hash.

I appreciate that the option of "compare-by-value" is there. Documentatio=
n=20
about where "compare-by-hash" is used, is important in orther that people=
=20
may decide by themselve.

Interesting would also be the possibility of easily changing the hash
functions. As you note in the paper, this is important in "compare-by-val=
ue"
for increasing performance.

The problem I have with "compare-by-hash" is not only the probability of
hash colisions, but that it seems to rely on empirical knowledge about=20
the used hash function. People used to analytical arguments may find
empirical arguments and empirical programming gruesome. If the empirical=20
knowledge changes, if one discovers that the used hash function does not=20
distribute homogenously enough its domain in its range, then one will=20
want (specially in the case of compare by hash) to change the hash=20
function with a better one. Trial and error is the empirical method=20
of solving (and making) problems.

Rodrigo.