From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Thu, 23 Apr 2015 07:21:38 +0000 From: hruodr@gmail.com To: 9fans@9fans.net Message-ID: <55389d82.M9BOTEwLjblmxdW/%hruodr@gmail.com> User-Agent: Heirloom mailx 12.4 7/29/08 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [9fans] thoughs about venti+fossil Topicbox-Message-UUID: 4e402ccc-ead9-11e9-9d60-3106f5b1d025 On Tue, 21 Apr 2015, Russ Cox wrote: > My paper with Sean Rhea and Alex Pesterev documents the performance=20 > effect of double-checking the equality in some detail. > https://www.usenix.org/legacy/event/usenix08/tech/full_papers/rhea/rhea= .pdf Very nice paper. Specialy from chapter 3. By reading it, my same original question arises again: **** Foundation=E2=80=99s CAS layer is modeled on the Venti [34] content-addressed storage server, but we have adapted the Venti algorithms for use in a single-disk system and also optionally eliminated the assumption that SHA-1 is free of collisions, producing two operating modes for Foundation: compare-by-hash and compare-by-value. (Page 4) **** And the question is answered in the paper: **** While we originally investigated this mode [Compare-by-value]=20 due to (in our opinion, unfounded) concerns about cryptographic hash collisions (see [5, 16] for a lively debate), we were surprised to find that its overall write performance was close to that of compare-by-hash mode, despite the added comparisons. Moreover, compare-by-value is always faster for reads, as naming blocks by their log offsets completely eliminates index lookups during reads. (page 7) **** > (Caveat: in the usual academic tradition, the paper uses "Venti" to mea= n > the system described in the original paper, not the system in Plan 9 to= day. > The current Plan 9 implementation is much closer to what the paper call= s=20 > "Foundation: Compare by Hash".) New question: and if I compile it with "int verifywrites =3D 1", is it closer to "Compare-by-Value"? I mean offset as handle. > Hope this helps. I wanted to know if the optional compiling with full check was in consideration of people that have concerns about the (in)correctness of compare-by-hash. One can disagree about the risk of using compare-by-hash= , but one cannot disagree in the fact that one disagrees. :) I think, everyone should decide himself if he uses "compare-as-hash"=20 and where he uses it. In some applications I would even take much more=20 risk than compare-by-hash. And I find interesting the experiments with hash functions, including compare-by-hash. I appreciate that the option of "compare-by-value" is there. Documentatio= n=20 about where "compare-by-hash" is used, is important in orther that people= =20 may decide by themselve. Interesting would also be the possibility of easily changing the hash functions. As you note in the paper, this is important in "compare-by-val= ue" for increasing performance. The problem I have with "compare-by-hash" is not only the probability of hash colisions, but that it seems to rely on empirical knowledge about=20 the used hash function. People used to analytical arguments may find empirical arguments and empirical programming gruesome. If the empirical=20 knowledge changes, if one discovers that the used hash function does not=20 distribute homogenously enough its domain in its range, then one will=20 want (specially in the case of compare by hash) to change the hash=20 function with a better one. Trial and error is the empirical method=20 of solving (and making) problems. Rodrigo.