Re: find duplicate files

zsh-users
 help / color / mirror / code / Atom feed

From: Charles Blake <charlechaud@gmail.com>
To: Bart Schaefer <schaefer@brasslantern.com>
Cc: Zsh Users <zsh-users@zsh.org>
Subject: Re: find duplicate files
Date: Mon, 8 Apr 2019 07:17:56 -0400	[thread overview]
Message-ID: <CAKiz1a90DsdjXmrE1wEviN5R9hP=225tVfQPFTJkC73f9pQ74A@mail.gmail.com> (raw)
In-Reply-To: <CAH+w=7YKTPfQdkA3c-FUKbZmJnpZuBzBQdaCpSxoeY=SQaJtMw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4784 bytes --]

>I find that a LOT more understandable than the python code.

Understandability is, of course, somewhat subjective (e.g. some might say
every 15th field is unclear relative to a named label), but the python also
does do more and would be quite a bit simpler without any parallelism, and
the Zsh code more complex with a set up for it.  So, it's a somewhat apples
& oranges/strawman comparison.  No offense intended.  I don't dislike your
solution or a simpler Python one.  Simplicity is good.  As elaborated below,
everything sort of "just depends".

Some hash functions are really fast..some IO systems really slow, etc.
To really do the parallel IO/hashing right (be nice to competition both
from self & programs on the system) you probably want to start with a
parallelism of 1 CPU and dynamically adjust based on CPU utilization
say after each file via getrusage in some parallel thread coordinator).
Such would be more complex still, but as we both mentioned will be more
efficient in a lower level language.

>This should be vanishingly small if the sizes are the same?

Of course, if you have a hash with no collision attack or no reason to
suspect a filesystem collision attack, then yes, a vanishingly small
probability.  I did not mean to overemphasize the risk as much as use it
to introduce the "IO/hash truncated" optimization idea trading off some
false positive risk with performance.  I personally used the report for
rare-case manual intervention, not automated response.  So, cmp at the
end makes total sense.  I don't think there is any disagreement here.

Another motivation on the truncated IO/hash optimization that may be
underappreciated is that if your report is empty then you are done.
False negatives don't happen.  That might well be the common case after
some situation is cleaned up and you want a "quick check" that there are
no duplicates or if you otherwise "suspect but do not know" no duplicates
for some file set.  If that quick check fails then you can do a slower,
more precise job.  With large files or slow IO this can help a lot.
Call it an 'is it possible there's a duplicate' operation if you will.

Partial hashes can also be a jumping off point to detection of "corrupt"
or near-duplicate files.  The many ways corruption can happen generates
a list of questions/approaches which is off topic.  One way it can happen
is non-size-changing scribbling over/non-population of just a few bytes
which does relate to partial hashing, though obviously it's not perfectly
tuned for such an application.

>I haven't dealt with the files having strange characters in the names
>that might prevent them being easily used as hash keys, that's left as
>an exercise

Part of my purpose writing the Python was indeed being able to deal with
weird path names which I also generally try hard to avoid, but sometimes
it's beyond one's control/authority.  I almost added user parameter for
the hard TAB and NEWLINE chars used for output delimiters, but tried to
keep it simple for exposition.  With utf8 these days...truly general
filenames are decidedly a PITA.

>unless you're NOT going to consider linked files as duplicates you
>might as well just compare sizes.  (It would be faster to get inodes

It may have been underappreciated is that handling hard-link identity also
lets you skip recomputing hashes over a hard link cluster which must be the
same.  (Or re-doing the IO if the time between hitting that i-node is long
enough to lose buffer caching.)  One thinkable, possibly even automatic,
resolution of duplicates is conversion into hard links.  If there was a
lot of duplication resolved this way then this would save time on a "second
pass" over such file hierarchies.  I'm just defending that admittedly
likely minor optimization (which just wasn't so hard to add in Python).

Most claims in this whole space depend upon application context in some
way.  Some are hard, like files being updated during the duplicate scan.
Yeah..You can look at time stamps, and maybe re-compute things, but ick.
If things are changing who knows what version the user wants assessed?
Duplicates are probably rare.  Large trees of heavily hard-linked files
are probably a much smaller optimization than using file size as a weak
hash to organize the calculation.  You can probably find a hash function
faster than your IO.  Or maybe you have files on some quad m.2 NVMe raid
setup delivering 13 gigabytes/sec and it is harder to match that with a
crypto hash function.  Almost everything you say needs a "probably/maybe"
qualifier.  I don't think you disagree.  I'm just elaborating a little
for passers by.

P.S.:
    Two of my side-comments got auto-wrapped at 76 text columns.  Should
be an easy fix for anyone cutting & pasting that python code.  Sorry!

next prev parent reply	other threads:[~2019-04-08 11:19 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-06  5:40 Emanuel Berg
2019-04-06 13:02 ` Paul Hoffman
2019-04-06 14:44   ` Emanuel Berg
2019-04-06 19:11     ` zv
2019-04-06 19:42       ` Emanuel Berg
2019-04-08 14:37         ` Paul Hoffman
2019-04-08 14:58           ` Ray Andrews
2019-04-08 15:14             ` Volodymyr Khomchak
2019-04-08 15:24             ` Peter Stephenson
2019-04-08 15:32             ` Andrew J. Rech
2019-04-08 15:47             ` Oliver Kiddle
2019-04-08 16:29               ` Ray Andrews
2019-04-08 16:45                 ` Bart Schaefer
2019-04-08 21:30               ` Emanuel Berg
2019-04-09  1:08             ` Jason L Tibbitts III
2019-04-09  1:28               ` Ray Andrews
2019-04-09  9:28               ` Charles Blake
2019-04-08 21:26           ` Emanuel Berg
2019-04-07 11:16       ` Charles Blake
2019-04-07 21:32         ` Bart Schaefer
2019-04-08 11:17           ` Charles Blake [this message]
2019-04-08 17:14             ` Bart Schaefer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKiz1a90DsdjXmrE1wEviN5R9hP=225tVfQPFTJkC73f9pQ74A@mail.gmail.com' \
    --to=charlechaud@gmail.com \
    --cc=schaefer@brasslantern.com \
    --cc=zsh-users@zsh.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).