Re: find duplicate files

zsh-users
 help / color / mirror / code / Atom feed

From: Bart Schaefer <schaefer@brasslantern.com>
To: Charles Blake <charlechaud@gmail.com>
Cc: Zsh Users <zsh-users@zsh.org>
Subject: Re: find duplicate files
Date: Sun, 7 Apr 2019 14:32:38 -0700	[thread overview]
Message-ID: <CAH+w=7YKTPfQdkA3c-FUKbZmJnpZuBzBQdaCpSxoeY=SQaJtMw@mail.gmail.com> (raw)
In-Reply-To: <CAKiz1a_yfNUY87FX15H3=M5P46icqeH7m=1MM7JWazqVqz4bYw@mail.gmail.com>

On Sun, Apr 7, 2019 at 4:19 AM Charles Blake <charlechaud@gmail.com> wrote:
>
>
> Zeroeth, you may want to be careful of zero length files which are all
> identical, but also take up little to no space beyond their pathname.

This is just **/*(.l+0) in zsh (the dot is to ignore directories).

> Another zeroth order concern is
> file identity.  I-node/file numbers are better here than path names.

You can get all these numbers reasonably fast (unless some sort of
networked storage is involved) with

zmodload zsh/stat
zstat -A stats **/*(.l+0)

Of course if you're talking really huge numbers of files you might not
want to collect this all at once.

> First, and most importantly, the file size itself acts as a weak hash

That's also snarfled up by zstat -A.  It doesn't really take any
longer to get the file size from stat than it does the inode, so
unless you're NOT going to consider linked files as duplicates you
might as well just compare sizes.  (It would be faster to get inodes
from readdir, but there's no shell-level readdir.)  More on this
later.

> Second, if you have a fast IO system (eg., your data all fits in RAM)
> then time to strongly hash likely dominates the calculation, but that
> also parallelizes easily.

You're probably not going to beat built-in lightweight threading and
shared memory with shell process-level "threading", so if you have a
large number of files that are the same size but with different
contents then something like python or C might be the way to go.

> Third, in theory, even strong crypto hashes have a small but non-zero
> chance of a collision.  So, you may need to deal with the possibility
> of false positives anyway.

This should be vanishingly small if the sizes are the same?  False
positives can be checked by using "cmp -s" which just compares the
files byte-by-byte until it it finds a difference, but this is a
pretty hefty penalty on the true positives.

Anyway, here's zsh code; I haven't dealt with the files having strange
characters in the names that might prevent them being easily used as
hash keys, that's left as an exercise for somebody who has files with
strange names:

zmodload zsh/stat
zstat -nA stats **/*(.l+0)
# Every stat struct has 15 elements, so we pick out every 15th
names=( ${(e):-'${stats['{1..${#stats}..15}']}'} ) # name is element 1
sizes=( ${(e):-'${stats['{9..${#stats}..15}']}'} ) # size is element 9
# Zip the two arrays to make a mapping
typeset -A clusters sizemap=( ${names:^sizes} )
# Compute clusters of same-sized files
for i in {1..$#sizes}
do
  same=( ${(k)sizemap[(R)$sizes[i]]} )
  (( $#same > 0 )) || continue
  # Delete entries we've seen so we don't find them again
  unset 'sizemap['${^same}']'
  (( $#same > 1 )) && clusters[$sizes[i]]=${(@qq)same}
done
# Calculate checksums by cluster and report duplicates
typeset -A sums
for f in ${(v)clusters}
do
  # Inner loop could be put in a function and outer loop replaced by zargs -P
  for sum blocks file in $( eval cksum $f )
  do
    if (( ${+sums[$sum.$blocks]} ))
    then print Duplicate: ${sums[$sum.$blocks]} $file
    else sums[$sum.$blocks]=$file
    fi
  done
done

I find that a LOT more understandable than the python code.

next prev parent reply	other threads:[~2019-04-07 21:33 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-06  5:40 Emanuel Berg
2019-04-06 13:02 ` Paul Hoffman
2019-04-06 14:44   ` Emanuel Berg
2019-04-06 19:11     ` zv
2019-04-06 19:42       ` Emanuel Berg
2019-04-08 14:37         ` Paul Hoffman
2019-04-08 14:58           ` Ray Andrews
2019-04-08 15:14             ` Volodymyr Khomchak
2019-04-08 15:24             ` Peter Stephenson
2019-04-08 15:32             ` Andrew J. Rech
2019-04-08 15:47             ` Oliver Kiddle
2019-04-08 16:29               ` Ray Andrews
2019-04-08 16:45                 ` Bart Schaefer
2019-04-08 21:30               ` Emanuel Berg
2019-04-09  1:08             ` Jason L Tibbitts III
2019-04-09  1:28               ` Ray Andrews
2019-04-09  9:28               ` Charles Blake
2019-04-08 21:26           ` Emanuel Berg
2019-04-07 11:16       ` Charles Blake
2019-04-07 21:32         ` Bart Schaefer [this message]
2019-04-08 11:17           ` Charles Blake
2019-04-08 17:14             ` Bart Schaefer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAH+w=7YKTPfQdkA3c-FUKbZmJnpZuBzBQdaCpSxoeY=SQaJtMw@mail.gmail.com' \
    --to=schaefer@brasslantern.com \
    --cc=charlechaud@gmail.com \
    --cc=zsh-users@zsh.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).