find duplicate files

zsh-users
 help / color / mirror / code / Atom feed

* find duplicate files
@ 2019-04-06  5:40 Emanuel Berg
  2019-04-06 13:02 ` Paul Hoffman
  0 siblings, 1 reply; 22+ messages in thread
From: Emanuel Berg @ 2019-04-06  5:40 UTC (permalink / raw)
  To: zsh-users

Is this any good? Can it be done lineary?

TIA

#! /bin/zsh

find-duplicates () {
    local -a files
    [[ $# = 0 ]] && files=("${(@f)$(ls)}") || files=($@)

    local dups=0

    # files
    local a
    local b

    for a in $files; do
        for b in $files; do
            if [[ $a != $b ]]; then
                diff $a $b > /dev/null
                if [[ $? = 0 ]]; then
                    echo $a and $b are the same
                    dups=1
                fi
            fi
        done
    done
    [[ $dups = 0 ]] && echo "no duplicates"
}
alias dups=find-duplicates

-- 
underground experts united
http://user.it.uu.se/~embe8573


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-06  5:40 find duplicate files Emanuel Berg
@ 2019-04-06 13:02 ` Paul Hoffman
  2019-04-06 14:44   ` Emanuel Berg
  0 siblings, 1 reply; 22+ messages in thread
From: Paul Hoffman @ 2019-04-06 13:02 UTC (permalink / raw)
  To: zsh-users

On Sat, Apr 06, 2019 at 07:40:59AM +0200, Emanuel Berg wrote:
> Is this any good? Can it be done lineary?
> 
> TIA
> 
> #! /bin/zsh
> 
> find-duplicates () {
>     local -a files
>     [[ $# = 0 ]] && files=("${(@f)$(ls)}") || files=($@)
> 
>     local dups=0
> 
>     # files
>     local a
>     local b
> 
>     for a in $files; do
>         for b in $files; do
>             if [[ $a != $b ]]; then
>                 diff $a $b > /dev/null
>                 if [[ $? = 0 ]]; then
>                     echo $a and $b are the same
>                     dups=1
>                 fi
>             fi
>         done
>     done
>     [[ $dups = 0 ]] && echo "no duplicates"
> }
> alias dups=find-duplicates

Your function keeps comparing even after it finds duplicates, so its 
runtime will be O(N^2), i.e., proportional to the square of the sum of 
file sizes (N).  Here's one that calculates MD5 checksums, and compares 
those, and so is O(N) + O(M^2), i.e., proportional to the sum of file 
sizes plus the square of the number of files (M).

#!/bin/zsh
find-duplicates () {
    (( # > 0 )) || set -- *(.N)
    local dups=0
    md5sum $@ | sort | uniq -c |
    grep -qv '^  *1 ' | wc -l | read dups
    (( dups == 0 )) && echo "no duplicates"
}

A better solution would use an associative array (local -A NAME), would 
*not* sort, and would stop as soon as it found a duplicate, but I'll 
leave that as an exercise for the reader.  :-) 

Paul.

-- 
Paul Hoffman <nkuitse@nkuitse.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-06 13:02 ` Paul Hoffman
@ 2019-04-06 14:44   ` Emanuel Berg
  2019-04-06 19:11     ` zv
  0 siblings, 1 reply; 22+ messages in thread
From: Emanuel Berg @ 2019-04-06 14:44 UTC (permalink / raw)
  To: zsh-users

Paul Hoffman wrote:

> #!/bin/zsh
> find-duplicates () {
>     (( # > 0 )) || set -- *(.N)
>     local dups=0
>     md5sum $@ | sort | uniq -c |
>     grep -qv '^  *1 ' | wc -l | read dups
>     (( dups == 0 )) && echo "no duplicates"
> }

Cool, but doesn't seem to work?

-- 
underground experts united
http://user.it.uu.se/~embe8573


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-06 14:44   ` Emanuel Berg
@ 2019-04-06 19:11     ` zv
  2019-04-06 19:42       ` Emanuel Berg
  2019-04-07 11:16       ` Charles Blake
  0 siblings, 2 replies; 22+ messages in thread
From: zv @ 2019-04-06 19:11 UTC (permalink / raw)
  To: zsh-users

On 4/6/19 7:44 AM, Emanuel Berg wrote:
> Paul Hoffman wrote:
> 
>> #!/bin/zsh
>> find-duplicates () {
>>     (( # > 0 )) || set -- *(.N)
>>     local dups=0
>>     md5sum $@ | sort | uniq -c |
>>     grep -qv '^  *1 ' | wc -l | read dups
>>     (( dups == 0 )) && echo "no duplicates"
>> }
> 
> Cool, but doesn't seem to work?
> 

Forgot to ignore the second field.

    #!/bin/zsh
    find-duplicates () {
         (( # > 0 )) || set -- *(.N)
         local dups=0
         md5sum $@ | sort | awk '{ print $2,$1 }' | uniq -c -f1 | \
                     grep -v '^  *1 ' | wc -l | read dups
         (( dups == 0 )) && echo "no duplicates"
    }


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-06 19:11     ` zv
@ 2019-04-06 19:42       ` Emanuel Berg
  2019-04-08 14:37         ` Paul Hoffman
  2019-04-07 11:16       ` Charles Blake
  1 sibling, 1 reply; 22+ messages in thread
From: Emanuel Berg @ 2019-04-06 19:42 UTC (permalink / raw)
  To: zsh-users

zv wrote:
 
>> Cool, but doesn't seem to work?
>> 
>
> Forgot to ignore the second field.
>
>     #!/bin/zsh
>     find-duplicates () {
>          (( # > 0 )) || set -- *(.N)
>          local dups=0
>          md5sum $@ | sort | awk '{ print $2,$1 }' | uniq -c -f1 | \
>                      grep -v '^  *1 ' | wc -l | read dups
>          (( dups == 0 )) && echo "no duplicates"
>     }

Still nothing :)

What's with the third line BTW? Looks like
a comment to me and my editor interprets it that
way as well.

But changing it to $# doesn't help :(

-- 
underground experts united
http://user.it.uu.se/~embe8573


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-06 19:11     ` zv
  2019-04-06 19:42       ` Emanuel Berg
@ 2019-04-07 11:16       ` Charles Blake
  2019-04-07 21:32         ` Bart Schaefer
  1 sibling, 1 reply; 22+ messages in thread
From: Charles Blake @ 2019-04-07 11:16 UTC (permalink / raw)
  To: zsh-users

[-- Attachment #1: Type: text/plain, Size: 7485 bytes --]

Several optimization points bear mentioning.  The performance impact of
all of these depend upon the nature of your file collection.  Taken
together these optimizations may compound to do thousands of times less
work.  Adding no benefit at all is rare in my experience.

Zeroeth, you may want to be careful of zero length files which are all
identical, but also take up little to no space beyond their pathname.
These may or may not be of interest in your specific use case, but are
common enough to warrant mentioning.  Another zeroth order concern is
file identity.  I-node/file numbers are better here than path names.
Some separate "map out hard links" utility is better for those.

First, and most importantly, the file size itself acts as a weak hash
that is maintained by the OS/FS.  You do not need to hash or compare
files with different sizes since a different size already tells you the
content is different.  For many file collections, the size itself is a
strong enough hash which makes the running time very similar to the time
of "find".

Your files could all be identical sizes in which case this does not
help, but this is a very cheap preprocessing step, often barely more
expensive than computing a set of files of interest.  It results in a
set of clusters of possible duplicates which can then be tested more
precisely.  In my peculiar file sets I got a speed-up of 10x .. 100x
from this preprocessing step.

Second, if you have a fast IO system (eg., your data all fits in RAM)
then time to strongly hash likely dominates the calculation, but that
also parallelizes easily.  In my situations, I found parallel speed-up
to be better if you just hash over a full list of files in any maybe-
duplicated cluster than trying to only hash "cluster by cluster", but
I mostly was finding clusters small compared to my CPU core count.

Larger clusters or more need for as-you-go progress reporting could
alter the desirability of this aspect of hashing parallelization.
The speed-up of this step depends on how many CPU cores you have and
how fast hashing is relative to IO.  For well-cached-in-RAM file sets
it can be order factor-of-10.

Third, in theory, even strong crypto hashes have a small but non-zero
chance of a collision.  So, you may need to deal with the possibility
of false positives anyway.  If have a way to deal with a small false
positive rate then you may not need to strongly hash the WHOLE file.
If any well-defined subrange of files (first or last 64 KiB, say) is
effective at distinguishing files already known to be the same size
then limiting hashing to that range saves a factor of average-file-
size/limit-size (in both IO and hashing time - so, a similar factor
whichever is your throughput bottleneck).

There is some interaction between hash-curtailing and parallelization
in that very small limits can make the bottleneck become non-hashing
costs like copying data/task switching, but for file sets with large
files these are mostly independent optimizations.  There can also be
interaction betweeen spinning rust disk seeking and parellelization.

Putting these all together, you can get a prototype program in Python
not zsh that looks like this (should work on any Python version).
--
from sys      import argv, stderr as err, stdout as out, exc_info
from os       import environ, listdir, stat, lstat ; E = environ.get
from hashlib  import *  # md5, sha1, sha256, sha512, blake2b, etc.
#This is the command-interface; Could be fancier
roots = argv[1:] if len(argv) > 1 else [ "." ]
DIGEST = eval(E("DIGEST", "sha1"))  #Can be any algo exported by hashlib
MIN    =  int(E("MIN"   , "1"))     #Exclude File Smaller Than This (Bytes)
MAX    =  int(E("MAX"   , "-1"))    #Limit IO/Digest work to first MAX Bytes
WASTED = E("WASTED", "")            #Set to non-Empty to get a wasted report
from multiprocessing import Pool, cpu_count  #$PAR controls max parallelism

from stat import S_ISDIR, S_ISREG
def findR(ino2st_paths, root, keep, follow):
    prefix = root + "/"             #Recursive body of find()
    for entry in listdir(root):
        if entry == "." or entry == "..": continue
        path = prefix + entry
        try: st = stat(path) if follow else lstat(path)
        except OSError:
            err.write("findR: %s: %s\n" % (root, exc_info()[1][-1]))
            continue
        if S_ISDIR(st.st_mode):     #Dir ==> Recurse
            findR(ino2st_paths, path, keep, follow)
        if keep(st.st_mode):
            try   : ino2st_paths[st.st_ino].append( path )
            except: ino2st_paths[st.st_ino] = [ st, path ]

def find(roots, keep=S_ISREG, follow=False):
    ino2st_paths = {}
    ino2st_paths = {}
    for root in roots: findR(ino2st_paths, root, keep, follow)
    return ino2st_paths

def Digest(path):
    try: return path, DIGEST(open(path, "rb").read(MAX)).digest()
    except:
        err.write("Cannot digest "+ path + "\n")
        return path, ""
## THE MAIN EVENT ##
path2size = {}                      #path -> size to avoid stat() calls
size2paths = {}                     #size -> paths of *first* hardlinks
for sp in find(roots).values():     #find() gives { ino,[stat,path1,..] }
    if sp[0].st_size < MIN:
        continue
    for p in sp[1:]:
        path2size[p] = sp[0].st_size
    try   : size2paths[sp[0].st_size].append(sp[1]) #only append 1st
hardlink
    except: size2paths[sp[0].st_size] = [ sp[1] ]   #which is sp[1]; see
findR

size_coll = [ ]                     #Build worklist of files w/sz collides.
for paths in size2paths.values():   #Could Digest() here, but get better
    if len(paths) > 1:              #..parallel scale-up to do worklist all
        size_coll.extend(paths)     #..at once if most colls are "few-way".

digest = {}         #Build path -> digest
if MAX == 0:
    for path in size_coll: digest[path] = path2size[path]
else:               #Hashing can be slow; Do most work in a proc pool
    pool = Pool(int(E("PAR", cpu_count())))
    for (path, d) in pool.imap_unordered(Digest, size_coll):
        digest[path] = d + bytes(path2size[path])

paths = {}                          #Invert digest[path] to get collisions
for path in size_coll:
    try   : paths[ digest[path] ].append(path)  #Duplicate digest detected
    except: paths[ digest[path] ] = [ path ]    #First occurrence of digest

ans = []; sz = 0
for paths in paths.values():        #Finally, print sorted answer. Also
    if len(paths) > 1:              #..sort rows internally by mtime.
        ans.append("\t".join(sorted(paths, key=lambda p: stat(p).st_mtime)))
        sz += (len(paths) - 1) * (path2size[paths[0]])
ans.sort()                          #sort alphabetically by first hardlink
out.write("\n".join(ans)); out.write("\n")
if WASTED: out.write("Wasted Space From Duplicates: %d bytes\n" % sz)
--

To re-connect topically with Zsh, you can study the impact of the file
size "weak hash prefix" idea by saving the above program as "dups" and
then doing:
    diff =(dups) =(MAX=0 dups)
or
    diff =(dups) =(MAX=65536 dups)
on a file collection using Zsh's process substitution to study effects of
these ideas.

Alternatively, some ambitious Python hater/Zsh lover could try to re-write
the above with Zsh associative arrays, xargs -P, temp files and the like.
I'm not sure I would recommend that.  A C re-write would likely do better
at minimizing hash curtailing and parallelizing interactions or detecting
when disk seeks cause thrashing and you want to disable parallel IO.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-07 11:16       ` Charles Blake
@ 2019-04-07 21:32         ` Bart Schaefer
  2019-04-08 11:17           ` Charles Blake
  0 siblings, 1 reply; 22+ messages in thread
From: Bart Schaefer @ 2019-04-07 21:32 UTC (permalink / raw)
  To: Charles Blake; +Cc: Zsh Users

On Sun, Apr 7, 2019 at 4:19 AM Charles Blake <charlechaud@gmail.com> wrote:
>
>
> Zeroeth, you may want to be careful of zero length files which are all
> identical, but also take up little to no space beyond their pathname.

This is just **/*(.l+0) in zsh (the dot is to ignore directories).

> Another zeroth order concern is
> file identity.  I-node/file numbers are better here than path names.

You can get all these numbers reasonably fast (unless some sort of
networked storage is involved) with

zmodload zsh/stat
zstat -A stats **/*(.l+0)

Of course if you're talking really huge numbers of files you might not
want to collect this all at once.

> First, and most importantly, the file size itself acts as a weak hash

That's also snarfled up by zstat -A.  It doesn't really take any
longer to get the file size from stat than it does the inode, so
unless you're NOT going to consider linked files as duplicates you
might as well just compare sizes.  (It would be faster to get inodes
from readdir, but there's no shell-level readdir.)  More on this
later.

> Second, if you have a fast IO system (eg., your data all fits in RAM)
> then time to strongly hash likely dominates the calculation, but that
> also parallelizes easily.

You're probably not going to beat built-in lightweight threading and
shared memory with shell process-level "threading", so if you have a
large number of files that are the same size but with different
contents then something like python or C might be the way to go.

> Third, in theory, even strong crypto hashes have a small but non-zero
> chance of a collision.  So, you may need to deal with the possibility
> of false positives anyway.

This should be vanishingly small if the sizes are the same?  False
positives can be checked by using "cmp -s" which just compares the
files byte-by-byte until it it finds a difference, but this is a
pretty hefty penalty on the true positives.

Anyway, here's zsh code; I haven't dealt with the files having strange
characters in the names that might prevent them being easily used as
hash keys, that's left as an exercise for somebody who has files with
strange names:

zmodload zsh/stat
zstat -nA stats **/*(.l+0)
# Every stat struct has 15 elements, so we pick out every 15th
names=( ${(e):-'${stats['{1..${#stats}..15}']}'} ) # name is element 1
sizes=( ${(e):-'${stats['{9..${#stats}..15}']}'} ) # size is element 9
# Zip the two arrays to make a mapping
typeset -A clusters sizemap=( ${names:^sizes} )
# Compute clusters of same-sized files
for i in {1..$#sizes}
do
  same=( ${(k)sizemap[(R)$sizes[i]]} )
  (( $#same > 0 )) || continue
  # Delete entries we've seen so we don't find them again
  unset 'sizemap['${^same}']'
  (( $#same > 1 )) && clusters[$sizes[i]]=${(@qq)same}
done
# Calculate checksums by cluster and report duplicates
typeset -A sums
for f in ${(v)clusters}
do
  # Inner loop could be put in a function and outer loop replaced by zargs -P
  for sum blocks file in $( eval cksum $f )
  do
    if (( ${+sums[$sum.$blocks]} ))
    then print Duplicate: ${sums[$sum.$blocks]} $file
    else sums[$sum.$blocks]=$file
    fi
  done
done

I find that a LOT more understandable than the python code.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-07 21:32         ` Bart Schaefer
@ 2019-04-08 11:17           ` Charles Blake
  2019-04-08 17:14             ` Bart Schaefer
  0 siblings, 1 reply; 22+ messages in thread
From: Charles Blake @ 2019-04-08 11:17 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh Users

[-- Attachment #1: Type: text/plain, Size: 4784 bytes --]

>I find that a LOT more understandable than the python code.

Understandability is, of course, somewhat subjective (e.g. some might say
every 15th field is unclear relative to a named label), but the python also
does do more and would be quite a bit simpler without any parallelism, and
the Zsh code more complex with a set up for it.  So, it's a somewhat apples
& oranges/strawman comparison.  No offense intended.  I don't dislike your
solution or a simpler Python one.  Simplicity is good.  As elaborated below,
everything sort of "just depends".

Some hash functions are really fast..some IO systems really slow, etc.
To really do the parallel IO/hashing right (be nice to competition both
from self & programs on the system) you probably want to start with a
parallelism of 1 CPU and dynamically adjust based on CPU utilization
say after each file via getrusage in some parallel thread coordinator).
Such would be more complex still, but as we both mentioned will be more
efficient in a lower level language.

>This should be vanishingly small if the sizes are the same?

Of course, if you have a hash with no collision attack or no reason to
suspect a filesystem collision attack, then yes, a vanishingly small
probability.  I did not mean to overemphasize the risk as much as use it
to introduce the "IO/hash truncated" optimization idea trading off some
false positive risk with performance.  I personally used the report for
rare-case manual intervention, not automated response.  So, cmp at the
end makes total sense.  I don't think there is any disagreement here.

Another motivation on the truncated IO/hash optimization that may be
underappreciated is that if your report is empty then you are done.
False negatives don't happen.  That might well be the common case after
some situation is cleaned up and you want a "quick check" that there are
no duplicates or if you otherwise "suspect but do not know" no duplicates
for some file set.  If that quick check fails then you can do a slower,
more precise job.  With large files or slow IO this can help a lot.
Call it an 'is it possible there's a duplicate' operation if you will.

Partial hashes can also be a jumping off point to detection of "corrupt"
or near-duplicate files.  The many ways corruption can happen generates
a list of questions/approaches which is off topic.  One way it can happen
is non-size-changing scribbling over/non-population of just a few bytes
which does relate to partial hashing, though obviously it's not perfectly
tuned for such an application.

>I haven't dealt with the files having strange characters in the names
>that might prevent them being easily used as hash keys, that's left as
>an exercise

Part of my purpose writing the Python was indeed being able to deal with
weird path names which I also generally try hard to avoid, but sometimes
it's beyond one's control/authority.  I almost added user parameter for
the hard TAB and NEWLINE chars used for output delimiters, but tried to
keep it simple for exposition.  With utf8 these days...truly general
filenames are decidedly a PITA.

>unless you're NOT going to consider linked files as duplicates you
>might as well just compare sizes.  (It would be faster to get inodes

It may have been underappreciated is that handling hard-link identity also
lets you skip recomputing hashes over a hard link cluster which must be the
same.  (Or re-doing the IO if the time between hitting that i-node is long
enough to lose buffer caching.)  One thinkable, possibly even automatic,
resolution of duplicates is conversion into hard links.  If there was a
lot of duplication resolved this way then this would save time on a "second
pass" over such file hierarchies.  I'm just defending that admittedly
likely minor optimization (which just wasn't so hard to add in Python).

Most claims in this whole space depend upon application context in some
way.  Some are hard, like files being updated during the duplicate scan.
Yeah..You can look at time stamps, and maybe re-compute things, but ick.
If things are changing who knows what version the user wants assessed?
Duplicates are probably rare.  Large trees of heavily hard-linked files
are probably a much smaller optimization than using file size as a weak
hash to organize the calculation.  You can probably find a hash function
faster than your IO.  Or maybe you have files on some quad m.2 NVMe raid
setup delivering 13 gigabytes/sec and it is harder to match that with a
crypto hash function.  Almost everything you say needs a "probably/maybe"
qualifier.  I don't think you disagree.  I'm just elaborating a little
for passers by.

P.S.:
    Two of my side-comments got auto-wrapped at 76 text columns.  Should
be an easy fix for anyone cutting & pasting that python code.  Sorry!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-06 19:42       ` Emanuel Berg
@ 2019-04-08 14:37         ` Paul Hoffman
  2019-04-08 14:58           ` Ray Andrews
  2019-04-08 21:26           ` Emanuel Berg
  0 siblings, 2 replies; 22+ messages in thread
From: Paul Hoffman @ 2019-04-08 14:37 UTC (permalink / raw)
  To: zsh-users

On Sat, Apr 06, 2019 at 09:42:51PM +0200, Emanuel Berg wrote:
> zv wrote:
>  
> >> Cool, but doesn't seem to work?
> >> 
> >
> > Forgot to ignore the second field.

Oops, my bad!

> >     #!/bin/zsh
> >     find-duplicates () {
> >          (( # > 0 )) || set -- *(.N)
> >          local dups=0
> >          md5sum $@ | sort | awk '{ print $2,$1 }' | uniq -c -f1 | \
> >                      grep -v '^  *1 ' | wc -l | read dups
> >          (( dups == 0 )) && echo "no duplicates"
> >     }
> 
> Still nothing :)

What were you expecting?  It exits with status 0 if there were no 
duplicates; otherwise, it exits with status 1.

> What's with the third line BTW? Looks like
> a comment to me and my editor interprets it that
> way as well.
> 
> But changing it to $# doesn't help :(

This might make it clearer:

    (( n == 0 )) || echo "n = $n, expected 0"
    (( # == 0 )) || echo "# = $#, expected 0"

# is the name of a variable (a "variable" is zsh terminology).  Its 
value is the number of positional parameters, i.e., the number of 
elements in $argv.  Within (( ... )) you don't need the dollar sign 
before a variable name, but (for the most part) it doesn't hurt to use 
it, and so you can write (( # > 0 )) or (( $# > 0 )) or (( #argv > 0 )) 
or (( $#argv > 0 )) all with the same effect.

Paul.

-- 
Paul Hoffman <nkuitse@nkuitse.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-08 14:37         ` Paul Hoffman
@ 2019-04-08 14:58           ` Ray Andrews
  2019-04-08 15:14             ` Volodymyr Khomchak
                               ` (4 more replies)
  2019-04-08 21:26           ` Emanuel Berg
  1 sibling, 5 replies; 22+ messages in thread
From: Ray Andrews @ 2019-04-08 14:58 UTC (permalink / raw)
  To: zsh-users

Pardon a comment from the peanut gallery, but it seems strange to me 
that the issue of duplicate files would be something that would not have 
been solved definitively  50 years ago.  I'd have thought that for 
donkey's years there would be utilities that would let you find 
duplicates at various levels of rigour because it's so obviously an 
issue that's bound to come up.  No?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-08 14:58           ` Ray Andrews
@ 2019-04-08 15:14             ` Volodymyr Khomchak
  2019-04-08 15:24             ` Peter Stephenson
                               ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Volodymyr Khomchak @ 2019-04-08 15:14 UTC (permalink / raw)
  To: zsh-users

Why wouldn't simple find work for this purpose ?
This example can do what you need

|find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 
md5sum | sort | uniq -w32 --all-repeated=separate If you want something dedicated then look at *fdupes* utility |

On 4/8/19 5:58 PM, Ray Andrews wrote:
> Pardon a comment from the peanut gallery, but it seems strange to me that the issue of duplicate files would be 
> something that would not have been solved definitively  50 years ago.  I'd have thought that for donkey's years there 
> would be utilities that would let you find duplicates at various levels of rigour because it's so obviously an issue 
> that's bound to come up.  No?
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-08 14:58           ` Ray Andrews
  2019-04-08 15:14             ` Volodymyr Khomchak
@ 2019-04-08 15:24             ` Peter Stephenson
  2019-04-08 15:32             ` Andrew J. Rech
                               ` (2 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Peter Stephenson @ 2019-04-08 15:24 UTC (permalink / raw)
  To: zsh-users

On Mon, 2019-04-08 at 07:58 -0700, Ray Andrews wrote:
> Pardon a comment from the peanut gallery, but it seems strange to me 
> that the issue of duplicate files would be something that would not have 
> been solved definitively  50 years ago.  I'd have thought that for 
> donkey's years there would be utilities that would let you find 
> duplicates at various levels of rigour because it's so obviously an 
> issue that's bound to come up.  No?

Slightly philosophical answer.  (The right answer is probably that
actually there are such utilities but a lot of us don't know them.)
Nothing really to do with zsh.

One of the answers may be the way issues of this kind typically arise.
For example: you've copied a whole pile of files somewhere else to
rationalise an interface, or make a new project, or something like
that.  Then the problem isn't so much looking for individual files that
happen to be the same as looking for a bunch of stuff that got copied
together and could really be re-rationalised to avoid the copy.  But
often it's a modified copy so the files are similar but not quite the
same.

On the other hand, if you're looking not for piles of stuff that got
copied but shouldn't, but individual files that happen to be the same,
then you're quite likely to get lots of false positives with e.g.  files
based on a template that are logically different in the sense they are
generated with different arguments but happen to turn out the same
(that's just one random example off the top of my head) and you probably
wouldn't want to rationalise them down to one.

So the use case here maybe isn't quite as common as you might think.

pws

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-08 14:58           ` Ray Andrews
  2019-04-08 15:14             ` Volodymyr Khomchak
  2019-04-08 15:24             ` Peter Stephenson
@ 2019-04-08 15:32             ` Andrew J. Rech
  2019-04-08 15:47             ` Oliver Kiddle
  2019-04-09  1:08             ` Jason L Tibbitts III
  4 siblings, 0 replies; 22+ messages in thread
From: Andrew J. Rech @ 2019-04-08 15:32 UTC (permalink / raw)
  To: Ray Andrews; +Cc: zsh-users

Not 50, but?:

https://github.com/jbruchon/jdupes

"jdupes is a program for identifying and taking actions upon duplicate
files."



> On Apr 8, 2019, at 10:58, Ray Andrews <rayandrews@eastlink.ca> wrote:
> 
> Pardon a comment from the peanut gallery, but it seems strange to me that the issue of duplicate files would be something that would not have been solved definitively  50 years ago.
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-08 14:58           ` Ray Andrews
                               ` (2 preceding siblings ...)
  2019-04-08 15:32             ` Andrew J. Rech
@ 2019-04-08 15:47             ` Oliver Kiddle
  2019-04-08 16:29               ` Ray Andrews
  2019-04-08 21:30               ` Emanuel Berg
  2019-04-09  1:08             ` Jason L Tibbitts III
  4 siblings, 2 replies; 22+ messages in thread
From: Oliver Kiddle @ 2019-04-08 15:47 UTC (permalink / raw)
  To: Ray Andrews; +Cc: zsh-users

Ray Andrews wrote:
> Pardon a comment from the peanut gallery, but it seems strange to me 
> that the issue of duplicate files would be something that would not have 
> been solved definitively?? 50 years ago.?? I'd have thought that for 
> donkey's years there would be utilities that would let you find 
> duplicates at various levels of rigour because it's so obviously an 
> issue that's bound to come up.?? No?

There are multiple. I tend to use duff (DUplicate File Finder) for the
purpose. There's also samefile. And I've come across one or two others
over the years.

Oliver

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-08 15:47             ` Oliver Kiddle
@ 2019-04-08 16:29               ` Ray Andrews
  2019-04-08 16:45                 ` Bart Schaefer
  2019-04-08 21:30               ` Emanuel Berg
  1 sibling, 1 reply; 22+ messages in thread
From: Ray Andrews @ 2019-04-08 16:29 UTC (permalink / raw)
  To: zsh-users

On 2019-04-08 8:47 a.m., Oliver Kiddle wrote:

Thanks all.

> There are multiple. I tend to use duff (DUplicate File Finder) for the
> purpose. There's also samefile. And I've come across one or two others
> over the years.
>
> Oliver

So then why do we need a new zsh based solution?  Not that I'm 
complaining, maybe 'zdupe' would be good to have, but my light reading 
of the thing seems to be that this is essentially a plain vanilla search 
and existing solutions would be fine.  Mind, I suppose there are so many 
things that might or might not be considered that maybe custom solutions 
are needed  and of course for that custom code would be wanted.  I 
particularly note what Peter said about files that are logically 
different but coincidentally the same.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-08 16:29               ` Ray Andrews
@ 2019-04-08 16:45                 ` Bart Schaefer
  0 siblings, 0 replies; 22+ messages in thread
From: Bart Schaefer @ 2019-04-08 16:45 UTC (permalink / raw)
  To: Ray Andrews; +Cc: Zsh Users

On Mon, Apr 8, 2019 at 9:30 AM Ray Andrews <rayandrews@eastlink.ca> wrote:
>
> So then why do we need a new zsh based solution?

The most typical answer would be that you have a shell and the
standard utilities but not the ability or permission necessary to
install a new specialized utility.

Another answer is that it's an interesting algorithmic question and
demonstrates techniques that could be applied to other problems.

> particularly note what Peter said about files that are logically
> different but coincidentally the same.

Yes, if you run my sample script in an active zsh build tree it will
find a lot of those.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-08 11:17           ` Charles Blake
@ 2019-04-08 17:14             ` Bart Schaefer
  0 siblings, 0 replies; 22+ messages in thread
From: Bart Schaefer @ 2019-04-08 17:14 UTC (permalink / raw)
  To: Charles Blake; +Cc: Zsh Users

On Mon, Apr 8, 2019 at 4:18 AM Charles Blake <charlechaud@gmail.com> wrote:
>
> >I find that a LOT more understandable than the python code.
>
> Understandability is, of course, somewhat subjective (e.g. some might say
> every 15th field is unclear relative to a named label)

Yes, lack of multi-dimensional data structures is a limitation on the
shell implementation.

I could have done it this way:

names=( **/*(.l+0) )
zstat -tA stats $names
sizes=( ${(M)stats:#size *} )

I chose the other way so the name and size would be directly connected
in the stats array rather than rely on implicit ordering (to one of
your later points, bad things happen with the above if a file is
removed between generating the list of names and collecting the file
stats).

> >unless you're NOT going to consider linked files as duplicates you
> >might as well just compare sizes.  (It would be faster to get inodes
>
> It may have been underappreciated is that handling hard-link identity also
> lets you skip recomputing hashes over a hard link cluster

Yes, this could be used to reduce the number of names passed to
"cksum" or the equivalent.

> Almost everything you say needs a "probably/maybe"
> qualifier.  I don't think you disagree.  I'm just elaborating a little
> for passers by.

Absolutely.  The flip side of this is that shells and utilities are
generally optimized for the average case, not for the extremes.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-08 14:37         ` Paul Hoffman
  2019-04-08 14:58           ` Ray Andrews
@ 2019-04-08 21:26           ` Emanuel Berg
  1 sibling, 0 replies; 22+ messages in thread
From: Emanuel Berg @ 2019-04-08 21:26 UTC (permalink / raw)
  To: zsh-users

Paul Hoffman wrote:

>> >     #!/bin/zsh
>> >     find-duplicates () {
>> >          (( # > 0 )) || set -- *(.N)
>> >          local dups=0
>> >          md5sum $@ | sort | awk '{ print $2,$1 }' | uniq -c -f1 | \
>> >                      grep -v '^  *1 ' | wc -l | read dups
>> >          (( dups == 0 )) && echo "no duplicates"
>> >     }
>> 
>> Still nothing :)
>
> What were you expecting? It exits with status
> 0 if there were no duplicates; otherwise, it
> exits with status 1.

OK! Yes, it works. Thank you. I put your name
next to it in the file [1]. Only I expected it
to tell what files are duplicates. Otherwise it
isn't so useful :)

> # is the name of a variable (a "variable" is
> zsh terminology). Its value is the number of
> positional parameters, i.e., the number of
> elements in $argv. Within (( ... )) you don't
> need the dollar sign before a variable name,
> but (for the most part) it doesn't hurt to
> use it

OK, thanks. Better to use the dollar sign so
that it doesn't brake the font lock and look
like a comment.


[1] http://user.it.uu.se/~embe8573/conf/.zsh/files-fs

-- 
underground experts united
http://user.it.uu.se/~embe8573


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-08 15:47             ` Oliver Kiddle
  2019-04-08 16:29               ` Ray Andrews
@ 2019-04-08 21:30               ` Emanuel Berg
  1 sibling, 0 replies; 22+ messages in thread
From: Emanuel Berg @ 2019-04-08 21:30 UTC (permalink / raw)
  To: zsh-users

Oliver Kiddle wrote:

> There are multiple. I tend to use duff
> (DUplicate File Finder) for the purpose.

duff(1) works great, thanks!

-- 
underground experts united
http://user.it.uu.se/~embe8573


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-08 14:58           ` Ray Andrews
                               ` (3 preceding siblings ...)
  2019-04-08 15:47             ` Oliver Kiddle
@ 2019-04-09  1:08             ` Jason L Tibbitts III
  2019-04-09  1:28               ` Ray Andrews
  2019-04-09  9:28               ` Charles Blake
  4 siblings, 2 replies; 22+ messages in thread
From: Jason L Tibbitts III @ 2019-04-09  1:08 UTC (permalink / raw)
  To: Ray Andrews; +Cc: zsh-users

>>>>> "RA" == Ray Andrews <rayandrews@eastlink.ca> writes:

RA> I'd have thought that for donkey's years there would be utilities
RA> that would let you find duplicates at various levels of rigour
RA> because it's so obviously an issue that's bound to come up.

What makes you think that such utilities don't exist?  The fact that
people discuss pure zsh ways to do something doesn't say much about
whether non-zsh things exist.

For example, I've used "duff" for this.  http://duff.dreda.org/

 - J<

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-09  1:08             ` Jason L Tibbitts III
@ 2019-04-09  1:28               ` Ray Andrews
  2019-04-09  9:28               ` Charles Blake
  1 sibling, 0 replies; 22+ messages in thread
From: Ray Andrews @ 2019-04-09  1:28 UTC (permalink / raw)
  To: zsh-users

On 2019-04-08 6:08 p.m., Jason L Tibbitts III wrote:
>
> RA> I'd have thought that for donkey's years there would be utilities
> RA> that would let you find duplicates at various levels of rigour
> RA> because it's so obviously an issue that's bound to come up.
>
> What makes you think that such utilities don't exist?
I thought that they do exist, and many have been named.
> The fact that
> people discuss pure zsh ways to do something doesn't say much about
> whether non-zsh things exist.
Sure, as an exercise or some customization, but in this case it seemed 
like there was a genuine need, but the problem didn't seem exotic so I 
thought I'd ask.  Looks to me like an official 'zdupe' is in order and 
90% already cooked.
>
> For example, I've used "duff" for this.  http://duff.dreda.org/

What a fine document.  But why is so much computer documentation so 
bloody awful?



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: find duplicate files
  2019-04-09  1:08             ` Jason L Tibbitts III
  2019-04-09  1:28               ` Ray Andrews
@ 2019-04-09  9:28               ` Charles Blake
  1 sibling, 0 replies; 22+ messages in thread
From: Charles Blake @ 2019-04-09  9:28 UTC (permalink / raw)
  To: Zsh Users

[-- Attachment #1: Type: text/plain, Size: 3003 bytes --]

Apologies if my 70 line Python script post added to that confusion.
I thought it better to include working code rather than merely a list
of theoretical optimizations to consider in a pure Zsh implementation.
In my experience lately, availability of Python is greater than that
of either Zsh or a C compiler.  So, there is also that, I suppose.

In timings I just did, my Python was only 1.25x slower than 'duff' (in
MAX=-1/hash everything mode).  Most of the time is in the crypto hashing
which the Py just calls fast impls for.  So, that's not so surprising.
Volodymyr's pipeline took about 20X longer and Bart's pure Zsh took 10X
longer.  jdupes was 20X faster than the Python but missed 87% of my dups.
[ I have not tried to track down why..conceivably wonky pathnames making
it abort early.  That seems the most likely culprit. ] The unix duff
seems better designed with a -0 option, FWIW.

One other optimization (not easily exhibited without a low level lang) is
to not use a cryptographic hash at all, but to use a super fast one and
always do the equivalent of a "cmp" only on matching hashes (or maybe a
slow hash only after a fast hash).  That is the jdupes approach.  I think
going parallel adds more value than that, though.

At least in server/laptop/desktop CPUs from around Intel's Nehalem onward,
just one core has been unable to saturate DIMM BW.  I usually get 2X-8X
more DIMM BW going multi-core.  So, even infinitely fast hashes doesn't
mean parallelism could not speed things by a big factor for RAM resident
or superfast IO backed sets of (equally sized) files.  At those speeds,
Python's cPickle BW would likely suck enough compared to hash/cmp-ing
that you'd need a C-like impl to max out your performance.

This may *seem* to be drifting more off topic, but actually does speak to
Ray's original question.  Hardware has evolved enough enough over the
last 50 years that a tuned implementations from yester-decade had concerns
different from a tuned implementation today.  RAM is giant compared to
(many) file sets, IO fast compared to RAM, CPU cores abundant.  Of the
5 solutions discussed (pure Zsh, Volodymyr's, Py, duff, jdupes), only my
Python one used parallelism to any good effect.  Then there is variation
in what optimization applies to what file sets..so, a good tool would
probably provide all of them.

So, even though there may be dozens to hundreds of such tools, there may
well *still* be room for some new tuned impl in a fast language taking all
the optimizations I've mentioned into account instead of just a subset,
where any single optimization might, depending upon deployment context,
make the interesting procedure "several to many" times faster. [ Maybe
some Rust person already did it.  They seem to have a lot of energy. ;-) ]
I agree the original question from Emanuel Berg was probably just unaware
of any tool, though, or asked out of curiosity.  Anyway, enough about the
most optimal approaches.  It was probably always too large a topic.

Cheers

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-04-09  9:30 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-06  5:40 find duplicate files Emanuel Berg
2019-04-06 13:02 ` Paul Hoffman
2019-04-06 14:44   ` Emanuel Berg
2019-04-06 19:11     ` zv
2019-04-06 19:42       ` Emanuel Berg
2019-04-08 14:37         ` Paul Hoffman
2019-04-08 14:58           ` Ray Andrews
2019-04-08 15:14             ` Volodymyr Khomchak
2019-04-08 15:24             ` Peter Stephenson
2019-04-08 15:32             ` Andrew J. Rech
2019-04-08 15:47             ` Oliver Kiddle
2019-04-08 16:29               ` Ray Andrews
2019-04-08 16:45                 ` Bart Schaefer
2019-04-08 21:30               ` Emanuel Berg
2019-04-09  1:08             ` Jason L Tibbitts III
2019-04-09  1:28               ` Ray Andrews
2019-04-09  9:28               ` Charles Blake
2019-04-08 21:26           ` Emanuel Berg
2019-04-07 11:16       ` Charles Blake
2019-04-07 21:32         ` Bart Schaefer
2019-04-08 11:17           ` Charles Blake
2019-04-08 17:14             ` Bart Schaefer

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).