* find duplicate files @ 2019-04-06 5:40 Emanuel Berg 2019-04-06 13:02 ` Paul Hoffman 0 siblings, 1 reply; 22+ messages in thread From: Emanuel Berg @ 2019-04-06 5:40 UTC (permalink / raw) To: zsh-users Is this any good? Can it be done lineary? TIA #! /bin/zsh find-duplicates () { local -a files [[ $# = 0 ]] && files=("${(@f)$(ls)}") || files=($@) local dups=0 # files local a local b for a in $files; do for b in $files; do if [[ $a != $b ]]; then diff $a $b > /dev/null if [[ $? = 0 ]]; then echo $a and $b are the same dups=1 fi fi done done [[ $dups = 0 ]] && echo "no duplicates" } alias dups=find-duplicates -- underground experts united http://user.it.uu.se/~embe8573 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-06 5:40 find duplicate files Emanuel Berg @ 2019-04-06 13:02 ` Paul Hoffman 2019-04-06 14:44 ` Emanuel Berg 0 siblings, 1 reply; 22+ messages in thread From: Paul Hoffman @ 2019-04-06 13:02 UTC (permalink / raw) To: zsh-users On Sat, Apr 06, 2019 at 07:40:59AM +0200, Emanuel Berg wrote: > Is this any good? Can it be done lineary? > > TIA > > #! /bin/zsh > > find-duplicates () { > local -a files > [[ $# = 0 ]] && files=("${(@f)$(ls)}") || files=($@) > > local dups=0 > > # files > local a > local b > > for a in $files; do > for b in $files; do > if [[ $a != $b ]]; then > diff $a $b > /dev/null > if [[ $? = 0 ]]; then > echo $a and $b are the same > dups=1 > fi > fi > done > done > [[ $dups = 0 ]] && echo "no duplicates" > } > alias dups=find-duplicates Your function keeps comparing even after it finds duplicates, so its runtime will be O(N^2), i.e., proportional to the square of the sum of file sizes (N). Here's one that calculates MD5 checksums, and compares those, and so is O(N) + O(M^2), i.e., proportional to the sum of file sizes plus the square of the number of files (M). #!/bin/zsh find-duplicates () { (( # > 0 )) || set -- *(.N) local dups=0 md5sum $@ | sort | uniq -c | grep -qv '^ *1 ' | wc -l | read dups (( dups == 0 )) && echo "no duplicates" } A better solution would use an associative array (local -A NAME), would *not* sort, and would stop as soon as it found a duplicate, but I'll leave that as an exercise for the reader. :-) Paul. -- Paul Hoffman <nkuitse@nkuitse.com> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-06 13:02 ` Paul Hoffman @ 2019-04-06 14:44 ` Emanuel Berg 2019-04-06 19:11 ` zv 0 siblings, 1 reply; 22+ messages in thread From: Emanuel Berg @ 2019-04-06 14:44 UTC (permalink / raw) To: zsh-users Paul Hoffman wrote: > #!/bin/zsh > find-duplicates () { > (( # > 0 )) || set -- *(.N) > local dups=0 > md5sum $@ | sort | uniq -c | > grep -qv '^ *1 ' | wc -l | read dups > (( dups == 0 )) && echo "no duplicates" > } Cool, but doesn't seem to work? -- underground experts united http://user.it.uu.se/~embe8573 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-06 14:44 ` Emanuel Berg @ 2019-04-06 19:11 ` zv 2019-04-06 19:42 ` Emanuel Berg 2019-04-07 11:16 ` Charles Blake 0 siblings, 2 replies; 22+ messages in thread From: zv @ 2019-04-06 19:11 UTC (permalink / raw) To: zsh-users On 4/6/19 7:44 AM, Emanuel Berg wrote: > Paul Hoffman wrote: > >> #!/bin/zsh >> find-duplicates () { >> (( # > 0 )) || set -- *(.N) >> local dups=0 >> md5sum $@ | sort | uniq -c | >> grep -qv '^ *1 ' | wc -l | read dups >> (( dups == 0 )) && echo "no duplicates" >> } > > Cool, but doesn't seem to work? > Forgot to ignore the second field. #!/bin/zsh find-duplicates () { (( # > 0 )) || set -- *(.N) local dups=0 md5sum $@ | sort | awk '{ print $2,$1 }' | uniq -c -f1 | \ grep -v '^ *1 ' | wc -l | read dups (( dups == 0 )) && echo "no duplicates" } ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-06 19:11 ` zv @ 2019-04-06 19:42 ` Emanuel Berg 2019-04-08 14:37 ` Paul Hoffman 2019-04-07 11:16 ` Charles Blake 1 sibling, 1 reply; 22+ messages in thread From: Emanuel Berg @ 2019-04-06 19:42 UTC (permalink / raw) To: zsh-users zv wrote: >> Cool, but doesn't seem to work? >> > > Forgot to ignore the second field. > > #!/bin/zsh > find-duplicates () { > (( # > 0 )) || set -- *(.N) > local dups=0 > md5sum $@ | sort | awk '{ print $2,$1 }' | uniq -c -f1 | \ > grep -v '^ *1 ' | wc -l | read dups > (( dups == 0 )) && echo "no duplicates" > } Still nothing :) What's with the third line BTW? Looks like a comment to me and my editor interprets it that way as well. But changing it to $# doesn't help :( -- underground experts united http://user.it.uu.se/~embe8573 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-06 19:42 ` Emanuel Berg @ 2019-04-08 14:37 ` Paul Hoffman 2019-04-08 14:58 ` Ray Andrews 2019-04-08 21:26 ` Emanuel Berg 0 siblings, 2 replies; 22+ messages in thread From: Paul Hoffman @ 2019-04-08 14:37 UTC (permalink / raw) To: zsh-users On Sat, Apr 06, 2019 at 09:42:51PM +0200, Emanuel Berg wrote: > zv wrote: > > >> Cool, but doesn't seem to work? > >> > > > > Forgot to ignore the second field. Oops, my bad! > > #!/bin/zsh > > find-duplicates () { > > (( # > 0 )) || set -- *(.N) > > local dups=0 > > md5sum $@ | sort | awk '{ print $2,$1 }' | uniq -c -f1 | \ > > grep -v '^ *1 ' | wc -l | read dups > > (( dups == 0 )) && echo "no duplicates" > > } > > Still nothing :) What were you expecting? It exits with status 0 if there were no duplicates; otherwise, it exits with status 1. > What's with the third line BTW? Looks like > a comment to me and my editor interprets it that > way as well. > > But changing it to $# doesn't help :( This might make it clearer: (( n == 0 )) || echo "n = $n, expected 0" (( # == 0 )) || echo "# = $#, expected 0" # is the name of a variable (a "variable" is zsh terminology). Its value is the number of positional parameters, i.e., the number of elements in $argv. Within (( ... )) you don't need the dollar sign before a variable name, but (for the most part) it doesn't hurt to use it, and so you can write (( # > 0 )) or (( $# > 0 )) or (( #argv > 0 )) or (( $#argv > 0 )) all with the same effect. Paul. -- Paul Hoffman <nkuitse@nkuitse.com> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-08 14:37 ` Paul Hoffman @ 2019-04-08 14:58 ` Ray Andrews 2019-04-08 15:14 ` Volodymyr Khomchak ` (4 more replies) 2019-04-08 21:26 ` Emanuel Berg 1 sibling, 5 replies; 22+ messages in thread From: Ray Andrews @ 2019-04-08 14:58 UTC (permalink / raw) To: zsh-users Pardon a comment from the peanut gallery, but it seems strange to me that the issue of duplicate files would be something that would not have been solved definitively 50 years ago. I'd have thought that for donkey's years there would be utilities that would let you find duplicates at various levels of rigour because it's so obviously an issue that's bound to come up. No? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-08 14:58 ` Ray Andrews @ 2019-04-08 15:14 ` Volodymyr Khomchak 2019-04-08 15:24 ` Peter Stephenson ` (3 subsequent siblings) 4 siblings, 0 replies; 22+ messages in thread From: Volodymyr Khomchak @ 2019-04-08 15:14 UTC (permalink / raw) To: zsh-users Why wouldn't simple find work for this purpose ? This example can do what you need |find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate If you want something dedicated then look at *fdupes* utility | On 4/8/19 5:58 PM, Ray Andrews wrote: > Pardon a comment from the peanut gallery, but it seems strange to me that the issue of duplicate files would be > something that would not have been solved definitively 50 years ago. I'd have thought that for donkey's years there > would be utilities that would let you find duplicates at various levels of rigour because it's so obviously an issue > that's bound to come up. No? > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-08 14:58 ` Ray Andrews 2019-04-08 15:14 ` Volodymyr Khomchak @ 2019-04-08 15:24 ` Peter Stephenson 2019-04-08 15:32 ` Andrew J. Rech ` (2 subsequent siblings) 4 siblings, 0 replies; 22+ messages in thread From: Peter Stephenson @ 2019-04-08 15:24 UTC (permalink / raw) To: zsh-users On Mon, 2019-04-08 at 07:58 -0700, Ray Andrews wrote: > Pardon a comment from the peanut gallery, but it seems strange to me > that the issue of duplicate files would be something that would not have > been solved definitively 50 years ago. I'd have thought that for > donkey's years there would be utilities that would let you find > duplicates at various levels of rigour because it's so obviously an > issue that's bound to come up. No? Slightly philosophical answer. (The right answer is probably that actually there are such utilities but a lot of us don't know them.) Nothing really to do with zsh. One of the answers may be the way issues of this kind typically arise. For example: you've copied a whole pile of files somewhere else to rationalise an interface, or make a new project, or something like that. Then the problem isn't so much looking for individual files that happen to be the same as looking for a bunch of stuff that got copied together and could really be re-rationalised to avoid the copy. But often it's a modified copy so the files are similar but not quite the same. On the other hand, if you're looking not for piles of stuff that got copied but shouldn't, but individual files that happen to be the same, then you're quite likely to get lots of false positives with e.g. files based on a template that are logically different in the sense they are generated with different arguments but happen to turn out the same (that's just one random example off the top of my head) and you probably wouldn't want to rationalise them down to one. So the use case here maybe isn't quite as common as you might think. pws ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-08 14:58 ` Ray Andrews 2019-04-08 15:14 ` Volodymyr Khomchak 2019-04-08 15:24 ` Peter Stephenson @ 2019-04-08 15:32 ` Andrew J. Rech 2019-04-08 15:47 ` Oliver Kiddle 2019-04-09 1:08 ` Jason L Tibbitts III 4 siblings, 0 replies; 22+ messages in thread From: Andrew J. Rech @ 2019-04-08 15:32 UTC (permalink / raw) To: Ray Andrews; +Cc: zsh-users Not 50, but?: https://github.com/jbruchon/jdupes "jdupes is a program for identifying and taking actions upon duplicate files." > On Apr 8, 2019, at 10:58, Ray Andrews <rayandrews@eastlink.ca> wrote: > > Pardon a comment from the peanut gallery, but it seems strange to me that the issue of duplicate files would be something that would not have been solved definitively 50 years ago. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-08 14:58 ` Ray Andrews ` (2 preceding siblings ...) 2019-04-08 15:32 ` Andrew J. Rech @ 2019-04-08 15:47 ` Oliver Kiddle 2019-04-08 16:29 ` Ray Andrews 2019-04-08 21:30 ` Emanuel Berg 2019-04-09 1:08 ` Jason L Tibbitts III 4 siblings, 2 replies; 22+ messages in thread From: Oliver Kiddle @ 2019-04-08 15:47 UTC (permalink / raw) To: Ray Andrews; +Cc: zsh-users Ray Andrews wrote: > Pardon a comment from the peanut gallery, but it seems strange to me > that the issue of duplicate files would be something that would not have > been solved definitively?? 50 years ago.?? I'd have thought that for > donkey's years there would be utilities that would let you find > duplicates at various levels of rigour because it's so obviously an > issue that's bound to come up.?? No? There are multiple. I tend to use duff (DUplicate File Finder) for the purpose. There's also samefile. And I've come across one or two others over the years. Oliver ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-08 15:47 ` Oliver Kiddle @ 2019-04-08 16:29 ` Ray Andrews 2019-04-08 16:45 ` Bart Schaefer 2019-04-08 21:30 ` Emanuel Berg 1 sibling, 1 reply; 22+ messages in thread From: Ray Andrews @ 2019-04-08 16:29 UTC (permalink / raw) To: zsh-users On 2019-04-08 8:47 a.m., Oliver Kiddle wrote: Thanks all. > There are multiple. I tend to use duff (DUplicate File Finder) for the > purpose. There's also samefile. And I've come across one or two others > over the years. > > Oliver So then why do we need a new zsh based solution? Not that I'm complaining, maybe 'zdupe' would be good to have, but my light reading of the thing seems to be that this is essentially a plain vanilla search and existing solutions would be fine. Mind, I suppose there are so many things that might or might not be considered that maybe custom solutions are needed and of course for that custom code would be wanted. I particularly note what Peter said about files that are logically different but coincidentally the same. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-08 16:29 ` Ray Andrews @ 2019-04-08 16:45 ` Bart Schaefer 0 siblings, 0 replies; 22+ messages in thread From: Bart Schaefer @ 2019-04-08 16:45 UTC (permalink / raw) To: Ray Andrews; +Cc: Zsh Users On Mon, Apr 8, 2019 at 9:30 AM Ray Andrews <rayandrews@eastlink.ca> wrote: > > So then why do we need a new zsh based solution? The most typical answer would be that you have a shell and the standard utilities but not the ability or permission necessary to install a new specialized utility. Another answer is that it's an interesting algorithmic question and demonstrates techniques that could be applied to other problems. > particularly note what Peter said about files that are logically > different but coincidentally the same. Yes, if you run my sample script in an active zsh build tree it will find a lot of those. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-08 15:47 ` Oliver Kiddle 2019-04-08 16:29 ` Ray Andrews @ 2019-04-08 21:30 ` Emanuel Berg 1 sibling, 0 replies; 22+ messages in thread From: Emanuel Berg @ 2019-04-08 21:30 UTC (permalink / raw) To: zsh-users Oliver Kiddle wrote: > There are multiple. I tend to use duff > (DUplicate File Finder) for the purpose. duff(1) works great, thanks! -- underground experts united http://user.it.uu.se/~embe8573 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-08 14:58 ` Ray Andrews ` (3 preceding siblings ...) 2019-04-08 15:47 ` Oliver Kiddle @ 2019-04-09 1:08 ` Jason L Tibbitts III 2019-04-09 1:28 ` Ray Andrews 2019-04-09 9:28 ` Charles Blake 4 siblings, 2 replies; 22+ messages in thread From: Jason L Tibbitts III @ 2019-04-09 1:08 UTC (permalink / raw) To: Ray Andrews; +Cc: zsh-users >>>>> "RA" == Ray Andrews <rayandrews@eastlink.ca> writes: RA> I'd have thought that for donkey's years there would be utilities RA> that would let you find duplicates at various levels of rigour RA> because it's so obviously an issue that's bound to come up. What makes you think that such utilities don't exist? The fact that people discuss pure zsh ways to do something doesn't say much about whether non-zsh things exist. For example, I've used "duff" for this. http://duff.dreda.org/ - J< ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-09 1:08 ` Jason L Tibbitts III @ 2019-04-09 1:28 ` Ray Andrews 2019-04-09 9:28 ` Charles Blake 1 sibling, 0 replies; 22+ messages in thread From: Ray Andrews @ 2019-04-09 1:28 UTC (permalink / raw) To: zsh-users On 2019-04-08 6:08 p.m., Jason L Tibbitts III wrote: > > RA> I'd have thought that for donkey's years there would be utilities > RA> that would let you find duplicates at various levels of rigour > RA> because it's so obviously an issue that's bound to come up. > > What makes you think that such utilities don't exist? I thought that they do exist, and many have been named. > The fact that > people discuss pure zsh ways to do something doesn't say much about > whether non-zsh things exist. Sure, as an exercise or some customization, but in this case it seemed like there was a genuine need, but the problem didn't seem exotic so I thought I'd ask. Looks to me like an official 'zdupe' is in order and 90% already cooked. > > For example, I've used "duff" for this. http://duff.dreda.org/ What a fine document. But why is so much computer documentation so bloody awful? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-09 1:08 ` Jason L Tibbitts III 2019-04-09 1:28 ` Ray Andrews @ 2019-04-09 9:28 ` Charles Blake 1 sibling, 0 replies; 22+ messages in thread From: Charles Blake @ 2019-04-09 9:28 UTC (permalink / raw) To: Zsh Users [-- Attachment #1: Type: text/plain, Size: 3003 bytes --] Apologies if my 70 line Python script post added to that confusion. I thought it better to include working code rather than merely a list of theoretical optimizations to consider in a pure Zsh implementation. In my experience lately, availability of Python is greater than that of either Zsh or a C compiler. So, there is also that, I suppose. In timings I just did, my Python was only 1.25x slower than 'duff' (in MAX=-1/hash everything mode). Most of the time is in the crypto hashing which the Py just calls fast impls for. So, that's not so surprising. Volodymyr's pipeline took about 20X longer and Bart's pure Zsh took 10X longer. jdupes was 20X faster than the Python but missed 87% of my dups. [ I have not tried to track down why..conceivably wonky pathnames making it abort early. That seems the most likely culprit. ] The unix duff seems better designed with a -0 option, FWIW. One other optimization (not easily exhibited without a low level lang) is to not use a cryptographic hash at all, but to use a super fast one and always do the equivalent of a "cmp" only on matching hashes (or maybe a slow hash only after a fast hash). That is the jdupes approach. I think going parallel adds more value than that, though. At least in server/laptop/desktop CPUs from around Intel's Nehalem onward, just one core has been unable to saturate DIMM BW. I usually get 2X-8X more DIMM BW going multi-core. So, even infinitely fast hashes doesn't mean parallelism could not speed things by a big factor for RAM resident or superfast IO backed sets of (equally sized) files. At those speeds, Python's cPickle BW would likely suck enough compared to hash/cmp-ing that you'd need a C-like impl to max out your performance. This may *seem* to be drifting more off topic, but actually does speak to Ray's original question. Hardware has evolved enough enough over the last 50 years that a tuned implementations from yester-decade had concerns different from a tuned implementation today. RAM is giant compared to (many) file sets, IO fast compared to RAM, CPU cores abundant. Of the 5 solutions discussed (pure Zsh, Volodymyr's, Py, duff, jdupes), only my Python one used parallelism to any good effect. Then there is variation in what optimization applies to what file sets..so, a good tool would probably provide all of them. So, even though there may be dozens to hundreds of such tools, there may well *still* be room for some new tuned impl in a fast language taking all the optimizations I've mentioned into account instead of just a subset, where any single optimization might, depending upon deployment context, make the interesting procedure "several to many" times faster. [ Maybe some Rust person already did it. They seem to have a lot of energy. ;-) ] I agree the original question from Emanuel Berg was probably just unaware of any tool, though, or asked out of curiosity. Anyway, enough about the most optimal approaches. It was probably always too large a topic. Cheers ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-08 14:37 ` Paul Hoffman 2019-04-08 14:58 ` Ray Andrews @ 2019-04-08 21:26 ` Emanuel Berg 1 sibling, 0 replies; 22+ messages in thread From: Emanuel Berg @ 2019-04-08 21:26 UTC (permalink / raw) To: zsh-users Paul Hoffman wrote: >> > #!/bin/zsh >> > find-duplicates () { >> > (( # > 0 )) || set -- *(.N) >> > local dups=0 >> > md5sum $@ | sort | awk '{ print $2,$1 }' | uniq -c -f1 | \ >> > grep -v '^ *1 ' | wc -l | read dups >> > (( dups == 0 )) && echo "no duplicates" >> > } >> >> Still nothing :) > > What were you expecting? It exits with status > 0 if there were no duplicates; otherwise, it > exits with status 1. OK! Yes, it works. Thank you. I put your name next to it in the file [1]. Only I expected it to tell what files are duplicates. Otherwise it isn't so useful :) > # is the name of a variable (a "variable" is > zsh terminology). Its value is the number of > positional parameters, i.e., the number of > elements in $argv. Within (( ... )) you don't > need the dollar sign before a variable name, > but (for the most part) it doesn't hurt to > use it OK, thanks. Better to use the dollar sign so that it doesn't brake the font lock and look like a comment. [1] http://user.it.uu.se/~embe8573/conf/.zsh/files-fs -- underground experts united http://user.it.uu.se/~embe8573 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-06 19:11 ` zv 2019-04-06 19:42 ` Emanuel Berg @ 2019-04-07 11:16 ` Charles Blake 2019-04-07 21:32 ` Bart Schaefer 1 sibling, 1 reply; 22+ messages in thread From: Charles Blake @ 2019-04-07 11:16 UTC (permalink / raw) To: zsh-users [-- Attachment #1: Type: text/plain, Size: 7485 bytes --] Several optimization points bear mentioning. The performance impact of all of these depend upon the nature of your file collection. Taken together these optimizations may compound to do thousands of times less work. Adding no benefit at all is rare in my experience. Zeroeth, you may want to be careful of zero length files which are all identical, but also take up little to no space beyond their pathname. These may or may not be of interest in your specific use case, but are common enough to warrant mentioning. Another zeroth order concern is file identity. I-node/file numbers are better here than path names. Some separate "map out hard links" utility is better for those. First, and most importantly, the file size itself acts as a weak hash that is maintained by the OS/FS. You do not need to hash or compare files with different sizes since a different size already tells you the content is different. For many file collections, the size itself is a strong enough hash which makes the running time very similar to the time of "find". Your files could all be identical sizes in which case this does not help, but this is a very cheap preprocessing step, often barely more expensive than computing a set of files of interest. It results in a set of clusters of possible duplicates which can then be tested more precisely. In my peculiar file sets I got a speed-up of 10x .. 100x from this preprocessing step. Second, if you have a fast IO system (eg., your data all fits in RAM) then time to strongly hash likely dominates the calculation, but that also parallelizes easily. In my situations, I found parallel speed-up to be better if you just hash over a full list of files in any maybe- duplicated cluster than trying to only hash "cluster by cluster", but I mostly was finding clusters small compared to my CPU core count. Larger clusters or more need for as-you-go progress reporting could alter the desirability of this aspect of hashing parallelization. The speed-up of this step depends on how many CPU cores you have and how fast hashing is relative to IO. For well-cached-in-RAM file sets it can be order factor-of-10. Third, in theory, even strong crypto hashes have a small but non-zero chance of a collision. So, you may need to deal with the possibility of false positives anyway. If have a way to deal with a small false positive rate then you may not need to strongly hash the WHOLE file. If any well-defined subrange of files (first or last 64 KiB, say) is effective at distinguishing files already known to be the same size then limiting hashing to that range saves a factor of average-file- size/limit-size (in both IO and hashing time - so, a similar factor whichever is your throughput bottleneck). There is some interaction between hash-curtailing and parallelization in that very small limits can make the bottleneck become non-hashing costs like copying data/task switching, but for file sets with large files these are mostly independent optimizations. There can also be interaction betweeen spinning rust disk seeking and parellelization. Putting these all together, you can get a prototype program in Python not zsh that looks like this (should work on any Python version). -- from sys import argv, stderr as err, stdout as out, exc_info from os import environ, listdir, stat, lstat ; E = environ.get from hashlib import * # md5, sha1, sha256, sha512, blake2b, etc. #This is the command-interface; Could be fancier roots = argv[1:] if len(argv) > 1 else [ "." ] DIGEST = eval(E("DIGEST", "sha1")) #Can be any algo exported by hashlib MIN = int(E("MIN" , "1")) #Exclude File Smaller Than This (Bytes) MAX = int(E("MAX" , "-1")) #Limit IO/Digest work to first MAX Bytes WASTED = E("WASTED", "") #Set to non-Empty to get a wasted report from multiprocessing import Pool, cpu_count #$PAR controls max parallelism from stat import S_ISDIR, S_ISREG def findR(ino2st_paths, root, keep, follow): prefix = root + "/" #Recursive body of find() for entry in listdir(root): if entry == "." or entry == "..": continue path = prefix + entry try: st = stat(path) if follow else lstat(path) except OSError: err.write("findR: %s: %s\n" % (root, exc_info()[1][-1])) continue if S_ISDIR(st.st_mode): #Dir ==> Recurse findR(ino2st_paths, path, keep, follow) if keep(st.st_mode): try : ino2st_paths[st.st_ino].append( path ) except: ino2st_paths[st.st_ino] = [ st, path ] def find(roots, keep=S_ISREG, follow=False): ino2st_paths = {} ino2st_paths = {} for root in roots: findR(ino2st_paths, root, keep, follow) return ino2st_paths def Digest(path): try: return path, DIGEST(open(path, "rb").read(MAX)).digest() except: err.write("Cannot digest "+ path + "\n") return path, "" ## THE MAIN EVENT ## path2size = {} #path -> size to avoid stat() calls size2paths = {} #size -> paths of *first* hardlinks for sp in find(roots).values(): #find() gives { ino,[stat,path1,..] } if sp[0].st_size < MIN: continue for p in sp[1:]: path2size[p] = sp[0].st_size try : size2paths[sp[0].st_size].append(sp[1]) #only append 1st hardlink except: size2paths[sp[0].st_size] = [ sp[1] ] #which is sp[1]; see findR size_coll = [ ] #Build worklist of files w/sz collides. for paths in size2paths.values(): #Could Digest() here, but get better if len(paths) > 1: #..parallel scale-up to do worklist all size_coll.extend(paths) #..at once if most colls are "few-way". digest = {} #Build path -> digest if MAX == 0: for path in size_coll: digest[path] = path2size[path] else: #Hashing can be slow; Do most work in a proc pool pool = Pool(int(E("PAR", cpu_count()))) for (path, d) in pool.imap_unordered(Digest, size_coll): digest[path] = d + bytes(path2size[path]) paths = {} #Invert digest[path] to get collisions for path in size_coll: try : paths[ digest[path] ].append(path) #Duplicate digest detected except: paths[ digest[path] ] = [ path ] #First occurrence of digest ans = []; sz = 0 for paths in paths.values(): #Finally, print sorted answer. Also if len(paths) > 1: #..sort rows internally by mtime. ans.append("\t".join(sorted(paths, key=lambda p: stat(p).st_mtime))) sz += (len(paths) - 1) * (path2size[paths[0]]) ans.sort() #sort alphabetically by first hardlink out.write("\n".join(ans)); out.write("\n") if WASTED: out.write("Wasted Space From Duplicates: %d bytes\n" % sz) -- To re-connect topically with Zsh, you can study the impact of the file size "weak hash prefix" idea by saving the above program as "dups" and then doing: diff =(dups) =(MAX=0 dups) or diff =(dups) =(MAX=65536 dups) on a file collection using Zsh's process substitution to study effects of these ideas. Alternatively, some ambitious Python hater/Zsh lover could try to re-write the above with Zsh associative arrays, xargs -P, temp files and the like. I'm not sure I would recommend that. A C re-write would likely do better at minimizing hash curtailing and parallelizing interactions or detecting when disk seeks cause thrashing and you want to disable parallel IO. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-07 11:16 ` Charles Blake @ 2019-04-07 21:32 ` Bart Schaefer 2019-04-08 11:17 ` Charles Blake 0 siblings, 1 reply; 22+ messages in thread From: Bart Schaefer @ 2019-04-07 21:32 UTC (permalink / raw) To: Charles Blake; +Cc: Zsh Users On Sun, Apr 7, 2019 at 4:19 AM Charles Blake <charlechaud@gmail.com> wrote: > > > Zeroeth, you may want to be careful of zero length files which are all > identical, but also take up little to no space beyond their pathname. This is just **/*(.l+0) in zsh (the dot is to ignore directories). > Another zeroth order concern is > file identity. I-node/file numbers are better here than path names. You can get all these numbers reasonably fast (unless some sort of networked storage is involved) with zmodload zsh/stat zstat -A stats **/*(.l+0) Of course if you're talking really huge numbers of files you might not want to collect this all at once. > First, and most importantly, the file size itself acts as a weak hash That's also snarfled up by zstat -A. It doesn't really take any longer to get the file size from stat than it does the inode, so unless you're NOT going to consider linked files as duplicates you might as well just compare sizes. (It would be faster to get inodes from readdir, but there's no shell-level readdir.) More on this later. > Second, if you have a fast IO system (eg., your data all fits in RAM) > then time to strongly hash likely dominates the calculation, but that > also parallelizes easily. You're probably not going to beat built-in lightweight threading and shared memory with shell process-level "threading", so if you have a large number of files that are the same size but with different contents then something like python or C might be the way to go. > Third, in theory, even strong crypto hashes have a small but non-zero > chance of a collision. So, you may need to deal with the possibility > of false positives anyway. This should be vanishingly small if the sizes are the same? False positives can be checked by using "cmp -s" which just compares the files byte-by-byte until it it finds a difference, but this is a pretty hefty penalty on the true positives. Anyway, here's zsh code; I haven't dealt with the files having strange characters in the names that might prevent them being easily used as hash keys, that's left as an exercise for somebody who has files with strange names: zmodload zsh/stat zstat -nA stats **/*(.l+0) # Every stat struct has 15 elements, so we pick out every 15th names=( ${(e):-'${stats['{1..${#stats}..15}']}'} ) # name is element 1 sizes=( ${(e):-'${stats['{9..${#stats}..15}']}'} ) # size is element 9 # Zip the two arrays to make a mapping typeset -A clusters sizemap=( ${names:^sizes} ) # Compute clusters of same-sized files for i in {1..$#sizes} do same=( ${(k)sizemap[(R)$sizes[i]]} ) (( $#same > 0 )) || continue # Delete entries we've seen so we don't find them again unset 'sizemap['${^same}']' (( $#same > 1 )) && clusters[$sizes[i]]=${(@qq)same} done # Calculate checksums by cluster and report duplicates typeset -A sums for f in ${(v)clusters} do # Inner loop could be put in a function and outer loop replaced by zargs -P for sum blocks file in $( eval cksum $f ) do if (( ${+sums[$sum.$blocks]} )) then print Duplicate: ${sums[$sum.$blocks]} $file else sums[$sum.$blocks]=$file fi done done I find that a LOT more understandable than the python code. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-07 21:32 ` Bart Schaefer @ 2019-04-08 11:17 ` Charles Blake 2019-04-08 17:14 ` Bart Schaefer 0 siblings, 1 reply; 22+ messages in thread From: Charles Blake @ 2019-04-08 11:17 UTC (permalink / raw) To: Bart Schaefer; +Cc: Zsh Users [-- Attachment #1: Type: text/plain, Size: 4784 bytes --] >I find that a LOT more understandable than the python code. Understandability is, of course, somewhat subjective (e.g. some might say every 15th field is unclear relative to a named label), but the python also does do more and would be quite a bit simpler without any parallelism, and the Zsh code more complex with a set up for it. So, it's a somewhat apples & oranges/strawman comparison. No offense intended. I don't dislike your solution or a simpler Python one. Simplicity is good. As elaborated below, everything sort of "just depends". Some hash functions are really fast..some IO systems really slow, etc. To really do the parallel IO/hashing right (be nice to competition both from self & programs on the system) you probably want to start with a parallelism of 1 CPU and dynamically adjust based on CPU utilization say after each file via getrusage in some parallel thread coordinator). Such would be more complex still, but as we both mentioned will be more efficient in a lower level language. >This should be vanishingly small if the sizes are the same? Of course, if you have a hash with no collision attack or no reason to suspect a filesystem collision attack, then yes, a vanishingly small probability. I did not mean to overemphasize the risk as much as use it to introduce the "IO/hash truncated" optimization idea trading off some false positive risk with performance. I personally used the report for rare-case manual intervention, not automated response. So, cmp at the end makes total sense. I don't think there is any disagreement here. Another motivation on the truncated IO/hash optimization that may be underappreciated is that if your report is empty then you are done. False negatives don't happen. That might well be the common case after some situation is cleaned up and you want a "quick check" that there are no duplicates or if you otherwise "suspect but do not know" no duplicates for some file set. If that quick check fails then you can do a slower, more precise job. With large files or slow IO this can help a lot. Call it an 'is it possible there's a duplicate' operation if you will. Partial hashes can also be a jumping off point to detection of "corrupt" or near-duplicate files. The many ways corruption can happen generates a list of questions/approaches which is off topic. One way it can happen is non-size-changing scribbling over/non-population of just a few bytes which does relate to partial hashing, though obviously it's not perfectly tuned for such an application. >I haven't dealt with the files having strange characters in the names >that might prevent them being easily used as hash keys, that's left as >an exercise Part of my purpose writing the Python was indeed being able to deal with weird path names which I also generally try hard to avoid, but sometimes it's beyond one's control/authority. I almost added user parameter for the hard TAB and NEWLINE chars used for output delimiters, but tried to keep it simple for exposition. With utf8 these days...truly general filenames are decidedly a PITA. >unless you're NOT going to consider linked files as duplicates you >might as well just compare sizes. (It would be faster to get inodes It may have been underappreciated is that handling hard-link identity also lets you skip recomputing hashes over a hard link cluster which must be the same. (Or re-doing the IO if the time between hitting that i-node is long enough to lose buffer caching.) One thinkable, possibly even automatic, resolution of duplicates is conversion into hard links. If there was a lot of duplication resolved this way then this would save time on a "second pass" over such file hierarchies. I'm just defending that admittedly likely minor optimization (which just wasn't so hard to add in Python). Most claims in this whole space depend upon application context in some way. Some are hard, like files being updated during the duplicate scan. Yeah..You can look at time stamps, and maybe re-compute things, but ick. If things are changing who knows what version the user wants assessed? Duplicates are probably rare. Large trees of heavily hard-linked files are probably a much smaller optimization than using file size as a weak hash to organize the calculation. You can probably find a hash function faster than your IO. Or maybe you have files on some quad m.2 NVMe raid setup delivering 13 gigabytes/sec and it is harder to match that with a crypto hash function. Almost everything you say needs a "probably/maybe" qualifier. I don't think you disagree. I'm just elaborating a little for passers by. P.S.: Two of my side-comments got auto-wrapped at 76 text columns. Should be an easy fix for anyone cutting & pasting that python code. Sorry! ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: find duplicate files 2019-04-08 11:17 ` Charles Blake @ 2019-04-08 17:14 ` Bart Schaefer 0 siblings, 0 replies; 22+ messages in thread From: Bart Schaefer @ 2019-04-08 17:14 UTC (permalink / raw) To: Charles Blake; +Cc: Zsh Users On Mon, Apr 8, 2019 at 4:18 AM Charles Blake <charlechaud@gmail.com> wrote: > > >I find that a LOT more understandable than the python code. > > Understandability is, of course, somewhat subjective (e.g. some might say > every 15th field is unclear relative to a named label) Yes, lack of multi-dimensional data structures is a limitation on the shell implementation. I could have done it this way: names=( **/*(.l+0) ) zstat -tA stats $names sizes=( ${(M)stats:#size *} ) I chose the other way so the name and size would be directly connected in the stats array rather than rely on implicit ordering (to one of your later points, bad things happen with the above if a file is removed between generating the list of names and collecting the file stats). > >unless you're NOT going to consider linked files as duplicates you > >might as well just compare sizes. (It would be faster to get inodes > > It may have been underappreciated is that handling hard-link identity also > lets you skip recomputing hashes over a hard link cluster Yes, this could be used to reduce the number of names passed to "cksum" or the equivalent. > Almost everything you say needs a "probably/maybe" > qualifier. I don't think you disagree. I'm just elaborating a little > for passers by. Absolutely. The flip side of this is that shells and utilities are generally optimized for the average case, not for the extremes. ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2019-04-09 9:30 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-04-06 5:40 find duplicate files Emanuel Berg 2019-04-06 13:02 ` Paul Hoffman 2019-04-06 14:44 ` Emanuel Berg 2019-04-06 19:11 ` zv 2019-04-06 19:42 ` Emanuel Berg 2019-04-08 14:37 ` Paul Hoffman 2019-04-08 14:58 ` Ray Andrews 2019-04-08 15:14 ` Volodymyr Khomchak 2019-04-08 15:24 ` Peter Stephenson 2019-04-08 15:32 ` Andrew J. Rech 2019-04-08 15:47 ` Oliver Kiddle 2019-04-08 16:29 ` Ray Andrews 2019-04-08 16:45 ` Bart Schaefer 2019-04-08 21:30 ` Emanuel Berg 2019-04-09 1:08 ` Jason L Tibbitts III 2019-04-09 1:28 ` Ray Andrews 2019-04-09 9:28 ` Charles Blake 2019-04-08 21:26 ` Emanuel Berg 2019-04-07 11:16 ` Charles Blake 2019-04-07 21:32 ` Bart Schaefer 2019-04-08 11:17 ` Charles Blake 2019-04-08 17:14 ` Bart Schaefer
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).