From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham autolearn_force=no version=3.4.2 Received: from primenet.com.au (ns1.primenet.com.au [203.24.36.2]) by inbox.vuxu.org (OpenSMTPD) with ESMTP id bc38fef8 for ; Sat, 6 Apr 2019 13:03:39 +0000 (UTC) Received: (qmail 24852 invoked by alias); 6 Apr 2019 13:03:21 -0000 Mailing-List: contact zsh-users-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Users List List-Post: List-Help: List-Unsubscribe: X-Seq: 23899 Received: (qmail 9557 invoked by uid 1010); 6 Apr 2019 13:03:21 -0000 X-Qmail-Scanner-Diagnostics: from reka.pair.com by f.primenet.com.au (envelope-from , uid 7791) with qmail-scanner-2.11 (clamdscan: 0.101.1/25405. spamassassin: 3.4.2. Clear:RC:0(209.68.5.132):SA:0(-1.9/5.0):. Processed in 2.308569 secs); 06 Apr 2019 13:03:21 -0000 X-Envelope-From: nkuitse@nkuitse.com X-Qmail-Scanner-Mime-Attachments: | X-Qmail-Scanner-Zip-Files: | Received-SPF: none (ns1.primenet.com.au: domain at nkuitse.com does not designate permitted sender hosts) Date: Sat, 6 Apr 2019 09:02:42 -0400 From: Paul Hoffman To: zsh-users@zsh.org Subject: Re: find duplicate files Message-ID: <20190406130242.GA29292@trot> Mail-Followup-To: zsh-users@zsh.org References: <86v9zrbsic.fsf@zoho.eu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <86v9zrbsic.fsf@zoho.eu> User-Agent: Mutt/1.5.23 (2014-03-12) On Sat, Apr 06, 2019 at 07:40:59AM +0200, Emanuel Berg wrote: > Is this any good? Can it be done lineary? > > TIA > > #! /bin/zsh > > find-duplicates () { > local -a files > [[ $# = 0 ]] && files=("${(@f)$(ls)}") || files=($@) > > local dups=0 > > # files > local a > local b > > for a in $files; do > for b in $files; do > if [[ $a != $b ]]; then > diff $a $b > /dev/null > if [[ $? = 0 ]]; then > echo $a and $b are the same > dups=1 > fi > fi > done > done > [[ $dups = 0 ]] && echo "no duplicates" > } > alias dups=find-duplicates Your function keeps comparing even after it finds duplicates, so its runtime will be O(N^2), i.e., proportional to the square of the sum of file sizes (N). Here's one that calculates MD5 checksums, and compares those, and so is O(N) + O(M^2), i.e., proportional to the sum of file sizes plus the square of the number of files (M). #!/bin/zsh find-duplicates () { (( # > 0 )) || set -- *(.N) local dups=0 md5sum $@ | sort | uniq -c | grep -qv '^ *1 ' | wc -l | read dups (( dups == 0 )) && echo "no duplicates" } A better solution would use an associative array (local -A NAME), would *not* sort, and would stop as soon as it found a duplicate, but I'll leave that as an exercise for the reader. :-) Paul. -- Paul Hoffman