From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham autolearn_force=no version=3.4.2 Received: from primenet.com.au (ns1.primenet.com.au [203.24.36.2]) by inbox.vuxu.org (OpenSMTPD) with ESMTP id 0f0fecfd for ; Sun, 7 Apr 2019 21:33:47 +0000 (UTC) Received: (qmail 23726 invoked by alias); 7 Apr 2019 21:33:28 -0000 Mailing-List: contact zsh-users-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Users List List-Post: List-Help: List-Unsubscribe: X-Seq: 23904 Received: (qmail 6828 invoked by uid 1010); 7 Apr 2019 21:33:28 -0000 X-Qmail-Scanner-Diagnostics: from mail-lf1-f54.google.com by f.primenet.com.au (envelope-from , uid 7791) with qmail-scanner-2.11 (clamdscan: 0.101.1/25412. spamassassin: 3.4.2. Clear:RC:0(209.85.167.54):SA:0(-1.9/5.0):. Processed in 2.234739 secs); 07 Apr 2019 21:33:28 -0000 X-Envelope-From: schaefer@brasslantern.com X-Qmail-Scanner-Mime-Attachments: | X-Qmail-Scanner-Zip-Files: | Received-SPF: pass (ns1.primenet.com.au: SPF record at _netblocks.google.com designates 209.85.167.54 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=brasslantern-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=nAfFWoCw+8h+VJP28komCHqS2bGtGecIh6/waXgsXVI=; b=s4hx2AUva2g3bXbe//ANPeU4AZiw3YRQ8zjZtVQHAABX3dno36LsC6hwQj99p6tgQE WlXukQLaBXT5MAyp/7fzP/PKXUewMSSTvOcel2iz4RfOB2rUyNss1/HPuTTVAsU974eD Olhs+d6O36V6iGVoGwY8JeZb4ozrfTsY16CiOVErN4gKrIbQYaSYSM/FhuRhttPg7Dgq wSp+GV7AzbQclULK7RhWAuipKhKl+mjubdPWiqswhdUwJf/4oR/wqIK4WdDMjYMbT8Cw F48DCv2hwMacjfgptlmlKMK0rPpcuPgv7mN5czhBrrkNYt8sLvMvbgmevM4F4Q4oJURc VWPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=nAfFWoCw+8h+VJP28komCHqS2bGtGecIh6/waXgsXVI=; b=fW+JZ4fgJSTK0GOhuVJbIk6MDvHEd1zeZ1V6qHlMXHAfoZREikVs390zCbdB+U5kG7 WD2QIwFQuyIbX8374JR59SSN0/Lzr8XWPz5pUxLuiqLWdIEPGuI9WqQXTb9f0Dwx0+Ii y9AgxF3TMXXK3ApfrNwgqlh5fJYzd4GSXgRHEWeCIb/89u0NH1HhpQDRA/3AzJkSqhh/ bV7oj1dleNiNLuMdkPBm71lUTlk7d1XQ+WIky8fW3NlLifBBjqpZSEwB5ZIwZlxEo4Ga muV2mYesXNw/l0UIPzQ9trMhnhIb00/zSylYcizQRtg86MjIj2xUqgu8+246k73lKqYM VK8A== X-Gm-Message-State: APjAAAVAZa46Y81vOThFpaEw+DxY4Usdl74G3a5EfchSFHIWTZzGlZ2t f2DF0ihgrAw3XppBMiX//1twS7ko7LSCNxEIF5xY0Q== X-Google-Smtp-Source: APXvYqy3phxrH6qRoocUeF513aSq/cdQ/lZP7ka5UJQEpPr5hv5FjSUNl7EPtfSqTSomdKAHhwGWOFr9PClFGDbYRHo= X-Received: by 2002:ac2:5110:: with SMTP id q16mr13938981lfb.44.1554672770001; Sun, 07 Apr 2019 14:32:50 -0700 (PDT) MIME-Version: 1.0 References: <86v9zrbsic.fsf@zoho.eu> <20190406130242.GA29292@trot> <86tvfb9ore.fsf@zoho.eu> In-Reply-To: From: Bart Schaefer Date: Sun, 7 Apr 2019 14:32:38 -0700 Message-ID: Subject: Re: find duplicate files To: Charles Blake Cc: Zsh Users Content-Type: text/plain; charset="UTF-8" On Sun, Apr 7, 2019 at 4:19 AM Charles Blake wrote: > > > Zeroeth, you may want to be careful of zero length files which are all > identical, but also take up little to no space beyond their pathname. This is just **/*(.l+0) in zsh (the dot is to ignore directories). > Another zeroth order concern is > file identity. I-node/file numbers are better here than path names. You can get all these numbers reasonably fast (unless some sort of networked storage is involved) with zmodload zsh/stat zstat -A stats **/*(.l+0) Of course if you're talking really huge numbers of files you might not want to collect this all at once. > First, and most importantly, the file size itself acts as a weak hash That's also snarfled up by zstat -A. It doesn't really take any longer to get the file size from stat than it does the inode, so unless you're NOT going to consider linked files as duplicates you might as well just compare sizes. (It would be faster to get inodes from readdir, but there's no shell-level readdir.) More on this later. > Second, if you have a fast IO system (eg., your data all fits in RAM) > then time to strongly hash likely dominates the calculation, but that > also parallelizes easily. You're probably not going to beat built-in lightweight threading and shared memory with shell process-level "threading", so if you have a large number of files that are the same size but with different contents then something like python or C might be the way to go. > Third, in theory, even strong crypto hashes have a small but non-zero > chance of a collision. So, you may need to deal with the possibility > of false positives anyway. This should be vanishingly small if the sizes are the same? False positives can be checked by using "cmp -s" which just compares the files byte-by-byte until it it finds a difference, but this is a pretty hefty penalty on the true positives. Anyway, here's zsh code; I haven't dealt with the files having strange characters in the names that might prevent them being easily used as hash keys, that's left as an exercise for somebody who has files with strange names: zmodload zsh/stat zstat -nA stats **/*(.l+0) # Every stat struct has 15 elements, so we pick out every 15th names=( ${(e):-'${stats['{1..${#stats}..15}']}'} ) # name is element 1 sizes=( ${(e):-'${stats['{9..${#stats}..15}']}'} ) # size is element 9 # Zip the two arrays to make a mapping typeset -A clusters sizemap=( ${names:^sizes} ) # Compute clusters of same-sized files for i in {1..$#sizes} do same=( ${(k)sizemap[(R)$sizes[i]]} ) (( $#same > 0 )) || continue # Delete entries we've seen so we don't find them again unset 'sizemap['${^same}']' (( $#same > 1 )) && clusters[$sizes[i]]=${(@qq)same} done # Calculate checksums by cluster and report duplicates typeset -A sums for f in ${(v)clusters} do # Inner loop could be put in a function and outer loop replaced by zargs -P for sum blocks file in $( eval cksum $f ) do if (( ${+sums[$sum.$blocks]} )) then print Duplicate: ${sums[$sum.$blocks]} $file else sums[$sum.$blocks]=$file fi done done I find that a LOT more understandable than the python code.