From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 14653 invoked by alias); 28 Mar 2014 20:26:18 -0000 Mailing-List: contact zsh-users-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Users List List-Post: List-Help: X-Seq: 18686 Received: (qmail 24511 invoked from network); 28 Mar 2014 20:26:12 -0000 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.2 X-Originating-IP: [86.6.157.246] X-Spam: 0 X-Authority: v=2.1 cv=Bp0TwOn5 c=1 sm=1 tr=0 a=BvYiZ/UW0Fmn8Wufq9dPrg==:117 a=BvYiZ/UW0Fmn8Wufq9dPrg==:17 a=NLZqzBF-AAAA:8 a=7s3Jj7Ix0b0A:10 a=uObrxnre4hsA:10 a=kj9zAlcOel0A:10 a=-GJzvSM7Un3BdIQz0vAA:9 a=_QqPrWYvwRP9CHxy:21 a=0dEoEIMme29EPAUL:21 a=CjuIK1q_8ugA:10 a=_dQi-Dcv4p4A:10 Date: Fri, 28 Mar 2014 20:20:37 +0000 From: Peter Stephenson To: Zsh-Users List Subject: Re: Compare two (or more) filenames and return what is common between them Message-ID: <20140328202037.779c3436@pws-pc.ntlworld.com> In-Reply-To: <20140318202309.4d830a8b@pws-pc.ntlworld.com> References: <20140318202309.4d830a8b@pws-pc.ntlworld.com> X-Mailer: Claws Mail 3.8.0 (GTK+ 2.24.7; x86_64-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Tue, 18 Mar 2014 20:23:09 +0000 Peter Stephenson wrote: > The upshot is that for the input > > "One Two Nineteen" > "One Two Three" > "One Two Buckle My Shoe" > "One Two Buckle My Belt" > "One Three Four" > "Two Three Sixteen" > "Two Three Seventeen" > "Three Forty Five" > > it prints > > Extracting common prefixes 'One Two Buckle My'... > 'One Two Buckle My Shoe' goes in directory 'One Two Buckle My' > 'One Two Buckle My Belt' goes in directory 'One Two Buckle My' > Extracting common prefixes 'One Two', 'Two Three'... > 'One Two Nineteen' goes in directory 'One Two' > 'One Two Three' goes in directory 'One Two' > 'Two Three Sixteen' goes in directory 'Two Three' > 'Two Three Seventeen' goes in directory 'Two Three' > Unmatched files: > 'One Three Four' > 'Three Forty Five' Let my try to make this more adaptable by annotating it. ##start # Sanitise the options in use for this function. emulate -L zsh # We'll need extendglob for matching. setopt extendedglob local -a words match mbegin mend split restwords # Here's what we're going to apply the algorithm to. # You'd probably get these from a "*" or something similar. words=( "One Two Nineteen" "One Two Three" "One Two Buckle My Shoe" "One Two Buckle My Belt" "One Three Four" "Two Three Sixteen" "Two Three Seventeen" "Three Forty Five" ) # We'll use two associative arrays for storing results. # $groups holds the initial prefixes we're going to group; # they're stored in the keys of the hash for easy access. We don't # really need the value so we'll just stick 1 there so it has a # non-zero length for testing. $foundgroups has the same keys but we'll # only stick something there if there are at least two matching names, # i.e. it's a real group. We could do this other ways e.g. by # sticking a count in $groups, but this way it was dead easy # to get all the groups out later without looping over the array again. typeset -A groups foundgroups # We're going to use spaces in the file name to divide it into words # so we match whole words only. maxwords counts the max number of # words in a file name --- in the example that's 5 in the case of # "One Two Buckle My Shoe". (I'm being very confusing here because # I also use "word" to describe a complete element of a the array # "words", but that's what you get with buggy software from the net.) integer maxwords local word initial pat make # First we'll count the space-delimited words in each input word. for word in $words; do # We're not interested in anything from the first "." on. # Remove the longest trailing string beginning with a ".". initial=${word%%.*} # Split to space-separate words. split=(${=initial}) # ${#split} is the number of such words in the, er, word. if (( ${#split} > maxwords )); then # So this is the largest number of space-separated words we've found. maxwords=${#split} fi done # Helper function to take an input word and split it into "maxword" # space-separated words. words_getinitial() { # Complete name passed in as argument. local word=$1 # As before, remove anything that looks like a suffix. initial=${word%%.*} if (( maxwords > 1 )); then # This is a rather verbose way of matching space-separated words. # [[:blank:]] and [^[:blank:]] are any character that are or not a # space or tab, respectively. Appending ## says we want any number # of such characters that's at least one. The (#b) says parentheses # are live, so I could refer to them later, but I'm not using that # feature so I should take it out :-). (#c times. We match # it maxwords-1 times because we're repeating the blank followed by # space; together with the first word we get maxwords words. pat="(#b)(([^[:blank:]]##[[:blank:]]##)(#c$((maxwords-1)))([^[:blank:]]##))" else # Simple case where we're just matching one word. pat="(#b)([^[:blank:]]##)" fi # ${word##${~pat}} says "remove the longest string matching $pat from # $word". The ~ is so the pattern characters in $pat are live, rather # than just ordinary string characters. By putting in the (M) we # get the matched string, rather than the original string with the # matched bit removed. initial=${(M)word##${~pat}} # So at this point, $initial contains the initial $maxwords space-separated # words from the string passed in, ignoring suffixes. } # functions -T words_getinitial # We're going to start by looking for the longest possible matches, # then gradually decrease maxwords until it reaches 0 or we run out # of words. while (( maxwords && ${#words} )); do restwords=() groups=() foundgroups=() # For all test words (file names) for word in $words; do words_getinitial $word # $initial is the first $maxwords words from $word [[ -z $initial ]] && continue # Got something... if [[ -n $groups[$initial] ]]; then # ... for the second time, so record there's something to group. foundgroups[$initial]=1 else # ... for the first time, remember it. groups[$initial]=1 fi done if (( ${#foundgroups} )); then # Found some groups. The group names to use are the keys of # $foundgroupds. Say what these are just for info. # The expression is just a smarmy way of joining the keys together # with a comma and a space. print "Extracting common prefixes '${(kj.', '.)foundgroups}'..." # Now see which words fit any of these groups. for word in $words; do words_getinitial $word # As before, $initial contains the initial $maxwords words from $word. if [[ -z $initial ]]; then # Nothing here, so stick this one in the remainder to handle later. restwords+=($word) elif [[ -n $foundgroups[$initial] ]]; then # Yes, this is one of the words to group. # In real life we'd stick it in the directory, making # sure the directory existed. For now just report. print "'$word' goes in directory '$initial'" else # No, so stick this in the remainder list. restwords+=($word) fi # Next time, we only need to consider the words we didn't # handle this time, so assign these back to the original array. words=($restwords) done fi # Decrease the number of space-separated words to look for next time (( maxwords-- )) done # If there are some words we didn't group, tell the user what these are. if (( ${#words} )); then print "Unmatched files:" print "'${(pj.'\n'.)words}'" fi ##end -- Peter Stephenson Web page now at http://homepage.ntlworld.com/p.w.stephenson/