* Compare two (or more) filenames and return what is common between them @ 2014-03-18 7:05 TJ Luoma 2014-03-18 20:23 ` Peter Stephenson 0 siblings, 1 reply; 6+ messages in thread From: TJ Luoma @ 2014-03-18 7:05 UTC (permalink / raw) To: Zsh-Users List What I am trying to do: Given a folder/directory full of files (and, possibly, some existing folders/directories), I want to create folders which will group files with similar files names, but which will leave folders alone. For example, here’s a list of files from a directory: @TUAW Adding Low Power Bluetooth to your older Mac.md all-tuaw-posts-with-titles.txt Fluid_1.8.zip FSB- Yellow King.md fsb-followup.md iCXItGrjqrw --- ATP_Ending_Theme_Song_A_Day_1546 18_-_640x360 18.mp4 iCXItGrjqrw --- ATP_Ending_Theme_Song_A_Day_1546 18_-_640x360 18.mp4.description.txt Narrative Lectionary 2013-2014 Readings for Year 4 (Gospel of John).pdf Narrative Lectionary Summer 2014.pdf qq Set a Mac's Hostname in Terminal.txt rss-audio-template-index.xml rsync-skip-compress.txt Tumblr Private Posts.txt afterwards, I would like all of the above files to have been sorted into these folders: @TUAW Adding Low Power Bluetooth to your older Mac all-tuaw-posts-with-titles Fluid_1.8 FSB- Yellow King fsb-followup iCXItGrjqrw --- ATP_Ending_Theme_Song_A_Day_1546 18_-_640x360 18 Narrative Lectionary qq Set a Mac's Hostname in Terminal rss-audio-template-index rsync-skip-compress Tumblr Private Posts The only really tricky part here is for a few of the files which share part or all of their filename: iCXItGrjqrw --- ATP_Ending_Theme_Song_A_Day_1546 18_-_640x360 18.mp4 iCXItGrjqrw --- ATP_Ending_Theme_Song_A_Day_1546 18_-_640x360 18.mp4.description.txt (which should both go into a folder "iCXItGrjqrw --- ATP_Ending_Theme_Song_A_Day_1546 18_-_640x360 18”) (Let’s called this “Case #1”) and Narrative Lectionary 2013-2014 Readings for Year 4 (Gospel of John).pdf Narrative Lectionary Summer 2014.pdf (which should both go into a folder "Narrative Lectionary”) (Let’s called this “Case #2”) Also notice that "rss-audio-template-index.xml” and "rsync-skip-compress.txt” should _not_ go into a folder called “rs” (Let’s called this “Case #3”) Case #1 seems like it should be pretty easy, because all I would have to do is take off the extension(s) and both of the files have the same “root” so I guess I could match that somehow, but I’m not exactly sure how since one file has ".mp4.description.txt” and one file has “.mp4” Case #2 - I am not sure how to efficiently match those two… I guess I could start comparing letters of each filename and then stop when they don’t match, but I’m not even sure how to do that. (And I just realized that I would not want a trailing space in the folder name either so it would have to be smart enough to deal with that somehow too so I end up with a folder named "Narrative Lectionary” not "Narrative Lectionary ”!) For "Case #3” I guess I need to set some sort of minimum number of characters to be matched. Maybe… 5? I don’t really know how to deal with that case very well. Has anyone already invented this? If not, can anyone suggest how I might go about doing this? I’ve been trying to come up with something and I’m just at a complete loss to know where to start, and I get the strong suspicion that there might be a zsh feature that would help that I just don’t know about. Thanks for any help you can offer. TjL ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Compare two (or more) filenames and return what is common between them 2014-03-18 7:05 Compare two (or more) filenames and return what is common between them TJ Luoma @ 2014-03-18 20:23 ` Peter Stephenson 2014-03-18 21:42 ` Bart Schaefer 2014-03-28 20:20 ` Peter Stephenson 0 siblings, 2 replies; 6+ messages in thread From: Peter Stephenson @ 2014-03-18 20:23 UTC (permalink / raw) To: Zsh-Users List On Tue, 18 Mar 2014 03:05:27 -0400 TJ Luoma <luomat@gmail.com> wrote: > What I am trying to do: > > Given a folder/directory full of files (and, possibly, some existing > folders/directories), I want to create folders which will group files > with similar files names, but which will leave folders alone. I'm still not quite sure after reading your description what it is you want, but below is a function for you to play with. It deals with array entries rather than files, but fixing that part should be straightforward. Somewhere you'll have a '*(.)' pattern to select all the regular files in a directory, somewhere else a mkdir or possibly mkdir -p, and somewhere else a mv. The upshot is that for the input "One Two Nineteen" "One Two Three" "One Two Buckle My Shoe" "One Two Buckle My Belt" "One Three Four" "Two Three Sixteen" "Two Three Seventeen" "Three Forty Five" it prints Extracting common prefixes 'One Two Buckle My'... 'One Two Buckle My Shoe' goes in directory 'One Two Buckle My' 'One Two Buckle My Belt' goes in directory 'One Two Buckle My' Extracting common prefixes 'One Two', 'Two Three'... 'One Two Nineteen' goes in directory 'One Two' 'One Two Three' goes in directory 'One Two' 'Two Three Sixteen' goes in directory 'Two Three' 'Two Three Seventeen' goes in directory 'Two Three' Unmatched files: 'One Three Four' 'Three Forty Five' which may or may not be what you want. I handled suffixes by stripping off everything from the earliest "." to an end before looking for common prefixes. I have to admit I was within an ace of switching to Ruby for this. ##start emulate -L zsh setopt extendedglob local -a words match mbegin mend split restwords words=( "One Two Nineteen" "One Two Three" "One Two Buckle My Shoe" "One Two Buckle My Belt" "One Three Four" "Two Three Sixteen" "Two Three Seventeen" "Three Forty Five" ) typeset -A groups foundgroups integer maxwords local word initial pat make for word in $words; do initial=${word%%.*} split=(${=initial}) if (( ${#split} > maxwords )); then maxwords=${#split} fi done words_getinitial() { local word=$1 initial=${word%%.*} if (( maxwords > 1 )); then pat="(#b)(([^[:blank:]]##[[:blank:]]##)(#c$((maxwords-1)))([^[:blank:]]##))" else pat="(#b)([^[:blank:]]##)" fi initial=${(M)word##${~pat}} } # functions -T words_getinitial while (( maxwords && ${#words} )); do restwords=() groups=() foundgroups=() for word in $words; do words_getinitial $word [[ -z $initial ]] && continue if [[ -n $groups[$initial] ]]; then foundgroups[$initial]=1 else groups[$initial]=1 fi done if (( ${#foundgroups} )); then print "Extracting common prefixes '${(kj.', '.)foundgroups}'..." for word in $words; do words_getinitial $word if [[ -z $initial ]]; then restwords+=($word) elif [[ -n $foundgroups[$initial] ]]; then print "'$word' goes in directory '$initial'" else restwords+=($word) fi words=($restwords) done fi (( maxwords-- )) done if (( ${#words} )); then print "Unmatched files:" print "'${(pj.'\n'.)words}'" fi ##end -- Peter Stephenson <p.w.stephenson@ntlworld.com> Web page now at http://homepage.ntlworld.com/p.w.stephenson/ ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Compare two (or more) filenames and return what is common between them 2014-03-18 20:23 ` Peter Stephenson @ 2014-03-18 21:42 ` Bart Schaefer 2014-03-21 19:02 ` TJ Luoma 2014-03-28 20:20 ` Peter Stephenson 1 sibling, 1 reply; 6+ messages in thread From: Bart Schaefer @ 2014-03-18 21:42 UTC (permalink / raw) To: Zsh-Users List On Mar 18, 8:23pm, Peter Stephenson wrote: } Subject: Re: Compare two (or more) filenames and return what is common bet } } On Tue, 18 Mar 2014 03:05:27 -0400 } TJ Luoma <luomat@gmail.com> wrote: } > What I am trying to do: } > } > Given a folder/directory full of files (and, possibly, some existing } > folders/directories), I want to create folders which will group files } > with similar files names, but which will leave folders alone. } } I'm still not quite sure after reading your description what it is you } want I suspect he wants something like this: http://en.wikipedia.org/wiki/Approximate_string_matching Zsh has such a function internally for "spelling" correction, but it is actually based on correcting for typographical mistakes on a QWERTY keyboard more than on traditional substring similarity. You could also try something with the (#a) glob qualifier (approximate matching) but it's VERY expensive for strings as long as some of your example file names and is entirely impossible to interrupt once it is started ("kill -9" territory). ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Compare two (or more) filenames and return what is common between them 2014-03-18 21:42 ` Bart Schaefer @ 2014-03-21 19:02 ` TJ Luoma 2014-03-21 19:39 ` Peter Stephenson 0 siblings, 1 reply; 6+ messages in thread From: TJ Luoma @ 2014-03-21 19:02 UTC (permalink / raw) To: Zsh-Users List On Tue, Mar 18, 2014 at 5:42 PM, Bart Schaefer <schaefer@brasslantern.com> wrote: > On Mar 18, 8:23pm, Peter Stephenson wrote: > } Subject: Re: Compare two (or more) filenames and return what is common bet > } > } On Tue, 18 Mar 2014 03:05:27 -0400 > } TJ Luoma <luomat@gmail.com> wrote: > } > What I am trying to do: > } > > } > Given a folder/directory full of files (and, possibly, some existing > } > folders/directories), I want to create folders which will group files > } > with similar files names, but which will leave folders alone. > } > } I'm still not quite sure after reading your description what it is you > } want > > I suspect he wants something like this: > > http://en.wikipedia.org/wiki/Approximate_string_matching > > Zsh has such a function internally for "spelling" correction, but it is > actually based on correcting for typographical mistakes on a QWERTY > keyboard more than on traditional substring similarity. > > You could also try something with the (#a) glob qualifier (approximate > matching) but it's VERY expensive for strings as long as some of your > example file names and is entirely impossible to interrupt once it is > started ("kill -9" territory). I just realized that there’s a (seemingly) simpler example for what I’d like to be able to do in a shell-script instead of in an interactive shell. It wouldn’t have all of the features that I first described, but I think it would be enough, and perhaps simpler to implement. When I have a bunch of files in a folder and need to match them, I start typing the letters and then press {tab} for completion, and it shows me the files that have similar “roots” For example, right now I just did: % ls 2[tab] and it expanded to % ls 2014-03-16. even though the matching files are named 2014-03-16.Sermon.aiff 2014-03-16.worship.aiff So here are the steps simpler version of what I’d like to do (which, I realize, might not be possible to do in a shell script): 1. Find all of the files (not directories) in a given directory (that directory would probably be “$@“ in most cases) 2. Type the first ~4 letters of each filename 3. Emulate {tab} for completion matching 4. Create a directory based on the output from step #3 5. Move whatever files matched step #3 into the folder created in step 4 Is that more clear and/or possible? Thanks! TjL ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Compare two (or more) filenames and return what is common between them 2014-03-21 19:02 ` TJ Luoma @ 2014-03-21 19:39 ` Peter Stephenson 0 siblings, 0 replies; 6+ messages in thread From: Peter Stephenson @ 2014-03-21 19:39 UTC (permalink / raw) To: Zsh-Users List On Fri, 21 Mar 2014 15:02:52 -0400 TJ Luoma <luomat@gmail.com> wrote: > So here are the steps simpler version of what I’d like to do (which, I > realize, might not be possible to do in a shell script): > > 1. Find all of the files (not directories) in a given directory (that > directory would probably be “$@“ in most cases) > > 2. Type the first ~4 letters of each filename > > 3. Emulate {tab} for completion matching > > 4. Create a directory based on the output> > Thanks! > > TjL from step #3 > > 5. Move whatever files matched step #3 into the folder created in step 4 > > Is that more clear and/or possible? No, that's much harder! Hooking into completion is difficult. However, it sounds like you're saying you'd be happy with collecting files that happen to have any prefix in common, so you're not actually worried about completion, just about common prefixes. That's probably simpler than what I implemented, where I required the common prefixes to be space delimited. It's different in that you're now saying you'd like to group by the shortest common string, whereas I've looked for the longest common string first. pws ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Compare two (or more) filenames and return what is common between them 2014-03-18 20:23 ` Peter Stephenson 2014-03-18 21:42 ` Bart Schaefer @ 2014-03-28 20:20 ` Peter Stephenson 1 sibling, 0 replies; 6+ messages in thread From: Peter Stephenson @ 2014-03-28 20:20 UTC (permalink / raw) To: Zsh-Users List On Tue, 18 Mar 2014 20:23:09 +0000 Peter Stephenson <p.w.stephenson@ntlworld.com> wrote: > The upshot is that for the input > > "One Two Nineteen" > "One Two Three" > "One Two Buckle My Shoe" > "One Two Buckle My Belt" > "One Three Four" > "Two Three Sixteen" > "Two Three Seventeen" > "Three Forty Five" > > it prints > > Extracting common prefixes 'One Two Buckle My'... > 'One Two Buckle My Shoe' goes in directory 'One Two Buckle My' > 'One Two Buckle My Belt' goes in directory 'One Two Buckle My' > Extracting common prefixes 'One Two', 'Two Three'... > 'One Two Nineteen' goes in directory 'One Two' > 'One Two Three' goes in directory 'One Two' > 'Two Three Sixteen' goes in directory 'Two Three' > 'Two Three Seventeen' goes in directory 'Two Three' > Unmatched files: > 'One Three Four' > 'Three Forty Five' Let my try to make this more adaptable by annotating it. ##start # Sanitise the options in use for this function. emulate -L zsh # We'll need extendglob for matching. setopt extendedglob local -a words match mbegin mend split restwords # Here's what we're going to apply the algorithm to. # You'd probably get these from a "*" or something similar. words=( "One Two Nineteen" "One Two Three" "One Two Buckle My Shoe" "One Two Buckle My Belt" "One Three Four" "Two Three Sixteen" "Two Three Seventeen" "Three Forty Five" ) # We'll use two associative arrays for storing results. # $groups holds the initial prefixes we're going to group; # they're stored in the keys of the hash for easy access. We don't # really need the value so we'll just stick 1 there so it has a # non-zero length for testing. $foundgroups has the same keys but we'll # only stick something there if there are at least two matching names, # i.e. it's a real group. We could do this other ways e.g. by # sticking a count in $groups, but this way it was dead easy # to get all the groups out later without looping over the array again. typeset -A groups foundgroups # We're going to use spaces in the file name to divide it into words # so we match whole words only. maxwords counts the max number of # words in a file name --- in the example that's 5 in the case of # "One Two Buckle My Shoe". (I'm being very confusing here because # I also use "word" to describe a complete element of a the array # "words", but that's what you get with buggy software from the net.) integer maxwords local word initial pat make # First we'll count the space-delimited words in each input word. for word in $words; do # We're not interested in anything from the first "." on. # Remove the longest trailing string beginning with a ".". initial=${word%%.*} # Split to space-separate words. split=(${=initial}) # ${#split} is the number of such words in the, er, word. if (( ${#split} > maxwords )); then # So this is the largest number of space-separated words we've found. maxwords=${#split} fi done # Helper function to take an input word and split it into "maxword" # space-separated words. words_getinitial() { # Complete name passed in as argument. local word=$1 # As before, remove anything that looks like a suffix. initial=${word%%.*} if (( maxwords > 1 )); then # This is a rather verbose way of matching space-separated words. # [[:blank:]] and [^[:blank:]] are any character that are or not a # space or tab, respectively. Appending ## says we want any number # of such characters that's at least one. The (#b) says parentheses # are live, so I could refer to them later, but I'm not using that # feature so I should take it out :-). (#c<num)) says match the # previous expression (what's in parentheses) <num> times. We match # it maxwords-1 times because we're repeating the blank followed by # space; together with the first word we get maxwords words. pat="(#b)(([^[:blank:]]##[[:blank:]]##)(#c$((maxwords-1)))([^[:blank:]]##))" else # Simple case where we're just matching one word. pat="(#b)([^[:blank:]]##)" fi # ${word##${~pat}} says "remove the longest string matching $pat from # $word". The ~ is so the pattern characters in $pat are live, rather # than just ordinary string characters. By putting in the (M) we # get the matched string, rather than the original string with the # matched bit removed. initial=${(M)word##${~pat}} # So at this point, $initial contains the initial $maxwords space-separated # words from the string passed in, ignoring suffixes. } # functions -T words_getinitial # We're going to start by looking for the longest possible matches, # then gradually decrease maxwords until it reaches 0 or we run out # of words. while (( maxwords && ${#words} )); do restwords=() groups=() foundgroups=() # For all test words (file names) for word in $words; do words_getinitial $word # $initial is the first $maxwords words from $word [[ -z $initial ]] && continue # Got something... if [[ -n $groups[$initial] ]]; then # ... for the second time, so record there's something to group. foundgroups[$initial]=1 else # ... for the first time, remember it. groups[$initial]=1 fi done if (( ${#foundgroups} )); then # Found some groups. The group names to use are the keys of # $foundgroupds. Say what these are just for info. # The expression is just a smarmy way of joining the keys together # with a comma and a space. print "Extracting common prefixes '${(kj.', '.)foundgroups}'..." # Now see which words fit any of these groups. for word in $words; do words_getinitial $word # As before, $initial contains the initial $maxwords words from $word. if [[ -z $initial ]]; then # Nothing here, so stick this one in the remainder to handle later. restwords+=($word) elif [[ -n $foundgroups[$initial] ]]; then # Yes, this is one of the words to group. # In real life we'd stick it in the directory, making # sure the directory existed. For now just report. print "'$word' goes in directory '$initial'" else # No, so stick this in the remainder list. restwords+=($word) fi # Next time, we only need to consider the words we didn't # handle this time, so assign these back to the original array. words=($restwords) done fi # Decrease the number of space-separated words to look for next time (( maxwords-- )) done # If there are some words we didn't group, tell the user what these are. if (( ${#words} )); then print "Unmatched files:" print "'${(pj.'\n'.)words}'" fi ##end -- Peter Stephenson <p.w.stephenson@ntlworld.com> Web page now at http://homepage.ntlworld.com/p.w.stephenson/ ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-03-28 20:26 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-03-18 7:05 Compare two (or more) filenames and return what is common between them TJ Luoma 2014-03-18 20:23 ` Peter Stephenson 2014-03-18 21:42 ` Bart Schaefer 2014-03-21 19:02 ` TJ Luoma 2014-03-21 19:39 ` Peter Stephenson 2014-03-28 20:20 ` Peter Stephenson
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).