Compare two (or more) filenames and return what is common between them

zsh-users
 help / color / mirror / code / Atom feed

* Compare two (or more) filenames and return what is common between them
@ 2014-03-18  7:05 TJ Luoma
  2014-03-18 20:23 ` Peter Stephenson
  0 siblings, 1 reply; 6+ messages in thread
From: TJ Luoma @ 2014-03-18  7:05 UTC (permalink / raw)
  To: Zsh-Users List

What I am trying to do:

Given a folder/directory full of files (and, possibly, some existing
folders/directories), I want to create folders which will group files
with similar files names, but which will leave folders alone.

For example, here’s a list of files from a directory:

@TUAW Adding Low Power Bluetooth to your older Mac.md
all-tuaw-posts-with-titles.txt
Fluid_1.8.zip
FSB- Yellow King.md
fsb-followup.md
iCXItGrjqrw --- ATP_Ending_Theme_Song_A_Day_1546 18_-_640x360 18.mp4
iCXItGrjqrw --- ATP_Ending_Theme_Song_A_Day_1546 18_-_640x360
18.mp4.description.txt
Narrative Lectionary 2013-2014 Readings for Year 4 (Gospel of John).pdf
Narrative Lectionary Summer 2014.pdf
qq Set a Mac's Hostname in Terminal.txt
rss-audio-template-index.xml
rsync-skip-compress.txt
Tumblr Private Posts.txt

afterwards, I would like all of the above files to have been sorted
into these folders:

@TUAW Adding Low Power Bluetooth to your older Mac
all-tuaw-posts-with-titles
Fluid_1.8
FSB- Yellow King
fsb-followup
iCXItGrjqrw --- ATP_Ending_Theme_Song_A_Day_1546 18_-_640x360 18
Narrative Lectionary
qq Set a Mac's Hostname in Terminal
rss-audio-template-index
rsync-skip-compress
Tumblr Private Posts

The only really tricky part here is for a few of the files which share
part or all of their filename:

iCXItGrjqrw --- ATP_Ending_Theme_Song_A_Day_1546 18_-_640x360 18.mp4
iCXItGrjqrw --- ATP_Ending_Theme_Song_A_Day_1546 18_-_640x360
18.mp4.description.txt

(which should both go into a folder "iCXItGrjqrw ---
ATP_Ending_Theme_Song_A_Day_1546 18_-_640x360 18”)

(Let’s called this “Case #1”)

and

Narrative Lectionary 2013-2014 Readings for Year 4 (Gospel of John).pdf
Narrative Lectionary Summer 2014.pdf

(which should both go into a folder "Narrative Lectionary”)

(Let’s called this “Case #2”)

Also notice that "rss-audio-template-index.xml” and
"rsync-skip-compress.txt” should _not_ go into a folder called “rs”

(Let’s called this “Case #3”)

Case #1 seems like it should be pretty easy, because all I would have
to do is take off the extension(s) and both of the files have the same
“root” so I guess I could match that somehow, but I’m not exactly sure
how since one file has ".mp4.description.txt” and one file has “.mp4”

Case #2 - I am not sure how to efficiently match those two… I guess I
could start comparing letters of each filename and then stop when they
don’t match, but I’m not even sure how to do that. (And I just
realized that I would not want a trailing space in the folder name
either so it would have to be smart enough to deal with that somehow
too so I end up with a folder named "Narrative Lectionary” not
"Narrative Lectionary ”!)

For "Case #3” I guess I need to set some sort of minimum number of
characters to be matched. Maybe… 5? I don’t really know how to deal
with that case very well.

Has anyone already invented this?

If not, can anyone suggest how I might go about doing this? I’ve been
trying to come up with something and I’m just at a complete loss to
know where to start, and I get the strong suspicion that there might
be a zsh feature that would help that I just don’t know about.

Thanks for any help you can offer.

TjL

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Compare two (or more) filenames and return what is common between them
  2014-03-18  7:05 Compare two (or more) filenames and return what is common between them TJ Luoma
@ 2014-03-18 20:23 ` Peter Stephenson
  2014-03-18 21:42   ` Bart Schaefer
  2014-03-28 20:20   ` Peter Stephenson
  0 siblings, 2 replies; 6+ messages in thread
From: Peter Stephenson @ 2014-03-18 20:23 UTC (permalink / raw)
  To: Zsh-Users List

On Tue, 18 Mar 2014 03:05:27 -0400
TJ Luoma <luomat@gmail.com> wrote:
> What I am trying to do:
> 
> Given a folder/directory full of files (and, possibly, some existing
> folders/directories), I want to create folders which will group files
> with similar files names, but which will leave folders alone.

I'm still not quite sure after reading your description what it is you
want, but below is a function for you to play with.  It deals with array
entries rather than files, but fixing that part should be
straightforward.  Somewhere you'll have a '*(.)' pattern to select all
the regular files in a directory, somewhere else a mkdir or possibly mkdir -p,
and somewhere else a mv.

The upshot is that for the input

  "One Two Nineteen"
  "One Two Three"
  "One Two Buckle My Shoe"
  "One Two Buckle My Belt"
  "One Three Four"
  "Two Three Sixteen"
  "Two Three Seventeen"
  "Three Forty Five"

it prints

  Extracting common prefixes 'One Two Buckle My'...
  'One Two Buckle My Shoe' goes in directory 'One Two Buckle My'
  'One Two Buckle My Belt' goes in directory 'One Two Buckle My'
  Extracting common prefixes 'One Two', 'Two Three'...
  'One Two Nineteen' goes in directory 'One Two'
  'One Two Three' goes in directory 'One Two'
  'Two Three Sixteen' goes in directory 'Two Three'
  'Two Three Seventeen' goes in directory 'Two Three'
  Unmatched files:
  'One Three Four'
  'Three Forty Five'

which may or may not be what you want.  I handled suffixes by stripping
off everything from the earliest "." to an end before looking for common
prefixes.

I have to admit I was within an ace of switching to Ruby for this.


##start
emulate -L zsh
setopt extendedglob

local -a words match mbegin mend split restwords

words=(
	"One Two Nineteen"
	"One Two Three"
	"One Two Buckle My Shoe"
	"One Two Buckle My Belt"
	"One Three Four"
	"Two Three Sixteen"
	"Two Three Seventeen"
	"Three Forty Five"
)

typeset -A groups foundgroups
integer maxwords
local word initial pat make

for word in $words; do
  initial=${word%%.*}
  split=(${=initial})
  if (( ${#split} > maxwords )); then
    maxwords=${#split}
  fi
done

words_getinitial() {
  local word=$1
  initial=${word%%.*}
  if (( maxwords > 1 )); then
    pat="(#b)(([^[:blank:]]##[[:blank:]]##)(#c$((maxwords-1)))([^[:blank:]]##))"
  else
    pat="(#b)([^[:blank:]]##)"
  fi
  initial=${(M)word##${~pat}}
}
# functions -T words_getinitial

while (( maxwords && ${#words} )); do
  restwords=()
  groups=()
  foundgroups=()
  for word in $words; do
    words_getinitial $word
    [[ -z $initial ]] && continue
    if [[ -n $groups[$initial] ]]; then
      foundgroups[$initial]=1
    else
      groups[$initial]=1
    fi
  done
  if (( ${#foundgroups} )); then
    print "Extracting common prefixes '${(kj.', '.)foundgroups}'..."
    for word in $words; do
      words_getinitial $word
      if [[ -z $initial ]]; then
	restwords+=($word)
      elif [[ -n $foundgroups[$initial] ]]; then
	print "'$word' goes in directory '$initial'"
      else
	restwords+=($word)
      fi
      words=($restwords)
    done
  fi
  (( maxwords-- ))
done

if (( ${#words} )); then
  print "Unmatched files:"
  print "'${(pj.'\n'.)words}'"
fi
##end


-- 
Peter Stephenson <p.w.stephenson@ntlworld.com>
Web page now at http://homepage.ntlworld.com/p.w.stephenson/


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Compare two (or more) filenames and return what is common between them
  2014-03-18 20:23 ` Peter Stephenson
@ 2014-03-18 21:42   ` Bart Schaefer
  2014-03-21 19:02     ` TJ Luoma
  2014-03-28 20:20   ` Peter Stephenson
  1 sibling, 1 reply; 6+ messages in thread
From: Bart Schaefer @ 2014-03-18 21:42 UTC (permalink / raw)
  To: Zsh-Users List

On Mar 18,  8:23pm, Peter Stephenson wrote:
} Subject: Re: Compare two (or more) filenames and return what is common bet
}
} On Tue, 18 Mar 2014 03:05:27 -0400
} TJ Luoma <luomat@gmail.com> wrote:
} > What I am trying to do:
} > 
} > Given a folder/directory full of files (and, possibly, some existing
} > folders/directories), I want to create folders which will group files
} > with similar files names, but which will leave folders alone.
} 
} I'm still not quite sure after reading your description what it is you
} want

I suspect he wants something like this:

http://en.wikipedia.org/wiki/Approximate_string_matching

Zsh has such a function internally for "spelling" correction, but it is
actually based on correcting for typographical mistakes on a QWERTY
keyboard more than on traditional substring similarity.

You could also try something with the (#a) glob qualifier (approximate
matching) but it's VERY expensive for strings as long as some of your
example file names and is entirely impossible to interrupt once it is
started ("kill -9" territory).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Compare two (or more) filenames and return what is common between them
  2014-03-18 21:42   ` Bart Schaefer
@ 2014-03-21 19:02     ` TJ Luoma
  2014-03-21 19:39       ` Peter Stephenson
  0 siblings, 1 reply; 6+ messages in thread
From: TJ Luoma @ 2014-03-21 19:02 UTC (permalink / raw)
  To: Zsh-Users List

On Tue, Mar 18, 2014 at 5:42 PM, Bart Schaefer
<schaefer@brasslantern.com> wrote:
> On Mar 18,  8:23pm, Peter Stephenson wrote:
> } Subject: Re: Compare two (or more) filenames and return what is common bet
> }
> } On Tue, 18 Mar 2014 03:05:27 -0400
> } TJ Luoma <luomat@gmail.com> wrote:
> } > What I am trying to do:
> } >
> } > Given a folder/directory full of files (and, possibly, some existing
> } > folders/directories), I want to create folders which will group files
> } > with similar files names, but which will leave folders alone.
> }
> } I'm still not quite sure after reading your description what it is you
> } want
>
> I suspect he wants something like this:
>
> http://en.wikipedia.org/wiki/Approximate_string_matching
>
> Zsh has such a function internally for "spelling" correction, but it is
> actually based on correcting for typographical mistakes on a QWERTY
> keyboard more than on traditional substring similarity.
>
> You could also try something with the (#a) glob qualifier (approximate
> matching) but it's VERY expensive for strings as long as some of your
> example file names and is entirely impossible to interrupt once it is
> started ("kill -9" territory).

I just realized that there’s a (seemingly) simpler example for what
I’d like to be able to do in a shell-script instead of in an
interactive shell. It wouldn’t have all of the features that I first
described, but I think it would be enough, and perhaps simpler to
implement.

When I have a bunch of files in a folder and need to match them, I
start typing the letters and then press {tab} for completion, and it
shows me the files that have similar “roots”

For example, right now I just did:

% ls 2[tab]

and it expanded to

% ls 2014-03-16.

even though the matching files are named

2014-03-16.Sermon.aiff
2014-03-16.worship.aiff

So here are the steps simpler version of what I’d like to do (which, I
realize, might not be possible to do in a shell script):

1. Find all of the files (not directories) in a given directory (that
directory would probably be “$@“ in most cases)

2. Type the first ~4 letters of each filename

3. Emulate {tab} for completion matching

4. Create a directory based on the output from step #3

5. Move whatever files matched step #3 into the folder created in step 4

Is that more clear and/or possible?

Thanks!

TjL


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Compare two (or more) filenames and return what is common between them
  2014-03-21 19:02     ` TJ Luoma
@ 2014-03-21 19:39       ` Peter Stephenson
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Stephenson @ 2014-03-21 19:39 UTC (permalink / raw)
  To: Zsh-Users List

On Fri, 21 Mar 2014 15:02:52 -0400
TJ Luoma <luomat@gmail.com> wrote:
> So here are the steps simpler version of what I’d like to do (which, I
> realize, might not be possible to do in a shell script):
> 
> 1. Find all of the files (not directories) in a given directory (that
> directory would probably be “$@“ in most cases)
> 
> 2. Type the first ~4 letters of each filename
> 
> 3. Emulate {tab} for completion matching
> 
> 4. Create a directory based on the output> 
> Thanks!
> 
> TjL
 from step #3
> 
> 5. Move whatever files matched step #3 into the folder created in step 4
> 
> Is that more clear and/or possible?

No, that's much harder!  Hooking into completion is difficult.

However, it sounds like you're saying you'd be happy with collecting
files that happen to have any prefix in common, so you're not actually worried about completion, just about common prefixes.  That's probably simpler
than what I implemented, where I required the common prefixes to be
space delimited.  It's different in that you're now saying you'd like to
group by the shortest common string, whereas I've looked for the longest
common string first.

pws

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Compare two (or more) filenames and return what is common between them
  2014-03-18 20:23 ` Peter Stephenson
  2014-03-18 21:42   ` Bart Schaefer
@ 2014-03-28 20:20   ` Peter Stephenson
  1 sibling, 0 replies; 6+ messages in thread
From: Peter Stephenson @ 2014-03-28 20:20 UTC (permalink / raw)
  To: Zsh-Users List

On Tue, 18 Mar 2014 20:23:09 +0000
Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> The upshot is that for the input
> 
>   "One Two Nineteen"
>   "One Two Three"
>   "One Two Buckle My Shoe"
>   "One Two Buckle My Belt"
>   "One Three Four"
>   "Two Three Sixteen"
>   "Two Three Seventeen"
>   "Three Forty Five"
> 
> it prints
> 
>   Extracting common prefixes 'One Two Buckle My'...
>   'One Two Buckle My Shoe' goes in directory 'One Two Buckle My'
>   'One Two Buckle My Belt' goes in directory 'One Two Buckle My'
>   Extracting common prefixes 'One Two', 'Two Three'...
>   'One Two Nineteen' goes in directory 'One Two'
>   'One Two Three' goes in directory 'One Two'
>   'Two Three Sixteen' goes in directory 'Two Three'
>   'Two Three Seventeen' goes in directory 'Two Three'
>   Unmatched files:
>   'One Three Four'
>   'Three Forty Five'

Let my try to make this more adaptable by annotating it.

##start
# Sanitise the options in use for this function.
emulate -L zsh
# We'll need extendglob for matching.
setopt extendedglob

local -a words match mbegin mend split restwords

# Here's what we're going to apply the algorithm to.
# You'd probably get these from a "*" or something similar.
words=(
	"One Two Nineteen"
	"One Two Three"
	"One Two Buckle My Shoe"
	"One Two Buckle My Belt"
	"One Three Four"
	"Two Three Sixteen"
	"Two Three Seventeen"
	"Three Forty Five"
)

# We'll use two associative arrays for storing results.
# $groups holds the initial prefixes we're going to group;
# they're stored in the keys of the hash for easy access.  We don't
# really need the value so we'll just stick 1 there so it has a
# non-zero length for testing.  $foundgroups has the same keys but we'll
# only stick something there if there are at least two matching names,
# i.e. it's a real group.  We could do this other ways e.g. by
# sticking a count in $groups, but this way it was dead easy
# to get all the groups out later without looping over the array again.
typeset -A groups foundgroups
# We're going to use spaces in the file name to divide it into words
# so we match whole words only.  maxwords counts the max number of
# words in a file name --- in the example that's 5 in the case of
# "One Two Buckle My Shoe".  (I'm being very confusing here because
# I also use "word" to describe a complete element of a the array
# "words", but that's what you get with buggy software from the net.)
integer maxwords
local word initial pat make

# First we'll count the space-delimited words in each input word.
for word in $words; do
  # We're not interested in anything from the first "." on.
  # Remove the longest trailing string beginning with a ".".
  initial=${word%%.*}
  # Split to space-separate words.
  split=(${=initial})
  # ${#split} is the number of such words in the, er, word.
  if (( ${#split} > maxwords )); then
    # So this is the largest number of space-separated words we've found.
    maxwords=${#split}
  fi
done

# Helper function to take an input word and split it into "maxword"
# space-separated words.
words_getinitial() {
  # Complete name passed in as argument.
  local word=$1
  # As before, remove anything that looks like a suffix.
  initial=${word%%.*}
  if (( maxwords > 1 )); then
    # This is a rather verbose way of matching space-separated words.
    # [[:blank:]] and [^[:blank:]] are any character that are or not a
    # space or tab, respectively.  Appending ## says we want any number
    # of such characters that's at least one.  The (#b) says parentheses
    # are live, so I could refer to them later, but I'm not using that
    # feature so I should take it out :-).  (#c<num)) says match the
    # previous expression (what's in parentheses) <num> times.  We match
    # it maxwords-1 times because we're repeating the blank followed by
    # space; together with the first word we get maxwords words.
    pat="(#b)(([^[:blank:]]##[[:blank:]]##)(#c$((maxwords-1)))([^[:blank:]]##))"
  else
    # Simple case where we're just matching one word.
    pat="(#b)([^[:blank:]]##)"
  fi
  # ${word##${~pat}} says "remove the longest string matching $pat from
  # $word".  The ~ is so the pattern characters in $pat are live, rather
  # than just ordinary string characters.  By putting in the (M) we
  # get the matched string, rather than the original string with the
  # matched bit removed.
  initial=${(M)word##${~pat}}
  # So at this point, $initial contains the initial $maxwords space-separated
  # words from the string passed in, ignoring suffixes.
}
# functions -T words_getinitial

# We're going to start by looking for the longest possible matches,
# then gradually decrease maxwords until it reaches 0 or we run out
# of words.
while (( maxwords && ${#words} )); do
  restwords=()
  groups=()
  foundgroups=()
  # For all test words (file names)
  for word in $words; do
    words_getinitial $word
    # $initial is the first $maxwords words from $word
    [[ -z $initial ]] && continue
    # Got something...
    if [[ -n $groups[$initial] ]]; then
      # ... for the second time, so record there's something to group.
      foundgroups[$initial]=1
    else
      # ... for the first time, remember it.
      groups[$initial]=1
    fi
  done
  if (( ${#foundgroups} )); then
    # Found some groups.  The group names to use are the keys of
    # $foundgroupds.  Say what these are just for info.
    # The expression is just a smarmy way of joining the keys together
    # with a comma and a space.
    print "Extracting common prefixes '${(kj.', '.)foundgroups}'..."
    # Now see which words fit any of these groups.
    for word in $words; do
      words_getinitial $word
      # As before, $initial contains the initial $maxwords words from $word.
      if [[ -z $initial ]]; then
        # Nothing here, so stick this one in the remainder to handle later.
	restwords+=($word)
      elif [[ -n $foundgroups[$initial] ]]; then
        # Yes, this is one of the words to group.
	# In real life we'd stick it in the directory, making
	# sure the directory existed.  For now just report.
	print "'$word' goes in directory '$initial'"
      else
        # No, so stick this in the remainder list.
	restwords+=($word)
      fi
      # Next time, we only need to consider the words we didn't
      # handle this time, so assign these back to the original array.
      words=($restwords)
    done
  fi
  # Decrease the number of space-separated words to look for next time
  (( maxwords-- ))
done

# If there are some words we didn't group, tell the user what these are.
if (( ${#words} )); then
  print "Unmatched files:"
  print "'${(pj.'\n'.)words}'"
fi
##end


-- 
Peter Stephenson <p.w.stephenson@ntlworld.com>
Web page now at http://homepage.ntlworld.com/p.w.stephenson/


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-03-28 20:26 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-18  7:05 Compare two (or more) filenames and return what is common between them TJ Luoma
2014-03-18 20:23 ` Peter Stephenson
2014-03-18 21:42   ` Bart Schaefer
2014-03-21 19:02     ` TJ Luoma
2014-03-21 19:39       ` Peter Stephenson
2014-03-28 20:20   ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).