Thank you for the very useful comments. However, I disagree with you about the RE language. While I agree all RE experts don't need that, when I was hiring and gave some software to a new hire (whether an experienced programmer or a recent college grad) simply handing over huge RE's to my new hire was a daunting task to that person. I wrote that stuff that way to help remind me and anyone who might use the python program. I don't claim success. It does help me. When you say '{1}' is redundant, I think I did that to avoid any possibility of conflicts with the next string that is concatentated to the *Y_* (e.g. '*' or '+' or '{4,7}'). I am embarrassed I did not communicate that in the code. I had to think about it for a couple of hours before I recalled the "why". I will fix that. (it would be difficult to discuss this RE if I had to write "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE + "]" + ") rather than just *Y_*). My initial thoughts on naming were I wanted the definition to be defined in exactly one place in the software. Python and the BTL folks told me to never use a constant in code. Always name it. Hence, I gave it a name. Each name might be used in multiple places. They might be imported. You are correct, the expression is unbalanced. I tried to remove the text2bytes(lastYearRE*)* call so the expression in this email was all text. I failed to remove the trailing *)* when I removed the call to text2bytes(). My hasty transcriptions might have produced similar errors in my email. Recall, my focus was on any file of any size. I'm on Windows 10 and an m1 MacBook. Python works on both. I don't have a Linux machine or enough desktop space to host one. I'm also mildly fed-up with virtual machines. Friedl taught me one thing. Most RE implementations are different. I'm trying to write a program that I could give to anyone and could reliably find a date (an RE) in any file. YYYY, MM, DD, HR, MI, SE, TH are words my user could use in the command line or in an options dialog. LAT and LON might also be possibilities. CST, EST, MST, PST, ... also. A 500 gigabyte archive or directory/folder of pictures and movies would be a great test target. I very much appreciate your comments. If this discussion is boring to others, I would be happy to take it to emails. I like your program. My experience with RE, grep, python, and sed suggests that anything but gnu grep and sed might not work due to the different implementations. I've been out of the Unix software business for 30 years after starting work at BTL in the 1970s and working on Version 6. I didn't know "printf" was now built into bash! That was a surprise. It's an incremental improvement, but doesn't compare with f-strings in python. *The interactive interpreter for python should have* *a "bash" mode?!* Does grep use a memory mapped file for its search, thereby avoiding all buffering boundaries? That too, would be new information to me. The additional complexity of dealing with buffering is more than annoying. Do you have any thoughts on how to verify a program that uses RE's. I've given no thought until now. My first thought for dates would be to write a separate module that simply searched through the file looking for 4 numbers in a row without using RE's, recording the offsets and 16 characters after and 1 character before in a python list of (offset,str) of tuples, ddddList, and using *dddd**List* as a proxy for the entire file. I could then aim my RE's at *ddddList*. *[A list of tuples in python* *is wonderful! !]* It seems to me '*' and '+' and {x,y} are the performance hogs in RE's. My RE's avoid them. One pass, I think, should suffice. What do you think? I haven't "archived" my 350 GB of pictures and movies, but one pass over all files therein ought to suffice, right? Two different programs that use different algorithms should be pretty good proof of correctness wouldn't you think? My RE's have no stars or pluses. If there is a mismatch before a match, give up and move on. On my Windows 10 machine, I have cygwin. Microsoft says my CPU doesn't have a TPM and the specific Intel Core I7 on my system is not supported so Windows 11 is not happening. Microsoft is DOS personified. (An unkind editorial remark about the low quality of software coming from Microsoft.) Anyway, I thank you again for your patience with me and your observations. I value your views and the other views I've seen here on coff@tuhs.org. I welcome all input to my education and will share all I have done so far with anyone who wants to collaborate, test, or is just curious. GOAL: run python program from an at-cost thumb drive that: reaps all media files from a user specified directory/folder tree and Adds files to the thumb drive. *Adds files* means Original file system is untouched Adds only unique files (hash codes are unique) Creates on the thumb drive a relative directory wherein the original file was found Prepends a "YYYY-MM-DD-" string to the filename if one can be found (EXIF is great shortcut). Copies srcroot/relative_path/oldfilename to thumbdrive/relative_path/YYYY-MM-DD-oldfilename or thumbdrive/relative_path/0000-oldfilename. Can also incrementally add new files by just scanning anywhere in any other computer file system or any other computer. Must work on Mac, Windows, and Linux What I have is a working prototype. It works on Mac and Windows. It doesn't do the date thing very well, and there are other shortcomings. I have delivered exactly one Christmas present to my favorite person in the world - a 400 GB SSD drive with all our pictures and media we have ever taken. The next things are to *add *more media and *re-unique-ify* (check) what is already present on the SSD drive and *improve the proper choice of "YYYY-MM-DD-" prefix* to filenames. I am retired and this is fun. I'm too old to want to get rich. Ed Bradford Pflugerville, TX egbegb2@gmail.com On Tue, Mar 7, 2023 at 5:40 AM Ralph Corderoy wrote: > Hi Ed, > > > I have made an attempt to make my RE stuff readable and supportable. > > Readable to you, which is fine because you're the prime future reader. > But it's less readable than the regexp to those that know and read them > because of the indirection introduced by the variables. You've created > your own little language of CAPITALS rather than the lingua franca of > regexps. :-) > > > Machine language was unreadable and then along came assembly language. > > Assembly language was unreadable, then came higher level languages. > > Each time the original language was readable because practitioners had > to read and write it. When its replacement came along, the old skill > was no longer learnt and the language became ‘unreadable’. > > > So far, I can do that for this RE program that works for small files, > > large files, binary files and text files for exactly one pattern: > > YYYY[-MM-DD] > > I constructed this RE with code like this: > > # ymdt is YYYY-MM-DD RE in text. > > # looking only for 1900s and 2000s years and no later than today. > > _YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}" > > ‘{1}’ is redundant. > > > # months > > _MM = "(0[1-9]|1[012])" > > # days > > _DD = "(0[1-9]|[12]\d|3[01])" > > ymdt = _YYYY + '[' + _INTERNALSEP + > > _MM + > > _INTERNALSEP + > > ']'{0,1) > > I think we're missing something as the ‘'['’ is starting a character > class which is odd for wrapping the month and the ‘{0,1)’ doesn't have > matching brackets and is outside the string. > > BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’. > > > For the whole file, RE I used > > ymdthf = _FRSEP + ymdt + _BASEP > > where FRSEP is front separator which includes > > a bunch of possible separators, excluding numbers and letters, or-ed > > with the up arrow "beginning of line" RE mark. > > It sounds like you're wanting a word boundary; something provided by > regexps. In Python, it's ‘\b’. > > >>> re.search(r'\bfoo\b', 'endfoo foostart foo ends'), > (,) > > Are you aware of the /x modifier to a regexp which ignores internal > whitespace, including linefeeds? This allows a large regexp to be split > over lines. There's a comment syntax too. See > https://docs.python.org/3/library/re.html#re.X > > GNU grep isn't too shabby at looking through binary files. I can't use > /x with grep so in a bash script, I'd do it manually. \< and \> match > the start and end of a word, a bit like Python's \b. > > re=' > .?\< > (19[0-9][0-9]|20[01][0-9]|202[0-3]) > ( > ([-:._]) > (0[1-9]|1[0-2]) > \3 > (0[1-9]|[12][0-9]|3[01]) > )? > \>.? > ' > re=${re//$'\n'/} > re=${re// /} > > printf '%s\n' 2001-04-01,1999_12_31 1944.03.01,1914! 2000-01.01 > >big-binary-file > LC_ALL=C grep -Eboa "$re" big-binary-file | sed -n l > > which gives > > 0:2001-04-01,$ > 11:1999_12_31$ > 22:1944.03.01,$ > 33:1914!$ > 39:2000-$ > > showing: > > - the byte offset within the file of each match, > - along with the any before and after byte if it's not a \n and not > already matched, just to show the word-boundary at work, > - with any non-printables escaped into octal by sed. > > > I thought I was on the COFF mailing list. > > I'm sending this to just the list. > > > I received this email by direct mail to from Larry. > > Perhaps your account on the list is configured to not send you an email > if it sees your address in the header's fields. > > -- > Cheers, Ralph. > -- Advice is judged by results, not by intentions. Cicero