caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] ANN: a small library for shell/AWK/Perl-like scripting
@ 2020-08-26  7:44 Oleg
  2020-08-26 18:54 ` orbifx 🦊
  0 siblings, 1 reply; 2+ messages in thread
From: Oleg @ 2020-08-26  7:44 UTC (permalink / raw)
  To: caml-list; +Cc: murthy.chet

Some time ago Chet Murthy asked about writing shell-like scripts in
OCaml. Prompted by it, I also want to report on my experience and
announce a small library that made it pleasant to do
shell/AWK/Perl-like scripting in OCaml.

The library is available at
and consists of two small ML files, and The
latter collects general-purpose string operations, more convenient
than those in Stdlib.String. The rest of that web directory contains
half a dozen sample scripts with comments.

Here is the first example: a typical AWK script, but written in OCaml:

    #!/bin/env -S ocaml

    #load "myawk.cma"
    open Myawk open Strings
    let hash = string_of_int <|> Hashtbl.hash
    (* Sanitize the files originally used by and
       The files are made of space-separated fields; the first field is the
       key. It is sensitive; but because it is a key it can't be replaced with
       meaningless garbage. We obfuscate it beyond recognition. The third field 
       is obfuscated as well. The second and fourth can be left as they are,
       and the fifth, if present, is replaced with XXX

       The script is a proper filter: reads from stdin, writes to stdout

    for_each_line @@ map words @@ function (f1::f2::f3::f4::rest) ->
      print [hash f1; f2; hash f3; f4; if rest = [] then "" else "XXX"]

Here <|> is a function composition. I wish it were in Stdlib. The real
example, used in real life, was performing a database join

   SELECT T2.* from Table1 as T1, Table2 as T2 where T1.f1 = T2.f1

where Table1 and Table2 are text files with space-separated column
values. Table1 is supposed to be fed to stdin:

let () =
  for_each_line @@ map words @@ 
  map_option (function (x::_) -> Some x | _ -> None) @@
  (ignore <|> shell "grep %s table1.txt")

It is a typical rough-and-dirty script. Alas, it was too rough: I was
so excited that it typechecked and worked the first time, that I didn't
look carefully at the output and overlooked what I was looking for
(resulting in an unneeded hassle and apology). I should have queried exactly
for what I wanted:
   SELECT T1.f1, T1.f4 FROM Table1 as T1, Table2 as T2
   WHERE T1.f1 = T2.f1 AND T1.f3 <> "3"

which is actually easy to write in myawk (probably not so in AWK though)

 let () =
   for_each_line ~fname:"table2.txt" @@ map words @@ 
   map_option (function (w::_) -> Some w | _ -> None) @@
   fun w -> 
     for_each_line ~fname:"table1.txt" @@  map words @@
     map_option (function 
      (x::f2::f3::f4::_) when x = w && f4 <> "3" -> Some [x;f4] | _ -> None) @@ 

This is the classical nested loop join. Chet Murthy might be pleased to see
the extensive use of the continuation-passing style. I was
apprehensive at first, but it turned out not to be a hassle.

The library has a few other examples, including case-branching and
rewriting a real AWK script from the OCaml distribution.

Finally, let's compare with shell scripts. The example below doesn't
show off the library, but it does show the benefits of OCaml
for scripting. The original shell script is a sample GIT commit hook,
quoted in the comments:

        From GIT's sample hooks: 

          # Called by "git commit" with one argument, the name of the file
          # that has the commit message.  The hook should exit with non-zero
          # status after issuing an appropriate message if it wants to stop the
          # commit.  The hook is allowed to edit the commit message file.

          # This example catches duplicate Signed-off-by lines.

        test "" = "$(grep '^Signed-off-by: ' "$1" |
                 sort | uniq -c | sed -e '/^[ 	]*1[ 	]/d')" || {
                echo >&2 Duplicate Signed-off-by lines.
                exit 1

        module H = Hashtbl

        let commit_msg = Sys.argv.(1)
        let ht = H.create 5
        let () =
          for_each_line ~fname:commit_msg @@ fun l ->
          if is_prefix "Signed-off-by: " l <> None then begin
            if H.find_opt ht l <> None then begin
              prerr_endline "Duplicate Signed-off-by lines.";
              exit 1
            end else
              H.add ht l ()

Although the OCaml script seems to have more characters, one doesn't
need to type them all. Scripts like that are meant to be entered in an
editor; even ancient editors have completion facilities.

Looking at the original shell script brings despair, and drives me
right towards Unix Haters. Not only the script is algorithmically
ugly: if a duplicate signed-off line occurs near the beginning, we can
report it right away and stop. We don't need to read the rest of the
commit message, filter it, sort it, precisely count all duplicates and
filter again. Not only the script gratuitously wastes system
resources (read: the laptop battery) by launching many processes and
allocating communication buffers. Mainly, the script isn't good at its
primary purpose: it isn't easy to write and read. Pipeline composition
of small stream processors is generally a good thing -- but not when each
stream processor is written in its own idiosyncratic
language. Incidentally, I have doubts about the script: I think that 
quotes around $1 are meant to be embedded; but why they are not
escaped then? Probably it is some edge case of bash, out of several

In contrast, OCaml script does exactly what is required, with no extra
work. Everything is written in only one language.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Caml-list] ANN: a small library for shell/AWK/Perl-like scripting
  2020-08-26  7:44 [Caml-list] ANN: a small library for shell/AWK/Perl-like scripting Oleg
@ 2020-08-26 18:54 ` orbifx 🦊
  0 siblings, 0 replies; 2+ messages in thread
From: orbifx 🦊 @ 2020-08-26 18:54 UTC (permalink / raw)
  To: Oleg, caml-list, murthy.chet

Thanks for sharing this Oleg. I'd like to try this one day :)

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-08-26 18:55 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-26  7:44 [Caml-list] ANN: a small library for shell/AWK/Perl-like scripting Oleg
2020-08-26 18:54 ` orbifx 🦊

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).