caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Oleg <oleg@okmij.org>
To: caml-list@inria.fr
Cc: murthy.chet@gmail.com
Subject: [Caml-list] ANN: a small library for shell/AWK/Perl-like scripting
Date: Wed, 26 Aug 2020 16:44:23 +0900	[thread overview]
Message-ID: <20200826074423.GA2109@Melchior.localnet> (raw)


Some time ago Chet Murthy asked about writing shell-like scripts in
OCaml. Prompted by it, I also want to report on my experience and
announce a small library that made it pleasant to do
shell/AWK/Perl-like scripting in OCaml.

The library is available at
        http://okmij.org/ftp/ML/myawk/0README.dr
and consists of two small ML files, myawk.ml and strings.ml. The
latter collects general-purpose string operations, more convenient
than those in Stdlib.String. The rest of that web directory contains
half a dozen sample scripts with comments.

Here is the first example: a typical AWK script, but written in OCaml:

    #!/bin/env -S ocaml

    #load "myawk.cma"
    open Myawk open Strings
    let hash = string_of_int <|> Hashtbl.hash
    ;;
    (* Sanitize the files originally used by join1.ml and join2.ml
       The files are made of space-separated fields; the first field is the
       key. It is sensitive; but because it is a key it can't be replaced with
       meaningless garbage. We obfuscate it beyond recognition. The third field 
       is obfuscated as well. The second and fourth can be left as they are,
       and the fifth, if present, is replaced with XXX

       The script is a proper filter: reads from stdin, writes to stdout
     *)

    for_each_line @@ map words @@ function (f1::f2::f3::f4::rest) ->
      print [hash f1; f2; hash f3; f4; if rest = [] then "" else "XXX"]
    ;;

Here <|> is a function composition. I wish it were in Stdlib. The real
example, used in real life, was performing a database join

   SELECT T2.* from Table1 as T1, Table2 as T2 where T1.f1 = T2.f1

where Table1 and Table2 are text files with space-separated column
values. Table1 is supposed to be fed to stdin:

let () =
  for_each_line @@ map words @@ 
  map_option (function (x::_) -> Some x | _ -> None) @@
  (ignore <|> shell "grep %s table1.txt")

It is a typical rough-and-dirty script. Alas, it was too rough: I was
so excited that it typechecked and worked the first time, that I didn't
look carefully at the output and overlooked what I was looking for
(resulting in an unneeded hassle and apology). I should have queried exactly
for what I wanted:
   SELECT T1.f1, T1.f4 FROM Table1 as T1, Table2 as T2
   WHERE T1.f1 = T2.f1 AND T1.f3 <> "3"

which is actually easy to write in myawk (probably not so in AWK though)

 let () =
   for_each_line ~fname:"table2.txt" @@ map words @@ 
   map_option (function (w::_) -> Some w | _ -> None) @@
   fun w -> 
     for_each_line ~fname:"table1.txt" @@  map words @@
     map_option (function 
      (x::f2::f3::f4::_) when x = w && f4 <> "3" -> Some [x;f4] | _ -> None) @@ 
     print

This is the classical nested loop join. Chet Murthy might be pleased to see
the extensive use of the continuation-passing style. I was
apprehensive at first, but it turned out not to be a hassle.

The library has a few other examples, including case-branching and
rewriting a real AWK script from the OCaml distribution.

Finally, let's compare with shell scripts. The example below doesn't
show off the library, but it does show the benefits of OCaml
for scripting. The original shell script is a sample GIT commit hook,
quoted in the comments:

        (*
        From GIT's sample hooks: 
          ANY-GIT-REPO/.git/hooks/commit-msg.sample

          # Called by "git commit" with one argument, the name of the file
          # that has the commit message.  The hook should exit with non-zero
          # status after issuing an appropriate message if it wants to stop the
          # commit.  The hook is allowed to edit the commit message file.

          # This example catches duplicate Signed-off-by lines.

        test "" = "$(grep '^Signed-off-by: ' "$1" |
                 sort | uniq -c | sed -e '/^[ 	]*1[ 	]/d')" || {
                echo >&2 Duplicate Signed-off-by lines.
                exit 1
        }

        *)
        module H = Hashtbl

        let commit_msg = Sys.argv.(1)
        let ht = H.create 5
        let () =
          for_each_line ~fname:commit_msg @@ fun l ->
          if is_prefix "Signed-off-by: " l <> None then begin
            if H.find_opt ht l <> None then begin
              prerr_endline "Duplicate Signed-off-by lines.";
              exit 1
            end else
              H.add ht l ()
          end

Although the OCaml script seems to have more characters, one doesn't
need to type them all. Scripts like that are meant to be entered in an
editor; even ancient editors have completion facilities.

Looking at the original shell script brings despair, and drives me
right towards Unix Haters. Not only the script is algorithmically
ugly: if a duplicate signed-off line occurs near the beginning, we can
report it right away and stop. We don't need to read the rest of the
commit message, filter it, sort it, precisely count all duplicates and
filter again. Not only the script gratuitously wastes system
resources (read: the laptop battery) by launching many processes and
allocating communication buffers. Mainly, the script isn't good at its
primary purpose: it isn't easy to write and read. Pipeline composition
of small stream processors is generally a good thing -- but not when each
stream processor is written in its own idiosyncratic
language. Incidentally, I have doubts about the script: I think that 
quotes around $1 are meant to be embedded; but why they are not
escaped then? Probably it is some edge case of bash, out of several
0thousands.

In contrast, OCaml script does exactly what is required, with no extra
work. Everything is written in only one language.


             reply	other threads:[~2020-08-26  7:41 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-26  7:44 Oleg [this message]
2020-08-26 18:54 ` orbifx 🦊

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200826074423.GA2109@Melchior.localnet \
    --to=oleg@okmij.org \
    --cc=caml-list@inria.fr \
    --cc=murthy.chet@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).