The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
* [TUHS] Trying to date "A Supplemental Document For Awk"
@ 2023-06-28  6:26 Aharon Robbins
  2023-06-28  6:45 ` [TUHS] " arnold
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Aharon Robbins @ 2023-06-28  6:26 UTC (permalink / raw)
  To: tuhs

[-- Attachment #1: Type: text/plain, Size: 558 bytes --]

Hi All.

Attached is "A Supplemental Document For Awk". This circulated on USENET
in the 80s.  My copy is dated January 18, 1989, but I'm sure it's
older than that.  One clue is the reference to the 4.2 BSD manual,
and 4.3 came out already in 1986 or so.

Does anyone else have a copy of this with perhaps an older date?

As far as I can tell from a short search, the author is no
longer living.  If someone knows better and can provide contact
info for him, that'd be great.

In the meantime, Warren, do you want to add it to the archives?

Thanks!

Arnold

[-- Attachment #2: awkdoc --]
[-- Type: text/plain, Size: 19193 bytes --]

.RP
.TL
.B
A Supplemental Document For AWK
.sp
.R
- or -
.sp
.I
Things Al, Pete, And Brian Didn't Mention Much
.R
.AU
John W. Pierce
.AI
Department of Chemistry
University of California, San Diego
La Jolla, California  92093
jwp%chem@sdcsvax.ucsd.edu
.AB
As
.B awk
and its documentation are distributed with
.I
4.2 BSD UNIX*
.R
there are a number of bugs, undocumented features,
and features that are touched on so briefly in the
documentation that the casual user may
not realize their full significance.  While this document
applies primarily to the \fI4.2 BSD\fR version of \fIUNIX\fR,
it is known that the \fI4.3 BSD\fR version does not have
all of the bugs fixed, and that it does not have updated
documentation.  The situation with respect to the versions
of \fBawk\fR disitributed with other versions \fIUNIX\fR and
similar systems is unknown to the author.
.FS
*UNIX is a trademark of AT&T
.FE
.AE
.LP
In this document references to "the user manual" mean
.I
Awk - A Pattern Scanning and Processing Language (Second Edition)
.R
by Aho, Kernighan, and Weinberger.  References to "awk(1)" mean
the entry for
.B awk
in the
.I
UNIX Programmer's Manual, 4th Berkeley Distribution.
.R
References to "the documentation" mean both of those.
.LP
In most examples, the outermost set of braces ('{ }') have been
ommitted.  They would, of course, be necessary in real scripts.
.NH
Known Bugs
.LP
There are three main bugs known to me.  They involve:
.IP
Assignment to input fields.
.IP
Piping output to a program from within an \fBawk\fR script.
.IP
Using '*' in \fIprintf\fR field width and precision specifications.
.NH 2
Assignment to Input Fields
.LP
[This problem is partially fixed in \fI4.3BSD\fR;
see the last paragraph of this section regarding the unfixed portion.]
.LP
The user manual states that input fields may be objects of assignment
statements.  Given the input line
.DS
field_one field_two field_three
.DE
the script
.DS
$2 = "new_field_2"
print $0
.DE
should print
.DS
field_one new_field_2 field_three
.DE
.LP
This does not work; it will print
.DS
field_one field_two field_three
.DE
That is, the script will behave as if the
assignment to $2 had not been made.  However,
explicitly referencing an "assigned to" field
.I does
recognize that the assignment has been made.
If the script
.DS
$2 = "new_field_2"
print $1, $2, $3
.DE
is given the same input it will [properly] print
.DS
field_one new_field_2 field_three
.DE
Therefore, you can
get around this bug with, e.g.,
.DS
$2 = "new_field_2"
output = $1                       # Concatenate output fields
for(i = 2; i <= NF; ++i)          # into a single output line
	output = output OFS $i    # with OFS between fields
print output
.DE
.LP
In \fI4.3BSD\fR, this bug has been fixed to the extent that
the failing example above works correctly.  However, a script like
.DS
$2 = "new_field_2"
var = $0
print var
.DE
still gives incorrect output.  This problem can be bypassed by using
.DS
\fIvar\fR = sprintf("%s", $0)
.DE
instead of "\fIvar\fR = $0"; \fIvar\fR will have the correct value.
.NH 2
Piping Output to a Program
.LP
[This problem appears to have been fixed in \fI4.3BSD\fR,
but that has not been exhaustively tested.]
.LP
The user manual states that
.I print
and
.I printf
statements may write to a program using, e.g.,
.DS
print | "\fIcommand\fR"
.DE
This would pipe the output into \fIcommand\fR, and it
does work.  However, you should be aware that this causes
.B awk
to spawn a child process (\fIcommand\fR), and that it
.I
does not
.R
wait for the child to exit before it exits itself.  In the case of a
"slow" command like
.B sort,
.B awk
may exit before
.I command
has finished.
.LP
This can cause problems in, for example, a shell script that
depends on everything done by
.B awk
being finished before the next shell command is executed.
Consider the shell script
.DS
awk -f awk_script input_file
mv sorted_output somewhere_else
.DE
and the
.B awk
script
.DS
print output_line | "sort -o sorted_output"
.DE
If
.I input_file
is large
.B awk
will exit long before
.B sort
is finished.  That means that the
.B mv
command will be executed before
.B sort
is finished, and the result is unlikely to be what you wanted.
Other than fixing the source, there is no way to avoid this
problem except to handle such pipes outside of the awk script, e.g.
.DS
awk -f awk_file input_file | sort -o sorted_output
mv sorted_output somewhere_else
.DE
which is not wholly satisfactory.
.LP
See
.I
Sketchily Documented Features
.R
below for other considerations in redirecting
output from within an
.B awk
script.
.NH 2
Printf Field Width and Precision Specification With '*'
.LP
The document says that the \fIprintf\fR function provided is
identical to the \fIprintf\fR provided by the \fIC\fR language
\fBstdio\fR package.  This is not true for the case of using '*' to
specify a field width or precision.  The command
.DS
printf("%*.s", len, string)
.DE
will cause a core dump.  Given \fBawk\fR's age, it is likely
that its \fIprintf\fR was written well before the use of '*'
for specifying field width and precision appeared in the \fBstdio\fR
library's \fIprintf\fR.  Another possibility is that it wasn't
implemented because it isn't really needed to achieve the same effect.
.LP
To accomplish this effect, you can utilize the fact that \fBawk\fR
concatenates variables before it does any other processing on them.
For example, assume a script has two variables \fIwid\fR and
\fIprec\fR which control the width and precision used for printing
another variable \fIval\fI:
.DS
[code to set "wid", "prec", and "val"]

printf("%" wid "." prec "d\en", val)
.DE
If, for example, \fIwid\fR is 8 and \fIprec\fR is 3, then /fBawk\fR
will concatenate everything to the left of the comma in
the \fIprintf\fR statement, and the statement will really be
.DS
printf(%8.3d\en, val)
.DE
These could, of course, been assigned to some variable \fIfmt\fR before
being used:
.DS
fmt = "%" wid "." prec "d"

printf(fmt "\en", val)
.DE
Note, however, that the newline ("\en") in the second form \fIcannot\fR
be included in the assignment to \fIfmt\fR.
.bp
.NH
Undocumented Features
.LP
There are several undocumented features:
.IP
Variable values may be established on the command line.
.IP
A
.B getline
function exists that reads the next input line and starts processing it
immediately.
.IP
Regular expressions accept octal representations of characters.
.IP
A
.B -d
flag argument produces debugging output if
.B awk
was compiled with "DEBUG" defined.
.IP
Scripts may be "compiled" and run later (providing the installer
did what is necessary to make this work).
.NH 2
Defining Variables On The Command Line
.LP
To pass variable values into a script at run time, you may use
.IP
.I variable=value
.LP
(as many as you like) between any "\fB-f \fIscriptname\fR" or
.I program
and the names of any files to be processed.  For example,
.DS
awk -f awkscript today=\e"`date`\e" infile
.DE
would establish for
.I awkscript
a variable named
.B today
that had as its value the output of the
.B date
command.
.LP
There are a number of caveats:
.IP
Such assignments may appear only between
.B -f
.I awkscript
(or \fIprogram\fR or [see below] \fB-R\fIawk.out\fR)
and the name of any
input file (or '-').
.IP
Each
.I variable=value
combination must be a single argument (i.e. there must not be spaces
around the '=' sign);
.I value
may be either a numeric value or a string.  If it is a string,
it must be enclosed in
double quotes at the time \fBawk\fR reads the argument.  That means
that the double quotes enclosing \fIvalue\fR on the command line
must be protected from the shell as in the example above or it will
remove them.
.IP
.I Variable
is not available for use within the script until after the first record
has been read and parsed, but it is available as soon as
that has occurred so that it may be used before any other
processing begins.  It does not exist at the time the
.B BEGIN
block is executed, and if there was no input it will not exist in the
.B END
block (if any).
.NH 2
Getline Function
.LP
.B Getline
immediately reads the next input line (which is parsed into \fI$1\fR,
\fI$2\fR, etc) and starts processing it at the location of the call
(as opposed to
.B next
which immediately reads the next input line but starts processing
from the start of the script).
.LP
.B Getline
facilitates performing some types of tasks such as
processing files with multiline records and merging
information from several files.  To use the latter as an example,
consider a case where two files, whose lines do not share
a common format, must be processed together.  Shell and \fBawk\fR
scripts to do this might look something like
.sp
In the shell script
.DS
( echo DATA1; cat datafile1; echo ENDdata1 \e
  echo DATA2; cat datafile2; echo ENDdata2 \e
) | \e
    awk -f awkscript - > awk_output_file
.DE
In the
.B awk
script
.DS
/^DATA1/  {       # Next input line starts datafile1
          while (getline && $1 !~ /^ENDdata1$/)
                 {
                 [processing for \fIdata1\fR lines]
                 }
          }
.sp 1
/^DATA2/  {       # Next input line starts datafile2
          while (getline && $1 !~ /^ENDdata2$/)
                 {
                 [processing for \fIdata2\fR lines]
                 }
          }
.DE
There are, of course, other ways of accomplishing this particular task
(primarily using \fBsed\fR to preprocess the information),
but they are generally more difficult to write and more
subject to logic errors.  Many cases arising in practice
are significantly more difficult, if not impossible, to handle
without \fBgetline\fR.
.NH 2
Regular Expressions
.LP
The sequence "\fI\eddd\fR" (where 'd' is a digit)
may be used to include explicit octal
values in regular expressions.  This is often useful if "nonprinting"
characters have been used as "markers" in a file.  It has not been
tested for ASCII values outside the range 01 through 0127.
.NH 2
Debugging output
.LP
[This is unlikely to be of interest to the casual user.]
.sp
If \fBawk\fR was compiled with "DEBUG" defined, then giving it a
.B -d
flag argument will cause it to produce debugging output when it is run.
This is sometimes useful in finding obscure problems in scripts, though
it is primarily intended for tracking down problems with \fBawk\fR itself.
.NH 2
Script "Compilation"
.LP
[It is likely that this does not work at most sites.  If it does not, the
following will probably not be of interest to the casual user.]
.sp
The command
.DS
awk -S -f script.awk
.DE
produces a file named
.B awk.out.
This is a core image of
.B awk
after parsing the file
.I script.awk.
The command
.DS
awk -Rawk.out datafile
.DE
causes
.B awk.out
to be applied to \fIdatafile\fR (or the standard input if no
input file is given).  This avoids having to reparse large
scripts each time they are used.  Unfortunately, the way this
is implemented requires some special action on the part of the
person installing \fBawk\fR.
.LP
As \fBawk\fR is delivered with \fI4.2 BSD\fR (and \fI4.3 BSD\fR),
.I awk.out
is created by the \fBawk -S ...\fR process by calling
.B sbrk()
with '0', writing out the returned value, then
writing out the core image from location 0 to
the returned address.  The \fBawk -R...\fR process
reads the first word of
.I awk.out
to get the length of the image, calls
.B brk()
with that length, and
then reads the image into itself starting at location 0.
For this to work, \fBawk\fR must have been loaded with its
text segment writeable.  Unfortunately,
the \fIBSD\fR default for \fBld\fR is to load with the text
read-only and shareable.  Thus, the installer must remember to take
special action (e.g. "cc -N ..."
[equivalently "ld -N ..."] for \fI4BSD\fR) if these
flags are to work.
.LP
[Personally, I don't think it is
a very good idea to give \fBawk\fR the opportunity
to write on its text segment; I changed it so that
only the data segment is overwritten.]
.LP
Also, due to what appears to be a lapse in logic, the first
non-flag argument following \fB-R\fIawk.out\fR is discarded.
[Disliking that behavior, the I changed it so that the \fB-R\fR flag
is treated like the \fB-f\fR flag:  no flag arguments may follow it.]
.bp
.NH
Sketchily Documented Features
.LP
.NH 2
Exit
.LP
The user manual says that using the
.B exit
function causes the script to behave as if end-of-input has been reached.
Not menitoned explicitly is the fact that this will cause the
.B END
block to be executed if it exists.
Also, two things are ommitted:
.IP
\fBexit(\fIexpr\fB)\fR causes the script's exit status to be
set to the value of \fIexpr\fR.
.IP
If
.B exit
is called within the
.B END
block, the script exits immediately.
.NH 2
Mathematical Functions
.LP
The following builtin functions exist and are mentioned in
.I awk(1)
but not in the user manual.
.IP \fBint(\fIx\fB)\fR 10
\fIx\fR trunctated to an integer.
.IP \fBsqrt(\fIx\fB)\fR 10
the square root of \fIx\fR for \fIx\fR >= 0, otherwise zero.
.IP \fBexp(\fIx\fB)\fR 10
\fBe\fR-to-the-\fIx\fR for -88 <= \fIx\fR <= 88, zero
for \fIx\fR < -88, and dumps core for \fIx\fR > 88.
.IP \fBlog(\fIx\fB)\fR 10
the natural log of \fIx\fR.
.NH 2
OFMT Variable
.LP
The variable
.B OFMT
may be set to, e.g. "%.2f", and purely numerical output will be
bound by that restriction in
.B print
statements.  The default value is "%.6g".  Again, this is mentioned in
.I awk(1)
but not in the user manual.
.NH 2
Array Elements
.LP
The user manual states that "Array elements ... spring into existence by
being mentioned."  This is literally true;
.I any
reference to an array element causes it to exist.
("I was thought about, therefore I am.")
Take, for example,
.DS
if(array[$1] == "blah")
	{
	[process blah lines]
	}
.DE
If there is not an existing element of
.B array
whose subscript is the same as the contents of the
current line's first field,
.I
one is created
.R
and its value (null, of course) is then compared
with "blah".  This can be a bit
disconcerting, particularly when later processing is using
.DS
for (i in \fBarray\fR)
        {
        [do something with result of processing
	"blah" lines]
        }
.DE
to walk the array and expects all the elements to be non-null.
Succinct practical examples are difficult to construct, but
when this happens in a 500 line
script it can be difficult to determine what has gone wrong.
.NH 2
FS and Input Fields
.LP
By default any number of spaces or tabs can separate fields (i.e.
there are no null input fields) and trailing spaces and tabs
are ignored.  However, if
.B FS
is explicitly set to any character other than a space
(e.g., a tab: \fBFS = "\et"\fR), then a field is defined
by each such character and trailing field separator characters are
not ignored.  For example, if '>' represents a tab then
.DS
one>>three>>five>
.DE
defines six fields, with fields two, four, and six being empty.
.LP
If
.B FS
is explicitly set to a space (\fBFS\fR = "\ "), then
the default behavior obtains (this may be a bug); that
is, both spaces
and tabs are taken as field separators, there can be no
null input fields, and trailing spaces and tabs are ignored.
.NH 2
RS and Input Records
.LP
If
.B RS
is explicitly set to the null string (\fBRS\fR = ""), then the input
record separator becomes a blank line, and the newlines at the end
of input lines is a field separator.  This facilitates
handling multiline records.
.NH 2
"Fall Through"
.LP
This is mentioned in the user manual, but it is important
enough that it is worth pointing out here, also.
.LP
In the script
.DS
/\fIpattern_1\fR/  {
             [do something]
             }
.sp
/\fIpattern_2\fR/  {
             [do something]
             }
.DE
all input lines will be compared with both 
.I pattern_1
and
.I pattern_2
unless the
.B next
function is used before the closing '}' in the
.I pattern_1
portion.
.NH 2
Output Redirection
.LP
Once a file (or pipe) is opened by
.B awk
it is not closed until
.B awk
exits.  This can occassionally cause problems.  For example,
it means that a script that sorts its input lines into
output files named by the contents of their first fields
(similar to an example in the user manual)
.DS
{ print $0 > $1 }
.DE
is going to fail if the number of different first fields exceeds
about 10.
This problem
.I cannot
be avoided by using something like
.DS
{
command = "cat >> " $1
print $0 | command
}
.DE
as the value of the variable
.B command
is different for each different value of
.I $1
and is therefore treated as a different output "file".
.LP
[I have not been able to create a truly satisfactory
fix for this that doesn't involve having \fBawk\fR treat output
redirection to pipes differently from output to files; I
would greatly appreciate hearing of one.]
.NH 2
Field and Variable Types, Values, and Comparisons
.LP
The following is a synopsis of notes included with \fBawk\fR's
source code.
.NH 3
Types
.LP
Variables and fields can be strings or numbers or both.
.NH 4
Variable Types
.LP
When a variable is set by the assignment
.DS
\fIvar\fR = \fIexpr\fR
.DE
its type is set to the type of
.I expr
(this includes +=, ++, etc). An arithmetic
expression is of type
.I number,
a concatenation is of type
.I string,
etc.
If the assignment is a simple copy, e.g.
.DS
\fIvar1\fR = \fIvar2\fR
.DE
then the type of
.I var1
becomes that of
.I var2.
.LP
Type is determined by context; rarely, but always very inconveniently,
this context-determined type is incorrect.  As mentioned in
.I awk(1)
the type of an expression can be coerced to that desired.  E.g.
.DS
{
\fIexpr1\fR + 0
.sp 1
\fIexpr2\fR ""    # Concatenate with a null string
}
.DE
coerces
.I expr1
to numeric type and
.I expr2
to string type.
.NH 4
Field Types
.LP
As with variables, the type of a field is determined by
context when possible, e.g.
.RS
.IP $1++ 8
clearly implies that \fI$1\fR is to be numeric, and
.IP $1\ =\ $1\ ","\ $2 16
implies that $1 and $2 are both to be strings.
.RE
.LP
Coercion is done as needed.
In contexts where types cannot be reliably determined, e.g.,
.DS
if($1 == $2) ...
.DE
the type of each field is determined on input by inspection.  All fields are
strings; in addition, each field that contains only a number
is also considered numeric.  Thus, the test
.DS
if($1 == $2) ...
.DE
will succeed on the inputs
.DS
0       0.0
100     1e2
+100    100
1e-3    1e-3
.DE
and fail on the inputs
.DS
(null)      0
(null)      0.0
2E-518      6E-427
.DE
"only a number" in this case means matching the regular expression
.DS
^[+-]?[0-9]*\e.?[0-9]+(e[+-]?[0-9]+)?$
.DE
.NH 3
Values
.LP
Uninitialized variables have the numeric value 0 and the string value "".
Therefore, if \fIx\fR is uninitialized,
.DS
if(x) ...
if (x == "0") ...
.DE
are false, and
.DS
if(!x) ...
if(x == 0) ...
if(x == "") ...
.DE
are true.
.LP
Fields which are explicitly null have the string value "", and are not numeric.
Non-existent fields (i.e., fields past \fBNF\fR) are also treated this way.
.NH 3
Types of Comparisons
.LP
If both operands are numeric, the comparison is made
numerically.  Otherwise, operands are coerced to type
string if necessary, and the comparison is made on strings.
.NH 3
Array Elements
.LP
Array elements created by
.B split
are treated in the same way as fields.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-28  6:26 [TUHS] Trying to date "A Supplemental Document For Awk" Aharon Robbins
@ 2023-06-28  6:45 ` arnold
  2023-06-28 17:48 ` Adam Sampson
  2023-06-29  0:26 ` Jeremy C. Reed
  2 siblings, 0 replies; 20+ messages in thread
From: arnold @ 2023-06-28  6:45 UTC (permalink / raw)
  To: tuhs, arnold

Hmmm, skimming the file for the first time in a long time, I see that
he references 4.3 BSD as well.  Clearly, this document evolved over
time. I would still be interested in earlier versions if anyone has.

Thanks,

Arnold

Aharon Robbins <arnold@skeeve.com> wrote:

> Hi All.
>
> Attached is "A Supplemental Document For Awk". This circulated on USENET
> in the 80s.  My copy is dated January 18, 1989, but I'm sure it's
> older than that.  One clue is the reference to the 4.2 BSD manual,
> and 4.3 came out already in 1986 or so.
>
> Does anyone else have a copy of this with perhaps an older date?
>
> As far as I can tell from a short search, the author is no
> longer living.  If someone knows better and can provide contact
> info for him, that'd be great.
>
> In the meantime, Warren, do you want to add it to the archives?
>
> Thanks!
>
> Arnold

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-28  6:26 [TUHS] Trying to date "A Supplemental Document For Awk" Aharon Robbins
  2023-06-28  6:45 ` [TUHS] " arnold
@ 2023-06-28 17:48 ` Adam Sampson
  2023-06-28 18:03   ` KenUnix
  2023-06-29  0:26 ` Jeremy C. Reed
  2 siblings, 1 reply; 20+ messages in thread
From: Adam Sampson @ 2023-06-28 17:48 UTC (permalink / raw)
  To: tuhs

[-- Attachment #1: Type: text/plain, Size: 1169 bytes --]

On Wed, Jun 28, 2023 at 09:26:02AM +0300, Aharon Robbins wrote:
> Attached is "A Supplemental Document For Awk". This circulated on
> USENET in the 80s.  My copy is dated January 18, 1989, but I'm sure
> it's older than that.

In the utzoo Usenet archive, there are two versions of this document and
a few mentions of it...

John Pierce posted to comp.unix.questions on 1989-04-02, saying he'd
written it "four or five years ago".

Stu Heiss, in comp.unix.questions on 1989-03-06, said it was "posted to
net.sources 18 Jun 86 with message-id 238@sdchema.sdchem.uucp".
Unfortunately this isn't in the utzoo archive or the net.sources.mbox
in archive.org's Usenet Historical Collection.

A copy identical to yours was posted by Jim Harkins to
comp.unix.questions on 1990-03-29.

There's a later version, fixing a typo and some formatting and adding a
mention of \f and \b in printf, which was posted by Brian Kantor to
comp.doc on 1987-10-11 -- I've attached this. The same file (with two
.bps commented out) was reposted in comp.unix.questions on 1989-11-16 by
Francois-Michel Lang.

Thanks,

-- 
Adam Sampson <ats@offog.org>                         <http://offog.org/>

[-- Attachment #2: supplemental-19871011-brian --]
[-- Type: text/plain, Size: 20115 bytes --]

Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!esosun!ucsdhub!sdcsvax!brian
From: brian@sdcsvax.UCSD.EDU (Brian Kantor)
Newsgroups: comp.doc
Subject: AWK supplementary document - troff with 'ms' macros
Message-ID: <4070@sdcsvax.UCSD.EDU>
Date: Sun, 11-Oct-87 02:40:02 EDT
Article-I.D.: sdcsvax.4070
Posted: Sun Oct 11 02:40:02 1987
Date-Received: Mon, 12-Oct-87 21:20:14 EDT
Sender: root@sdcsvax.UCSD.EDU
Organization: UCSD wombat breeding society
Lines: 745
Approved: brian@cyberpunk.ucsd.edu

.RP
.TL
.B
A Supplemental Document For AWK
.sp
.R
- or -
.sp
.I
Things Al, Pete, And Brian Didn't Mention Much
.R
.AU
John W. Pierce
.AI
Department of Chemistry
University of California, San Diego
La Jolla, California  92093
jwp%chem@sdcsvax.ucsd.edu
.AB
As
.B awk
and its documentation are distributed with
.I
4.2 BSD UNIX*
.R
there are a number of bugs, undocumented features,
and features that are touched on so briefly in the
documentation that the casual user may
not realize their full significance.  While this document
applies primarily to the \fI4.2 BSD\fR version of \fIUNIX\fR,
it is known that the \fI4.3 BSD\fR version does not have
all of the bugs fixed, and that it does not have updated
documentation.  The situation with respect to the versions
of \fBawk\fR distributed with other versions \fIUNIX\fR and
similar systems is unknown to the author.
.FS
*UNIX is a trademark of AT&T
.FE
.AE
.LP
In this document references to "the user manual" mean
.I
Awk - A Pattern Scanning and Processing Language (Second Edition)
.R
by Aho, Kernighan, and Weinberger.  References to "awk(1)" mean
the entry for
.B awk
in the
.I
UNIX Programmer's Manual, 4th Berkeley Distribution.
.R
References to "the documentation" mean both of those.
.LP
In most examples, the outermost set of braces ('{ }') have been
ommitted.  They would, of course, be necessary in real scripts.
.NH
Known Bugs
.LP
There are three main bugs known to me.  They involve:
.IP
Assignment to input fields.
.IP
Piping output to a program from within an \fBawk\fR script.
.IP
Using '*' in \fIprintf\fR field width and precision specifications
does not work, nor do '\\f' and '\\b' print formfeed and backspace
respectively.
.NH 2
Assignment to Input Fields
.LP
[This problem is partially fixed in \fI4.3BSD\fR;
see the last paragraph of this section regarding the unfixed portion.]
.LP
The user manual states that input fields may be objects of assignment
statements.  Given the input line
.DS
field_one field_two field_three
.DE
the script
.DS
$2 = "new_field_2"
print $0
.DE
should print
.DS
field_one new_field_2 field_three
.DE
.LP
This does not work; it will print
.DS
field_one field_two field_three
.DE
That is, the script will behave as if the
assignment to $2 had not been made.  However,
explicitly referencing an "assigned to" field
.I does
recognize that the assignment has been made.
If the script
.DS
$2 = "new_field_2"
print $1, $2, $3
.DE
is given the same input it will [properly] print
.DS
field_one new_field_2 field_three
.DE
Therefore, you can
get around this bug with, e.g.,
.DS
$2 = "new_field_2"
output = $1                       # Concatenate output fields
for(i = 2; i <= NF; ++i)          # into a single output line
	output = output OFS $i    # with OFS between fields
print output
.DE
.LP
In \fI4.3BSD\fR, this bug has been fixed to the extent that
the failing example above works correctly.  However, a script like
.DS
$2 = "new_field_2"
var = $0
print var
.DE
still gives incorrect output.  This problem can be bypassed by using
.DS
\fIvar\fR = sprintf("%s", $0)
.DE
instead of "\fIvar\fR = $0"; \fIvar\fR will have the correct value.
.NH 2
Piping Output to a Program
.LP
[This problem appears to have been fixed in \fI4.3BSD\fR,
but that has not been exhaustively tested.]
.LP
The user manual states that
.I print
and
.I printf
statements may write to a program using, e.g.,
.DS
print | "\fIcommand\fR"
.DE
This would pipe the output into \fIcommand\fR, and it
does work.  However, you should be aware that this causes
.B awk
to spawn a child process (\fIcommand\fR), and that it
.I
does not
.R
wait for the child to exit before it exits itself.  In the case of a
"slow" command like
.B sort,
.B awk
may exit before
.I command
has finished.
.LP
This can cause problems in, for example, a shell script that
depends on everything done by
.B awk
being finished before the next shell command is executed.
Consider the shell script
.DS
awk -f awk_script input_file
mv sorted_output somewhere_else
.DE
and the
.B awk
script
.DS
print output_line | "sort -o sorted_output"
.DE
If
.I input_file
is large
.B awk
will exit long before
.B sort
is finished.  That means that the
.B mv
command will be executed before
.B sort
is finished, and the result is unlikely to be what you wanted.
Other than fixing the source, there is no way to avoid this
problem except to handle such pipes outside of the awk script, e.g.
.DS
awk -f awk_file input_file | sort -o sorted_output
mv sorted_output somewhere_else
.DE
which is not wholly satisfactory.
.LP
See
.I
Sketchily Documented Features
.R
below for other considerations in redirecting
output from within an
.B awk
script.
.NH 2
Printf and '*', '\\f', and '\\b'
.LP
The document says that the \fIprintf\fR function provided is
identical to the \fIprintf\fR provided by the \fIC\fR language
\fBstdio\fR package.  This is incorrect:  '*' cannot be used to
specify a field width or precision, and '\\f' and '\\b' cannot
be used to print formfeeds and backspaces.
.LP
The command
.DS
printf("%*.s", len, string)
.DE
will cause a core dump.  Given \fBawk\fR's age, it is likely
that its \fIprintf\fR was written well before the use of '*'
for specifying field width and precision appeared in the \fBstdio\fR
library's \fIprintf\fR.  Another possibility is that it wasn't
implemented because it isn't really needed to achieve the same effect.
.LP
To accomplish this effect, you can utilize the fact that \fBawk\fR
concatenates variables before it does any other processing on them.
For example, assume a script has two variables \fIwid\fR and
\fIprec\fR which control the width and precision used for printing
another variable \fIval\fI:
.DS
[code to set "wid", "prec", and "val"]

printf("%" wid "." prec "d\en", val)
.DE
If, for example, \fIwid\fR is 8 and \fIprec\fR is 3, then /fBawk\fR
will concatenate everything to the left of the comma in
the \fIprintf\fR statement, and the statement will really be
.DS
printf(%8.3d\en, val)
.DE
These could, of course, been assigned to some variable \fIfmt\fR before
being used:
.DS
fmt = "%" wid "." prec "d"

printf(fmt "\en", val)
.DE
Note, however, that the newline ("\en") in the second form \fIcannot\fR
be included in the assignment to \fIfmt\fR.
.LP
To allow use of '\\f' and '\\b', \fBawk\fR's \fIlex\fR script must
be changed.  This is trivial to do (it is done at the point
where '\\n' and '\\t' are processed), but requires having source
code.  [I have fixed this and have not seen any unwanted effects.]
.bp
.NH
Undocumented Features
.LP
There are several undocumented features:
.IP
Variable values may be established on the command line.
.IP
A
.B getline
function exists that reads the next input line and starts processing it
immediately.
.IP
Regular expressions accept octal representations of characters.
.IP
A
.B -d
flag argument produces debugging output if
.B awk
was compiled with "DEBUG" defined.
.IP
Scripts may be "compiled" and run later (providing the installer
did what is necessary to make this work).
.NH 2
Defining Variables On The Command Line
.LP
To pass variable values into a script at run time, you may use
.IP
.I variable=value
.LP
(as many as you like) between any "\fB-f \fIscriptname\fR" or
.I program
and the names of any files to be processed.  For example,
.DS
awk -f awkscript today=\e"`date`\e" infile
.DE
would establish for
.I awkscript
a variable named
.B today
that had as its value the output of the
.B date
command.
.LP
There are a number of caveats:
.IP
Such assignments may appear only between
.B -f
.I awkscript
(or \fIprogram\fR or [see below] \fB-R\fIawk.out\fR)
and the name of any
input file (or '-').
.IP
Each
.I variable=value
combination must be a single argument (i.e. there must not be spaces
around the '=' sign);
.I value
may be either a numeric value or a string.  If it is a string,
it must be enclosed in
double quotes at the time \fBawk\fR reads the argument.  That means
that the double quotes enclosing \fIvalue\fR on the command line
must be protected from the shell as in the example above or it will
remove them.
.IP
.I Variable
is not available for use within the script until after the first record
has been read and parsed, but it is available as soon as
that has occurred so that it may be used before any other
processing begins.  It does not exist at the time the
.B BEGIN
block is executed, and if there was no input it will not exist in the
.B END
block (if any).
.NH 2
Getline Function
.LP
.B Getline
immediately reads the next input line (which is parsed into \fI$1\fR,
\fI$2\fR, etc) and starts processing it at the location of the call
(as opposed to
.B next
which immediately reads the next input line but starts processing
from the start of the script).
.LP
.B Getline
facilitates performing some types of tasks such as
processing files with multiline records and merging
information from several files.  To use the latter as an example,
consider a case where two files, whose lines do not share
a common format, must be processed together.  Shell and \fBawk\fR
scripts to do this might look something like
.sp
In the shell script
.DS
( echo DATA1; cat datafile1; echo ENDdata1 \e
  echo DATA2; cat datafile2; echo ENDdata2 \e
) | \e
    awk -f awkscript - > awk_output_file
.DE
In the
.B awk
script
.DS
/^DATA1/  {       # Next input line starts datafile1
          while (getline && $1 !~ /^ENDdata1$/)
                 {
                 [processing for \fIdata1\fR lines]
                 }
          }
.sp 1
/^DATA2/  {       # Next input line starts datafile2
          while (getline && $1 !~ /^ENDdata2$/)
                 {
                 [processing for \fIdata2\fR lines]
                 }
          }
.DE
There are, of course, other ways of accomplishing this particular task
(primarily using \fBsed\fR to preprocess the information),
but they are generally more difficult to write and more
subject to logic errors.  Many cases arising in practice
are significantly more difficult, if not impossible, to handle
without \fBgetline\fR.
.NH 2
Regular Expressions
.LP
The sequence "\fI\eddd\fR" (where 'd' is a digit)
may be used to include explicit octal
values in regular expressions.  This is often useful if "nonprinting"
characters have been used as "markers" in a file.  It has not been
tested for ASCII values outside the range 01 through 0127.
.NH 2
Debugging output
.LP
[This is unlikely to be of interest to the casual user.]
.sp
If \fBawk\fR was compiled with "DEBUG" defined, then giving it a
.B -d
flag argument will cause it to produce debugging output when it is run.
This is sometimes useful in finding obscure problems in scripts, though
it is primarily intended for tracking down problems with \fBawk\fR itself.
.NH 2
Script "Compilation"
.LP
[It is likely that this does not work at most sites.  If it does not, the
following will probably not be of interest to the casual user.]
.sp
The command
.DS
awk -S -f script.awk
.DE
produces a file named
.B awk.out.
This is a core image of
.B awk
after parsing the file
.I script.awk.
The command
.DS
awk -Rawk.out datafile
.DE
causes
.B awk.out
to be applied to \fIdatafile\fR (or the standard input if no
input file is given).  This avoids having to reparse large
scripts each time they are used.  Unfortunately, the way this
is implemented requires some special action on the part of the
person installing \fBawk\fR.
.LP
As \fBawk\fR is delivered with \fI4.2 BSD\fR (and \fI4.3 BSD\fR),
.I awk.out
is created by the \fBawk -S ...\fR process by calling
.B sbrk()
with '0', writing out the returned value, then
writing out the core image from location 0 to
the returned address.  The \fBawk -R...\fR process
reads the first word of
.I awk.out
to get the length of the image, calls
.B brk()
with that length, and
then reads the image into itself starting at location 0.
For this to work, \fBawk\fR must have been loaded with its
text segment writeable.  Unfortunately,
the \fIBSD\fR default for \fBld\fR is to load with the text
read-only and shareable.  Thus, the installer must remember to take
special action (e.g. "cc -N ..."
[equivalently "ld -N ..."] for \fI4BSD\fR) if these
flags are to work.
.LP
[Personally, I don't think it is
a very good idea to give \fBawk\fR the opportunity
to write on its text segment; I changed it so that
only the data segment is overwritten.]
.LP
Also, due to what appears to be a lapse in logic, the first
non-flag argument following \fB-R\fIawk.out\fR is discarded.
[Disliking that behavior, the I changed it so that the \fB-R\fR flag
is treated like the \fB-f\fR flag:  no flag arguments may follow it.]
.bp
.NH
Sketchily Documented Features
.LP
.NH 2
Exit
.LP
The user manual says that using the
.B exit
function causes the script to behave as if end-of-input has been reached.
Not menitoned explicitly is the fact that this will cause the
.B END
block to be executed if it exists.
Also, two things are ommitted:
.IP
\fBexit(\fIexpr\fB)\fR causes the script's exit status to be
set to the value of \fIexpr\fR.
.IP
If
.B exit
is called within the
.B END
block, the script exits immediately.
.NH 2
Mathematical Functions
.LP
The following builtin functions exist and are mentioned in
.I awk(1)
but not in the user manual.
.IP \fBint(\fIx\fB)\fR 10
\fIx\fR trunctated to an integer.
.IP \fBsqrt(\fIx\fB)\fR 10
the square root of \fIx\fR for \fIx\fR >= 0, otherwise zero.
.IP \fBexp(\fIx\fB)\fR 10
\fBe\fR-to-the-\fIx\fR for -88 <= \fIx\fR <= 88, zero
for \fIx\fR < -88, and dumps core for \fIx\fR > 88.
.IP \fBlog(\fIx\fB)\fR 10
the natural log of \fIx\fR.
.NH 2
OFMT Variable
.LP
The variable
.B OFMT
may be set to, e.g. "%.2f", and purely numerical output will be
bound by that restriction in
.B print
statements.  The default value is "%.6g".  Again, this is mentioned in
.I awk(1)
but not in the user manual.
.NH 2
Array Elements
.LP
The user manual states that "Array elements ... spring into existence by
being mentioned."  This is literally true;
.I any
reference to an array element causes it to exist.
("I was thought about, therefore I am.")
Take, for example,
.DS
if(array[$1] == "blah")
	{
	[process blah lines]
	}
.DE
If there is not an existing element of
.B array
whose subscript is the same as the contents of the
current line's first field,
.I
one is created
.R
and its value (null, of course) is then compared
with "blah".  This can be a bit
disconcerting, particularly when later processing is using
.DS
for (i in \fBarray\fR)
        {
        [do something with result of processing
	"blah" lines]
        }
.DE
to walk the array and expects all the elements to be non-null.
Succinct practical examples are difficult to construct, but
when this happens in a 500 line
script it can be difficult to determine what has gone wrong.
.NH 2
FS and Input Fields
.LP
By default any number of spaces or tabs can separate fields (i.e.
there are no null input fields) and trailing spaces and tabs
are ignored.  However, if
.B FS
is explicitly set to any character other than a space
(e.g., a tab: \fBFS = "\et"\fR), then a field is defined
by each such character and trailing field separator characters are
not ignored.  For example, if '>' represents a tab then
.DS
one>>three>>five>
.DE
defines six fields, with fields two, four, and six being empty.
.LP
If
.B FS
is explicitly set to a space (\fBFS\fR = "\ "), then
the default behavior obtains (this may be a bug); that
is, both spaces
and tabs are taken as field separators, there can be no
null input fields, and trailing spaces and tabs are ignored.
.NH 2
RS and Input Records
.LP
If
.B RS
is explicitly set to the null string (\fBRS\fR = ""), then the input
record separator becomes a blank line, and the newlines at the end
of input lines is a field separator.  This facilitates
handling multiline records.
.NH 2
"Fall Through"
.LP
This is mentioned in the user manual, but it is important
enough that it is worth pointing out here, also.
.LP
In the script
.DS
/\fIpattern_1\fR/  {
             [do something]
             }
.sp
/\fIpattern_2\fR/  {
             [do something]
             }
.DE
all input lines will be compared with both 
.I pattern_1
and
.I pattern_2
unless the
.B next
function is used before the closing '}' in the
.I pattern_1
portion.
.NH 2
Output Redirection
.LP
Once a file (or pipe) is opened by
.B awk
it is not closed until
.B awk
exits.  This can occassionally cause problems.  For example,
it means that a script that sorts its input lines into
output files named by the contents of their first fields
(similar to an example in the user manual)
.DS
{ print $0 > $1 }
.DE
is going to fail if the number of different first fields exceeds
about 10.
This problem
.I cannot
be avoided by using something like
.DS
{
command = "cat >> " $1
print $0 | command
}
.DE
as the value of the variable
.B command
is different for each different value of
.I $1
and is therefore treated as a different output "file".
.LP
[I have not been able to create a truly satisfactory
fix for this that doesn't involve having \fBawk\fR treat output
redirection to pipes differently from output to files; I
would greatly appreciate hearing of one.]
.NH 2
Field and Variable Types, Values, and Comparisons
.LP
The following is a synopsis of notes included with \fBawk\fR's
source code.
.NH 3
Types
.LP
Variables and fields can be strings or numbers or both.
.NH 4
Variable Types
.LP
When a variable is set by the assignment
.DS
\fIvar\fR = \fIexpr\fR
.DE
its type is set to the type of
.I expr
(this includes +=, ++, etc). An arithmetic
expression is of type
.I number,
a concatenation is of type
.I string,
etc.
If the assignment is a simple copy, e.g.
.DS
\fIvar1\fR = \fIvar2\fR
.DE
then the type of
.I var1
becomes that of
.I var2.
.LP
Type is determined by context; rarely, but always very inconveniently,
this context-determined type is incorrect.  As mentioned in
.I awk(1)
the type of an expression can be coerced to that desired.  E.g.
.DS
{
\fIexpr1\fR + 0
.sp 1
\fIexpr2\fR ""    # Concatenate with a null string
}
.DE
coerces
.I expr1
to numeric type and
.I expr2
to string type.
.NH 4
Field Types
.LP
As with variables, the type of a field is determined by
context when possible, e.g.
.RS
.IP $1++ 8
clearly implies that \fI$1\fR is to be numeric, and
.IP $1\ =\ $1\ ","\ $2 16
implies that $1 and $2 are both to be strings.
.RE
.LP
Coercion is done as needed.
In contexts where types cannot be reliably determined, e.g.,
.DS
if($1 == $2) ...
.DE
the type of each field is determined on input by inspection.  All fields are
strings; in addition, each field that contains only a number
is also considered numeric.  Thus, the test
.DS
if($1 == $2) ...
.DE
will succeed on the inputs
.DS
0       0.0
100     1e2
+100    100
1e-3    1e-3
.DE
and fail on the inputs
.DS
(null)      0
(null)      0.0
2E-518      6E-427
.DE
"only a number" in this case means matching the regular expression
.DS
^[+-]?[0-9]*\e.?[0-9]+(e[+-]?[0-9]+)?$
.DE
.NH 3
Values
.LP
Uninitialized variables have the numeric value 0 and the string value "".
Therefore, if \fIx\fR is uninitialized,
.DS
if(x) ...
if (x == "0") ...
.DE
are false, and
.DS
if(!x) ...
if(x == 0) ...
if(x == "") ...
.DE
are true.
.LP
Fields which are explicitly null have the string value "", and are not numeric.
Non-existent fields (i.e., fields past \fBNF\fR) are also treated this way.
.NH 3
Types of Comparisons
.LP
If both operands are numeric, the comparison is made
numerically.  Otherwise, operands are coerced to type
string if necessary, and the comparison is made on strings.
.NH 3
Array Elements
.LP
Array elements created by
.B split
are treated in the same way as fields.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-28 17:48 ` Adam Sampson
@ 2023-06-28 18:03   ` KenUnix
  2023-06-28 18:38     ` Clem Cole
  2023-06-29  1:04     ` Bakul Shah
  0 siblings, 2 replies; 20+ messages in thread
From: KenUnix @ 2023-06-28 18:03 UTC (permalink / raw)
  To: Adam Sampson; +Cc: tuhs

[-- Attachment #1: Type: text/plain, Size: 1538 bytes --]

Guys,

It's been too long. What would I use to compile this man page source?

I do remember some option switches are required. Yes?

Thanks


On Wed, Jun 28, 2023 at 1:49 PM Adam Sampson <ats@offog.org> wrote:

> On Wed, Jun 28, 2023 at 09:26:02AM +0300, Aharon Robbins wrote:
> > Attached is "A Supplemental Document For Awk". This circulated on
> > USENET in the 80s.  My copy is dated January 18, 1989, but I'm sure
> > it's older than that.
>
> In the utzoo Usenet archive, there are two versions of this document and
> a few mentions of it...
>
> John Pierce posted to comp.unix.questions on 1989-04-02, saying he'd
> written it "four or five years ago".
>
> Stu Heiss, in comp.unix.questions on 1989-03-06, said it was "posted to
> net.sources 18 Jun 86 with message-id 238@sdchema.sdchem.uucp".
> Unfortunately this isn't in the utzoo archive or the net.sources.mbox
> in archive.org's Usenet Historical Collection.
>
> A copy identical to yours was posted by Jim Harkins to
> comp.unix.questions on 1990-03-29.
>
> There's a later version, fixing a typo and some formatting and adding a
> mention of \f and \b in printf, which was posted by Brian Kantor to
> comp.doc on 1987-10-11 -- I've attached this. The same file (with two
> .bps commented out) was reposted in comp.unix.questions on 1989-11-16 by
> Francois-Michel Lang.
>
> Thanks,
>
> --
> Adam Sampson <ats@offog.org>                         <http://offog.org/>
>


-- 
End of line
JOB TERMINATED -->> Okey Dokey, OK Boss

[-- Attachment #2: Type: text/html, Size: 2425 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-28 18:03   ` KenUnix
@ 2023-06-28 18:38     ` Clem Cole
  2023-06-28 23:47       ` Greg 'groggy' Lehey
  2023-06-29  1:04     ` Bakul Shah
  1 sibling, 1 reply; 20+ messages in thread
From: Clem Cole @ 2023-06-28 18:38 UTC (permalink / raw)
  To: KenUnix; +Cc: tuhs

[-- Attachment #1: Type: text/plain, Size: 2607 bytes --]

Download the file and make sure you save it in "UNIX" format, not DOS (
*i.e.* newline delimited not the nasty <CR><LF> cruft) -- (if you are not
sure how to do that running the dos2unix(1) command will assure it's was
not unix format when you are done).

% file awkdoc
awkdoc: troff or preprocessor input text, ASCII text
% man groff

We'll leave it to you to figure out which switches for troff/groff and
macro package (hint: try the head(1) command to peak at the first few lines
--  there are three likely choices, but it's pretty obvious since the same
one as most V7 documents).

FWIW: If you got a copy of Kernighan and Pike's - "The Unix Programming
Environment"  [ISBN 0-13-937699-2] which is available at most retailers.
You can read Chapter 9 for this question. Although, given so many of the
questions you seem to like to ask here, please consider doing all the
exercises in the entire book.
ᐧ

On Wed, Jun 28, 2023 at 2:04 PM KenUnix <ken.unix.guy@gmail.com> wrote:

> Guys,
>
> It's been too long. What would I use to compile this man page source?
>
> I do remember some option switches are required. Yes?
>
> Thanks
>
>
> On Wed, Jun 28, 2023 at 1:49 PM Adam Sampson <ats@offog.org> wrote:
>
>> On Wed, Jun 28, 2023 at 09:26:02AM +0300, Aharon Robbins wrote:
>> > Attached is "A Supplemental Document For Awk". This circulated on
>> > USENET in the 80s.  My copy is dated January 18, 1989, but I'm sure
>> > it's older than that.
>>
>> In the utzoo Usenet archive, there are two versions of this document and
>> a few mentions of it...
>>
>> John Pierce posted to comp.unix.questions on 1989-04-02, saying he'd
>> written it "four or five years ago".
>>
>> Stu Heiss, in comp.unix.questions on 1989-03-06, said it was "posted to
>> net.sources 18 Jun 86 with message-id 238@sdchema.sdchem.uucp".
>> Unfortunately this isn't in the utzoo archive or the net.sources.mbox
>> in archive.org's Usenet Historical Collection.
>>
>> A copy identical to yours was posted by Jim Harkins to
>> comp.unix.questions on 1990-03-29.
>>
>> There's a later version, fixing a typo and some formatting and adding a
>> mention of \f and \b in printf, which was posted by Brian Kantor to
>> comp.doc on 1987-10-11 -- I've attached this. The same file (with two
>> .bps commented out) was reposted in comp.unix.questions on 1989-11-16 by
>> Francois-Michel Lang.
>>
>> Thanks,
>>
>> --
>> Adam Sampson <ats@offog.org>                         <http://offog.org/>
>>
>
>
> --
> End of line
> JOB TERMINATED -->> Okey Dokey, OK Boss
>
>
>

[-- Attachment #2: Type: text/html, Size: 5062 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-28 18:38     ` Clem Cole
@ 2023-06-28 23:47       ` Greg 'groggy' Lehey
  2023-06-29  1:59         ` Stuff Received
  2023-06-29 13:34         ` G. Branden Robinson
  0 siblings, 2 replies; 20+ messages in thread
From: Greg 'groggy' Lehey @ 2023-06-28 23:47 UTC (permalink / raw)
  To: Clem Cole, KenUnix; +Cc: tuhs

[-- Attachment #1: Type: text/plain, Size: 1059 bytes --]

On Wednesday, 28 June 2023 at 14:38:40 -0400, Clem Cole wrote:
> On Wed, Jun 28, 2023 at 2:04 PM KenUnix <ken.unix.guy@gmail.com> wrote:
>
>> It's been too long. What would I use to compile this man page source?
>>
>> I do remember some option switches are required. Yes?
>
> Download the file and make sure you save it in "UNIX" format, not DOS (
> *i.e.* newline delimited not the nasty <CR><LF> cruft) -- (if you are not
> sure how to do that running the dos2unix(1) command will assure it's was
> not unix format when you are done).
>
> % file awkdoc
> awkdoc: troff or preprocessor input text, ASCII text
> % man groff

There's also grog (groff guess) that may help.  It's not very clever,
but it recognizes a number of formats:

 $ grog ls.1
 groff -mdoc ls.1

Greg
--
Sent from my desktop computer.
Finger grog@lemis.com for PGP public key.
See complete headers for address and phone numbers.
This message is digitally signed.  If your Microsoft mail program
reports problems, please read http://lemis.com/broken-MUA.php

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 163 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-28  6:26 [TUHS] Trying to date "A Supplemental Document For Awk" Aharon Robbins
  2023-06-28  6:45 ` [TUHS] " arnold
  2023-06-28 17:48 ` Adam Sampson
@ 2023-06-29  0:26 ` Jeremy C. Reed
  2 siblings, 0 replies; 20+ messages in thread
From: Jeremy C. Reed @ 2023-06-29  0:26 UTC (permalink / raw)
  To: Aharon Robbins; +Cc: tuhs


I found a copy from 1986 in 
usenix89/Lang/Awk_doc/:STUFF 
(the file is called :STUFF)
from a tar usenix878889.tar.gz
I didn't check but I assume it is one here
https://www.tuhs.org/Archive/Applications/Shoppa_Tapes/

 Path: plus5!wuphys!wucs!we53!ltuxa!cuae2!ihnp4!mhuxn!mhuxr!ulysses!ucbvax!sdcsvax!sdchem!jwp
 From: jwp@sdchem.UUCP (John Pierce)
 Newsgroups: net.sources
 Subject: Awk document
 Message-ID: <238@sdchema.sdchem.UUCP>
 Date: 18 Jun 86 20:04:32 GMT
 Reply-To: jwp@sdchem.UUCP (John Pierce)
 Organization: Chemistry Dept, UC San Diego
 Lines: 743
 Posted: Wed Jun 18 15:04:32 1986


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-28 18:03   ` KenUnix
  2023-06-28 18:38     ` Clem Cole
@ 2023-06-29  1:04     ` Bakul Shah
  1 sibling, 0 replies; 20+ messages in thread
From: Bakul Shah @ 2023-06-29  1:04 UTC (permalink / raw)
  To: KenUnix; +Cc: TUHS

The presence of .AB, .AU etc says you need

nroff -ms

But why even bother unless you plan to become an awkspert?

> On Jun 28, 2023, at 11:03 AM, KenUnix <ken.unix.guy@gmail.com> wrote:
> 
> Guys,
> 
> It's been too long. What would I use to compile this man page source?
> 
> I do remember some option switches are required. Yes?
> 
> Thanks
> 
> 
> On Wed, Jun 28, 2023 at 1:49 PM Adam Sampson <ats@offog.org> wrote:
> On Wed, Jun 28, 2023 at 09:26:02AM +0300, Aharon Robbins wrote:
>> Attached is "A Supplemental Document For Awk". This circulated on
>> USENET in the 80s.  My copy is dated January 18, 1989, but I'm sure
>> it's older than that.
> 
> In the utzoo Usenet archive, there are two versions of this document and
> a few mentions of it...
> 
> John Pierce posted to comp.unix.questions on 1989-04-02, saying he'd
> written it "four or five years ago".
> 
> Stu Heiss, in comp.unix.questions on 1989-03-06, said it was "posted to
> net.sources 18 Jun 86 with message-id 238@sdchema.sdchem.uucp".
> Unfortunately this isn't in the utzoo archive or the net.sources.mbox
> in archive.org's Usenet Historical Collection.
> 
> A copy identical to yours was posted by Jim Harkins to
> comp.unix.questions on 1990-03-29.
> 
> There's a later version, fixing a typo and some formatting and adding a
> mention of \f and \b in printf, which was posted by Brian Kantor to
> comp.doc on 1987-10-11 -- I've attached this. The same file (with two
> .bps commented out) was reposted in comp.unix.questions on 1989-11-16 by
> Francois-Michel Lang.
> 
> Thanks,
> 
> -- 
> Adam Sampson <ats@offog.org>                         <http://offog.org/>
> 
> 
> -- 
> End of line
> JOB TERMINATED -->> Okey Dokey, OK Boss



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-28 23:47       ` Greg 'groggy' Lehey
@ 2023-06-29  1:59         ` Stuff Received
  2023-06-29  6:27           ` segaloco via TUHS
  2023-06-29 13:45           ` G. Branden Robinson
  2023-06-29 13:34         ` G. Branden Robinson
  1 sibling, 2 replies; 20+ messages in thread
From: Stuff Received @ 2023-06-29  1:59 UTC (permalink / raw)
  To: tuhs

On 2023-06-28 19:47, Greg 'groggy' Lehey wrote:
> On Wednesday, 28 June 2023 at 14:38:40 -0400, Clem Cole wrote:
[...]
> 
> There's also grog (groff guess) that may help.  It's not very clever,
> but it recognizes a number of formats:
> 
>   $ grog ls.1
>   groff -mdoc ls.1

Thank you -- I never knew of its existence.

But what did people use before grog and why was the compilation line 
never placed in a comment in the file?

N.

> 
> Greg

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-29  1:59         ` Stuff Received
@ 2023-06-29  6:27           ` segaloco via TUHS
  2023-06-29  6:41             ` Andrew Hume
                               ` (2 more replies)
  2023-06-29 13:45           ` G. Branden Robinson
  1 sibling, 3 replies; 20+ messages in thread
From: segaloco via TUHS @ 2023-06-29  6:27 UTC (permalink / raw)
  To: The Eunuchs Hysterical Society

> But what did people use before grog and why was the compilation line
> never placed in a comment in the file?

The primary macro packages I see come up between Bell and UCB are man, ms, mm, and me.  Man of course finds use in the manual pages (although there are different representations of manpages in nroff over time.)  From what I've seen (someone who was there can surely correct me) it seems that ms macros were more commonly used on the research side of things while the mm macros proliferated more in the supported side.  Finally the me macros were a BSD component.  Given these separations, the origin of or relative vicinity from which a paper originates provides much context as to which macros may be present.

To a finer point, the papers published with V7 are ms macros papers while the new additions in PWB lineages are mm macros, while some papers that crop up in BSD likely use me (although I haven't gotten too far into BSD with doc research yet.)  Papers from UNIX consumers such as universities are likely in ms or me most of the time.  On the flip side, mm was the macro package touted with Documenter's Workbench, so many commercial operations using System V for documentation would've produced documents in mm.  I'd be curious whether the earlier "Phototypesetter" package included ms or mm (or both.)  I don't think I've seen a "papers" set with both the Lesk ms document and the Smith and Mashey mm one, so couldn't say how common both in the same Bell offering were.  Additionally, my research hasn't touched on any officially sanctioned use of mm in BSD, so that's an area ripe for some more study.

As for other breadcrumbs, Bell mm macros papers do often include a comment at the top indicating to print with nroff -mm or mm(1).  I don't recall seeing similar in research papers, but haven't necessarily gone looking.  In any case, the paper sets with UNIX itself typically had scripts included with the necessary command-lines, as many papers additionally needed some eqn and/or tbl processing.  I imagine any other such formally distributed document sources would likewise include scripts in lieu of commentary, but it depends.

- Matt G.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-29  6:27           ` segaloco via TUHS
@ 2023-06-29  6:41             ` Andrew Hume
  2023-06-29  6:45               ` Noel Hunt
  2023-06-29  6:44             ` Noel Hunt
  2023-06-29 14:02             ` G. Branden Robinson
  2 siblings, 1 reply; 20+ messages in thread
From: Andrew Hume @ 2023-06-29  6:41 UTC (permalink / raw)
  To: segaloco; +Cc: The Eunuchs Hysterical Society

over time, folks in research tended to use make (or its descendants) to generate paper outputs.
altho i do recall a tool similar to grog that correctly orchestrated the ideal/pic/eqn/tbl/troff pipeline
needed to generate the output. the order was important.

as for macros, for several years we tended to use the pm macros (akin to the ms macros)
because they drove chris van wyck and kernighan’s page balancing backend, which was necessary
to produce print ready copy for journals etc.

> On Jun 28, 2023, at 11:27 PM, segaloco via TUHS <tuhs@tuhs.org> wrote:
> 
>> But what did people use before grog and why was the compilation line
>> never placed in a comment in the file?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-29  6:27           ` segaloco via TUHS
  2023-06-29  6:41             ` Andrew Hume
@ 2023-06-29  6:44             ` Noel Hunt
  2023-06-29 14:02             ` G. Branden Robinson
  2 siblings, 0 replies; 20+ messages in thread
From: Noel Hunt @ 2023-06-29  6:44 UTC (permalink / raw)
  To: segaloco; +Cc: The Eunuchs Hysterical Society

And let us not forget the wonderful 'mv' macros, for typesetting over-head
projection slides.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-29  6:41             ` Andrew Hume
@ 2023-06-29  6:45               ` Noel Hunt
  2023-06-29  6:48                 ` Andrew Hume
  0 siblings, 1 reply; 20+ messages in thread
From: Noel Hunt @ 2023-06-29  6:45 UTC (permalink / raw)
  To: Andrew Hume; +Cc: segaloco, The Eunuchs Hysterical Society

> altho i do recall a tool similar to grog that correctly orchestrated the ideal/pic/eqn/tbl/troff pipeline

Perhaps you are referring to 'doctype'?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-29  6:45               ` Noel Hunt
@ 2023-06-29  6:48                 ` Andrew Hume
  2023-06-29  6:50                   ` arnold
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Hume @ 2023-06-29  6:48 UTC (permalink / raw)
  To: Noel Hunt; +Cc: The Eunuchs Hysterical Society

its possible; i simply can’t remember 40 years ago.

> On Jun 28, 2023, at 11:45 PM, Noel Hunt <noel.hunt@gmail.com> wrote:
> 
>> altho i do recall a tool similar to grog that correctly orchestrated the ideal/pic/eqn/tbl/troff pipeline
> 
> Perhaps you are referring to 'doctype'?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-29  6:48                 ` Andrew Hume
@ 2023-06-29  6:50                   ` arnold
  0 siblings, 0 replies; 20+ messages in thread
From: arnold @ 2023-06-29  6:50 UTC (permalink / raw)
  To: noel.hunt, andrew; +Cc: tuhs

It is doctype. It's still alive (as an rc/grep/awk) script in Plan 9
and descendants.

Andrew Hume <andrew@humeweb.com> wrote:

> its possible; i simply can’t remember 40 years ago.
>
> > On Jun 28, 2023, at 11:45 PM, Noel Hunt <noel.hunt@gmail.com> wrote:
> > 
> >> altho i do recall a tool similar to grog that correctly orchestrated the ideal/pic/eqn/tbl/troff pipeline
> > 
> > Perhaps you are referring to 'doctype'?
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-28 23:47       ` Greg 'groggy' Lehey
  2023-06-29  1:59         ` Stuff Received
@ 2023-06-29 13:34         ` G. Branden Robinson
  2023-06-29 13:47           ` Rich Salz
  1 sibling, 1 reply; 20+ messages in thread
From: G. Branden Robinson @ 2023-06-29 13:34 UTC (permalink / raw)
  To: Greg 'groggy' Lehey; +Cc: tuhs


[-- Attachment #1.1: Type: text/plain, Size: 686 bytes --]

At 2023-06-29T09:47:50+1000, Greg 'groggy' Lehey wrote:
> There's also grog (groff guess) that may help.  It's not very clever,
> but it recognizes a number of formats:
> 
>  $ grog ls.1
>  groff -mdoc ls.1

I won't claim that grog is more clever now, but as of groff 1.23.0 it
is[1] avowedly less buggy.  It is also 52% of its former size (by `wc
-l`), has 14 bug fixes since groff 1.22.4 (with only a wish list item
remaining), sports an automated test suite, and the tool itself can now
be conveniently passed around as a single file--so I'm attaching it.

Regards,
Branden

[1] Will be.  We're up to release candidate 4 now.

    https://alpha.gnu.org/gnu/groff/

[-- Attachment #1.2: grog --]
[-- Type: text/plain, Size: 19221 bytes --]

#!/usr/bin/perl
# grog - guess options for groff command
# Inspired by doctype script in Kernighan & Pike, Unix Programming
# Environment, pp 306-8.

# Copyright (C) 1993-2021 Free Software Foundation, Inc.
# Written by James Clark.
# Rewritten in Perl by Bernd Warken <groff-bernd.warken-72@web.de>.
# Hacked up by G. Branden Robinson, 2021.

# This file is part of 'grog', which is part of 'groff'.

# 'groff' is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 2 of the License, or
# (at your option) any later version.

# 'groff' is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# General Public License for more details.

# You should have received a copy of the GNU General Public License
# along with this program.  If not, see
# <http://www.gnu.org/licenses/gpl-2.0.html>.

use warnings;
use strict;

use File::Spec;

my $groff_version = 'DEVELOPMENT';

my @command = ();		# the constructed groff command
my @requested_package = ();	# arguments to '-m' grog options
my @inferred_preprocessor = ();	# preprocessors the document uses
my @inferred_main_package = ();	# full-service package(s) detected
my $main_package;		# full-service package we go with
my $do_run = 0;			# run generated 'groff' command
my $use_compatibility_mode = 0;	# is -C being passed to groff?

my %preprocessor_for_macro = (
  'EQ', 'eqn',
  'G1', 'grap',
  'GS', 'grn',
  'PS', 'pic',
  '[',  'refer',
  #'so', 'soelim', # Can't be inferred this way; see grog man page.
  'TS', 'tbl',
  'cstart',   'chem',
  'lilypond', 'glilypond',
  'Perl',     'gperl',
  'pinyin',   'gpinyin',
);

my $program_name = $0;
{
  my ($v, $d, $f) = File::Spec->splitpath($program_name);
  $program_name = $f;
}

my %user_macro;
my %score = ();

my @input_file;

# .TH is both a man(7) macro and often used with tbl(1).  We expect to
# find .TH in ms(7) documents only between .TS and .TE calls, and in
# man(7) documents only as the first macro call.
my $have_seen_first_macro_call = 0;
# man(7) and ms(7) use many of the same macro names; do extra checking.
my $man_score = 0;
my $ms_score = 0;

my $had_inference_problem = 0;
my $had_processing_problem = 0;
my $have_any_valid_arguments = 0;


sub fail {
  my $text = shift;
  print STDERR "$program_name: error: $text\n";
  $had_processing_problem = 1;
}


sub warn {
  my $text = shift;
  print STDERR "$program_name: warning: $text\n";
}


sub process_arguments {
  my $no_more_options = 0;
  my $delayed_option = '';
  my $was_minus = 0;
  my $optarg = 0;
  my $pdf_with_ligatures = 0;

  foreach my $arg (@ARGV) {
    if ( $optarg ) {
      push @command, $arg;
      $optarg = 0;
      next;
    }

    if ($no_more_options) {
      push @input_file, $arg;
      next;
    }

    if ($delayed_option) {
      if ($delayed_option eq '-m') {
	push @requested_package, $arg;
	$arg = '';
      } else {
	push @command, $delayed_option;
      }

      push @command, $arg if $arg;
      $delayed_option = '';
      next;
    }

    unless ( $arg =~ /^-/ ) { # file name, no opt, no optarg
      push @input_file, $arg;
      next;
    }

    # now $arg starts with '-'

    if ($arg eq '-') {
      unless ($was_minus) {
	push @input_file, $arg;
	$was_minus = 1;
      }
      next;
    }

    if ($arg eq '--') {
      $no_more_options = 1;
      next;
    }

    # Handle options that cause an early exit.
    &version() if ($arg eq '-v' || $arg eq '--version');
    &usage(0) if ($arg eq '-h' || $arg eq '--help');

    if ($arg =~ '^--.') {
      if ($arg =~ '^--(run|with-ligatures)$') {
	$do_run = 1             if ($arg eq '--run');
	$pdf_with_ligatures = 1 if ($arg eq '--with-ligatures');
      } else {
        &fail("unrecognized grog option '$arg'; ignored");
	&usage(1);
      }
      next;
    }

    # Handle groff options that take an argument.

    # Handle the option argument being separated by whitespace.
    if ($arg =~ /^-[dfFIKLmMnoPrTwW]$/) {
      $delayed_option = $arg;
      next;
    }

    # Handle '-m' option without subsequent whitespace.
    if ($arg =~ /^-m/) {
      my $package = $arg;
      $package =~ s/-m//;
      push @requested_package, $package;
      next;
    }

    # Treat anything else as (possibly clustered) groff options that
    # take no arguments.

    # Our do_line() needs to know if it should do compatibility parsing.
    $use_compatibility_mode = 1 if ($arg =~ /C/);

    push @command, $arg;
  }

  if ($pdf_with_ligatures) {
    push @command, '-P-y';
    push @command, '-PU';
  }

  @input_file = ('-') unless (@input_file);
} # process_arguments()


sub process_input {
  foreach my $file (@input_file) {
    unless ( open(FILE, $file eq "-" ? $file : "< $file") ) {
      &fail("cannot open '$file': $!");
      next;
    }

    $have_any_valid_arguments = 1;

    while (my $line = <FILE>) {
      chomp $line;
      &do_line($line);
    }

    close(FILE);
  } # end foreach
} # process_input()


# Push item onto inferred full-service list only if not already present.
sub push_main_package {
  my $pkg = shift;
  if (!grep(/^$pkg/, @inferred_main_package)) {
    push @inferred_main_package, $pkg;
  }
} # push_main_package()


sub do_line {
  my $command;			# request or macro name
  my $args;			# request or macro arguments

  my $line = shift;

  # Check for a Perl Pod::Man comment.
  #
  # An alternative to this kludge is noted below: if a "standard" macro
  # is redefined, we could delete it from the relevant lists and
  # hashes.
  if ($line =~ /\\\" Automatically generated by Pod::Man/) {
    $man_score += 100;
  }

  # Strip comments.
  $line =~ s/\\".*//;
  $line =~ s/\\#.*// unless $use_compatibility_mode;

  return unless ($line =~ /^[.']/);	# Ignore text lines.

  # Perform preprocessor checks; they scan their inputs using a rump
  # interpretation of roff(7) syntax that requires the default control
  # character and no space between it and the macro name.  In AT&T
  # compatibility mode, no space (or newline!) is required after the
  # macro name, either.  We mimic the preprocessors themselves; eqn(1),
  # for instance, does not recognize '.EN' if '.EQ' has not been seen.
  my $boundary = '\\b';
  $boundary = '' if ($use_compatibility_mode);

  if ($line =~ /^\.(\S\S)$boundary/ || $line =~ /^\.(\[)/) {
    my $macro = $1;
    # groff identifiers can have extremely weird characters in them.
    # The ones we care about are conventionally named, but me(7)
    # documents can call macros like '+c', so quote carefully.
    if (grep(/^\Q$macro\E$/, keys %preprocessor_for_macro)) {
      my $preproc = $preprocessor_for_macro{$macro};
      if (!grep(/$preproc/, @inferred_preprocessor)) {
	push @inferred_preprocessor, $preproc;
      }
    }
  }

  # Normalize control lines; convert no-break control character to the
  # regular one and remove unnecessary whitespace.
  $line =~ s/^['.]\s*/./;
  $line =~ s/\s+$//;

  return if ($line =~ /^\.$/);		# Ignore empty request.
  return if ($line =~ /^\.\\?\.$/);	# Ignore macro definition ends.

  # Split control line into a request or macro call and its arguments.

  # Handle single-letter macro names.
  if ($line =~ /^\.(\S)(\s+(.*))?$/) {
    $command = $1;
    $args = $2;
  # Handle two-letter macro/request names in compatibility mode.
  } elsif ($use_compatibility_mode) {
    $line =~ /^\.(\S\S)\s*(.*)$/;
    $command = $1;
    $args = $2;
  # Handle multi-letter macro/request names in groff mode.
  } else {
    $line =~ /^\.(\S+)(\s+(.*))?$/;
    $command = $1;
    $args = $3;
  }

  $command = '' unless ($command);
  $args = '' unless ($args);

  ######################################################################
  # user-defined macros

  # If the line calls a user-defined macro, skip it.
  return if (exists $user_macro{$command});

  # These are all requests supported by groff 1.23.0.
  my @request = ('ab', 'ad', 'af', 'aln', 'als', 'am', 'am1', 'ami',
		 'ami1', 'as', 'as1', 'asciify', 'backtrace', 'bd',
		 'blm', 'box', 'boxa', 'bp', 'br', 'brp', 'break', 'c2',
		 'cc', 'ce', 'cf', 'cflags', 'ch', 'char', 'chop',
		 'class', 'close', 'color', 'composite', 'continue',
		 'cp', 'cs', 'cu', 'da', 'de', 'de1', 'defcolor', 'dei',
		 'dei1', 'device', 'devicem', 'di', 'do', 'ds', 'ds1',
		 'dt', 'ec', 'ecr', 'ecs', 'el', 'em', 'eo', 'ev',
		 'evc', 'ex', 'fam', 'fc', 'fchar', 'fcolor', 'fi',
		 'fp', 'fschar', 'fspecial', 'ft', 'ftr', 'fzoom',
		 'gcolor', 'hc', 'hcode', 'hla', 'hlm', 'hpf', 'hpfa',
		 'hpfcode', 'hw', 'hy', 'hym', 'hys', 'ie', 'if', 'ig',
		 'in', 'it', 'itc', 'kern', 'lc', 'length', 'linetabs',
		 'lf', 'lg', 'll', 'lsm', 'ls', 'lt', 'mc', 'mk', 'mso',
		 'msoquiet', 'na', 'ne', 'nf', 'nh', 'nm', 'nn', 'nop',
		 'nr', 'nroff', 'ns', 'nx', 'open', 'opena', 'os',
		 'output', 'pc', 'pev', 'pi', 'pl', 'pm', 'pn', 'pnr',
		 'po', 'ps', 'psbb', 'pso', 'ptr', 'pvs', 'rchar', 'rd',
		 'return', 'rfschar', 'rj', 'rm', 'rn', 'rnn', 'rr',
		 'rs', 'rt', 'schar', 'shc', 'shift', 'sizes', 'so',
		 'soquiet', 'sp', 'special', 'spreadwarn', 'ss',
		 'stringdown', 'stringup', 'sty', 'substring', 'sv',
		 'sy', 'ta', 'tc', 'ti', 'tkf', 'tl', 'tm', 'tm1',
		 'tmc', 'tr', 'trf', 'trin', 'trnt', 'troff', 'uf',
		 'ul', 'unformat', 'vpt', 'vs', 'warn', 'warnscale',
		 'wh', 'while', 'write', 'writec', 'writem');

  # Add user-defined macro names to %user_macro.
  #
  # Macros can also be defined with .dei{,1}, ami{,1}, but supporting
  # that would be a heavy lift for the benefit of users that probably
  # don't require grog's help.  --GBR
  if ($command =~ /^(de|am)1?$/) {
    my $name = $args;
    # Strip off any end macro.
    $name =~ s/\s+.*$//;
    # Handle special cases of macros starting with '[' or ']'.
    if ($name =~ /^[][]/) {
      delete $preprocessor_for_macro{'['};
    }
    # XXX: If the macro name shadows a standard macro name, maybe we
    # should delete the latter from our lists and hashes.  This might
    # depend on whether the document is trying to remain compatible
    # with an existing interface, or simply colliding with names they
    # don't care about (consider a raw roff document that defines 'PP').
    # --GBR
    $user_macro{$name} = 0 unless (exists $user_macro{$name});
    return;
  }

  # XXX: Handle .rm as well?

  # Ignore all other requests.  Again, macro names can contain Perl
  # regex metacharacters, so be careful.
  return if (grep(/^\Q$command\E$/, @request));
  # What remains must be a macro name.
  my $macro = $command;

  $have_seen_first_macro_call = 1;
  $score{$macro}++;


  ######################################################################
  # macro package (tmac)
  ######################################################################

  # man and ms share too many macro names for the following approach to
  # be fruitful for many documents; see &infer_man_or_ms_package.
  #
  # We can put one thumb on the scale, however.
  if ((!$have_seen_first_macro_call) && ($macro eq 'TH')) {
    # TH as the first call in a document screams man(7).
    $man_score += 100;
  }

  ##########
  # mdoc
  if ($macro =~ /^Dd$/) {
    &push_main_package('doc');
    return;
  }

  ##########
  # old mdoc
  if ($macro =~ /^(Tp|Dp|De|Cx|Cl)$/) {
    &push_main_package('doc-old');
    return;
  }

  ##########
  # me

  if ($macro =~ /^(
		   [ilnp]p|
		   n[12]|
		   sh
		  )$/x) {
    &push_main_package('e');
    return;
  }


  #############
  # mm and mmse

  if ($macro =~ /^(
		   H|
		   MULB|
		   LO|
		   LT|
		   NCOL|
		   PH|
		   SA
		  )$/x) {
    if ($macro =~ /^LO$/) {
      if ( $args =~ /^(DNAMN|MDAT|BIL|KOMP|DBET|BET|SIDOR)/ ) {
	&push_main_package('mse');
	return;
      }
    } elsif ($macro =~ /^LT$/) {
      if ( $args =~ /^(SVV|SVH)/ ) {
	&push_main_package('mse');
	return;
      }
    }
    &push_main_package('m');
    return;
  }

  ##########
  # mom

  if ($macro =~ /^(
		   ALD|
		   AUTHOR|
		   CHAPTER_TITLE|
		   CHAPTER|
		   COLLATE|
		   DOCHEADER|
		   DOCTITLE|
		   DOCTYPE|
		   DOC_COVER|
		   FAMILY|
		   FAM|
		   FT|
		   LEFT|
		   LL|
		   LS|
		   NEWPAGE|
		   NO_TOC_ENTRY|
		   PAGENUMBER|
		   PAGE|
		   PAGINATION|
		   PAPER|
		   PRINTSTYLE|
		   PT_SIZE|
		   START|
		   TITLE|
		   TOC_AFTER_HERE
		   TOC|
		   T_MARGIN|
		  )$/x) {
    &push_main_package('om');
    return;
  }
} # do_line()

my @preprocessor = ();


sub infer_preprocessors {
  my %option_for_preprocessor =  (
    'eqn', '-e',
    'grap', '-G',
    'grn', '-g',
    'pic', '-p',
    'refer', '-R',
    #'soelim', '-s', # Can't be inferred this way; see grog man page.
    'tbl', '-t',
    'chem', '-j'
  );

  # Use a temporary list we can sort later.  We want the options to show
  # up in a stable order for testing purposes instead of the order their
  # macros turn up in the input.  groff doesn't care about the order.
  my @opt = ();

  foreach my $preproc (@inferred_preprocessor) {
    my $preproc_option = $option_for_preprocessor{$preproc};

    if ($preproc_option) {
      push @opt, $preproc_option;
    } else {
      push @preprocessor, $preproc;
    }
  }
  push @command, sort @opt;
} # infer_preprocessors()


# Return true (1) if either the man or ms package is inferred.
sub infer_man_or_ms_package {
  my @macro_ms = ('RP', 'TL', 'AU', 'AI', 'DA', 'ND', 'AB', 'AE',
		  'QP', 'QS', 'QE', 'XP',
		  'NH',
		  'R',
		  'CW',
		  'BX', 'UL', 'LG', 'NL',
		  'KS', 'KF', 'KE', 'B1', 'B2',
		  'DS', 'DE', 'LD', 'ID', 'BD', 'CD', 'RD',
		  'FS', 'FE',
		  'OH', 'OF', 'EH', 'EF', 'P1',
		  'TA', '1C', '2C', 'MC',
		  'XS', 'XE', 'XA', 'TC', 'PX',
		  'IX', 'SG');

  my @macro_man = ('BR', 'IB', 'IR', 'RB', 'RI', 'P', 'TH', 'TP', 'SS',
		   'HP', 'PD',
		   'AT', 'UC',
		   'SB',
		   'EE', 'EX',
		   'OP',
		   'MT', 'ME', 'SY', 'YS', 'TQ', 'UR', 'UE');

  my @macro_man_or_ms = ('B', 'I', 'BI',
			 'DT',
			 'RS', 'RE',
			 'SH',
			 'SM',
			 'IP', 'LP', 'PP');

  for my $key (@macro_man_or_ms, @macro_man, @macro_ms) {
    $score{$key} = 0 unless exists $score{$key};
  }

  # Compute a score for each package by counting occurrences of their
  # characteristic macros.
  foreach my $key (@macro_man_or_ms) {
    $man_score += $score{$key};
    $ms_score += $score{$key};
  }

  foreach my $key (@macro_man) {
    $man_score += $score{$key};
  }

  foreach my $key (@macro_ms) {
    $ms_score += $score{$key};
  }

  if (!$ms_score && !$man_score) {
    # The input may be a "raw" roff document; this is not a problem,
    # but it does mean no package was inferred.
    return 0;
  } elsif ($ms_score == $man_score) {
    # If there was no TH call, it's not a (valid) man(7) document.
    if (!$score{'TH'}) {
      &push_main_package('s');
    } else {
      &warn("document ambiguous; disambiguate with -man or -ms option");
      $had_inference_problem = 1;
    }
    return 0;
  } elsif ($ms_score > $man_score) {
    &push_main_package('s');
  } else {
    &push_main_package('an');
  }

  return 1;
} # infer_man_or_ms_package()


sub construct_command {
  my @main_package = ('an', 'doc', 'doc-old', 'e', 'm', 'om', 's');
  my $file_args_included;	# file args now only at 1st preproc
  unshift @command, 'groff';
  if (@preprocessor) {
    my @progs;
    $progs[0] = shift @preprocessor;
    push(@progs, @input_file);
    for (@preprocessor) {
      push @progs, '|';
      push @progs, $_;
    }
    push @progs, '|';
    unshift @command, @progs;
    $file_args_included = 1;
  } else {
    $file_args_included = 0;
  }

  foreach (@command) {
    next unless /\s/;
    # when one argument has several words, use accents
    $_ = "'" . $_ . "'";
  }

  my $have_ambiguous_main_package = 0;
  my $inferred_main_package_count = scalar @inferred_main_package;

  # Did we infer multiple full-service packages?
  if ($inferred_main_package_count > 1) {
    $have_ambiguous_main_package = 1;
    # For each one the user explicitly requested...
    for my $pkg (@requested_package) {
      # ...did it resolve the ambiguity for us?
      if (grep(/$pkg/, @inferred_main_package)) {
	@inferred_main_package = ($pkg);
	$have_ambiguous_main_package = 0;
	last;
      }
    }
  } elsif ($inferred_main_package_count == 1) {
    $main_package = shift @inferred_main_package;
  }

  if ($have_ambiguous_main_package) {
    # TODO: Alphabetical is probably not the best ordering here.  We
    # should tally up scores on a per-package basis generally, not just
    # for an and s.
    for my $pkg (@main_package) {
      if (grep(/$pkg/, @inferred_main_package)) {
	$main_package = $pkg;
	&warn("document ambiguous (choosing '$main_package'"
	      . " from '@inferred_main_package'); disambiguate with -m"
	      . " option");
	$had_inference_problem = 1;
	last;
      }
    }
  }

  # If a full-service package was explicitly requested, warn if the
  # inference differs from the request.  This also ensures that all -m
  # arguments are placed in the same order that the user gave them;
  # caveat dictator.
  my @auxiliary_package_argument = ();
  for my $pkg (@requested_package) {
    my $is_auxiliary_package = 1;
    if (grep(/$pkg/, @main_package)) {
      $is_auxiliary_package = 0;
      if ($pkg ne $main_package) {
	&warn("overriding inferred package '$main_package'"
	      . " with requested package '$pkg'");
	$main_package = $pkg;
      }
    }
    if ($is_auxiliary_package) {
      push @auxiliary_package_argument, "-m" . $pkg;
    }
  }

  push @command, '-m' . $main_package if ($main_package);
  push @command, @auxiliary_package_argument;
  push @command, @input_file unless ($file_args_included);

  #########
  # execute the 'groff' command here with option '--run'
  if ( $do_run ) { # with --run
    print STDERR "@command\n";
    my $cmd = join ' ', @command;
    system($cmd);
  } else {
    print "@command\n";
  }
} # construct_command()


sub usage {
  my $stream = *STDOUT;
  my $had_error = shift;
  $stream = *STDERR if $had_error;
  my $grog = $program_name;
  print $stream "usage: $grog [--ligatures] [--run]" .
    " [groff-option ...] [--] [file ...]\n" .
    "usage: $grog {-v | --version}\n" .
    "usage: $grog {-h | --help}\n";
  unless ($had_error) {
    print $stream "\n" .
"Read each roff(7) input FILE and attempt to infer an appropriate\n" .
"groff(1) command to format it.  See the grog(1) manual page.\n";
  }
  exit $had_error;
}


sub version {
  print "GNU $program_name (groff) $groff_version\n";
  exit 0;
} # version()


# initialize

my $in_unbuilt_source_tree = 0;
{
  my $at = '@';
  $in_unbuilt_source_tree = 1 if ('1.23.0.rc4.391-325a' eq "${at}VERSION${at}");
}

$groff_version = '1.23.0.rc4.391-325a' unless ($in_unbuilt_source_tree);

&process_arguments();
&process_input();

if ($have_any_valid_arguments) {
  &infer_preprocessors();
  &infer_man_or_ms_package() if (scalar @inferred_main_package != 1);
  &construct_command();
}

exit 2 if ($had_processing_problem);
exit 1 if ($had_inference_problem);
exit 0;

# Local Variables:
# fill-column: 72
# mode: CPerl
# End:
# vim: set cindent noexpandtab shiftwidth=2 softtabstop=2 textwidth=72:

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-29  1:59         ` Stuff Received
  2023-06-29  6:27           ` segaloco via TUHS
@ 2023-06-29 13:45           ` G. Branden Robinson
  1 sibling, 0 replies; 20+ messages in thread
From: G. Branden Robinson @ 2023-06-29 13:45 UTC (permalink / raw)
  To: Stuff Received; +Cc: tuhs

[-- Attachment #1: Type: text/plain, Size: 545 bytes --]

At 2023-06-28T21:59:24-0400, Stuff Received wrote:
> and why was the compilation line never placed in a comment in the
> file?

Having done some work with historical *roff documents, my conjecture is
that the single source of truth was usually to be found in a Makefile.
Unfortunately, *roff documents have not reliably been distributed along
with the scripts directing control of their compilation and
installation.  If you insist upon that, you start sounding like one of
those street-corner preaching copyleft people... ;-)

Regards,
Branden

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-29 13:34         ` G. Branden Robinson
@ 2023-06-29 13:47           ` Rich Salz
  2023-06-29 19:03             ` Steffen Nurpmeso
  0 siblings, 1 reply; 20+ messages in thread
From: Rich Salz @ 2023-06-29 13:47 UTC (permalink / raw)
  To: G. Branden Robinson; +Cc: tuhs

[-- Attachment #1: Type: text/plain, Size: 75 bytes --]

A perl script to inuit likely roff options as definitely a neat Unix hack.

[-- Attachment #2: Type: text/html, Size: 100 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-29  6:27           ` segaloco via TUHS
  2023-06-29  6:41             ` Andrew Hume
  2023-06-29  6:44             ` Noel Hunt
@ 2023-06-29 14:02             ` G. Branden Robinson
  2 siblings, 0 replies; 20+ messages in thread
From: G. Branden Robinson @ 2023-06-29 14:02 UTC (permalink / raw)
  To: segaloco; +Cc: The Eunuchs Hysterical Society

[-- Attachment #1: Type: text/plain, Size: 3759 bytes --]

At 2023-06-29T06:27:44+0000, segaloco via TUHS wrote:
> Man of course finds use in the manual pages (although there are
> different representations of manpages in nroff over time.)

Setting aside the well known bifurcation between man(7) and mdoc(7),
which manage to stay out of each other's way in the macro name space,
I'm not aware of any comparative survey of different man(7)
implementations.  Ultrix at some point--I have no insight into the
chronology of it--had a large set of extensions that remains quietly
documented and supported by groff to this day, albeit off in a corner
where it seems to receive little attention.  (Just as well, in my
opinion, as not all of its innovations are worthy of embrace.)

As far as other vendor extensions and developments go, I have collected
all of the information known to me into the groff_man(7) page in the
any-minute-now groff 1.23.0 release.  Here are the relevant sections.
(There are two because concept and implementation are distinguishable.)

  History
    M. Douglas McIlroy designed, implemented, and documented the AT&T
    man macros for Unix Version 7 (1979) and employed them to edit the
    first volume of its Programmer's Manual, a compilation of all man
    pages supplied by the system.  That man supported the macros listed
    in this page not described as extensions, except .P and the
    deprecated .AT and .UC.  The only strings defined were R and S; no
    registers were documented.

    .UC appeared in 3BSD (1980).  Unix System III (1980) introduced .P
    and exposed the registers IN and LL, which had been internal to
    Seventh Edition Unix man.  PWB/UNIX 2.0 (1980) added the Tm string.
    4BSD (1980) added lq and rq strings.  SunOS 2.0 (1985) recognized C,
    D, P, and X registers.  4.3BSD (1986) added .AT and .P.  Ninth
    Edition Research Unix (1986) introduced .EX and .EE.  SunOS 4.0
    (1988) added .SB.

    The foregoing features were what James Clark implemented in early
    versions of groff.  Later, groff 1.20 (2009) originated .SY/.YS,
    .TQ, .MT/.ME, and .UR/.UE.  Plan 9 from User Space's troff
    introduced .MR in 2020.

Authors
    The initial GNU implementation of the man macro package was written
    by James Clark.  Later, Werner Lemberg supplied the S, LT, and cR
    registers, the last a 4.3BSD-Reno mdoc(7) feature.  Larry Kollar
    added the FT, HY, and SN registers; the HF string; and the PT and BT
    macros.  G. Branden Robinson implemented the AD and MF strings; CS,
    CT, and U registers; and the MR macro.  Except for .SB, the
    extension macros were written by Lemberg, Eric S. Raymond, and
    Robinson.

    This document was originally written for the Debian GNU/Linux system
    by Susan G. Kleinmann.  It was corrected and updated by Lemberg and
    Robinson.  The extension macros were documented by Raymond and
    Robinson.

I welcome any further insights people can offer.  This man page isn't
the best place to document extensions that withered on the vine (like
Eighth/Ninth Edition Research Unix's addition of multi-column macros for
man(7)), but I wouldn't mind collecting such things into some sort of
auxiliary article.

While the mandoc(1)/mdocml project's "History of UNIX Manpages"[1] is an
invaluable resource, it doesn't really do what's written on the tin, and
serves more as a history of (some) *roff _formatters_--not of the man(7)
language.  I assume that this stance is in part due to the unease
bordering on antipathy that mandoc(1) proponents have for the man(7)
macro package.  In their view, everybody should be writing mdoc(7).
Unfortunately this lacuna has left useful historical information about
the man(7) package uncollected.

Regards,
Branden

[1] https://manpages.bsd.lv/history.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: Trying to date "A Supplemental Document For Awk"
  2023-06-29 13:47           ` Rich Salz
@ 2023-06-29 19:03             ` Steffen Nurpmeso
  0 siblings, 0 replies; 20+ messages in thread
From: Steffen Nurpmeso @ 2023-06-29 19:03 UTC (permalink / raw)
  To: tuhs

Rich Salz wrote in
 <CAFH29toi4aFfGY7g+SndAPz6ndjk8j+LKZOGfnd2GQGnrNXKhw@mail.gmail.com>:
 |A perl script to inuit likely roff options as definitely a neat Unix hack.

The "problem" is that the "shebang" line used for UNIX man'uals on
at least a few ("newer" <> post Y2K) systems has never been
extended in plain *roff terms, for general macro things.  Ie that

   For example, newer man(1)s read the first line of the manual and
   check for a syntax <^'\" >followed by concat of [egprtv]+ (and in
   fact  *join in* $MANROFFSEQ environment [egprtv]+)
                while getopts 'egprtv' preproc_arg; do
                        case "${preproc_arg}" in
                        e)      pipeline="$pipeline | $EQN" ;;
                        g)      GRAP  ;; # Ignore for compatibility.
                        p)      pipeline="$pipeline | $PIC" ;;
                        r)      pipeline="$pipeline | $REFER" ;;
                        t)      pipeline="$pipeline | $TBL" ;;
                        v)      pipeline="$pipeline | $VGRIND" ;;
                        *)      usage ;;
                        esac

Of course, most roff's do not have that "super process" that groff
actually is, for one, so you have to formulate pipelines anyway.
And then roff is dead for the young.  Generally speaking.

It is only a pity in my opinion because the most widely used
implementation (GNU roff) actually does "magic" already and
anyway, namely in its preconv(1), which does

       preconv tries to find the input encoding with the following algorithm.
       ...
       2.     Otherwise, check whether the input starts with a Byte Order Mark
              (BOM, see below).  If found, use it.

       3.     Otherwise, check whether there is a known coding tag (see below)
              in either the first or second input line.  If found, use it.
       ...
       5.     If everything fails[.]


And 3. is then

  [.]supports the coding tag convention (with some restrictions)
  as used by GNU Emacs and XEmacs[.]
  ...
  .\" -*- mode: troff; coding: latin-2 -*-

But possibly the future brings not only integrative and truthful
western white men, but also a roff which "can".  The former
i doubt, the latter i can still hope for.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-06-29 19:03 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-28  6:26 [TUHS] Trying to date "A Supplemental Document For Awk" Aharon Robbins
2023-06-28  6:45 ` [TUHS] " arnold
2023-06-28 17:48 ` Adam Sampson
2023-06-28 18:03   ` KenUnix
2023-06-28 18:38     ` Clem Cole
2023-06-28 23:47       ` Greg 'groggy' Lehey
2023-06-29  1:59         ` Stuff Received
2023-06-29  6:27           ` segaloco via TUHS
2023-06-29  6:41             ` Andrew Hume
2023-06-29  6:45               ` Noel Hunt
2023-06-29  6:48                 ` Andrew Hume
2023-06-29  6:50                   ` arnold
2023-06-29  6:44             ` Noel Hunt
2023-06-29 14:02             ` G. Branden Robinson
2023-06-29 13:45           ` G. Branden Robinson
2023-06-29 13:34         ` G. Branden Robinson
2023-06-29 13:47           ` Rich Salz
2023-06-29 19:03             ` Steffen Nurpmeso
2023-06-29  1:04     ` Bakul Shah
2023-06-29  0:26 ` Jeremy C. Reed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).