[TUHS] The 'usage: ...' message. (Was: On Bloat...)

The Unix Heritage Society mailing list
 help / color / mirror / Atom feed

* [TUHS] The 'usage: ...' message. (Was: On Bloat...)
@ 2024-05-19 23:08 Douglas McIlroy
  2024-05-20  0:58 ` [TUHS] " Rob Pike
  0 siblings, 1 reply; 20+ messages in thread
From: Douglas McIlroy @ 2024-05-19 23:08 UTC (permalink / raw)
  To: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 2004 bytes --]

>> Another non-descriptive style of error message that I admired was that
>> of Berkeley Pascal's syntax diagnostics. When the LR parser could not
>> proceed, it reported where, and automatically provided a sample token
>> that would allow the parsing to progress. I found this uniform
>> convention to be at least as informative as distinct hand-crafted
>> messages, which almost by definition can't foresee every contingency.
>> Alas, this elegant scheme seems not to have inspired imitators.

> The hazard with this approach is that the suggested syntactic correction
> might simply lead the user farther into the weeds

I don't think there's enough experience to justify this claim. Before I
experienced the Berkeley compiler, I would have thought such bad outcomes
were inevitable in any language. Although the compilers' suggestions often
bore little or no relationship to the real correction,  I always found them
informative. In particular, the utterly consistent style assured there was
never an issue of ambiguity or of technical jargon.

The compiler taught me Pascal in an evening. I had scanned the Pascal
Report a couple of years before but had never written a Pascal program.
With no manual at hand, I looked at one program to find out what
mumbo-jumbo had to come first and how to print integers, then wrote the
rest by trial and error. Within a couple of hours  I had a working program
good enough to pass muster in an ACM journal.

An example arose that one might think would lead "into the weeds". The
parser balked before 'or' in a compound Boolean expression like  'a=b and
c=d or x=y'. It couldn't suggest a right paren because no left paren had
been seen. Whatever suggestion it did make (perhaps 'then') was enough to
lead me to insert a remote left paren and teach me that parens are required
around Boolean-valued subexpressions. (I will agree that this lesson might
be less clear to a programming novice, but so might be many conventional
diagnostics, e.g. "no effect".)

Doug

[-- Attachment #2: Type: text/html, Size: 3392 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: The 'usage: ...' message. (Was: On Bloat...)
  2024-05-19 23:08 [TUHS] The 'usage: ...' message. (Was: On Bloat...) Douglas McIlroy
@ 2024-05-20  0:58 ` Rob Pike
  2024-05-20  3:19   ` arnold
                     ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Rob Pike @ 2024-05-20  0:58 UTC (permalink / raw)
  To: Douglas McIlroy; +Cc: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 561 bytes --]

The Cornell PL/I compiler, PL/C, ran on the IBM 360 so of course used batch
input. It tried automatically to keep things running after a parsing error
by inserting some token - semicolon, parenthesis, whatever seemed best -
and continuing to parse, in order to maximize the amount of input that
could be parsed before giving up. At least, that's what I took the
motivation to be. It rarely succeeded in fixing the actual problem, despite
PL/I being plastered with semicolons, but it did tend to ferret out more
errors per run. I found the tactic helpful.

-rob

[-- Attachment #2: Type: text/html, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: The 'usage: ...' message. (Was: On Bloat...)
  2024-05-20  0:58 ` [TUHS] " Rob Pike
@ 2024-05-20  3:19   ` arnold
  2024-05-20  3:43     ` Warner Losh
  2024-05-20  9:20     ` [TUHS] A fuzzy awk. (Was: The 'usage: ...' message.) Ralph Corderoy
  2024-05-20  3:54   ` [TUHS] Re: The 'usage: ...' message. (Was: On Bloat...) Bakul Shah via TUHS
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 20+ messages in thread
From: arnold @ 2024-05-20  3:19 UTC (permalink / raw)
  To: robpike, douglas.mcilroy; +Cc: tuhs

Rob Pike <robpike@gmail.com> wrote:

> The Cornell PL/I compiler, PL/C, ran on the IBM 360 so of course used batch
> input. It tried automatically to keep things running after a parsing error
> by inserting some token - semicolon, parenthesis, whatever seemed best -
> and continuing to parse, in order to maximize the amount of input that
> could be parsed before giving up. At least, that's what I took the
> motivation to be. It rarely succeeded in fixing the actual problem, despite
> PL/I being plastered with semicolons, but it did tend to ferret out more
> errors per run. I found the tactic helpful.
>
> -rob

Gawk used to do this, until people started fuzzing it, causing cascading
errors and eventually core dumps. Now the first syntax error is fatal.
It got to the point where I added this text to the manual:

	In recent years, people have been running "fuzzers" to generate
	invalid awk programs in order to find and report (so-called)
	bugs in gawk.

	In general, such reports are not of much practical use. The
	programs they create are not realistic and the bugs found are
	generally from some kind of memory corruption that is fatal
	anyway.

	So, if you want to run a fuzzer against gawk and report the
	results, you may do so, but be aware that such reports don’t
	carry the same weight as reports of real bugs do.

(Yeah, I've just changed the subject, feel free to stay on topic. :-)

Arnold

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: The 'usage: ...' message. (Was: On Bloat...)
  2024-05-20  3:19   ` arnold
@ 2024-05-20  3:43     ` Warner Losh
  2024-05-20  4:46       ` arnold
  2024-05-20  9:20     ` [TUHS] A fuzzy awk. (Was: The 'usage: ...' message.) Ralph Corderoy
  1 sibling, 1 reply; 20+ messages in thread
From: Warner Losh @ 2024-05-20  3:43 UTC (permalink / raw)
  To: Arnold Robbins; +Cc: Douglas McIlroy, The Eunuchs Hysterical Society

[-- Attachment #1: Type: text/plain, Size: 1697 bytes --]

On Sun, May 19, 2024, 9:19 PM <arnold@skeeve.com> wrote:

> Rob Pike <robpike@gmail.com> wrote:
>
> > The Cornell PL/I compiler, PL/C, ran on the IBM 360 so of course used
> batch
> > input. It tried automatically to keep things running after a parsing
> error
> > by inserting some token - semicolon, parenthesis, whatever seemed best -
> > and continuing to parse, in order to maximize the amount of input that
> > could be parsed before giving up. At least, that's what I took the
> > motivation to be. It rarely succeeded in fixing the actual problem,
> despite
> > PL/I being plastered with semicolons, but it did tend to ferret out more
> > errors per run. I found the tactic helpful.
> >
> > -rob
>
> Gawk used to do this, until people started fuzzing it, causing cascading
> errors and eventually core dumps. Now the first syntax error is fatal.
> It got to the point where I added this text to the manual:
>
>         In recent years, people have been running "fuzzers" to generate
>         invalid awk programs in order to find and report (so-called)
>         bugs in gawk.
>
>         In general, such reports are not of much practical use. The
>         programs they create are not realistic and the bugs found are
>         generally from some kind of memory corruption that is fatal
>         anyway.
>
>         So, if you want to run a fuzzer against gawk and report the
>         results, you may do so, but be aware that such reports don’t
>         carry the same weight as reports of real bugs do.
>
> (Yeah, I've just changed the subject, feel free to stay on topic. :-)
>


Awk bailing out near line 1.

Warner

> Arnold
>

[-- Attachment #2: Type: text/html, Size: 2473 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: The 'usage: ...' message. (Was: On Bloat...)
  2024-05-20  0:58 ` [TUHS] " Rob Pike
  2024-05-20  3:19   ` arnold
@ 2024-05-20  3:54   ` Bakul Shah via TUHS
  2024-05-20 14:23   ` Clem Cole
  2024-05-20 17:40   ` Stuff Received
  3 siblings, 0 replies; 20+ messages in thread
From: Bakul Shah via TUHS @ 2024-05-20  3:54 UTC (permalink / raw)
  To: The Unix Heritage Society mailing list

[-- Attachment #1: Type: text/plain, Size: 798 bytes --]

I remember helping newbie students at USC who were very confused that even though they made the changes "PL/C USES", their program didn't work! 

> On May 19, 2024, at 5:58 PM, Rob Pike <robpike@gmail.com> wrote:
> 
> The Cornell PL/I compiler, PL/C, ran on the IBM 360 so of course used batch input. It tried automatically to keep things running after a parsing error by inserting some token - semicolon, parenthesis, whatever seemed best - and continuing to parse, in order to maximize the amount of input that could be parsed before giving up. At least, that's what I took the motivation to be. It rarely succeeded in fixing the actual problem, despite PL/I being plastered with semicolons, but it did tend to ferret out more errors per run. I found the tactic helpful.
> 
> -rob
> 


[-- Attachment #2: Type: text/html, Size: 1443 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: The 'usage: ...' message. (Was: On Bloat...)
  2024-05-20  3:43     ` Warner Losh
@ 2024-05-20  4:46       ` arnold
  0 siblings, 0 replies; 20+ messages in thread
From: arnold @ 2024-05-20  4:46 UTC (permalink / raw)
  To: imp, arnold; +Cc: tuhs, douglas.mcilroy

Warner Losh <imp@bsdimp.com> wrote:

> > (Yeah, I've just changed the subject, feel free to stay on topic. :-)
>
> Awk bailing out near line 1.

	$ gawk --nostalgia
	awk: bailing out near line 1
	Aborted (core dumped)

A very long time Easter Egg... :-)

Arnold

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] A fuzzy awk.  (Was: The 'usage: ...' message.)
  2024-05-20  3:19   ` arnold
  2024-05-20  3:43     ` Warner Losh
@ 2024-05-20  9:20     ` Ralph Corderoy
  2024-05-20 11:58       ` [TUHS] " arnold
  2024-05-20 13:10       ` Chet Ramey
  1 sibling, 2 replies; 20+ messages in thread
From: Ralph Corderoy @ 2024-05-20  9:20 UTC (permalink / raw)
  To: tuhs

Hi Arnold,

> > in order to maximize the amount of input that could be parsed before
> > giving up.
>
> Gawk used to do this, until people started fuzzing it, causing
> cascading errors and eventually core dumps.  Now the first syntax
> error is fatal.

This is the first time I've heard of making life difficult for fuzzers
so I'm curious...

I'm assuming you agree the eventual core dump was a bug somewhere to be
fixed, and probably was.  Stopping on the first error lessens the
‘attack surface’ for the fuzzer.  Do you think there remains a bug which
would bite a user which the fuzzer might have found more easily before
the shrunken surface?

-- 
Cheers, Ralph.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: A fuzzy awk.  (Was: The 'usage: ...' message.)
  2024-05-20  9:20     ` [TUHS] A fuzzy awk. (Was: The 'usage: ...' message.) Ralph Corderoy
@ 2024-05-20 11:58       ` arnold
  2024-05-20 13:10       ` Chet Ramey
  1 sibling, 0 replies; 20+ messages in thread
From: arnold @ 2024-05-20 11:58 UTC (permalink / raw)
  To: tuhs, ralph

Ralph Corderoy <ralph@inputplus.co.uk> wrote:

> This is the first time I've heard of making life difficult for fuzzers
> so I'm curious...

I was making life easier for me. :-)

> I'm assuming you agree the eventual core dump was a bug somewhere to be
> fixed, and probably was.

Not really. Hugely syntactically invalid programs can end up causing
memory corruption as necessary data structures don't get built correctly
(or at all); since they're invalid, subsequent bits of gawk that expect
valid data structures end up not working.  These are "bugs" that can't
happen when using the tool correctly.

> Stopping on the first error lessens the ‘attack surface’ for the
> fuzzer.  Do you think there remains a bug which would bite a user which
> the fuzzer might have found more easily before the shrunken surface?

No.

I don't have any examples handy, but you can look back through the
bug-gawk archives for some examples of these reports.  The number
of true bugs that fuzzers have caught (if any!) could be counted
on one hand.

Sometimes they like to claim that the "bugs" they find could cause
denial of service attacks. That's also specious, gawk isn't used for
long-running server kinds of programs.

The joys of being a Free Software Maintainer.

Arnold

P.S. I don't claim that gawk is bug-free.  But I do think that there
are qualitatively different kinds of bugs, and bug reports.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: A fuzzy awk. (Was: The 'usage: ...' message.)
  2024-05-20  9:20     ` [TUHS] A fuzzy awk. (Was: The 'usage: ...' message.) Ralph Corderoy
  2024-05-20 11:58       ` [TUHS] " arnold
@ 2024-05-20 13:10       ` Chet Ramey
  2024-05-20 13:30         ` [TUHS] Re: A fuzzy awk Ralph Corderoy
  1 sibling, 1 reply; 20+ messages in thread
From: Chet Ramey @ 2024-05-20 13:10 UTC (permalink / raw)
  To: Ralph Corderoy, tuhs


[-- Attachment #1.1: Type: text/plain, Size: 1487 bytes --]

On 5/20/24 5:20 AM, Ralph Corderoy wrote:
> Hi Arnold,
> 
>>> in order to maximize the amount of input that could be parsed before
>>> giving up.
>>
>> Gawk used to do this, until people started fuzzing it, causing
>> cascading errors and eventually core dumps.  Now the first syntax
>> error is fatal.
> 
> This is the first time I've heard of making life difficult for fuzzers
> so I'm curious...

It's not making life difficult for them -- they can still fuzz all they
want. Chances are better they'll find a genuine bug if you stop right away.


> I'm assuming you agree the eventual core dump was a bug somewhere to be
> fixed, and probably was.  > Stopping on the first error lessens the
> ‘attack surface’ for the fuzzer.  Do you think there remains a bug which
> would bite a user which the fuzzer might have found more easily before
> the shrunken surface?

Chances are small. (People fuzz bash all the time, and that is my
experience.)

Look at it this way. Free Software maintainers have limited resources. Is
it better to spend time on bugs that will affect a larger percentage of
the user population, instead of those that require artificial circumstances
that won't be encountered by normal usage? Those get pushed down on the
priority list.

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
		 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    chet@case.edu    http://tiswww.cwru.edu/~chet/


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 203 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: A fuzzy awk.
  2024-05-20 13:10       ` Chet Ramey
@ 2024-05-20 13:30         ` Ralph Corderoy
  2024-05-20 13:48           ` Chet Ramey
  0 siblings, 1 reply; 20+ messages in thread
From: Ralph Corderoy @ 2024-05-20 13:30 UTC (permalink / raw)
  To: tuhs

Hi Chet,

> Is it better to spend time on bugs that will affect a larger
> percentage of the user population, instead of those that require
> artificial circumstances that won't be encountered by normal usage?
> Those get pushed down on the priority list.

You're talking about pushing unlikely, fuzzed bugs down the prioritised
list, but we're discussing those bugs not getting onto the list for
consideration.

Lack of resources also applies to triaging bugs and I agree a fuzzed bug
which hands over a 42 KiB of dense, gibberish awk will probably not get
volunteer attention.  But then fuzzers can seek a smaller test case,
similar to Andreas Zeller's delta debugging.

I'm in no way criticising Arnold who, like you, has spent many years
voluntarily enhancing a program many of us use every day.  But it's
interesting to shine some light on this corner to better understand
what's happening.

-- 
Cheers, Ralph.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: A fuzzy awk.
  2024-05-20 13:30         ` [TUHS] Re: A fuzzy awk Ralph Corderoy
@ 2024-05-20 13:48           ` Chet Ramey
  0 siblings, 0 replies; 20+ messages in thread
From: Chet Ramey @ 2024-05-20 13:48 UTC (permalink / raw)
  To: Ralph Corderoy, tuhs

[-- Attachment #1.1: Type: text/plain, Size: 1341 bytes --]

On 5/20/24 9:30 AM, Ralph Corderoy wrote:
> Hi Chet,
> 
>> Is it better to spend time on bugs that will affect a larger
>> percentage of the user population, instead of those that require
>> artificial circumstances that won't be encountered by normal usage?
>> Those get pushed down on the priority list.
> 
> You're talking about pushing unlikely, fuzzed bugs down the prioritised
> list, but we're discussing those bugs not getting onto the list for
> consideration.

I think the question is whether they were bugs in gawk at all, or the
result of gawk trying to be helpful by guessing at the script's intent
and trying to go on. Arnold's reaction to that, which had these negative
effects most often as the result of fuzzing attempts, was to exit on the
first syntax error.

Would those `bugs' have manifested themselves if gawk hadn't tried to do
this? Are they bugs at all? Guessing at intent is bound to be wrong some
of the time, and cause errors of its own.

I'm saying that fuzzing does occasionally find obscure bugs -- bugs that
would never be encountered in normal usage -- and those should be fixed.
Eventually.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
		 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    chet@case.edu    http://tiswww.cwru.edu/~chet/

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 203 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: The 'usage: ...' message. (Was: On Bloat...)
  2024-05-20  0:58 ` [TUHS] " Rob Pike
  2024-05-20  3:19   ` arnold
  2024-05-20  3:54   ` [TUHS] Re: The 'usage: ...' message. (Was: On Bloat...) Bakul Shah via TUHS
@ 2024-05-20 14:23   ` Clem Cole
  2024-05-20 17:30     ` Greg A. Woods
  2024-05-20 20:10     ` John Levine
  2024-05-20 17:40   ` Stuff Received
  3 siblings, 2 replies; 20+ messages in thread
From: Clem Cole @ 2024-05-20 14:23 UTC (permalink / raw)
  To: Rob Pike; +Cc: Douglas McIlroy, TUHS main list

[-- Attachment #1: Type: text/plain, Size: 3354 bytes --]

I was going to keep silent on this one until I realized I disagree with
both Doug and Rob here (always a little dangerous). But because of personal
experience, I have a pretty strong opinion is not really a win.  Note that I
cribbed this email response from an answer I wrote on Quora to the
question:

*When you are programming and commit a minor error, such as forgetting a
semicolon, the compiler throws an error and makes you fix it for yourself.
Why doesn’t it just fix it by itself and notify you of the fix instead?*


FWIW: The first version of the idea that I now about was DWIM - *Do What I
Mean* feature from BBN’s LISP (that eventually made it into InterLISP). As
the Wikipedia page describes DWIM became known as "Damn Warren's Infernal
Machine" [more details in the DWIM section of the jargon file].  As Doug
points out, the original Pascal implementation for Unix, pix(1), also
supported this idea of fixing your code for you, and as Rob points out, UCB’s
pix(1) took the idea of trying to keep going and make the compile work from
the earlier Cornell PL/C compiler for the IBM 360[1], which to quote
Wikipedia:

“The PL/C compiler had the unusual capability of never failing to compile
> any program, through the use of extensive automatic correction of many
> syntax errors and by converting any remaining syntax errors to output
> statements.”


The problem is that people can be lazy, and instead of using " DWIM" as a
tool to speed up their development and fix their own errors, they just
ignore the errors. In fact, when we were teaching the “Intro to CS” course
at UCB in the early 1980s; we actually had students turn in programs that
had syntax errors in them because pix(1) had corrected their code -- instead
of the student fixing his/her code before handing the program into the TA
(and then they would complain when they got “marked down” on the assignment
— sigh).

IMO: All in all, the experiment failed because many (??most??) people
really don’t work that way. Putting a feature like this in an IDE or even
an editor like emacs might be reasonable since the sources would be
modified, but it means you need to like using an IDE. I also ask --> what
happens when the computer’s (IDE) guess is different from the programmer's
real intent, and since it was ‘fixed’ behind the curtain, you don’t notice
it?

Some other people have suggested that DWIM isn’t a little like spelling
‘auto-correct’ or tools like ‘Grammarly.’ The truth is, I have a love/hate
relationship with auto-correct, particularly on my mobile devices. I'm
dyslexic, so tools like this can be helpful to me sometimes, but I spend a
great deal of my time fighting these types of tools because they are so
often wrong, particularly with a small screen/keyboard, that it is just
“not fun.”

This brings me back to my experience. IMO, auto-correct for programming is
like DWIM all over again, and the cure causes more problems than it solves.

Clem

[1] I should add that after Cornell’s PL/C compiler was introduced, IBM
eventually added a similar feature to its own PL/1, although it was not
nearly as extensive as the Cornell solution. I’m sure you can find people
who liked it, but in both cases, I personally never found it that useful.

>
ᐧ

[-- Attachment #2: Type: text/html, Size: 5841 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: The 'usage: ...' message. (Was: On Bloat...)
  2024-05-20 14:23   ` Clem Cole
@ 2024-05-20 17:30     ` Greg A. Woods
  2024-05-20 20:10     ` John Levine
  1 sibling, 0 replies; 20+ messages in thread
From: Greg A. Woods @ 2024-05-20 17:30 UTC (permalink / raw)
  To: The Unix Heritage Society mailing list

[-- Attachment #1: Type: text/plain, Size: 724 bytes --]

At Mon, 20 May 2024 10:23:56 -0400, Clem Cole <clemc@ccc.com> wrote:
Subject: [TUHS] Re: The 'usage: ...' message. (Was: On Bloat...)
>
> This brings me back to my experience. IMO, auto-correct for programming is
> like DWIM all over again, and the cure causes more problems than it solves.

We're deep down that rabbit hole this time with LLM/GPT systems
generating large swathes of code that I believe all too often gets into
production without any human programmer fully vetting its fitness for
purpose, or perhaps even understanding it.

--
					Greg A. Woods <gwoods@acm.org>

Kelowna, BC     +1 250 762-7675           RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>     Avoncote Farms <woods@avoncote.ca>

[-- Attachment #2: OpenPGP Digital Signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: The 'usage: ...' message. (Was: On Bloat...)
  2024-05-20  0:58 ` [TUHS] " Rob Pike
                     ` (2 preceding siblings ...)
  2024-05-20 14:23   ` Clem Cole
@ 2024-05-20 17:40   ` Stuff Received
  3 siblings, 0 replies; 20+ messages in thread
From: Stuff Received @ 2024-05-20 17:40 UTC (permalink / raw)
  To: tuhs

On 2024-05-19 20:58, Rob Pike wrote:
> The Cornell PL/I compiler, PL/C, ran on the IBM 360 so of course used 
> batch input. It tried automatically to keep things running after a 
> parsing error by inserting some token - semicolon, parenthesis, whatever 
> seemed best - and continuing to parse, in order to maximize the amount 
> of input that could be parsed before giving up. At least, that's what I 
> took the motivation to be. It rarely succeeded in fixing the actual 
> problem, despite PL/I being plastered with semicolons, but it did tend 
> to ferret out more errors per run. I found the tactic helpful.
> 
> -rob
> 

Possibly way off topic but Toronto allowed anyone to run PL/C decks for 
free, which I often did.  One day, they decided to allow all of the card 
to be read as text and my card numbers generated all sorts of errors. 
(At least easily fixed by a visit the card punch.)

S.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: The 'usage: ...' message. (Was: On Bloat...)
  2024-05-20 14:23   ` Clem Cole
  2024-05-20 17:30     ` Greg A. Woods
@ 2024-05-20 20:10     ` John Levine
  2024-05-21  1:14       ` John Cowan
  1 sibling, 1 reply; 20+ messages in thread
From: John Levine @ 2024-05-20 20:10 UTC (permalink / raw)
  To: tuhs

It appears that Clem Cole <clemc@ccc.com> said:
>“The PL/C compiler had the unusual capability of never failing to compile
>> any program, through the use of extensive automatic correction of many
>> syntax errors and by converting any remaining syntax errors to output
>> statements.”
>
>The problem is that people can be lazy, and instead of using " DWIM" as a
>tool to speed up their development and fix their own errors, they just
>ignore the errors. ...

PL/C was a long time ago in the early 1970s. People used it on batch
systems whre you handed in your cards at the window, waited a while,
and later got your printout back. Or at advanced places, you could
run the cards through the reader yourself, then wait until the batch
ran.

In that environment, the benefit from possibly guessing an error
correction right meant fewer trips to the card reader. In my youth I
did a fair amount of programming that way in WATFOR/WATFIV and Algol W
where we really tried to get the programs right since we wanted to
finish up and go home.

When I was using interactive systems where you could fix one bug and
try again, over and over, it seemed like cheating.

R's,
John

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: The 'usage: ...' message. (Was: On Bloat...)
  2024-05-20 20:10     ` John Levine
@ 2024-05-21  1:14       ` John Cowan
  0 siblings, 0 replies; 20+ messages in thread
From: John Cowan @ 2024-05-21  1:14 UTC (permalink / raw)
  To: John Levine; +Cc: tuhs

[-- Attachment #1: Type: text/plain, Size: 1465 bytes --]

On Mon, May 20, 2024 at 4:11 PM John Levine <johnl@taugh.com> wrote:

It appears that Clem Cole <clemc@ccc.com> said:
> >“The PL/C compiler had the unusual capability of never failing to compile
> >> any program, through the use of extensive automatic correction of many
> >> syntax errors and by converting any remaining syntax errors to output
> >> statements.”
> PL/C was a long time ago in the early 1970s. People used it on batch
> systems whre you handed in your cards at the window, waited a while,
> and later got your printout back. Or at advanced places, you could
> run the cards through the reader yourself, then wait until the batch
> ran.


PL/C was a 3rd-generation autocorrection programming language.  CORC was
the 1962 version and CUPL was the 1966 version (same date as DWIM), neither
of them based on PL/I.  There is an implementation of both at <
http://www.catb.org/~esr/cupl/>.

The Wikipedia DWIM article also points to Magit, the Emacs git client.

>
> In that environment, the benefit from possibly guessing an error
> correction right meant fewer trips to the card reader. In my youth I
> did a fair amount of programming that way in WATFOR/WATFIV and Algol W
> where we really tried to get the programs right since we wanted to
> finish up and go home.
>
> When I was using interactive systems where you could fix one bug and
> try again, over and over, it seemed like cheating.
>
> R's,
> John
>

[-- Attachment #2: Type: text/html, Size: 2636 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: A fuzzy awk. (Was: The 'usage: ...' message.)
  2024-05-20 13:06 [TUHS] A fuzzy awk. (Was: The 'usage: ...' message.) Douglas McIlroy
  2024-05-20 13:14 ` [TUHS] " arnold
  2024-05-20 13:25 ` Chet Ramey
@ 2024-05-20 16:06 ` Paul Winalski
  2 siblings, 0 replies; 20+ messages in thread
From: Paul Winalski @ 2024-05-20 16:06 UTC (permalink / raw)
  To: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 2182 bytes --]

On Mon, May 20, 2024 at 9:17 AM Douglas McIlroy <
douglas.mcilroy@dartmouth.edu> wrote:

> I'm surprised by nonchalance about bad inputs evoking bad program
> behavior. That attitude may have been excusable 50 years ago. By now,
> though, we have seen so much malicious exploitation of open avenues of
> "undefined behavior" that we can no longer ignore bugs that "can't happen
> when using the tool correctly". Mature software should not brook incorrect
> usage.
>
> Accepting bad inputs can also lead to security issues.  The data breaches
from SQL-based attacks are a modern case in point.

IMO, as a programmer you owe it to your users to do your best to detect bad
input and to handle it in a graceful fashion.  Nothing is more frustrating
to a user than to have a program blow up in their face with a seg fault, or
even worse, simply exit silently.

As the DEC compiler team's expert on object files, I was called on to add
object file support to a compiler back end originally targeted to VMS
only.  I inherited support of the object file generator for Unix COFF and
later wrote the support for Microsoft PECOFF and ELF.  When our group was
bought by Intel I did the object file support for Apple OS X MACH-O in the
Intel compiler back end.

I found that the folks who write linkers are particularly lazy about error
checking and error handling.  They assume that the compiler always
generates clean object files.  That's OK I suppose if the compiler and
linker people are in the same organization.  If the linker falls over you
can just go down the hall and have the linker developer debug the issue and
tell you where you went wrong.  But that doesn't work when they work for
different companies and the compiler person doesn't have access to the
linker sources.  I ran into a lot of cases where my buggy object file
caused the linker to seg fault or, even worse, simply exit without an error
message.

I ended up writing a very thorough formatted dumper for each object file
format that did very thorough checking for proper syntax and as many
semantic errors (e.g., symbol table index number out of range) as I could.

-Paul W.

[-- Attachment #2: Type: text/html, Size: 2610 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: A fuzzy awk. (Was: The 'usage: ...' message.)
  2024-05-20 13:14 ` [TUHS] " arnold
@ 2024-05-20 14:00   ` G. Branden Robinson
  0 siblings, 0 replies; 20+ messages in thread
From: G. Branden Robinson @ 2024-05-20 14:00 UTC (permalink / raw)
  To: tuhs; +Cc: douglas.mcilroy, groff

[-- Attachment #1: Type: text/plain, Size: 6974 bytes --]

Hi folks,

At 2024-05-20T07:14:07-0600, arnold@skeeve.com wrote:
> Douglas McIlroy <douglas.mcilroy@dartmouth.edu> wrote:
> > I'm surprised by nonchalance about bad inputs evoking bad program
> > behavior.  That attitude may have been excusable 50 years ago. By
> > now, though, we have seen so much malicious exploitation of open
> > avenues of "undefined behavior" that we can no longer ignore bugs
> > that "can't happen when using the tool correctly". Mature software
> > should not brook incorrect usage.
> 
> It's not nonchalance, not at all!
> 
> The current behavior is to die on the first syntax error, instead of
> trying to be "helpful" by continuing to try to parse the program in
> the hope of reporting other errors.
[...]
> The crashes came because errors cascaded.  I don't see a reason to
> spend valuable, *personal* time on adding defenses *where they aren't
> needed*.
> 
> A steel door on your bedroom closet does no good if your front door is
> made of balsa wood. My change was to stop the badness at the front
> door.
> 
> > I commend attention to the LangSec movement, which advocates for
> > rigorously enforced separation between legal and illegal inputs.
> 
> Illegal input, in gawk, as far as I know, should always cause a syntax
> error report and an immediate exit.
> 
> If it doesn't, that is a bug, and I'll be happy to try to fix it.
> 
> I hope that clarifies things.

For grins, and for a data point from elsewhere in GNU-land, GNU troff is
pretty robust to this sort of thing.  Much as I might like to boast of
having improved it in this area, it appears to have already come with
iron long johns courtesy of James Clark and/or Werner Lemberg.  I threw
troff its own ELF executable as a crude fuzz test some years ago, and I
don't recall needing to fix anything except unhelpfully vague diagnostic
messages (a phenomenon I am predisposed to observe anyway).

I did notice today that in one case we were spewing back out unprintable
characters (newlines, character codes > 127) _in_ one (but only one) of
the diagnostic messages, and while that's ugly, it's not an obvious
exploitation vector to me.

Nevertheless I decided to fix it and it will be in my next push.

So here's the mess you get when feeding GNU troff to itself.  No GNU
troff since before 1.22.3 core dumps on this sort of unprepossessing
input.

$ ./build/test-groff -Ww -z /usr/bin/troff 2>&1 | sed 's/:[0-9]\+:/:/' | sort | uniq -c
     17 troff:/usr/bin/troff: error: a backspace character is not allowed in an escape sequence parameter
     10 troff:/usr/bin/troff: error: a space character is not allowed in an escape sequence parameter
      1 troff:/usr/bin/troff: error: a space is not allowed as a starting delimiter
      1 troff:/usr/bin/troff: error: a special character is not allowed in an identifier
      1 troff:/usr/bin/troff: error: character '-' is not allowed as a starting delimiter
      1 troff:/usr/bin/troff: error: invalid argument ')' to output suppression escape sequence
      1 troff:/usr/bin/troff: error: invalid argument 'c' to output suppression escape sequence
      1 troff:/usr/bin/troff: error: invalid argument 'l' to output suppression escape sequence
      1 troff:/usr/bin/troff: error: invalid argument 'm' to output suppression escape sequence
      1 troff:/usr/bin/troff: error: invalid positional argument number ','
      3 troff:/usr/bin/troff: error: invalid positional argument number '<'
      3 troff:/usr/bin/troff: error: invalid positional argument number 'D'
      1 troff:/usr/bin/troff: error: invalid positional argument number 'E'
     10 troff:/usr/bin/troff: error: invalid positional argument number 'H'
      1 troff:/usr/bin/troff: error: invalid positional argument number 'Hi'
      1 troff:/usr/bin/troff: error: invalid positional argument number 'I'
      1 troff:/usr/bin/troff: error: invalid positional argument number 'I9'
      1 troff:/usr/bin/troff: error: invalid positional argument number 'L'
      1 troff:/usr/bin/troff: error: invalid positional argument number 'LD'
      2 troff:/usr/bin/troff: error: invalid positional argument number 'LL'
      5 troff:/usr/bin/troff: error: invalid positional argument number 'LT'
      1 troff:/usr/bin/troff: error: invalid positional argument number 'M'
      4 troff:/usr/bin/troff: error: invalid positional argument number 'P'
      5 troff:/usr/bin/troff: error: invalid positional argument number 'X'
      1 troff:/usr/bin/troff: error: invalid positional argument number 'dH'
      1 troff:/usr/bin/troff: error: invalid positional argument number 'h'
      1 troff:/usr/bin/troff: error: invalid positional argument number 'l'
      1 troff:/usr/bin/troff: error: invalid positional argument number 'p'
      1 troff:/usr/bin/troff: error: invalid positional argument number 'x'
      3 troff:/usr/bin/troff: error: invalid positional argument number '|'
     35 troff:/usr/bin/troff: error: invalid positional argument number (unprintable)
      3 troff:/usr/bin/troff: error: unterminated transparent embedding escape sequence

The second to last (and most frequent) message in the list above is the
"new" one.  Here's the diff.

diff --git a/src/roff/troff/input.cpp b/src/roff/troff/input.cpp
index 8d828a01e..596ecf6f9 100644
--- a/src/roff/troff/input.cpp
+++ b/src/roff/troff/input.cpp
@@ -4556,10 +4556,21 @@ static void interpolate_arg(symbol nm)
   }
   else {
     const char *p;
-    for (p = s; *p && csdigit(*p); p++)
-      ;
-    if (*p)
-      copy_mode_error("invalid positional argument number '%1'", s);
+    bool is_valid = true;
+    bool is_printable = true;
+    for (p = s; *p != 0 /* nullptr */; p++) {
+      if (!csdigit(*p))
+       is_valid = false;
+      if (!csprint(*p))
+       is_printable = false;
+    }
+    if (!is_valid) {
+      const char msg[] = "invalid positional argument number";
+      if (is_printable)
+       copy_mode_error("%1 '%2'", msg, s);
+      else
+       copy_mode_error("%1 (unprintable)", msg);
+    }
     else
       input_stack::push(input_stack::get_arg(atoi(s)));
   }

GNU troff may have started out with an easier task in this area than an
AWK or a shell had; its syntax is not block-structured in the same way,
so parser state recovery is easier, and it's _inherently_ a filter.

The only fruitful fuzz attack on groff I can recall was upon indexed
bibliographic database files, which are a binary format.  This went
unresolved for several years[1] but I fixed it for groff 1.23.0.

https://bugs.debian.org/716109

Regards,
Branden

[1] I think I understand the low triage priority.  Few groff users use
    the refer(1) preprocessor, and of those who do, even fewer find
    modern systems so poorly performant at text scanning that they
    desire the services of indxbib(1) to speed lookup of bibliographic
    entries.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: A fuzzy awk. (Was: The 'usage: ...' message.)
  2024-05-20 13:06 [TUHS] A fuzzy awk. (Was: The 'usage: ...' message.) Douglas McIlroy
  2024-05-20 13:14 ` [TUHS] " arnold
@ 2024-05-20 13:25 ` Chet Ramey
  2024-05-20 16:06 ` Paul Winalski
  2 siblings, 0 replies; 20+ messages in thread
From: Chet Ramey @ 2024-05-20 13:25 UTC (permalink / raw)
  To: Douglas McIlroy, TUHS main list


[-- Attachment #1.1: Type: text/plain, Size: 465 bytes --]

On 5/20/24 9:06 AM, Douglas McIlroy wrote:
> I'm surprised by nonchalance about bad inputs evoking bad program behavior.

I think the claim is that it's better to stop immediately with an error
on invalid input rather than guess at the user's intent and try to go on.


-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
		 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    chet@case.edu    http://tiswww.cwru.edu/~chet/


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 203 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [TUHS] Re: A fuzzy awk. (Was: The 'usage: ...' message.)
  2024-05-20 13:06 [TUHS] A fuzzy awk. (Was: The 'usage: ...' message.) Douglas McIlroy
@ 2024-05-20 13:14 ` arnold
  2024-05-20 14:00   ` G. Branden Robinson
  2024-05-20 13:25 ` Chet Ramey
  2024-05-20 16:06 ` Paul Winalski
  2 siblings, 1 reply; 20+ messages in thread
From: arnold @ 2024-05-20 13:14 UTC (permalink / raw)
  To: tuhs, douglas.mcilroy

Perhaps I should not respond to this immediately. But:

Douglas McIlroy <douglas.mcilroy@dartmouth.edu> wrote:

> I'm surprised by nonchalance about bad inputs evoking bad program behavior.
> That attitude may have been excusable 50 years ago. By now, though, we have
> seen so much malicious exploitation of open avenues of "undefined behavior"
> that we can no longer ignore bugs that "can't happen when using the tool
> correctly". Mature software should not brook incorrect usage.

It's not nonchalance, not at all!

The current behavior is to die on the first syntax error, instead of
trying to be "helpful" by continuing to try to parse the program in the
hope of reporting other errors.

> "Bailing out near line 1" is a sign of defensive precautions. Crashes and
> unjustified output betray their absence.

The crashes came because errors cascaded.  I don't see a reason to spend
valuable, *personal* time on adding defenses *where they aren't needed*.

A steel door on your bedroom closet does no good if your front door
is made of balsa wood. My change was to stop the badness at the
front door.

> I commend attention to the LangSec movement, which advocates for rigorously
> enforced separation between legal and illegal inputs.

Illegal input, in gawk, as far as I know, should always cause a syntax
error report and an immediate exit.

If it doesn't, that is a bug, and I'll be happy to try to fix it.

I hope that clarifies things.

Arnold

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-05-21  1:15 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-19 23:08 [TUHS] The 'usage: ...' message. (Was: On Bloat...) Douglas McIlroy
2024-05-20  0:58 ` [TUHS] " Rob Pike
2024-05-20  3:19   ` arnold
2024-05-20  3:43     ` Warner Losh
2024-05-20  4:46       ` arnold
2024-05-20  9:20     ` [TUHS] A fuzzy awk. (Was: The 'usage: ...' message.) Ralph Corderoy
2024-05-20 11:58       ` [TUHS] " arnold
2024-05-20 13:10       ` Chet Ramey
2024-05-20 13:30         ` [TUHS] Re: A fuzzy awk Ralph Corderoy
2024-05-20 13:48           ` Chet Ramey
2024-05-20  3:54   ` [TUHS] Re: The 'usage: ...' message. (Was: On Bloat...) Bakul Shah via TUHS
2024-05-20 14:23   ` Clem Cole
2024-05-20 17:30     ` Greg A. Woods
2024-05-20 20:10     ` John Levine
2024-05-21  1:14       ` John Cowan
2024-05-20 17:40   ` Stuff Received
2024-05-20 13:06 [TUHS] A fuzzy awk. (Was: The 'usage: ...' message.) Douglas McIlroy
2024-05-20 13:14 ` [TUHS] " arnold
2024-05-20 14:00   ` G. Branden Robinson
2024-05-20 13:25 ` Chet Ramey
2024-05-20 16:06 ` Paul Winalski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).