The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
* [TUHS] A fuzzy awk. (Was: The 'usage: ...' message.)
@ 2024-05-20 13:06 Douglas McIlroy
  2024-05-20 13:14 ` [TUHS] " arnold
                   ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Douglas McIlroy @ 2024-05-20 13:06 UTC (permalink / raw)
  To: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 616 bytes --]

I'm surprised by nonchalance about bad inputs evoking bad program behavior.
That attitude may have been excusable 50 years ago. By now, though, we have
seen so much malicious exploitation of open avenues of "undefined behavior"
that we can no longer ignore bugs that "can't happen when using the tool
correctly". Mature software should not brook incorrect usage.

"Bailing out near line 1" is a sign of defensive precautions. Crashes and
unjustified output betray their absence.

I commend attention to the LangSec movement, which advocates for rigorously
enforced separation between legal and illegal inputs.

Doug

[-- Attachment #2: Type: text/html, Size: 752 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread
* [TUHS] Re: A fuzzy awk
@ 2024-05-23 13:49 Douglas McIlroy
  2024-05-23 20:52 ` Rob Pike
  0 siblings, 1 reply; 47+ messages in thread
From: Douglas McIlroy @ 2024-05-23 13:49 UTC (permalink / raw)
  To: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 971 bytes --]

> Doug McIlroy was generating random regular expressions

Actually not. I exhaustively (within limits) tested an RE recognizer
without knowingly generating any RE either mechanically or by hand.

The trick: From recursive equations (easily derived from the grammar of
REs), I counted how many REs exist up to various limits on token counts,
Then I generated all strings that satisfied those limits, turned the
recognizer loose on them and counted how many it accepted. Any disagreement
of counts revealed the existence (but not any symptom) of bugs.

Unlike most diagnostic techniques, this scheme produces a certificate of
(very high odds on) correctness over a representative subdomain. The scheme
also agnostically checks behavior on bad inputs as well as good.  It does
not, however, provide a stress test of a recognizer's capacity limits. And
its exponential nature limits its applicability to rather small domains.
(REs have only 5 distinct kinds of token.)

Doug

[-- Attachment #2: Type: text/html, Size: 1698 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread
* [TUHS] Re: A fuzzy awk.
@ 2024-05-20 14:09 Serissa
  2024-05-21  1:56 ` Rob Pike
  0 siblings, 1 reply; 47+ messages in thread
From: Serissa @ 2024-05-20 14:09 UTC (permalink / raw)
  To: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 1571 bytes --]

Well this is obviously a hot button topic.  AFAIK I was nearby when fuzz-testing for software was invented. I was the main advocate for hiring Andy Payne into the Digital Cambridge Research Lab.  One of his little projects was a thing that generated random but correct C programs and fed them to different compilers or compilers with different switches to see if they crashed or generated incorrect results.  Overnight, his tester filed 300 or so bug reports against the Digital C compiler.  This was met with substantial pushback, but it was a mostly an issue that many of the reports traced to the same underlying bugs.

Bill McKeemon expanded the technique and published "Differential Testing of Software" https://www.cs.swarthmore.edu/~bylvisa1/cs97/f13/Papers/DifferentialTestingForSoftware.pdf

Andy had encountered the underlying idea while working as an intern on the Alpha processor development team.  Among many other testers, they used an architectural tester called REX to generate more or less random sequences of instructions, which were then run through different simulation chains (functional, RTL, cycle-accurate) to see if they did the same thing.  Finding user-accessible bugs in hardware seems like a good thing.

The point of generating correct programs (mentioned under the term LangSec here) goes a long way to avoid irritating the maintainers.  Making the test cases short is also maintainer-friendly.  The test generator is also in a position to annotate the source with exactly what it is supposed to do, which is also helpful.

-L



[-- Attachment #2: Type: text/html, Size: 1946 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread
* [TUHS] The 'usage: ...' message. (Was: On Bloat...)
@ 2024-05-19 23:08 Douglas McIlroy
  2024-05-20  0:58 ` [TUHS] " Rob Pike
  0 siblings, 1 reply; 47+ messages in thread
From: Douglas McIlroy @ 2024-05-19 23:08 UTC (permalink / raw)
  To: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 2004 bytes --]

>> Another non-descriptive style of error message that I admired was that
>> of Berkeley Pascal's syntax diagnostics. When the LR parser could not
>> proceed, it reported where, and automatically provided a sample token
>> that would allow the parsing to progress. I found this uniform
>> convention to be at least as informative as distinct hand-crafted
>> messages, which almost by definition can't foresee every contingency.
>> Alas, this elegant scheme seems not to have inspired imitators.

> The hazard with this approach is that the suggested syntactic correction
> might simply lead the user farther into the weeds

I don't think there's enough experience to justify this claim. Before I
experienced the Berkeley compiler, I would have thought such bad outcomes
were inevitable in any language. Although the compilers' suggestions often
bore little or no relationship to the real correction,  I always found them
informative. In particular, the utterly consistent style assured there was
never an issue of ambiguity or of technical jargon.

The compiler taught me Pascal in an evening. I had scanned the Pascal
Report a couple of years before but had never written a Pascal program.
With no manual at hand, I looked at one program to find out what
mumbo-jumbo had to come first and how to print integers, then wrote the
rest by trial and error. Within a couple of hours  I had a working program
good enough to pass muster in an ACM journal.

An example arose that one might think would lead "into the weeds". The
parser balked before 'or' in a compound Boolean expression like  'a=b and
c=d or x=y'. It couldn't suggest a right paren because no left paren had
been seen. Whatever suggestion it did make (perhaps 'then') was enough to
lead me to insert a remote left paren and teach me that parens are required
around Boolean-valued subexpressions. (I will agree that this lesson might
be less clear to a programming novice, but so might be many conventional
diagnostics, e.g. "no effect".)

Doug

[-- Attachment #2: Type: text/html, Size: 3392 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2024-05-25 17:37 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-20 13:06 [TUHS] A fuzzy awk. (Was: The 'usage: ...' message.) Douglas McIlroy
2024-05-20 13:14 ` [TUHS] " arnold
2024-05-20 14:00   ` G. Branden Robinson
2024-05-20 13:25 ` Chet Ramey
2024-05-20 13:41   ` [TUHS] Re: A fuzzy awk Ralph Corderoy
2024-05-20 14:26     ` Chet Ramey
2024-05-22 13:44     ` arnold
2024-05-20 13:54 ` Ralph Corderoy
2024-05-20 15:39   ` [TUHS] OT: LangSec (Re: A fuzzy awk.) Åke Nordin
2024-05-20 16:09     ` [TUHS] " Ben Kallus
2024-05-20 20:02       ` John Levine
2024-05-20 20:11         ` Larry McVoy
2024-05-20 21:00           ` Ben Kallus
2024-05-20 21:03             ` John R Levine
2024-05-20 21:14             ` Larry McVoy
2024-05-20 21:46               ` Ben Kallus
2024-05-20 21:57                 ` Larry McVoy
2024-05-20 16:06 ` [TUHS] Re: A fuzzy awk. (Was: The 'usage: ...' message.) Paul Winalski
  -- strict thread matches above, loose matches on Subject: below --
2024-05-23 13:49 [TUHS] Re: A fuzzy awk Douglas McIlroy
2024-05-23 20:52 ` Rob Pike
2024-05-24  5:41   ` andrew
2024-05-24  7:17   ` Ralph Corderoy
2024-05-24  7:41     ` Rob Pike
2024-05-24 11:56     ` Dan Halbert
2024-05-25  0:17   ` Bakul Shah via TUHS
2024-05-25  0:57     ` G. Branden Robinson
2024-05-25 13:56     ` David Arnold
2024-05-25 17:18     ` Paul Winalski
2024-05-25 17:36       ` Tom Perrine
2024-05-20 14:09 Serissa
2024-05-21  1:56 ` Rob Pike
2024-05-21  2:47   ` Larry McVoy
2024-05-21  2:54     ` Lawrence Stewart
2024-05-21  3:36       ` Rob Pike
2024-05-21 11:59         ` Peter Weinberger (温博格) via TUHS
2024-05-21  3:53   ` George Michaelson
2024-05-21 16:59   ` Paul Winalski
2024-05-21 17:56     ` segaloco via TUHS
2024-05-21 18:12     ` Luther Johnson
2024-05-22 15:37       ` Paul Winalski
2024-05-22 18:49         ` Larry McVoy
2024-05-22 20:17           ` Larry McVoy
2024-05-22  3:26     ` Dave Horsfall
2024-05-22  5:08       ` Alexis
2024-05-22 13:12         ` Warner Losh
2024-05-19 23:08 [TUHS] The 'usage: ...' message. (Was: On Bloat...) Douglas McIlroy
2024-05-20  0:58 ` [TUHS] " Rob Pike
2024-05-20  3:19   ` arnold
2024-05-20  9:20     ` [TUHS] A fuzzy awk. (Was: The 'usage: ...' message.) Ralph Corderoy
2024-05-20 13:10       ` [TUHS] " Chet Ramey
2024-05-20 13:30         ` [TUHS] Re: A fuzzy awk Ralph Corderoy
2024-05-20 13:48           ` Chet Ramey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).