The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
From: Larry McVoy <lm@mcvoy.com>
To: Paul Winalski <paul.winalski@gmail.com>
Cc: Luther Johnson <luther.johnson@makerlisp.com>, tuhs@tuhs.org
Subject: [TUHS] Re: A fuzzy awk.
Date: Wed, 22 May 2024 11:49:04 -0700	[thread overview]
Message-ID: <20240522184904.GK25728@mcvoy.com> (raw)
In-Reply-To: <CABH=_VThuZ+HGkmdbHAzMQt3Ubh4xEvQZZcnrsGxv1uwRfeRLw@mail.gmail.com>

On Wed, May 22, 2024 at 11:37:39AM -0400, Paul Winalski wrote:
> On Tue, May 21, 2024 at 2:12???PM Luther Johnson <luther.johnson@makerlisp.com>
> wrote:
> 
> > I like this anecdote because it points out the difference between being
> > able to handle and process bizarre conditions, as if they were something
> > that should work, which is maybe not that helpful, vs. detecting them and
> > doing something reasonable, like failiing with a "limit exceeded" message
> >
> That is in fact precisely how the DEC compiler handled the 100 nested
> parentheses condition.
> 
> > . A silent, insidious failure down the line because a limit was exceeded
> > is never good.
> >
> Amen!  One should always do bounds checking when dealing with fixed-size
> aggregate data structures.  One compiler that I worked on got a bug report
> of bad code being generated.  The problem was an illegal optimization that
> never should have triggered but did due to a corrupted data table.  Finding
> the culprit of the corruption took hours.  It finally turned out to be due
> to overflow of an adjacent data table in use elsewhere in the compiler.
> The routine to add another entry to that table didn't check for table
> overflow.

We invented a data structure that gets around this problem nicely.  It's
an array of pointers that starts at [1] instead of [0].  The [0]
entry encodes 2 things:

In the upper bits, the log(2) the size of the array.  So all arrays
have at least [0] and [1].  So 2 pointers is the smallest array and
that was important to us, we wanted it to scale up and scale down.

In the lower bits, we record the number of used entries in the array.
We assumed 32 bit pointers and with those we got ~134 million entries
as our maximum number of entries.

Usage is like

char **space = allocLines(4);	// start with space for 4 entries

space = addLine(space, "I am [1]");
space = addLine(space, "I am [2]");
space = addLine(space, "I am [3]");
space = addLine(space, "I am [4]");	// realloc's to 8 entries

freelines(space, 0);	// second arg is typically 0 or free()

It works GREAT.  We used it all over BitKeeper, for stuff as small as
commit comments to arrays of data structures.  It scales down, scales
up.  Helper functions:

/*
 * liblines - interfaces for autoexpanding data structures
 *
 * s= allocLines(n)
 *      pre allocate space for slightly less than N entries.
 * s = addLine(s, line)
 *      add line to s, allocating as needed.
 *      line must be a pointer to preallocated space.
 * freeLines(s, freep)
 *      free the lines array; if freep is set, call that on each entry.
 *      if freep is 0, do not free each entry.
 * buf = popLine(s)
 *      return the most recently added line (not an alloced copy of it)
 * reverseLines(s)
 *      reverse the order of the lines in the array
 * sortLines(space, compar)
 *      sort the lines using the compar function if set, else string_sort()
 * removeLine(s, which, freep)
 *      look for all lines which match "which" and remove them from the array
 *      returns number of matches found
 * removeLineN(s, i, freep)
 *      remove the 'i'th line.
 * lines = splitLine(buf, delim, lines)
 *      split buf on any/all chars in delim and put the tokens in lines.
 * buf = joinLines(":", s)
 *      return one string which is all the strings glued together with ":"
 *      does not free s, caller must free s.
 * buf = findLine(lines, needle);
 *      Return the index the line in lines that matches needle
 */

It's all open source, apache licensed, but you'd have to tease it out of
the bitkeeper source tree.  Wouldn't be that hard and it would be useful.

  reply	other threads:[~2024-05-22 18:49 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-20 14:09 Serissa
2024-05-21  1:56 ` Rob Pike
2024-05-21  2:47   ` Larry McVoy
2024-05-21  2:54     ` Lawrence Stewart
2024-05-21  3:36       ` Rob Pike
2024-05-21 11:59         ` Peter Weinberger (温博格) via TUHS
2024-05-21  3:53   ` George Michaelson
2024-05-21 16:59   ` Paul Winalski
2024-05-21 17:56     ` segaloco via TUHS
2024-05-21 18:12     ` Luther Johnson
2024-05-22 15:37       ` Paul Winalski
2024-05-22 18:49         ` Larry McVoy [this message]
2024-05-22 20:17           ` Larry McVoy
2024-05-22  3:26     ` Dave Horsfall
2024-05-22  5:08       ` Alexis
2024-05-22 13:12         ` Warner Losh
  -- strict thread matches above, loose matches on Subject: below --
2024-05-23 13:49 Douglas McIlroy
2024-05-23 20:52 ` Rob Pike
2024-05-24  5:41   ` andrew
2024-05-24  7:17   ` Ralph Corderoy
2024-05-24  7:41     ` Rob Pike
2024-05-24 11:56     ` Dan Halbert
2024-05-25  0:17   ` Bakul Shah via TUHS
2024-05-25  0:57     ` G. Branden Robinson
2024-05-25 13:56     ` David Arnold
2024-05-25 17:18     ` Paul Winalski
2024-05-25 17:36       ` Tom Perrine
2024-05-20 13:06 [TUHS] A fuzzy awk. (Was: The 'usage: ...' message.) Douglas McIlroy
2024-05-20 13:25 ` [TUHS] " Chet Ramey
2024-05-20 13:41   ` [TUHS] Re: A fuzzy awk Ralph Corderoy
2024-05-20 14:26     ` Chet Ramey
2024-05-22 13:44     ` arnold
2024-05-20 13:54 ` Ralph Corderoy
2024-05-19 23:08 [TUHS] The 'usage: ...' message. (Was: On Bloat...) Douglas McIlroy
2024-05-20  0:58 ` [TUHS] " Rob Pike
2024-05-20  3:19   ` arnold
2024-05-20  9:20     ` [TUHS] A fuzzy awk. (Was: The 'usage: ...' message.) Ralph Corderoy
2024-05-20 13:10       ` [TUHS] " Chet Ramey
2024-05-20 13:30         ` [TUHS] Re: A fuzzy awk Ralph Corderoy
2024-05-20 13:48           ` Chet Ramey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240522184904.GK25728@mcvoy.com \
    --to=lm@mcvoy.com \
    --cc=luther.johnson@makerlisp.com \
    --cc=paul.winalski@gmail.com \
    --cc=tuhs@tuhs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).