The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
From: "John P. Linderman" <jpl.jpl@gmail.com>
To: Douglas McIlroy <douglas.mcilroy@dartmouth.edu>
Cc: Alejandro Colomar <alx.manpages@gmail.com>,
	TUHS main list <tuhs@tuhs.org>
Subject: [TUHS] Re: origin of null-terminated strings
Date: Fri, 16 Dec 2022 12:24:11 -0500	[thread overview]
Message-ID: <CAC0cEp8ScnZoWE4LAOdVrBij=+wTmNebo4UPSLZMg1rmHnc9LA@mail.gmail.com> (raw)
In-Reply-To: <CAKH6PiUOhGNH9bGonY9y1o=ChGByiGGVeOOHG3Hubj2qRb2Yjw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4313 bytes --]

Suppose you have two strings of 8-bit bytes, and you'd like
to compare them lexicographically (left to right, byte by byte).
An oracle tells you the length of the strings, so maybe you have

3 ram
4 ramp

You can just do an strncmp on the two strings, using the minimum
of the two lengths (3 in the example). If they differ (they didn't),
you are done. If the strings are of the same length (they aren't),
they are equal. Otherwise, the shorter (a prefix of the longer)
compares low. Ho-hum.

Suppose each comparand is a sequence of such strings, and you
want to break ties on initial components of such sequences
using subsequent components (if any). But you have to combine
them as a single string and the oracle only tells you the total length.
You can't just concatenate them together, or

(3 ram, 4 part) => 7 rampart
(4 ramp, 3 art) => 7 rampart
(7 rampart) => 7 rampart

and they all look equal, but they're not supposed to be.
The problem is that some components are proper prefixes
of the corresponding component. We can sneak past the end
of one component and compare bytes from different components,
something that cannot be allowed. A collection of components
is said to have "the prefix property" if no component is
a proper prefix of any other component. If there is some
byte that cannot occur in some component, we can tack that
byte onto the component, and the components will then have
the prefix property. Newline terminated lines have the
prefix property, but we cannot just concatenate such
components and do an strncmp, because newlines compare low
to most bytes, but high to some legitimate bytes, like tabs,
so they don't break ties correctly.

Null terminators to the rescue! These induce the prefix property,
and compare low to everything else. Using blanks to stand in for
hard-to-parse null bytes, we get

(4 ram , 5 part ) => 9 ram part
(5 ramp , 4 art ) => 9 ramp art
(8 rampart )      => 8 rampart

The strncmp on the results establishes the desired order.
The null terminator on the final component is optional.
The oracular length determines how much of the component
is significant. But you have to be consistent. The final
component must always, or never, have the "optional" terminator.

Null-terminated strings (which cannot, themselves, contain a null)
are not the only components having the prefix property.
Fixed length strings cannot, by definition, include proper prefixes,
and they are free to contain any byte. UTF-8 code-points
have the prefix property (I suspect this was no accident),
so the null-terminated concatenation of non-null UTF-8
code-points have the prefix property and break ties appropriately
(assuming that the code-points themselves establish the
correct order for what is being compared).

I doubt that this explains the use of null terminated strings,
but for those of use who spend too much time thinking about sorting,
they sure work well.

On Thu, Dec 15, 2022 at 10:04 PM Douglas McIlroy <
douglas.mcilroy@dartmouth.edu> wrote:

> I think this cited quote from
> https://www.joelonsoftware.com/2001/12/11/ is urban legend.
>
>     Why do C strings [have a terminating NUl]? It’s because the PDP-7
> microprocessor, on which UNIX and the C programming language were
> invented, had an ASCIZ string type. ASCIZ meant “ASCII with a Z (zero)
> at the end.”
>
> This assertion seems unlikely since neither C nor the library string
> functions existed on the PDP-7. In fact the "terminating character" of
> a string in the PDP-7 language B was the pair '*e'. A string was a
> sequence of words, packed two characters per word. For odd-length
> strings half of the final one-character word was effectively
> NUL-padded as described below.
>
> One might trace null termination to the original (1965) proposal for
> ASCII,  https://dl.acm.org/doi/10.1145/363831.363839. There the only
> role specifically suggested for NUL is to "serve to accomplish time
> fill or media fill." With character-addressable hardware (not the
> PDP-7), it is only a small step from using NUL as terminal padding to
> the convention of null termination in all cases.
>
> Ken would probably know for sure whether there's any  truth in the
> attribution to ASCIZ.
>
> Doug
>

[-- Attachment #2: Type: text/html, Size: 5367 bytes --]

  parent reply	other threads:[~2022-12-16 17:25 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-16  3:02 [TUHS] " Douglas McIlroy
2022-12-16  3:14 ` [TUHS] " Ken Thompson
2022-12-16  9:13   ` Dr Iain Maoileoin
2022-12-16 13:42     ` Dan Halbert
2022-12-16 16:10       ` Dan Cross
2022-12-16 16:22         ` Tom Lyon
2022-12-16 16:29         ` Jon Steinhart
2022-12-16 20:12     ` Dave Horsfall
2022-12-16 21:02       ` Warner Losh
2022-12-16 21:13         ` Clem Cole
2022-12-16 21:49           ` Clem Cole
2022-12-17  0:26             ` Phil Budne
2022-12-16 21:18         ` Luther Johnson
2022-12-16 21:20         ` Dan Halbert
2022-12-16  3:17 ` Steve Nickolas
2022-12-16 17:24 ` John P. Linderman [this message]
     [not found] ` <6009124d-750d-365e-a424-ec7bb25922b9@gmail.com>
2022-12-16 22:30   ` [TUHS] Terms for string, and similar character constructs (was: origin of null-terminated strings) Alejandro Colomar
2022-12-16 22:51     ` [TUHS] " Dave Horsfall
2022-12-16 22:26 [TUHS] Re: origin of null-terminated strings Douglas McIlroy
2022-12-17  2:03 ` James Frew
2022-12-17  3:42 ` steve jenkin
2022-12-17 17:11 ` Clem Cole
2022-12-17 18:15   ` Tom Lyon
2022-12-17 18:43     ` Clem Cole
2022-12-17 18:46       ` Clem Cole
2022-12-17 19:26     ` Tom Perrine
2022-12-19  4:26     ` Adam Thornton
2022-12-16 23:11 Noel Chiappa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAC0cEp8ScnZoWE4LAOdVrBij=+wTmNebo4UPSLZMg1rmHnc9LA@mail.gmail.com' \
    --to=jpl.jpl@gmail.com \
    --cc=alx.manpages@gmail.com \
    --cc=douglas.mcilroy@dartmouth.edu \
    --cc=tuhs@tuhs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).