From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,HTML_MESSAGE,MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 7036 invoked from network); 16 Dec 2022 17:25:35 -0000 Received: from minnie.tuhs.org (50.116.15.146) by inbox.vuxu.org with ESMTPUTF8; 16 Dec 2022 17:25:35 -0000 Received: from minnie.tuhs.org (localhost [IPv6:::1]) by minnie.tuhs.org (Postfix) with ESMTP id 593AF42418; Sat, 17 Dec 2022 03:25:28 +1000 (AEST) Received: from mail-yw1-f172.google.com (mail-yw1-f172.google.com [209.85.128.172]) by minnie.tuhs.org (Postfix) with ESMTPS id C548442414 for ; Sat, 17 Dec 2022 03:25:23 +1000 (AEST) Received: by mail-yw1-f172.google.com with SMTP id 00721157ae682-381662c78a9so42289047b3.7 for ; Fri, 16 Dec 2022 09:25:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=bygSsCxq16vb2vzuptizWPHVaFC4ekNPOBkunEBrN+M=; b=Nuo7I77ruDWQTnFD6VfuRPwG5ruSQXJ8O3ZNz4Fgdb/EUP3eP/D0XIRABn0VIx5I66 uSzN+qYKpX/PFmcx9Qf9sXYg17UeILI1VpX1GNX7R2zUGy7Wvc+SZBtHkdHLzWZmoiOo JyzZzAQetBa5HnKmqABtiJukI3TwEDEy0n0csJbX2XtEYUc/Zi3Lc2bIE8Ghv8Yt6SJL /O8Is+7hd1plGPracZcBf5Zc5luqUg/EPwNueCoX2snT1vuEq2c7jcWcjhu0Heb6ZWol VxNqQohiozLpKgtCN2DXU7RlN+5oSzVLBuw8VCaUKXQ+PYwvYOvobDJKW7SD7d45s5DQ udHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=bygSsCxq16vb2vzuptizWPHVaFC4ekNPOBkunEBrN+M=; b=q7AaDP6LBji7vVqMb+WQTkSQZ2uR/lJjOpmnWG0nXj5nLDl2IKl17ugZooseDaohQE 69xEbVhbWJwGC4WMUOvUzDkOmgubhubf4YrmfRvakE9ZqOWFDVFl84MW8pobIq2T4DXk bGhWUVD+1ZLWs/B1Xqgaro2X4x+OE71PnxtahV/U2aPI0VnR1t/5Fu12t5nHqhKJMmZw L3Vz0cKA8h4sgNIRcnSWQmupcQgoDIDEhB5wuHgNnKt8QyQkEwq3RmvIwytpWdYXWMO3 bDZeMDwOjzQrJNxznBJ90TXU5Va7f+Ke+rh+eOED1Qk0jl2sB+l/mDPogQSTWTf7Kimi M5/A== X-Gm-Message-State: ANoB5pnJpQo302kFDMGEBBGszJQB888o+lyyBxO/HcZqLvQC+dwgAzDY n6EszSfQh9EYAt3+7wm29zhKrtQ25uM0FMwqSaNxy9/8gbI= X-Google-Smtp-Source: AA0mqf5N/7T/0KH5Cat9KXKkP58xCLCojGHJ+ESkrEyRN5REPryNxD7O6Ionej0dYdO7WVuyDJAp1wY/eXEftvhrIvY= X-Received: by 2002:a81:7409:0:b0:3c8:cd0e:87d6 with SMTP id p9-20020a817409000000b003c8cd0e87d6mr2339489ywc.272.1671211462775; Fri, 16 Dec 2022 09:24:22 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: "John P. Linderman" Date: Fri, 16 Dec 2022 12:24:11 -0500 Message-ID: To: Douglas McIlroy Content-Type: multipart/alternative; boundary="00000000000095569605eff53d28" Message-ID-Hash: NAYZAO7CYKF7AEAH7QNM4J4WZHMYCFXC X-Message-ID-Hash: NAYZAO7CYKF7AEAH7QNM4J4WZHMYCFXC X-MailFrom: jpl.jpl@gmail.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; header-match-tuhs.tuhs.org-0; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Alejandro Colomar , TUHS main list X-Mailman-Version: 3.3.6b1 Precedence: list Subject: [TUHS] Re: origin of null-terminated strings List-Id: The Unix Heritage Society mailing list Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --00000000000095569605eff53d28 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Suppose you have two strings of 8-bit bytes, and you'd like to compare them lexicographically (left to right, byte by byte). An oracle tells you the length of the strings, so maybe you have 3 ram 4 ramp You can just do an strncmp on the two strings, using the minimum of the two lengths (3 in the example). If they differ (they didn't), you are done. If the strings are of the same length (they aren't), they are equal. Otherwise, the shorter (a prefix of the longer) compares low. Ho-hum. Suppose each comparand is a sequence of such strings, and you want to break ties on initial components of such sequences using subsequent components (if any). But you have to combine them as a single string and the oracle only tells you the total length. You can't just concatenate them together, or (3 ram, 4 part) =3D> 7 rampart (4 ramp, 3 art) =3D> 7 rampart (7 rampart) =3D> 7 rampart and they all look equal, but they're not supposed to be. The problem is that some components are proper prefixes of the corresponding component. We can sneak past the end of one component and compare bytes from different components, something that cannot be allowed. A collection of components is said to have "the prefix property" if no component is a proper prefix of any other component. If there is some byte that cannot occur in some component, we can tack that byte onto the component, and the components will then have the prefix property. Newline terminated lines have the prefix property, but we cannot just concatenate such components and do an strncmp, because newlines compare low to most bytes, but high to some legitimate bytes, like tabs, so they don't break ties correctly. Null terminators to the rescue! These induce the prefix property, and compare low to everything else. Using blanks to stand in for hard-to-parse null bytes, we get (4 ram , 5 part ) =3D> 9 ram part (5 ramp , 4 art ) =3D> 9 ramp art (8 rampart ) =3D> 8 rampart The strncmp on the results establishes the desired order. The null terminator on the final component is optional. The oracular length determines how much of the component is significant. But you have to be consistent. The final component must always, or never, have the "optional" terminator. Null-terminated strings (which cannot, themselves, contain a null) are not the only components having the prefix property. Fixed length strings cannot, by definition, include proper prefixes, and they are free to contain any byte. UTF-8 code-points have the prefix property (I suspect this was no accident), so the null-terminated concatenation of non-null UTF-8 code-points have the prefix property and break ties appropriately (assuming that the code-points themselves establish the correct order for what is being compared). I doubt that this explains the use of null terminated strings, but for those of use who spend too much time thinking about sorting, they sure work well. On Thu, Dec 15, 2022 at 10:04 PM Douglas McIlroy < douglas.mcilroy@dartmouth.edu> wrote: > I think this cited quote from > https://www.joelonsoftware.com/2001/12/11/ is urban legend. > > Why do C strings [have a terminating NUl]? It=E2=80=99s because the P= DP-7 > microprocessor, on which UNIX and the C programming language were > invented, had an ASCIZ string type. ASCIZ meant =E2=80=9CASCII with a Z (= zero) > at the end.=E2=80=9D > > This assertion seems unlikely since neither C nor the library string > functions existed on the PDP-7. In fact the "terminating character" of > a string in the PDP-7 language B was the pair '*e'. A string was a > sequence of words, packed two characters per word. For odd-length > strings half of the final one-character word was effectively > NUL-padded as described below. > > One might trace null termination to the original (1965) proposal for > ASCII, https://dl.acm.org/doi/10.1145/363831.363839. There the only > role specifically suggested for NUL is to "serve to accomplish time > fill or media fill." With character-addressable hardware (not the > PDP-7), it is only a small step from using NUL as terminal padding to > the convention of null termination in all cases. > > Ken would probably know for sure whether there's any truth in the > attribution to ASCIZ. > > Doug > --00000000000095569605eff53d28 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Suppose you have two strings of 8-bit bytes, and you'd li= ke
to compare them lexicographically (left to right, byte by byte).
A= n oracle tells you the length of the strings, so maybe you have

3 ra= m
4 ramp

You can just do an strncmp on the two strings, using the= minimum
of the two lengths (3 in the example). If they differ (they did= n't),
you are done. If the strings are of the same length (they aren= 't),
they are equal. Otherwise, the shorter (a prefix of the longer)=
compares low. Ho-hum.

Suppose each comparand is a sequence of su= ch strings, and you
want to break ties on initial components of such seq= uences
using subsequent components (if any). But you have to combine
= them as a single string and the oracle only tells you the total length.
= You can't just concatenate them together, or

(3 ram, 4 part) =3D> 7 rampart
= (4 ramp, 3 art) =3D> 7 rampart
(7 rampart) =3D> 7 rampart

a= nd they all look equal, but they're not supposed to be.
The problem = is that some components are proper prefixes
of the corresponding compone= nt. We can sneak past the end
of one component and compare bytes from di= fferent components,
something that cannot be allowed. A collection of co= mponents
is said to have "the prefix property" if no component= is
a proper prefix of any other component. If there is some
byte tha= t cannot occur in some component, we can tack that
byte onto the compone= nt, and the components will then have
the prefix property. Newline termi= nated lines have the
prefix property, but we cannot just concatenate suc= h
components and do an strncmp, because newlines compare low
to most = bytes, but high to some legitimate bytes, like tabs,
so they don't b= reak ties correctly.

= Null terminators to the rescue! These induce the prefix property,
and co= mpare low to everything else. Using blanks to stand in for
hard-to-parse= null bytes, we get

(4 ram , 5 part ) =3D> 9 ram part
(5 ramp = , 4 art ) =3D> 9 ramp art
(8 rampart ) =C2=A0 =C2=A0 =C2=A0=3D> 8 = rampart

The strncmp on the results establishes the desired order.The null terminator on the final component is optional.
The oracular le= ngth determines how much of the component
is significant. But you have t= o be consistent. The final
component must always, or never, have the &qu= ot;optional" terminator.

Null-terminated strings (which cannot, themselves, c= ontain a null)
are not the only components having the prefix property.Fixed length strings cannot, by definition, include proper prefixes,
a= nd they are free to contain any byte. UTF-8 code-points
have the prefix = property (I suspect this was no accident),
so the null-terminated concat= enation of non-null UTF-8
code-points have the prefix property and break= ties appropriately
(assuming that the code-points themselves establish = the
correct order for what is being compared).

I doubt that this = explains the use of null terminated strings,
but for those of use who sp= end too much time thinking about sorting,
they sure work well.
=

= On Thu, Dec 15, 2022 at 10:04 PM Douglas McIlroy <douglas.mcilroy@dartmouth.edu> wrote:
=
I think this cited = quote from
https://www.joelonsoftware.com/2001/12/11/ is urban le= gend.

=C2=A0 =C2=A0 Why do C strings [have a terminating NUl]? It=E2=80=99s becau= se the PDP-7
microprocessor, on which UNIX and the C programming language were
invented, had an ASCIZ string type. ASCIZ meant =E2=80=9CASCII with a Z (ze= ro)
at the end.=E2=80=9D

This assertion seems unlikely since neither C nor the library string
functions existed on the PDP-7. In fact the "terminating character&quo= t; of
a string in the PDP-7 language B was the pair '*e'. A string was a<= br> sequence of words, packed two characters per word. For odd-length
strings half of the final one-character word was effectively
NUL-padded as described below.

One might trace null termination to the original (1965) proposal for
ASCII,=C2=A0 https://dl.acm.org/doi/10.1145/363831.363= 839. There the only
role specifically suggested for NUL is to "serve to accomplish time fill or media fill." With character-addressable hardware (not the
PDP-7), it is only a small step from using NUL as terminal padding to
the convention of null termination in all cases.

Ken would probably know for sure whether there's any=C2=A0 truth in the=
attribution to ASCIZ.

Doug
--00000000000095569605eff53d28--