Suppose you have two strings of 8-bit bytes, and you'd like to compare them lexicographically (left to right, byte by byte). An oracle tells you the length of the strings, so maybe you have 3 ram 4 ramp You can just do an strncmp on the two strings, using the minimum of the two lengths (3 in the example). If they differ (they didn't), you are done. If the strings are of the same length (they aren't), they are equal. Otherwise, the shorter (a prefix of the longer) compares low. Ho-hum. Suppose each comparand is a sequence of such strings, and you want to break ties on initial components of such sequences using subsequent components (if any). But you have to combine them as a single string and the oracle only tells you the total length. You can't just concatenate them together, or (3 ram, 4 part) => 7 rampart (4 ramp, 3 art) => 7 rampart (7 rampart) => 7 rampart and they all look equal, but they're not supposed to be. The problem is that some components are proper prefixes of the corresponding component. We can sneak past the end of one component and compare bytes from different components, something that cannot be allowed. A collection of components is said to have "the prefix property" if no component is a proper prefix of any other component. If there is some byte that cannot occur in some component, we can tack that byte onto the component, and the components will then have the prefix property. Newline terminated lines have the prefix property, but we cannot just concatenate such components and do an strncmp, because newlines compare low to most bytes, but high to some legitimate bytes, like tabs, so they don't break ties correctly. Null terminators to the rescue! These induce the prefix property, and compare low to everything else. Using blanks to stand in for hard-to-parse null bytes, we get (4 ram , 5 part ) => 9 ram part (5 ramp , 4 art ) => 9 ramp art (8 rampart ) => 8 rampart The strncmp on the results establishes the desired order. The null terminator on the final component is optional. The oracular length determines how much of the component is significant. But you have to be consistent. The final component must always, or never, have the "optional" terminator. Null-terminated strings (which cannot, themselves, contain a null) are not the only components having the prefix property. Fixed length strings cannot, by definition, include proper prefixes, and they are free to contain any byte. UTF-8 code-points have the prefix property (I suspect this was no accident), so the null-terminated concatenation of non-null UTF-8 code-points have the prefix property and break ties appropriately (assuming that the code-points themselves establish the correct order for what is being compared). I doubt that this explains the use of null terminated strings, but for those of use who spend too much time thinking about sorting, they sure work well. On Thu, Dec 15, 2022 at 10:04 PM Douglas McIlroy < douglas.mcilroy@dartmouth.edu> wrote: > I think this cited quote from > https://www.joelonsoftware.com/2001/12/11/ is urban legend. > > Why do C strings [have a terminating NUl]? It’s because the PDP-7 > microprocessor, on which UNIX and the C programming language were > invented, had an ASCIZ string type. ASCIZ meant “ASCII with a Z (zero) > at the end.” > > This assertion seems unlikely since neither C nor the library string > functions existed on the PDP-7. In fact the "terminating character" of > a string in the PDP-7 language B was the pair '*e'. A string was a > sequence of words, packed two characters per word. For odd-length > strings half of the final one-character word was effectively > NUL-padded as described below. > > One might trace null termination to the original (1965) proposal for > ASCII, https://dl.acm.org/doi/10.1145/363831.363839. There the only > role specifically suggested for NUL is to "serve to accomplish time > fill or media fill." With character-addressable hardware (not the > PDP-7), it is only a small step from using NUL as terminal padding to > the convention of null termination in all cases. > > Ken would probably know for sure whether there's any truth in the > attribution to ASCIZ. > > Doug >