From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: To: 9fans@cse.psu.edu Subject: Re: [9fans] spaces, separators, and utf-8 From: Geoff Collyer MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Date: Sat, 1 Jun 2002 16:01:16 -0700 Topicbox-Message-UUID: a460030c-eaca-11e9-9e20-41e7f4b1d025 Michael may only be arguing for admitting the space character in file names, but I believe that others will go farther once space is admitted, having witnessed various wheels of reincarnation in the past. Yes, tab is a control character, but a tab sometimes appears as only a single space or two and so some people will argue that tab should be admitted too, since it's just another form of whitespace and visually similar to space. And once tab is admitted, some people will wonder why other whitespace should be excluded, and so will lobby for return, newline, form feed and vertical tab. About this time, somebody will assert that any utf-8 string should be admitted as a file name. Others *may* be able to argue successfully to exclude NUL characters in general and slashes from individual components. And there's the tricky business of the '#' namespace. Then, seeking ever greater generality, somebody will suggest that any sequence of bytes should be a acceptable as a file name. Again, there will be debate about slashes and NULs. And now we're back to the situation on Unix, where names were indeed fairly unrestrained, though variants experimented with restrictions. Berkeley at one time forbade characters with the high bit set in file names. Let's try a few exercises to see what the brave new world looks like. I created a file called michael's mother's recipes on Mac OS X. To refer to this file by name from rc, let's see what we'd have to type: ; cd /n/imac/tmp/zoo ; ls 'michael''s mother''s recipes' Not impossible, but not something I'd want to type often. Next I created a file with a similar name, but with spaces replaced by newlines: : imac; ls -v1 michael's mother's recipes michael's mother's recipes Plain ls prints this: : imac; ls -1 michael's?mother's?recipes michael's mother's recipes I can't manipulate this latest file via u9fs currently: ; ls ls: .: bad character in file name: 'michael''s mother''s recipes' du and find on Unix naïvely print the names, which tends to confuse programs that want to process the names, thus leading to ``find -print0'' and a corresponding xargs option to cope with one common case, but there hasn't been any general solution, particularly where the file names are just one column of a program's output. : imac; du -a 0 ./michael's mother's recipes 0 ./michael's mother's recipes 0 . I suppose one could universally adopt Mike Lesk's solution of using BEL (control-G, ) or some character in the private-use space as a column delimiter. I am indeed working on UTF-8 issues (among others) in OS X. The most recent version of Terminal I've tried does better at displaying UTF-8 than 10.1.4's but there's still some odd interaction with locale files. Unfortunately, OS X has to deal with UTF-8 as just one of several supported encodings, though I believe it's the most common, and we have to support locale files. If we could get agreement on UTF-8 as the standard encoding, with tcs-like transliterations at the edges, and get ANSI, ISO and IEEE to drop the whole idea of locales from their standards, things would eventually get better (as we phased out support for the deprecated locale notion). [If it isn't obvious why locales don't work, it's for pretty much the same reasons that you want a single large alphabet and encoding (Unicode and UTF-8) rather than a bunch of local encodings (e.g., Big Five). A professor of Japanese studies in Greece, writing in Greek about Japanese should be able to freely intermix those characters. locales pretend to describe a geographic area and its culture, language, and other conventions. But people move and take some of those things with them. So what locale are newly-arrived Koreans living in California in? They aren't in Korea's time zone but they may not yet speak the primary language(s) of California. Locales don't fit multiculturalism (programs need to be prepared to synthesize them on the fly, but then a big catalogue of them isn't very useful), and proliferate if you try to honestly describe the situations of people away from their places of origin. I end up mixing British and American conventions when configuring my machines, since an English Canadian locale doesn't seem to be widely recognised.] Has anybody figured out how (or if) to cope with Unicode 3? They've broken their promise to stick to 16 bits, which UTF-8 can cope with, if we crank up UTFmax. Is switching to 32-bit runes only a minor performance hit?