9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: Geoff Collyer <geoff@collyer.net>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] spaces, separators, and utf-8
Date: Sat,  1 Jun 2002 16:01:16 -0700	[thread overview]
Message-ID: <a1be82d961ea80f8143d35f570a84ab0@collyer.net> (raw)

Michael may only be arguing for admitting the space character in file
names, but I believe that others will go farther once space is
admitted, having witnessed various wheels of reincarnation in the
past.  Yes, tab is a control character, but a tab sometimes appears as
only a single space or two and so some people will argue that tab
should be admitted too, since it's just another form of whitespace and
visually similar to space.  And once tab is admitted, some people will
wonder why other whitespace should be excluded, and so will lobby for
return, newline, form feed and vertical tab.  About this time,
somebody will assert that any utf-8 string should be admitted as a
file name.  Others *may* be able to argue successfully to exclude NUL
characters in general and slashes from individual components.  And
there's the tricky business of the '#' namespace.  Then, seeking ever
greater generality, somebody will suggest that any sequence of bytes
should be a acceptable as a file name.  Again, there will be debate
about slashes and NULs.  And now we're back to the situation on Unix,
where names were indeed fairly unrestrained, though variants
experimented with restrictions.  Berkeley at one time forbade
characters with the high bit set in file names.

Let's try a few exercises to see what the brave new world looks like.
I created a file called

	michael's mother's recipes

on Mac OS X. To refer to this file by name from rc, let's see what
we'd have to type:

	; cd /n/imac/tmp/zoo
	; ls
	'michael''s mother''s recipes'

Not impossible, but not something I'd want to type often.  Next I
created a file with a similar name, but with spaces replaced by
newlines:

	: imac; ls -v1
	michael's
	mother's
	recipes
	michael's mother's recipes

Plain ls prints this:

	: imac; ls -1
	michael's?mother's?recipes
	michael's mother's recipes

I can't manipulate this latest file via u9fs currently:

	; ls
	ls: .: bad character in file name: 'michael''s
	mother''s
	recipes'

du and find on Unix naïvely print the names, which tends to confuse
programs that want to process the names, thus leading to ``find
-print0'' and a corresponding xargs option to cope with one common
case, but there hasn't been any general solution, particularly where
the file names are just one column of a program's output.

	: imac; du -a
	0	./michael's
	mother's
	recipes
	0	./michael's mother's recipes
	0	.

I suppose one could universally adopt Mike Lesk's solution of using
BEL (control-G, \a) or some character in the private-use space as a
column delimiter.


I am indeed working on UTF-8 issues (among others) in OS X. The most
recent version of Terminal I've tried does better at displaying UTF-8
than 10.1.4's but there's still some odd interaction with locale
files.  Unfortunately, OS X has to deal with UTF-8 as just one of
several supported encodings, though I believe it's the most common,
and we have to support locale files.  If we could get agreement on
UTF-8 as the standard encoding, with tcs-like transliterations at the
edges, and get ANSI, ISO and IEEE to drop the whole idea of locales
from their standards, things would eventually get better (as we phased
out support for the deprecated locale notion).

[If it isn't obvious why locales don't work, it's for pretty much the
same reasons that you want a single large alphabet and encoding
(Unicode and UTF-8) rather than a bunch of local encodings (e.g., Big
Five).  A professor of Japanese studies in Greece, writing in Greek
about Japanese should be able to freely intermix those characters.
locales pretend to describe a geographic area and its culture,
language, and other conventions.  But people move and take some of
those things with them.  So what locale are newly-arrived Koreans
living in California in?  They aren't in Korea's time zone but they
may not yet speak the primary language(s) of California.  Locales
don't fit multiculturalism (programs need to be prepared to synthesize
them on the fly, but then a big catalogue of them isn't very useful),
and proliferate if you try to honestly describe the situations of
people away from their places of origin.  I end up mixing British and
American conventions when configuring my machines, since an English
Canadian locale doesn't seem to be widely recognised.]

Has anybody figured out how (or if) to cope with Unicode 3?  They've
broken their promise to stick to 16 bits, which UTF-8 can cope with,
if we crank up UTFmax.  Is switching to 32-bit runes only a minor
performance hit?



             reply	other threads:[~2002-06-01 23:01 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-06-01 23:01 Geoff Collyer [this message]
2002-06-05 10:03 ` Douglas A. Gwyn
  -- strict thread matches above, loose matches on Subject: below --
2002-06-01 14:59 [9fans] lures Lucio De Re
2002-06-01 17:54 ` [9fans] spaces, separators, and utf-8 Michael Baldwin
2002-06-01 18:21   ` Scott Schwartz
2002-06-01 22:00   ` Dan Cross

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a1be82d961ea80f8143d35f570a84ab0@collyer.net \
    --to=geoff@collyer.net \
    --cc=9fans@cse.psu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).