9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* Re: [9fans] spaces, separators, and utf-8
@ 2002-06-01 23:01 Geoff Collyer
  2002-06-05 10:03 ` Douglas A. Gwyn
  0 siblings, 1 reply; 5+ messages in thread
From: Geoff Collyer @ 2002-06-01 23:01 UTC (permalink / raw)
  To: 9fans

Michael may only be arguing for admitting the space character in file
names, but I believe that others will go farther once space is
admitted, having witnessed various wheels of reincarnation in the
past.  Yes, tab is a control character, but a tab sometimes appears as
only a single space or two and so some people will argue that tab
should be admitted too, since it's just another form of whitespace and
visually similar to space.  And once tab is admitted, some people will
wonder why other whitespace should be excluded, and so will lobby for
return, newline, form feed and vertical tab.  About this time,
somebody will assert that any utf-8 string should be admitted as a
file name.  Others *may* be able to argue successfully to exclude NUL
characters in general and slashes from individual components.  And
there's the tricky business of the '#' namespace.  Then, seeking ever
greater generality, somebody will suggest that any sequence of bytes
should be a acceptable as a file name.  Again, there will be debate
about slashes and NULs.  And now we're back to the situation on Unix,
where names were indeed fairly unrestrained, though variants
experimented with restrictions.  Berkeley at one time forbade
characters with the high bit set in file names.

Let's try a few exercises to see what the brave new world looks like.
I created a file called

	michael's mother's recipes

on Mac OS X. To refer to this file by name from rc, let's see what
we'd have to type:

	; cd /n/imac/tmp/zoo
	; ls
	'michael''s mother''s recipes'

Not impossible, but not something I'd want to type often.  Next I
created a file with a similar name, but with spaces replaced by
newlines:

	: imac; ls -v1
	michael's
	mother's
	recipes
	michael's mother's recipes

Plain ls prints this:

	: imac; ls -1
	michael's?mother's?recipes
	michael's mother's recipes

I can't manipulate this latest file via u9fs currently:

	; ls
	ls: .: bad character in file name: 'michael''s
	mother''s
	recipes'

du and find on Unix naïvely print the names, which tends to confuse
programs that want to process the names, thus leading to ``find
-print0'' and a corresponding xargs option to cope with one common
case, but there hasn't been any general solution, particularly where
the file names are just one column of a program's output.

	: imac; du -a
	0	./michael's
	mother's
	recipes
	0	./michael's mother's recipes
	0	.

I suppose one could universally adopt Mike Lesk's solution of using
BEL (control-G, \a) or some character in the private-use space as a
column delimiter.


I am indeed working on UTF-8 issues (among others) in OS X. The most
recent version of Terminal I've tried does better at displaying UTF-8
than 10.1.4's but there's still some odd interaction with locale
files.  Unfortunately, OS X has to deal with UTF-8 as just one of
several supported encodings, though I believe it's the most common,
and we have to support locale files.  If we could get agreement on
UTF-8 as the standard encoding, with tcs-like transliterations at the
edges, and get ANSI, ISO and IEEE to drop the whole idea of locales
from their standards, things would eventually get better (as we phased
out support for the deprecated locale notion).

[If it isn't obvious why locales don't work, it's for pretty much the
same reasons that you want a single large alphabet and encoding
(Unicode and UTF-8) rather than a bunch of local encodings (e.g., Big
Five).  A professor of Japanese studies in Greece, writing in Greek
about Japanese should be able to freely intermix those characters.
locales pretend to describe a geographic area and its culture,
language, and other conventions.  But people move and take some of
those things with them.  So what locale are newly-arrived Koreans
living in California in?  They aren't in Korea's time zone but they
may not yet speak the primary language(s) of California.  Locales
don't fit multiculturalism (programs need to be prepared to synthesize
them on the fly, but then a big catalogue of them isn't very useful),
and proliferate if you try to honestly describe the situations of
people away from their places of origin.  I end up mixing British and
American conventions when configuring my machines, since an English
Canadian locale doesn't seem to be widely recognised.]

Has anybody figured out how (or if) to cope with Unicode 3?  They've
broken their promise to stick to 16 bits, which UTF-8 can cope with,
if we crank up UTFmax.  Is switching to 32-bit runes only a minor
performance hit?



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] spaces, separators, and utf-8
  2002-06-01 23:01 [9fans] spaces, separators, and utf-8 Geoff Collyer
@ 2002-06-05 10:03 ` Douglas A. Gwyn
  0 siblings, 0 replies; 5+ messages in thread
From: Douglas A. Gwyn @ 2002-06-05 10:03 UTC (permalink / raw)
  To: 9fans

Geoff Collyer wrote:
> Michael may only be arguing for admitting the space character in file
> names, but I believe that others will go farther once space is
> admitted, having witnessed various wheels of reincarnation in the
> past. ...

I don't think it's so much a matter of admitting certain characters
as it is of disallowing certain characters.  There should be good
reason for disallowal.

> Has anybody figured out how (or if) to cope with Unicode 3?  They've
> broken their promise to stick to 16 bits, which UTF-8 can cope with,
> if we crank up UTFmax.  Is switching to 32-bit runes only a minor
> performance hit?

"I told you so."


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] spaces, separators, and utf-8
  2002-06-01 17:54 ` [9fans] spaces, separators, and utf-8 Michael Baldwin
  2002-06-01 18:21   ` Scott Schwartz
@ 2002-06-01 22:00   ` Dan Cross
  1 sibling, 0 replies; 5+ messages in thread
From: Dan Cross @ 2002-06-01 22:00 UTC (permalink / raw)
  To: 9fans

> ["even Michael Baldwin" -- are you saying i now have a reputation as a
> communist?!]

Yes.

	- Dan C.

:-)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] spaces, separators, and utf-8
  2002-06-01 17:54 ` [9fans] spaces, separators, and utf-8 Michael Baldwin
@ 2002-06-01 18:21   ` Scott Schwartz
  2002-06-01 22:00   ` Dan Cross
  1 sibling, 0 replies; 5+ messages in thread
From: Scott Schwartz @ 2002-06-01 18:21 UTC (permalink / raw)
  To: 9fans

| ...  and 9P even works when accessing something
| that doesn't use / because the protocol itself doesn't use / in Twalk.
| so one can even get to those ugly \ systems from plan 9 (until they do
| something stupid like put / in a path element).

If open took an array of filenames, you wouldn't need a delimiter in the
kernel.  On the other hand, if there were system calls that gave more
direct access to 9p, you could do it that way.  An old thread also argued
for that on the grounds of making stacked file servers easier to build.

| don't feel like quoting.  and NUL, well let's not get started.  does NUL
| work *anywhere*?

It is said to work in TCL, where they use a (technically illegal)
utf sequence that translates to 0x0000.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [9fans] spaces, separators, and utf-8
  2002-06-01 14:59 [9fans] lures Lucio De Re
@ 2002-06-01 17:54 ` Michael Baldwin
  2002-06-01 18:21   ` Scott Schwartz
  2002-06-01 22:00   ` Dan Cross
  0 siblings, 2 replies; 5+ messages in thread
From: Michael Baldwin @ 2002-06-01 17:54 UTC (permalink / raw)
  To: 9fans

On Saturday, June 1, 2002, at 10:59 , lucio@proxima.alt.za wrote:

> The clincher is that the space is useful both as a separator of
> command line arguments and as a joiner of filename "words".

command line arguments can be trivially quoted.  if the issue is
possible shell misinterpretation, then we've got a whole slew of allowed
chars that are "problems": * ? ; < > [ ] { } ` ' " $ & ^ # = \ |.  they
are all allowed now, and if i type them at a shell, i've got to quote
them.  one shell (mash) used : as a special char, and there were and are
files with colons in them in the plan 9 distribution.  so why does space
get all the scorn?

> Seeing as even Michael Baldwin does not suggest using spaces as path
> separators (why not?)

["even Michael Baldwin" -- are you saying i now have a reputation as a
communist?!]

we already have a separator (/), no need for another one.  or are you
asking why not space instead?  when i call a file "My Great Novel" it
doesn't seem natural to think of space as a path separator.  i used
Primos eons ago, and they used ">", which is perhaps arguably slightly
more intuitive than "/" (or ":" or "\").  i can imagine wanting to put a
date in a filename as in 2002/06/01 or use / in other ways in names,
but > is harder to imagine.

but in the end, you must have a separator, so you just pick a char and
say it's the separator.  it is /.  and you cannot use / otherwise in a
path.  fine, that's life.  and 9P even works when accessing something
that doesn't use / because the protocol itself doesn't use / in Twalk.
so one can even get to those ugly \ systems from plan 9 (until they do
something stupid like put / in a path element).  but space as a path
separator?  yikes, no.

but speaking to digy's point, i'm glad that control chars are
disallowed.  i think it is useful to have a char or two that you know
are outside the possible charset for filenames.  i'm thinking of \t and
\n, which can easily be used in text programs to delimit paths if they
don't feel like quoting.  and NUL, well let's not get started.  does NUL
work *anywhere*?  can't use it in C strings, can't use it in acme or rio
or sam, can't use it in old 9P.  curiously enough, 9P2000 can actually
transport it.  but just say no to NUL.

> The rationale being that long filenames, GUIs and Internationalisation
> are all the _new_ rage and may as well be lumped into a single
> paradigm change.

hmm, i thought that internationalization by using utf-8 everywhere
(including in pathnames) was pioneered by plan 9 itself.  and it is a
good idea.  mac os x uses utf-8 paths and does ok with utf-8 in terminal
windows and mail; what about the other unixoid systems?  but are there
any systems that handle utf-8 as cleanly as plan 9 yet?  i don't know of
any.  the mac has problems (do ls in a Terminal window, or use TextEdit
or [gasp] vi or emacs), and if there is a convenient input method, i
haven't found it.  now it would be a great thing if this attribute of
plan 9 (utf-8 everywhere and it just works, decent C language support,
and a goodly-sized unicode font) were put into the commercial OS's out
there.  hey geoff, can you pull that off at apple?  certainly you
wouldn't be opposed to *that* crusade?!



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2002-06-05 10:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-06-01 23:01 [9fans] spaces, separators, and utf-8 Geoff Collyer
2002-06-05 10:03 ` Douglas A. Gwyn
  -- strict thread matches above, loose matches on Subject: below --
2002-06-01 14:59 [9fans] lures Lucio De Re
2002-06-01 17:54 ` [9fans] spaces, separators, and utf-8 Michael Baldwin
2002-06-01 18:21   ` Scott Schwartz
2002-06-01 22:00   ` Dan Cross

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).