From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dave <dave@dave.tj>
Subject: Re: [9fans] blanks in file names
In-reply-to: <200207110200.WAA26141@math.psu.edu>
To: 9fans@cse.psu.edu
Message-id: <200207110614.g6B6ExM18574@dave2.dave.tj>
MIME-version: 1.0
Content-type: text/plain; charset=iso-8859-1
Content-transfer-encoding: 8BIT
Date: Thu, 11 Jul 2002 02:14:59 -0400
Topicbox-Message-UUID: c91eb990-eaca-11e9-9e20-41e7f4b1d025

Reply inline:

 - Dave

Dan Cross wrote:
> 
> > > I don't think it would be simpler; I think it would be more
> > > complicated.  You're replacing a simple, textual representation of an
> > > object with a binary representation; you have to have some way to do
> > > canonicalization in the common case, but even that path is thrwat with
> > > danger.
> > 
> > Manipulating text with all sorts of dynamic buffers is substantially
> > more complicated than simply replacing a node in a linked list.
> > The canonicalization is all being done by the kernel, or a library.
> 
> How could this possibly be in the kernel?  After all, you're talking
> about changing the interface to open a file; I pass a file name via
> some mechanism to a user level application that wants to call open on
> it.  What's it supposed to do?  Does the shell now pass a linked list
> as an argument to main somehow?  How does the system know that it's a
> file?  Do we have to replace the argument vector with some more complex
> representation that encapsulates type information (e.g., this argument
> is a file, this next one is a string, etc)?  Does the shell change to
> represent file names as lists?  Does the user suffer the indignation of
> having to specify a list of path components to represent a file?  Or do
> we provide a canonicalization library for shell arguments, in which
> case, you have the exact same problem as supporting spaces now, since
> most programs are going to expect to get file name arguments in the
> canonical representation?  If you do that, who calls it?  The shell or
> the library?
That's an interesting point I didn't quite consider ... we'll have to
change the exec interface a lot more than I suspected at first glance.
I wasn't planning that until much later, because it'll require very
fundamental changes to the shell.  (/me hates proposing incremental
changes, because they invariably depend on other fundamental changes in
order for people to see their utility.)

> 
> I for one am going to be *very* unhappy if I have to type:
> 
> 	cat ('' 'usr' 'cross' 'file')
> 
> Instead of:
> 
> 	cat /usr/cross/file
> 
> Or do you make every program that wants to open a file call a function
> to canonicalize a filename into the internal format before it calls
> open?
My image of a shell is a user interface.  It should translate
all output from programs into a format that's easy for a human to
understand, and should offer to translate data entered by the user
from an easy-for-a-human-to-input format into the machine format.
If you want to print /"usr"/"cross"/"file", you should be able to
type something like "cat /usr/cross/file" and have the shell translate
that into the collection of lists (/usr/bin/cat and /usr/cross/file)
required for the underlying system.  The shell should also translate
the output of the ls command, for instance, so it prints filenames
in an easy-for-humans-to-understand format.  The ls command, though,
should only print filenames in an easy-for-machine-to-understand
format.  Basically, the shell is the bidirectional translator between
computer-speak and human-speak.  That's it's raison d'étre (spelling?).
Getting the kernel away from plain text doesn't mean getting the shell
away from plain text.  The shell can choose to support any method(s)
it wants to represent filenames in an "easy-for-machine-to-understand"
format, since it'll be converting the filenames into linked lists for
the kernel.  Utilities like find or ls or whatever output filenames in
a format that your shell can read.  (I envision an rc file supplied by
the shell to let other programs know what formats it supports.)

> 
> > > But they change an already well-established interface.  Have you
> > > thought through the implications of this, in all their macabre glory?
> > > What you propose--changing the most basic interface for opening a file
> > > in a system where everything looks more or less like a file--has huge
> > > implications.  And all this just to support a strange edge-case, which
> > > is adequately solved by substitutions in the filename.  Sure, it's not
> > > perfect in some weird pathological case, but how often is this going to
> > > come up in practice?  Remember: Optimize for the common case.
> > 
> > Optimization for the common case is good, but creating a system where the
> > uncommon case will cause major mayhem at the system level is evidence
> > of a very unclean approach.  (When you consider the reasoning behind
> > the problem (namely, spaces and slashes in filenames kill our ability
> > to seperate nodes easily), it makes perfect sense that our solution
> > isn't very clean.  The only clean solution is to restore the ancient
> > UNIX ideal of being able to easily seperate nodes.  In other words,
> > either kill spaces altogether and damn interoperability, or promote
> > spaces to full citizenship.)
> 
> But Plan 9 can handle this.
> 
> One of the beautiful things about Plan 9 is that it provides a solution
> that's workable with little effort.  The various substitution file
> systems provide a workable solution without introducing any additional
> complexity.  If you want a total--100% complete--solution, then a
> `urlifyfs' can be written that uses URL escaping as a canonical
> representation, or something similar.  The system interface doesn't
> have to be changed, though.  *That* is the mark of a clean system
> design.
The only way to have the urlifyfs concept providing a 100% complete
solution is to use it as the default filesystem for your own stuff.
The reason?  imagine downloading a file "blah%apos;" from an FTP server;
now, you download a file "blah'" from an FTP server (which your urlifyfs
faithfully translates into "blah%apos;" without realizing that it's
destroying a different file).  Guess what?  You've just clobbered your
original.  Now, if you're going to use urlifyfs for your own stuff
on your Plan 9 system, you're going to have to deal with the same
shell-interaction issues that my system has to deal with.  The only
difference is that my system doesn't break if somebody forgets to use
urlifyfs on a new filesystem, because my system moves text representation
of filenames over to the shell, where it belongs, rather than dumping
that burden on a filesystem translation hack.

> 
> The Unix `ideal' was eliminated because it was overly complex, without
> a commensurate gain in functionality.  Besides, the inode system didn't
> really fit in well with the idea of 9p.
> 
> > > > There's plenty of experience with other systems working on linked lists
> > > > (including a huge amount of kernel code in my Linux box that I'm typing
> > > > from, ATM).  Most of the problems with linked lists have been pretty
> > > > well documented, by now.
> > > 
> > > It's the huge amount of kernel code that Plan 9 is trying to avoid.
> > 
> > String manipulation is more complex than linked list manipulation.
> 
> No, it's really not.  Consider passing a linked list as an argument to
> a function you're calling, versus passing an argument vector of
> strings.  How do you do that?  Do you muck with all the C startup code
> to make sure you get the linking and so right in such a way that the
> list is in a contiguous memory block so it doesn't get stomped by the
> image read by exec?  Do you pass each node in the list to main as a
> seperate string in the argument vector?  If so, how do you tell when
> a file name ends and another begins?  Do we introduce some convention
> for delineating the beginning and ends of a filename in a list
> representation, effectively creating a protocol that every program has
> to follow to take a filename as an argument?  Surely the former option
> is significantly easier....
This is only true with our current exec family, which has been essentially
carried over unchanged from UNIX.  It's based on strings, not on lists.
IMHO, arguments should be objects.  Those objects can be filenames,
options with or without arguments of their own, subcommands, just plain
strings, etc.  This makes arguments a lot more representitive of what
they actually are, and eliminates the need for complex argument-handling
libraries.  Obviously, this whole change can be totally transparent to
the user, because his shell is doing the necessary translations back
and forth.  However, you get an extremely powerful system as the payoff,
a system that makes it rather easy to reimplement all our current syscalls
as tiny library functions, possibly in an emulation library.

> 
> Consider a possible canonicalization routine that might be used in
> a substitution FS:
> 
> char *
> canonical(char *str)
> {
> 	char	*p, *s, *t;
>  
> 	if (str == nil || (p = malloc(2 * strlen(str) + 1)) == nil) {
> 		return(nil);
> 	}
> 	for (s = str, t = p; *s != '\0'; s++, t++) {
> 		if (isspace(*s)) {
> 			*t++ = '+';	/*  Or whatever.  */
> 			*t = '2';
> 			continue;
> 		}
> 		*t = *s;
> 	}
> 	if ((s = realloc(p, strlen(p) + 1)) == nil) {
> 		free(p);
> 	}
>  
> 	return(s);
> }
> 
> That's pretty straight-forward; just inserting into a linked list
> would be just as hard.  Doing so in a contiguous memory block would
> be, I think harder (you'd have to step over the list, keep a count
> of how much memory you needed, then allocate the list, copy each
> node and set the links.  That's a pain).
strlen() is an expensive operation.  realloc() sucks in a multithreaded
environment.  To top it all off, that algorithm doesn't take into account
the expansion which is ABSOLUTELY NECESSARY in order to achieve 100%
coverage.  (If you're not going to achieve a 1-1 mapping, it's silly to
even bother with this.)  Also, I'd like to mention again that I'm not
asking the kernel to allocate memory.  The userland program provides
a block of memory, and the kernel manipulates that block, returning an
error if the block is too small.

> 
> > > Being forced to conform to a lot of external interfaces *will* kill the
> > > system.
> > 
> > I don't dispute that point, but the interface I propose is most unlike
> > any other interface currently known to man (not trying to conform to any
> > external interface).  I'm simply pointing out that failing to provide
> > at least a 1-1 mapping with capabilities that are already widely used
> > in external systems that must interoperate with ours *will* kill us.
> 
> Well, if you *really* want 100% 1 to 1 mappings, use the URL encoding
> others have mentioned, or something similar.  As it is, it seems that
> this mostly works; about 80% of what's needed is there.
URL encoding _will_ work if it's implemented right (except for the
uncleanliness I mentioned above, and some more problems I mention below).
However, using URL encoding makes the resulting system just as ugly as
the one I'm proposing from the user's perspective, but much much uglier
from a system perspective.

> 
> > > Besides, the point Nemo was trying to make umpteen posts ago was that,
> > > yes, you can roll back changes using the dump filesystem, which gives
> > > you temporal mobility.  He is right.
> > 
> > You can do a lot of things if you're prepared to get involved in the
> > functions that your OS should be doing automatically.  Try running an FTP
> > mirror to a busy site that way, though, and you'll quickly discover why
> > automation is a good thing.  The worst part about our system is that the
> > "solution" you eventually find for an FTP mirror will be useless on an
> > HTTP proxy.  When "solutions" need to be modified for each individual
> > application, you know that the system isn't clean.
> 
> Yesterday is a wonderful tool, and can be scripted to do whatever you
> want.  Eg, copying all files that changed on June 14th back to the
> cache isn't very diffcult.
Yesterday can't be used to update the relative references in all the
README files in the FTP archives to the urlified versions.

> 
> I don't see what running a big FTP mirror has to do with it.  netlib is
> a big FTP site; it runs on Plan 9.  Maybe it's not a mirror, but so what?
Since it's not a mirror, it doesn't have to contend with all the spaceful
filenames you find in the non-Plan9 world.

> I also don't see how you can't leverage whatever you did for FTP with
> HTTP.  The substitution-style FS gives you a *lot* of flexibility in this
> area.
What you did in FTP was scanning the README files for references.
What you do in HTTP is updating all the href and src attributes in HTML
files (and hope none of the JAVA programs have embedded URLs that you
can't change at all), so you don't get broken links everywhere.

...unless you want to implement the transformation/detransformation
code in the FTP and HTTP servers, as well ... in which case your
system becomes one step worse than my system, because you have
transformation/detransformation code in two places on your system :-(

> 
> 	- Dan C.
>