From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <393a7f63e7c4973e661988f1009f5d7a@vitanuova.com>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] ls, rc question -- proposed change to rc/glob.c 
From: rog@vitanuova.com
In-Reply-To: <240892daccaec2131635fbd4917479de@hamnavoe.com>
MIME-Version: 1.0
Content-Type: multipart/mixed;
	boundary="upas-ykttiwvdkmfsuzrinyxxkxscna"
Date: Sun, 21 Mar 2004 20:47:46 +0000
Topicbox-Message-UUID: 3bffdadc-eacd-11e9-9e20-41e7f4b1d025

This is a multi-part message in MIME format.
--upas-ykttiwvdkmfsuzrinyxxkxscna
Content-Disposition: inline
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit

for programs that require read-write access to a namespace,
requirements are different, as charles points out.

however, most tools do not write to the namespace.

for me, one of plan 9's fundamental strengths is that while a tool is
often written to access a predefined portion of the namespace, the
actual namespace that it sees can be arranged at will for the tool in
question.

thus if a program accesses files in directory /d, i think it's
reasonable to expect that if i bind another directory, say /b, before
/d, i can expect the program to treat /d exactly as if i had really
replaced the files therein with files from /b.

the program shouldn't have to *care* whether it's dealing with a union
directory or not - that should be solely the concern of whoever's
arranging the namespace.

i think the difference between:

	bind /dev/null /bin/ls

and

	mkdir /tmp/x
	> /tmp/x/ls
	bind /dev/null /tmp/x/ls
	bind -b /tmp/x /bin

should be invisible, particularly to shell scripts, which have no
concept of sequentially reading a directory, and which are often short
and trivial, but for which the above invariants are still important.

russ:
> if a file is covered up, it should be completely hidden.
> it turns out that actually implementing this is quite hard,

i.e.  the visibility of mutiple names in a directory is an
implementation issue.  it's something that's easily solved at user
level (no problems with arbitrary buffering there, and programs can
eliminate duplicates in their own way if they like as, for example,
du(1), mk(1), cron(1) listen(1) do).

i think it should be made as easy as possible to write programs that
behave in a namespace-agnostic way.  i think that if a version of
dirreadall was available that eliminated duplicates, most (all?)
programs would wish to use it.

i think it's particularly silly that

	wc */*.[ch]

should give me erroneous results in a union directory,
and that

	echo /bin/ape/*

should be different from

	echo /bin/ape*/*

when there's only one accessible directory starting "ape" in /bin, and
the shell's sorting the globbed output anyway.

i think eliminating duplicates in two core places would break nothing,
but make parts of the system work more smoothly together.

it's one less thing to think about.

geoff points out many programs that use dirread.  a random inspection
turns up many programs that already do their own elimination of
duplicates (in different ways), and others that behave badly in the
face of duplicate directory entries (e.g.  diff).

this seems to me to point towards a systematic problem, which should
be solved centrally.

towards this end, i've attached a stable sort routine, along the same
lines as qsort(2), which can optionally eliminate duplicates.  i've
tested it to a degree, but not greatly; i haven't done any performance
analysis of it, but i doubt that's a limiting factor in this case.

i envisage ls(1) being changed to use this routine (with an option to
show/suppress duplicate entries), and perhaps a version of dirreadall
that sorts the directory entries alphabetically and eliminates
duplicates, to make it easy to replace it in existing programs.


PS it's interesting to note that ls(1) does make some sort of an
effort to do a stable sort:

	if(i == 0)
		i = (a<b? -1 : 1);

but it's simple minded (in fact, i'm a little bit surprised it doesn't
kill qsort).  a sequence field in the NDir structure could fix it.

--upas-ykttiwvdkmfsuzrinyxxkxscna
Content-Disposition: attachment; filename=mergesort.c
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit

#include <u.h>
#include <libc.h>
	
int
mergesort1(int uniq, char *a, int r, int width, int (*compare)(void*, void*), char *b)
{
	int m, n0, n1, i, j, k, e, c;
	if(r <= width)
		return r;
	m = ((r/width-1)/2 + 1) * width;
	n0 = mergesort1(uniq, a, m, width, compare, b);
	n1 = mergesort1(uniq, a+m, r-m, width, compare, b);
	memcpy(b, a, n0);
	memcpy(b+n0, a+m, n1);
	e = n0+n1;
	i = 0;
	j = n0;
	k = 0;
	for(; i < n0 && j < e; k += width){
		c = (*compare)(b+i, b+j);
		if(c > 0){
			memcpy(a+k, b+j, width);
			j += width;
		}else{
			memcpy(a+k, b+i, width);
			i += width;
			if(c == 0 && uniq)
				j += width;
		}
	}
	if(i < n0){
		memcpy(a+k, b+i, n0-i);
		k += n0-i;
	}else if(j < e){
		memcpy(a+k, b+j, e-j);
		k += e-j;
	}
	return k;
}

int
mergesort(int uniq, void *a, long n, long width, int (*compare)(void*, void*))
{
	void *b;
	n *= width;
	b = malloc(n);
	if(b == nil)
		return -1;
	n = mergesort1(uniq, a, n, width, compare, b);
	free(b);
	return n / width;
}
--upas-ykttiwvdkmfsuzrinyxxkxscna--