[9fans] awk, not utf aware...

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] awk, not utf aware...
@ 2008-02-26 12:18 Gorka Guardiola
  2008-02-26 13:16 ` Martin Neubauer
  2008-02-26 20:24 ` erik quanstrom
  0 siblings, 2 replies; 21+ messages in thread
From: Gorka Guardiola @ 2008-02-26 12:18 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I think this has come up before, but I didn't found reply.
If I do in awk something like:

split($0, c, "");

c should be an array of Runes internally, UTF externally, but apparently,
it is not. Is it just broken?, is there a replacement?, is it just the
builtins or
is the whole awk broken?.

Example, freqpair

------
#!/bin/awk -f

{
	n = split($0, c , "");
	for(i=1; i<n; i++){
		pair=c[i] c[i+1]
		f[pair]++;
	}
}
END{
	for(h in f)
		printf("%d %s\n", f[h], h);
}

------

% echo abcd|freqpair
1 ab
1 cd
1 bc
% echo aícd|freqpair
1 cd
1 �c
1 í
1 a�

where the ? is a Peter face...

Thanks.

-- 
- curiosity sKilled the cat

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-26 12:18 [9fans] awk, not utf aware Gorka Guardiola
@ 2008-02-26 13:16 ` Martin Neubauer
  2008-02-26 14:54   ` Gorka Guardiola
  2008-02-26 20:24 ` erik quanstrom
  1 sibling, 1 reply; 21+ messages in thread
From: Martin Neubauer @ 2008-02-26 13:16 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Awk is one of the few programs in the ditribution that is maintained
externally (by Brian Kernighan) and is pulled in via ape and pcc (it might
actually be the only one - I didn't bother to check.) A quick glimpse at
lex.c suggests that awk scans input one char at a time. In hindsight I'm a
bit surprised that I haven't got bitten by this, but I probably didn't split
within multibyte sequences. It's probably not too hard to change awk to read
runes for the price of creating ``the other one true awk.''

	Martin

* Gorka Guardiola (paurea@gmail.com) wrote:
> I think this has come up before, but I didn't found reply.
> If I do in awk something like:
> 
> split($0, c, "");
> 
> c should be an array of Runes internally, UTF externally, but apparently,
> it is not. Is it just broken?, is there a replacement?, is it just the
> builtins or
> is the whole awk broken?.
> 
> Example, freqpair
> 
> ------
> #!/bin/awk -f
> 
> {
> 	n = split($0, c , "");
> 	for(i=1; i<n; i++){
> 		pair=c[i] c[i+1]
> 		f[pair]++;
> 	}
> }
> END{
> 	for(h in f)
> 		printf("%d %s\n", f[h], h);
> }
> 
> ------
> 
> % echo abcd|freqpair
> 1 ab
> 1 cd
> 1 bc
> % echo aícd|freqpair
> 1 cd
> 1 �c
> 1 í
> 1 a�
> 
> 
> where the ? is a Peter face...
> 
> Thanks.
> 
> -- 
> - curiosity sKilled the cat


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-26 13:16 ` Martin Neubauer
@ 2008-02-26 14:54   ` Gorka Guardiola
  0 siblings, 0 replies; 21+ messages in thread
From: Gorka Guardiola @ 2008-02-26 14:54 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Tue, Feb 26, 2008 at 2:16 PM, Martin Neubauer <m.ne@gmx.net> wrote:
> Awk is one of the few programs in the ditribution that is maintained
>  externally (by Brian Kernighan) and is pulled in via ape and pcc (it might
>  actually be the only one - I didn't bother to check.) A quick glimpse at
>  lex.c suggests that awk scans input one char at a time. In hindsight I'm a
>  bit surprised that I haven't got bitten by this, but I probably didn't split
>  within multibyte sequences. It's probably not too hard to change awk to read
>  runes for the price of creating ``the other one true awk.''
>

I don't know if it is as easy. I leave it in my todo list for the future :-).
Anyway, the BUGS section should say it does not know about UTF.
I´ll send a patch.


-- 
- curiosity sKilled the cat


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-26 12:18 [9fans] awk, not utf aware Gorka Guardiola
  2008-02-26 13:16 ` Martin Neubauer
@ 2008-02-26 20:24 ` erik quanstrom
  2008-02-26 21:08   ` geoff
  2008-02-27  7:36   ` Gorka Guardiola
  1 sibling, 2 replies; 21+ messages in thread
From: erik quanstrom @ 2008-02-26 20:24 UTC (permalink / raw)
  To: 9fans

> I think this has come up before, but I didn't found reply.
> If I do in awk something like:
> 
> split($0, c, "");
> 
> c should be an array of Runes internally, UTF externally, but apparently,
> it is not. Is it just broken?, is there a replacement?, is it just the
> builtins or
> is the whole awk broken?.

i think the comments about this problem are missing the point
a bit.  utf8 should be transparent to awk unless the situation demands
that awk needs to know the length of a character.  it's not necessary
to keep strings as Rune*s internally to work with utf8.  splitting on
"" is a special case where awk does need to know the length of
a character.  e.g. this script should work fine

	; cat /tmp/smile
	#!/bin/awk -f
	{
		n = split($0, c, "☺");
		for(i = 1; i <= n; i++)
			print c[i]
	}
	; echo fu☺bar|/tmp/smile
	fu
	bar

but splitting on "" won't.  i attached a patch that fixes this problem
as an illustration.  i'm not using utflen because pcc won't see it.
it's an ugly patch.

i don't think i know what a proper fix for awk would be.  i wouldn't
think there are many cases like this, but i haven't spent much time
with awk internals.

- erik

------

9diff run.c
/n/sources/plan9//sys/src/cmd/awk/run.c:1191,1196 - run.c:1191,1219
  	return(False);
  }
  
+ static int
+ utf8len(char *s)
+ {
+ 	int c, n, i;
+ 
+ 	c = *(unsigned char*)s++;
+ 	if ((c&0xe0) == 0xc0)
+ 		n = 2;
+ 	else if ((c&0xf0) == 0xe0)
+ 		n = 3;
+ 	else if ((c&0xf8) == 0xf0)
+ 		n = 4;
+ 	else
+ 		return 1; 	//-1;
+ 	i = n-1;
+ 	if(strlen(s) < i)
+ 		return 1;		// -1;
+ 	for(; i-- && (c = *(unsigned char*)s++);)
+ 		if(0x80 != (c&0xc0))
+ 			return 1;	//-1;
+ 	return n;
+ }
+ 
  Cell *split(Node **a, int nnn)	/* split(a[0], a[1], a[2]); a[3] is type */
  {
  	Cell *x = 0, *y, *ap;
/n/sources/plan9//sys/src/cmd/awk/run.c:1279,1290 - run.c:1302,1316
  				s++;
  		}
  	} else if (sep == 0) {	/* new: split(s, a, "") => 1 char/elem */
- 		for (n = 0; *s != 0; s++) {
- 			char buf[2];
+ 		int i, len;
+ 		char buf[5];
+ 		for (n = 0; *s != 0; s += len) {
  			n++;
  			sprintf(num, "%d", n);
- 			buf[0] = *s;
- 			buf[1] = 0;
+ 			len = utf8len(s);
+ 			for(i = 0; i < len; i++)
+ 				buf[i] = s[i];
+ 			buf[len] = 0;
  			if (isdigit(buf[0]))
  				setsymtab(num, buf, atof(buf), STR|NUM, (Array *) ap->sval);
  			else


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-26 20:24 ` erik quanstrom
@ 2008-02-26 21:08   ` geoff
  2008-02-26 21:21     ` Pietro Gagliardi
  2008-02-26 21:34     ` erik quanstrom
  2008-02-27  7:36   ` Gorka Guardiola
  1 sibling, 2 replies; 21+ messages in thread
From: geoff @ 2008-02-26 21:08 UTC (permalink / raw)
  To: 9fans

Plan 9 awk is an APE program, so it uses the unpronounceable ANSI
mbtowc/wctomb functions to deal with UTF.  Thus it uses mblen rather
than utflen or utf8len.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-26 21:08   ` geoff
@ 2008-02-26 21:21     ` Pietro Gagliardi
  2008-02-26 21:24       ` erik quanstrom
                         ` (2 more replies)
  2008-02-26 21:34     ` erik quanstrom
  1 sibling, 3 replies; 21+ messages in thread
From: Pietro Gagliardi @ 2008-02-26 21:21 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

And it's wonderful that the C standard defines a character literal as
so:

	char-literal:
		' characters '
	characters:
		character
		characters character

(or something like that)

Question, then: why do we need wchar_t/Rune?

On Feb 26, 2008, at 4:08 PM, geoff@plan9.bell-labs.com wrote:

> Plan 9 awk is an APE program, so it uses the unpronounceable ANSI
> mbtowc/wctomb functions to deal with UTF.  Thus it uses mblen rather
> than utflen or utf8len.
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-26 21:21     ` Pietro Gagliardi
@ 2008-02-26 21:24       ` erik quanstrom
  2008-02-26 21:32       ` Steven Vormwald
  2008-02-27  2:38       ` Joel C. Salomon
  2 siblings, 0 replies; 21+ messages in thread
From: erik quanstrom @ 2008-02-26 21:24 UTC (permalink / raw)
  To: 9fans

> And it's wonderful that the C standard defines a character literal as
> so:
>
> 	char-literal:
> 		' characters '
> 	characters:
> 		character
> 		characters character
>
> (or something like that)
>
> Question, then: why do we need wchar_t/Rune?
>

because we have more tha 255 characters.

- erik


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-26 21:21     ` Pietro Gagliardi
  2008-02-26 21:24       ` erik quanstrom
@ 2008-02-26 21:32       ` Steven Vormwald
  2008-02-26 21:40         ` Pietro Gagliardi
  2008-02-27  2:38       ` Joel C. Salomon
  2 siblings, 1 reply; 21+ messages in thread
From: Steven Vormwald @ 2008-02-26 21:32 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Tue, 2008-02-26 at 16:21 -0500, Pietro Gagliardi wrote:
> And it's wonderful that the C standard defines a character literal as
> so:
>
> 	char-literal:
> 		' characters '
> 	characters:
> 		character
> 		characters character
>
> (or something like that)
>
> Question, then: why do we need wchar_t/Rune?

The definitions are (<> used to indicate non-terminals in the
grammar...):

(6.4.4.4) character-constant:
	' <c-char-sequence> '
	L' <c-char-sequence> '

(6.4.4.4) c-char-sequence:
	<c-char>
	<c-char-sequence> <c-char>

(6.4.4.4) c-char:
	any member of the source character set except the single-quote ',
backslash \, or new-line character

	<escape-sequence>

Steven Vormwald
sdvormwa@mtu.edu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-26 21:08   ` geoff
  2008-02-26 21:21     ` Pietro Gagliardi
@ 2008-02-26 21:34     ` erik quanstrom
  1 sibling, 0 replies; 21+ messages in thread
From: erik quanstrom @ 2008-02-26 21:34 UTC (permalink / raw)
  To: 9fans

thanks for catching that.

my brain's not on today.  generally i avoid the mb functions because they
rely on locale.  of course this doesn't apply on plan 9 and so there's no reason
for utf8len.

it looks like mblen is used elsewhere; perhaps this would now be a worthwhile
patch.

- erik

> Plan 9 awk is an APE program, so it uses the unpronounceable ANSI
> mbtowc/wctomb functions to deal with UTF.  Thus it uses mblen rather
> than utflen or utf8len.
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-26 21:32       ` Steven Vormwald
@ 2008-02-26 21:40         ` Pietro Gagliardi
  2008-02-26 21:42           ` Pietro Gagliardi
  2008-02-26 23:59           ` Steven Vormwald
  0 siblings, 2 replies; 21+ messages in thread
From: Pietro Gagliardi @ 2008-02-26 21:40 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Yes. I'm too lazy to pick up my copy of the standard.

On Feb 26, 2008, at 4:32 PM, Steven Vormwald wrote:

> On Tue, 2008-02-26 at 16:21 -0500, Pietro Gagliardi wrote:
>> And it's wonderful that the C standard defines a character literal as
>> so:
>>
>> 	char-literal:
>> 		' characters '
>> 	characters:
>> 		character
>> 		characters character
>>
>> (or something like that)
>>
>> Question, then: why do we need wchar_t/Rune?
>
> The definitions are (<> used to indicate non-terminals in the
> grammar...):
>
> (6.4.4.4) character-constant:
> 	' <c-char-sequence> '
> 	L' <c-char-sequence> '
>
> (6.4.4.4) c-char-sequence:
> 	<c-char>
> 	<c-char-sequence> <c-char>
>
> (6.4.4.4) c-char:
> 	any member of the source character set except the single-quote ',
> backslash \, or new-line character
>
> 	<escape-sequence>
>
> Steven Vormwald
> sdvormwa@mtu.edu
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-26 21:40         ` Pietro Gagliardi
@ 2008-02-26 21:42           ` Pietro Gagliardi
  2008-02-26 23:59           ` Steven Vormwald
  1 sibling, 0 replies; 21+ messages in thread
From: Pietro Gagliardi @ 2008-02-26 21:42 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

(which I have sitting next to me)

On Feb 26, 2008, at 4:40 PM, Pietro Gagliardi wrote:

> Yes. I'm too lazy to pick up my copy of the standard.
>
> On Feb 26, 2008, at 4:32 PM, Steven Vormwald wrote:
>
>> On Tue, 2008-02-26 at 16:21 -0500, Pietro Gagliardi wrote:
>>> And it's wonderful that the C standard defines a character
>>> literal as
>>> so:
>>>
>>> 	char-literal:
>>> 		' characters '
>>> 	characters:
>>> 		character
>>> 		characters character
>>>
>>> (or something like that)
>>>
>>> Question, then: why do we need wchar_t/Rune?
>>
>> The definitions are (<> used to indicate non-terminals in the
>> grammar...):
>>
>> (6.4.4.4) character-constant:
>> 	' <c-char-sequence> '
>> 	L' <c-char-sequence> '
>>
>> (6.4.4.4) c-char-sequence:
>> 	<c-char>
>> 	<c-char-sequence> <c-char>
>>
>> (6.4.4.4) c-char:
>> 	any member of the source character set except the single-quote ',
>> backslash \, or new-line character
>>
>> 	<escape-sequence>
>>
>> Steven Vormwald
>> sdvormwa@mtu.edu
>>
>>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-26 21:40         ` Pietro Gagliardi
  2008-02-26 21:42           ` Pietro Gagliardi
@ 2008-02-26 23:59           ` Steven Vormwald
  1 sibling, 0 replies; 21+ messages in thread
From: Steven Vormwald @ 2008-02-26 23:59 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Tue, 2008-02-26 at 16:40 -0500, Pietro Gagliardi wrote:
> Yes. I'm too lazy to pick up my copy of the standard.

I just happened to be reading through Annex A (the grammar) at the time,
so I thought I'd send it out.

Steven Vormwald
sdvormwa@mtu.edu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-26 21:21     ` Pietro Gagliardi
  2008-02-26 21:24       ` erik quanstrom
  2008-02-26 21:32       ` Steven Vormwald
@ 2008-02-27  2:38       ` Joel C. Salomon
  2008-02-29 17:00         ` Douglas A. Gwyn
  2 siblings, 1 reply; 21+ messages in thread
From: Joel C. Salomon @ 2008-02-27  2:38 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Tue, Feb 26, 2008 at 4:21 PM, Pietro Gagliardi <pietro10@mac.com> wrote:
> And it's wonderful that the C standard defines a character literal as
>  so:

But it leaves the meaning of a literal like 'abcd' up to the compiler.
 I did something very perverse -- but 'legal' -- in the compiler I
started writing for class...

Also recall that sizeof('c') == sizeof(int).  I suspect, though, that
literals like 'abcd' are left from the B (word-addressable, not
byte-addressable) days.

A quick check of /sys/src/cmd/cc/lex.c shows that kenc disallows such horrors.

--Joel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-26 20:24 ` erik quanstrom
  2008-02-26 21:08   ` geoff
@ 2008-02-27  7:36   ` Gorka Guardiola
  2008-02-27 15:54     ` Sape Mullender
  1 sibling, 1 reply; 21+ messages in thread
From: Gorka Guardiola @ 2008-02-27  7:36 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Tue, Feb 26, 2008 at 9:24 PM, erik quanstrom <quanstro@quanstro.net> wrote:
>
>  i think the comments about this problem are missing the point
>  a bit.  utf8 should be transparent to awk unless the situation demands

No. It is not transparent at all. It is semitranslucid because someone did it
partways and because of that I have been bitten hardly by this in different
situations (I am not complaining, just saying that this may not be the right
approach to take in the future).

What someone did is make it so:
/a.j/
matches
a☺j
because someone fixed the regexp part of awk somehow it already understands this
which made me (falsely) think originally that it works and conned me
into the bug.

There is split and other functions,
for example:

toupper("aí")
gives
Aí

My guess is that there are many more little (or not) corners where it
doesn't work.
We can go on and on looking for crevices and hiding the bugs further
under the rug
so that they are not evident and find everyone completely unaware,
leave awk as it is now or really fix the problem. The first approach
doesn't work. I am going to take
the second till I have time to take the third which means use runes or
at least revise all the
code so that it is uniformly aware of the existance of non-ascii characters.
-- 
- curiosity sKilled the cat

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-27  7:36   ` Gorka Guardiola
@ 2008-02-27 15:54     ` Sape Mullender
  2008-02-27 20:01       ` Uriel
  2008-02-28 15:10       ` [9fans] awk, not utf aware erik quanstrom
  0 siblings, 2 replies; 21+ messages in thread
From: Sape Mullender @ 2008-02-27 15:54 UTC (permalink / raw)
  To: 9fans

> There is split and other functions,
> for example:
> 
> toupper("aí")
> gives
> Aí
> 
> My guess is that there are many more little (or not) corners where it
> doesn't work.

Yes, and then there is locale: does [a-z] include ĳ when you run it
in Holland (it should)?  Does it include á, è, ô in France (it should)?
Does it include ø, å in Norway (it should not)?  And what happens when
you evaluate "è" < "o" (it depends)?

Fixing awk is much harder than anyone things.  I had a chat about it with
Brian Kernighan and he says he's been thinking about fixing awk for a
long time, but that it really is a hard problem.

	Sape

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-27 15:54     ` Sape Mullender
@ 2008-02-27 20:01       ` Uriel
  2008-02-28 19:06         ` [9fans] localization, unicode, regexps (was: awk, not utf aware...) Tristan Plumb
  2008-02-28 15:10       ` [9fans] awk, not utf aware erik quanstrom
  1 sibling, 1 reply; 21+ messages in thread
From: Uriel @ 2008-02-27 20:01 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

None of those issues are specific to AWK, they apply just as well to
sed(1) or any program dealing with regexps. I think the plan9 tools
demonstrate that it is not so hard to find a 'good enough' solution;
and the lunix locale debacle demonstrate that if you want to get it
'right' you will end up with a nightmare.

The problem with awk is that it is not a native plan9 app, and it
simian nature shows in too many places. For example system() and | are
badly broken:

%  echo |awk '{print |"echo $KSH_VERSION"}'
@(#)PD KSH v5.2.14 99/07/13.2

Boyd made a native port of awk that fixed most (all?) of this issues,
it can be found somewhere in his contrib dir but I don't think is
production-ready.

uriel

On Wed, Feb 27, 2008 at 4:54 PM, Sape Mullender
<sape@plan9.bell-labs.com> wrote:
> > There is split and other functions,
>  > for example:
>  >
>  > toupper("aí")
>  > gives
>  > Aí
>  >
>  > My guess is that there are many more little (or not) corners where it
>  > doesn't work.
>
>  Yes, and then there is locale: does [a-z] include ĳ when you run it
>  in Holland (it should)?  Does it include á, è, ô in France (it should)?
>  Does it include ø, å in Norway (it should not)?  And what happens when
>  you evaluate "è" < "o" (it depends)?
>
>  Fixing awk is much harder than anyone things.  I had a chat about it with
>  Brian Kernighan and he says he's been thinking about fixing awk for a
>  long time, but that it really is a hard problem.
>
>         Sape
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
@ 2008-02-28 15:10       ` erik quanstrom
  2008-03-03 23:48         ` Jack Johnson
  0 siblings, 1 reply; 21+ messages in thread
From: erik quanstrom @ 2008-02-28 15:10 UTC (permalink / raw)
  To: 9fans

i had to dig this off 9fans.net/archive.  htmlfmt does some very bad things
with non-ascii characters.  i hope i put them back correctly.

> Yes, and then there is locale: does [a-z] include ĳ when you run it
> in Holland (it should)?  Does it include á, è, ô in France (it should)?
> Does it include ø, å in Norway (it should not)?  And what happens when
> you evaluate "è"< "o" (it depends)?
> 
> Fixing awk is much harder than anyone things.  I had a chat about it with
> Brian Kernighan and he says he's been thinking about fixing awk for a
> long time, but that it really is a hard problem.

how does a program know where it's being run?  ☺ how do you write a
program that processes byte streams from a dutch user and from a
norwegian?  how does one deal with a multi-language file.

i see some problems with localized regexps.  like pre-utf character
sets, it's impossible to tell from a byte stream what the character
set is.  two users can run the same program and get different results.
(how do you test in an environment like this?) and, of course, you
can't switch locale within a file making multi-language files
difficult.

perhaps it would be more effective to break down the concept
a bit.  instead of a general locale hammer, why not expose some
operations that could go into a locale?  for example, have a base-
character folding switch that allows regexps to fold codpoints into
base codepoints so that íïìîi -> i.  this information is in the unicode
tables.  perhaps the language-dependent character mapping should
be specified explictly. &c.

- erik

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [9fans] localization, unicode, regexps (was: awk, not utf aware...)
  2008-02-27 20:01       ` Uriel
@ 2008-02-28 19:06         ` Tristan Plumb
  0 siblings, 0 replies; 21+ messages in thread
From: Tristan Plumb @ 2008-02-28 19:06 UTC (permalink / raw)
  To: 9fans

> erik | Sape * uriel

I have been pondering character sets rather alot recently (mostly wishful
thinking, by my estimation), so this conversation set me thinking more...

> how does one deal with a multi-language file.
By not dealing in languages? Unicode (however flawed) solves multi-script
files, why mire ourselves in mutable (scripts are plenty) language rules?

> for example, have a base-character folding switch that allows regexps
> to fold codpoints into base codepoints so that íïìîi -> i.
I would favor decomposing codepoints (í→í, ï→ï, ì→ì, î→î) with the switch
to ignore combining characters, that has the disadvantage of lengthening,
by a byte or rune a time, your text, but does allow you to match accents.

| Yes, and then there is locale: does [a-z] include ĳ when you run it
| in Holland (it should)?  Does it include á, è, ô in France (it should)?
| Does it include ø, å in Norway (it should not)?  And what happens when
| you evaluate "è"< "o" (it depends)?
Does spanish [a-c] match the c in ch (depends on when and where you ask)?
More Unicode-centric, does 'a' match (the first byte of) 'à' (U0061+0300)
(or all three bytes, or not at all)?

I would write [a-z] in a regexp upon two occations, a letter of the latin
alphabet (better served by something like [[:latin:]] (so I needent add a
bunch of other things ([þðæœø]))) or the bytes [61, 7a]. As any sort of a
public project is stuck with Unicode (not advocating the hysteria before,
just wishing Unicode left some of it behind), regexps reflecting Unicode,
not the user's language, makes sense to me. Unicode is at least codified.

* I think the plan9 tools demonstrate that it is not so hard to find a
* 'good enough' solution; and the lunix locale debacle demonstrate that
* if you want to get it 'right' you will end up with a nightmare.
Yet some things that are good enough (I'll pick on Unicode) for one idea,
lumping character sets together does a fine job to write multiple scripts
in the same file, spawns nightmares, ǭ = ǭ = ǭ = ǭ = ǭ, good enough being
ill-thought-out. Yet mayhap you mean well-compromised (that seems right).

To those who were at IWP9 this year: Cast your mind back to a question of
plan9 people with vested intrest for RtL rendering and the like. I should
have stood up then and cried out, I! Imagine either I did so or I do now.

If anyone has interest in playing on this at a character set level, tell?

enjoy,
tristan

-- 
All original matter is hereby placed immediately under the public domain.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-27  2:38       ` Joel C. Salomon
@ 2008-02-29 17:00         ` Douglas A. Gwyn
  0 siblings, 0 replies; 21+ messages in thread
From: Douglas A. Gwyn @ 2008-02-29 17:00 UTC (permalink / raw)
  To: 9fans

"Joel C. Salomon" wrote:
> Also recall that sizeof('c') == sizeof(int).  I suspect, though, that
> literals like 'abcd' are left from the B (word-addressable, not
> byte-addressable) days.

Yes, in C ordinary character constants have always had type int.
Multi-character constants were used in the first C version of "troff",
for one example, so the language permits them even though their use
has nonportable aspects.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-02-28 15:10       ` [9fans] awk, not utf aware erik quanstrom
@ 2008-03-03 23:48         ` Jack Johnson
  2008-03-04  0:13           ` erik quanstrom
  0 siblings, 1 reply; 21+ messages in thread
From: Jack Johnson @ 2008-03-03 23:48 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Feb 28, 2008 at 6:10 AM, erik quanstrom <quanstro@quanstro.net> wrote:
>  perhaps it would be more effective to break down the concept
>  a bit.  instead of a general locale hammer, why not expose some
>  operations that could go into a locale?  for example, have a base-
>  character folding switch that allows regexps to fold codpoints into
>  base codepoints so that íïìîi -> i.  this information is in the unicode
>  tables.  perhaps the language-dependent character mapping should
>  be specified explictly. &c.

Loosely-related tangent:

http://www.mail-archive.com/rsync@lists.samba.org/msg20395.html

> On the LINUX machines running utf-8 the ä is coded as $C3A4 which is
> in utf-8 equal to the character E4. The ä occupies in that way 2 bytes.
>
> I was very astonished, when I copied a mac-filename, pasted into a
> texteditor and looked at the file:
>
> In the mac-filename the letter ä is coded as: $61CC88, which in utf-8
> means the letter "a" followed by a $0308. (Combining diacritical marks)
> So the Mac combines the letter a with the two points above it instead
> using the E4 letter
> Now the things are clear: The filenames are different, in spite of
> looking equally.

So, if folding codepoints is a reasonable tactic, how many
representations do you need to fold?  How many binary representations
are needed to fold íïìîi -> i?

-Jack


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] awk, not utf aware...
  2008-03-03 23:48         ` Jack Johnson
@ 2008-03-04  0:13           ` erik quanstrom
  0 siblings, 0 replies; 21+ messages in thread
From: erik quanstrom @ 2008-03-04  0:13 UTC (permalink / raw)
  To: 9fans

> > On the LINUX machines running utf-8 the ä is coded as $C3A4 which is
> > in utf-8 equal to the character E4. The ä occupies in that way 2 bytes.
> >
> > I was very astonished, when I copied a mac-filename, pasted into a
> > texteditor and looked at the file:
> >
> > In the mac-filename the letter ä is coded as: $61CC88, which in utf-8
> > means the letter "a" followed by a $0308. (Combining diacritical marks)
> > So the Mac combines the letter a with the two points above it instead
> > using the E4 letter
> > Now the things are clear: The filenames are different, in spite of
> > looking equally.
> 
> So, if folding codepoints is a reasonable tactic, how many
> representations do you need to fold?  How many binary representations
> are needed to fold íïìîi -> i?

i didn't make my point very well.  in this case i was suggesting a -f flag
for grep that would map a codepoints into their base codepoint.  the match
result would be the original text --- in the manner of the -i flag.

seperately, however ...

utf combining characters are a really unfortunate choice, imho.  there
is no limit to the number of combining codepoints one can add to
a base codepoint.  you can, for example build a single letter like this
	U+0061 U+0302 ... U+0302
i don't think it's possible to build legible glyphs from bitmaps using
combining diacriticals.

therefore, i would argue for reducing letters made up of base+combiners
to a precombined codepoint whenever possible.  it would be helpful
if tcs did this.  infortunately some transliterations of russian into the roman
alphabet use characters with no precombined form in unicode.

rob probablly has a more informed opinion on this than i.

- erik


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2008-03-04  0:13 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-26 12:18 [9fans] awk, not utf aware Gorka Guardiola
2008-02-26 13:16 ` Martin Neubauer
2008-02-26 14:54   ` Gorka Guardiola
2008-02-26 20:24 ` erik quanstrom
2008-02-26 21:08   ` geoff
2008-02-26 21:21     ` Pietro Gagliardi
2008-02-26 21:24       ` erik quanstrom
2008-02-26 21:32       ` Steven Vormwald
2008-02-26 21:40         ` Pietro Gagliardi
2008-02-26 21:42           ` Pietro Gagliardi
2008-02-26 23:59           ` Steven Vormwald
2008-02-27  2:38       ` Joel C. Salomon
2008-02-29 17:00         ` Douglas A. Gwyn
2008-02-26 21:34     ` erik quanstrom
2008-02-27  7:36   ` Gorka Guardiola
2008-02-27 15:54     ` Sape Mullender
2008-02-27 20:01       ` Uriel
2008-02-28 19:06         ` [9fans] localization, unicode, regexps (was: awk, not utf aware...) Tristan Plumb
2008-02-28 15:10       ` [9fans] awk, not utf aware erik quanstrom
2008-03-03 23:48         ` Jack Johnson
2008-03-04  0:13           ` erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).