9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* Re: [9fans] awk, not utf aware...
@ 2008-02-27  9:57 erik quanstrom
  0 siblings, 0 replies; 24+ messages in thread
From: erik quanstrom @ 2008-02-27  9:57 UTC (permalink / raw)
  To: paurea, 9fans

> There is split and other functions,
> for example:
> 
> toupper("aí")
> gives
> Aí
> 
> My guess is that there are many more little (or not) corners where it
> doesn't work.
> We can go on and on looking for crevices and hiding the bugs further
> under the rug
> so that they are not evident and find everyone completely unaware,
> leave awk as it is now or really fix the problem. The first approach
> doesn't work. I am going to take
> the second till I have time to take the third which means use runes or
> at least revise all the
> code so that it is uniformly aware of the existance of non-ascii characters.

i don't understand this approach.  you propose redoing a fundamental
part of awk.   yet at the end you won't have solved the bug that's bothering
you.

ignoring the fact that awk is an ape program and doesn't use runes, the
problem with toupper is independent of the internal representation
of strings. as far as i can tell, ape doesn't even have towupper and towlower.

so if you provide those functions, fixing toupper and tolower could be
a 5 minute fix.  and you know you won't have broken anything else.

/sys/doc/utf.ps is worth a read.  it's not to hard to think of situations
that depend on character boundaries or operate on non-ascii characters.
generally there are few.  for example, rc only bothers with character
boundaries in matching. perhaps you could build a utf testsuite for awk.
make sure to use non-latin1 languages, too.

- erik


^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [9fans] awk, not utf aware...
@ 2008-02-28 18:54 Aharon Robbins
  2008-02-28 21:48 ` Uriel
  0 siblings, 1 reply; 24+ messages in thread
From: Aharon Robbins @ 2008-02-28 18:54 UTC (permalink / raw)
  To: 9fans

> Date: Wed, 27 Feb 2008 21:01:33 +0100
> From: Uriel <uriel99@gmail.com>
> Subject: Re: [9fans] awk, not utf aware...
> To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
>
> None of those issues are specific to AWK, they apply just as well to
> sed(1) or any program dealing with regexps. I think the plan9 tools
> demonstrate that it is not so hard to find a 'good enough' solution;
> and the lunix locale debacle demonstrate that if you want to get it
> 'right' you will end up with a nightmare.

Plan 9 had the luxury of starting over with Unicode from the ground
up. Many of the C mb* interfaces predate Unicode, as do many of the
character encodings in use in different parts of the world. Unix vendors
(and standards bodies) have the very real problems of trying to make
their software work, and continue to work for the forseeable future,
in different countries, encodings, etc.

I am not saying that the POSIX locale stuff is wonderful, elegant,
clean, etc.  It has real problems, and for the most recent gawk
release, gawk no longer uses the locale's decimal point for numeric
output by default.

But one has to give the standards groups and Unix vendors credit for
trying to grapple with a real problem instead of side stepping it and
then crowing about it.

> The problem with awk is that it is not a native plan9 app, and it
> simian nature shows in too many places. For example system() and | are
> badly broken:
>
> %  echo |awk '{print |"echo $KSH_VERSION"}'
> @(#)PD KSH v5.2.14 99/07/13.2

Why is this broken?  If the shell that awk is running is PDKSH, or
KSH_VERSION exists in the environment, this is to be expected.

For awk specifically, off the top of my head, the functions that have to
be character-set aware are: index, substr, length, tolower, toupper, and
match.  Gawk has been multibyte aware for several years, although there
were some bugs initially.  And someone recently pointed out another one:

	str = sprintf("%.5s", otherstr)

has to work in terms of characters, not bytes, which I overlooked
and still have to fix.

> Boyd made a native port of awk that fixed most (all?) of this issues,
> it can be found somewhere in his contrib dir but I don't think is
> production-ready.

I remember talking to him about this some, since for a long while the Plan
9 awk was one that was forked from BWK's circa 1993 and needed updating.

> On Wed, Feb 27, 2008 at 4:54 PM, Sape Mullender
> <sape@plan9.bell-labs.com> wrote:
> > > There is split and other functions,
> >  > for example:
> >  >
> >  > toupper("aֳ­")
> >  > gives
> >  > Aֳ­
> >  >
> >  > My guess is that there are many more little (or not) corners where it
> >  > doesn't work.
> >
> >  Yes, and then there is locale: does [a-z] include ִ³ when you run it
> >  in Holland (it should)?  Does it include ֳ¡, ֳ¨, ֳ´ in France (it should)?
> >  Does it include ֳ¸, ֳ¥ in Norway (it should not)?  And what happens when
> >  you evaluate "ֳ¨" < "o" (it depends)?
> >
> >  Fixing awk is much harder than anyone things.  I had a chat about it with
> >  Brian Kernighan and he says he's been thinking about fixing awk for a
> >  long time, but that it really is a hard problem.

Indeed.  I bit the bullet; Brian hasn't been willing to suffer the complaints,
and I don't blame him. :-)  You can see some of his travails by looking
at the CHANGES file in his distribution, available from his Bell Labs
and Princeton web pages.

As far as I know, gawk and the Solaris /usr/xpg4/bin/awk are the only
awks that are multibyte aware.  The Solaris version is derived from the MKS
one (see the code from opensolaris.org) with multibyte fixes. I can supply
simple patches to make it compile on Linux if anyone wants.  This version
doesn't handle some dark corners, but has the advantage of being
very small.

Arnold


^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [9fans] awk, not utf aware...
@ 2008-02-28 15:10 erik quanstrom
  2008-03-03 23:48 ` Jack Johnson
  0 siblings, 1 reply; 24+ messages in thread
From: erik quanstrom @ 2008-02-28 15:10 UTC (permalink / raw)
  To: 9fans

i had to dig this off 9fans.net/archive.  htmlfmt does some very bad things
with non-ascii characters.  i hope i put them back correctly.

> Yes, and then there is locale: does [a-z] include ij when you run it
> in Holland (it should)?  Does it include á, è, ô in France (it should)?
> Does it include ø, å in Norway (it should not)?  And what happens when
> you evaluate "è"< "o" (it depends)?
> 
> Fixing awk is much harder than anyone things.  I had a chat about it with
> Brian Kernighan and he says he's been thinking about fixing awk for a
> long time, but that it really is a hard problem.

how does a program know where it's being run?  ☺ how do you write a
program that processes byte streams from a dutch user and from a
norwegian?  how does one deal with a multi-language file.

i see some problems with localized regexps.  like pre-utf character
sets, it's impossible to tell from a byte stream what the character
set is.  two users can run the same program and get different results.
(how do you test in an environment like this?) and, of course, you
can't switch locale within a file making multi-language files
difficult.

perhaps it would be more effective to break down the concept
a bit.  instead of a general locale hammer, why not expose some
operations that could go into a locale?  for example, have a base-
character folding switch that allows regexps to fold codpoints into
base codepoints so that íïìîi -> i.  this information is in the unicode
tables.  perhaps the language-dependent character mapping should
be specified explictly. &c.

- erik


^ permalink raw reply	[flat|nested] 24+ messages in thread
* [9fans] awk, not utf aware...
@ 2008-02-26 12:18 Gorka Guardiola
  2008-02-26 13:16 ` Martin Neubauer
  2008-02-26 20:24 ` erik quanstrom
  0 siblings, 2 replies; 24+ messages in thread
From: Gorka Guardiola @ 2008-02-26 12:18 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I think this has come up before, but I didn't found reply.
If I do in awk something like:

split($0, c, "");

c should be an array of Runes internally, UTF externally, but apparently,
it is not. Is it just broken?, is there a replacement?, is it just the
builtins or
is the whole awk broken?.

Example, freqpair

------
#!/bin/awk -f

{
	n = split($0, c , "");
	for(i=1; i<n; i++){
		pair=c[i] c[i+1]
		f[pair]++;
	}
}
END{
	for(h in f)
		printf("%d %s\n", f[h], h);
}

------

% echo abcd|freqpair
1 ab
1 cd
1 bc
% echo aícd|freqpair
1 cd
1 �c
1 í
1 a�


where the ? is a Peter face...

Thanks.

-- 
- curiosity sKilled the cat

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-03-04  0:13 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-27  9:57 [9fans] awk, not utf aware erik quanstrom
  -- strict thread matches above, loose matches on Subject: below --
2008-02-28 18:54 Aharon Robbins
2008-02-28 21:48 ` Uriel
2008-02-28 22:08   ` erik quanstrom
2008-02-28 15:10 erik quanstrom
2008-03-03 23:48 ` Jack Johnson
2008-03-04  0:13   ` erik quanstrom
2008-02-26 12:18 Gorka Guardiola
2008-02-26 13:16 ` Martin Neubauer
2008-02-26 14:54   ` Gorka Guardiola
2008-02-26 20:24 ` erik quanstrom
2008-02-26 21:08   ` geoff
2008-02-26 21:21     ` Pietro Gagliardi
2008-02-26 21:24       ` erik quanstrom
2008-02-26 21:32       ` Steven Vormwald
2008-02-26 21:40         ` Pietro Gagliardi
2008-02-26 21:42           ` Pietro Gagliardi
2008-02-26 23:59           ` Steven Vormwald
2008-02-27  2:38       ` Joel C. Salomon
2008-02-29 17:00         ` Douglas A. Gwyn
2008-02-26 21:34     ` erik quanstrom
2008-02-27  7:36   ` Gorka Guardiola
2008-02-27 15:54     ` Sape Mullender
2008-02-27 20:01       ` Uriel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).