9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: Martin Neubauer <m.ne@gmx.net>
To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
Subject: Re: [9fans] awk, not utf aware...
Date: Tue, 26 Feb 2008 14:16:13 +0100	[thread overview]
Message-ID: <20080226131613.GA811@shodan.homeunix.net> (raw)
In-Reply-To: <599f06db0802260418m1c2732fdt1487051c59152e27@mail.gmail.com>

Awk is one of the few programs in the ditribution that is maintained
externally (by Brian Kernighan) and is pulled in via ape and pcc (it might
actually be the only one - I didn't bother to check.) A quick glimpse at
lex.c suggests that awk scans input one char at a time. In hindsight I'm a
bit surprised that I haven't got bitten by this, but I probably didn't split
within multibyte sequences. It's probably not too hard to change awk to read
runes for the price of creating ``the other one true awk.''

	Martin

* Gorka Guardiola (paurea@gmail.com) wrote:
> I think this has come up before, but I didn't found reply.
> If I do in awk something like:
> 
> split($0, c, "");
> 
> c should be an array of Runes internally, UTF externally, but apparently,
> it is not. Is it just broken?, is there a replacement?, is it just the
> builtins or
> is the whole awk broken?.
> 
> Example, freqpair
> 
> ------
> #!/bin/awk -f
> 
> {
> 	n = split($0, c , "");
> 	for(i=1; i<n; i++){
> 		pair=c[i] c[i+1]
> 		f[pair]++;
> 	}
> }
> END{
> 	for(h in f)
> 		printf("%d %s\n", f[h], h);
> }
> 
> ------
> 
> % echo abcd|freqpair
> 1 ab
> 1 cd
> 1 bc
> % echo aícd|freqpair
> 1 cd
> 1 �c
> 1 í
> 1 a�
> 
> 
> where the ? is a Peter face...
> 
> Thanks.
> 
> -- 
> - curiosity sKilled the cat


  reply	other threads:[~2008-02-26 13:16 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-26 12:18 Gorka Guardiola
2008-02-26 13:16 ` Martin Neubauer [this message]
2008-02-26 14:54   ` Gorka Guardiola
2008-02-26 20:24 ` erik quanstrom
2008-02-26 21:08   ` geoff
2008-02-26 21:21     ` Pietro Gagliardi
2008-02-26 21:24       ` erik quanstrom
2008-02-26 21:32       ` Steven Vormwald
2008-02-26 21:40         ` Pietro Gagliardi
2008-02-26 21:42           ` Pietro Gagliardi
2008-02-26 23:59           ` Steven Vormwald
2008-02-27  2:38       ` Joel C. Salomon
2008-02-29 17:00         ` Douglas A. Gwyn
2008-02-26 21:34     ` erik quanstrom
2008-02-27  7:36   ` Gorka Guardiola
2008-02-27 15:54     ` Sape Mullender
2008-02-27 20:01       ` Uriel
2008-02-28 19:06         ` [9fans] localization, unicode, regexps (was: awk, not utf aware...) Tristan Plumb
2008-02-28 15:10       ` [9fans] awk, not utf aware erik quanstrom
2008-03-03 23:48         ` Jack Johnson
2008-03-04  0:13           ` erik quanstrom
2008-02-27  9:57 erik quanstrom
2008-02-28 18:54 Aharon Robbins
2008-02-28 21:48 ` Uriel
2008-02-28 22:08   ` erik quanstrom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080226131613.GA811@shodan.homeunix.net \
    --to=m.ne@gmx.net \
    --cc=9fans@cse.psu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).