From mboxrd@z Thu Jan 1 00:00:00 1970 Message-Id: <4B61B461020000CC0001D4FA@wlgw07.wlu.ca> Date: Thu, 28 Jan 2010 15:59:29 -0500 From: "Karljurgen Feuerherm" To: <9fans@9fans.net> References: <4B61A280020000CC0001D4A1@wlgw07.wlu.ca> <982374feab1ff1d8ea2f176256d16934@plan9.bell-labs.com> In-Reply-To: <982374feab1ff1d8ea2f176256d16934@plan9.bell-labs.com> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="=__Part19333EA1.0__=" Subject: Re: [9fans] Lex, Yacc, Unicode Plane 1 Topicbox-Message-UUID: c9cdb49e-ead5-11e9-9d60-3106f5b1d025 This is a MIME message. If you are reading this text, you may want to consider changing to a mail reader or gateway that understands how to properly handle MIME multipart messages. --=__Part19333EA1.0__= Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: quoted-printable Thanks, Geoff, and Erik. =20 However... (with my 5 minute intro to Runes courtesy of Hello World doc...) we=27re still talking BMP, right? =20 (I programmed in B back in the day... i.e. 1980-ish and due to a career shift have been out of things for a while, so forgive my potential obtuseness as I gradually reintegrate...=21) =20 This reminds me of what I read here: http://www.w3.org/2005/03/23-lex-U =20 K =20 Karlj=FCrgen G. Feuerherm, PhD Department of Archaeology and Classical Studies Wilfrid Laurier University 75 University Avenue West Waterloo, Ontario N2L 3C5 Tel. (519) 884-1970 x3193 Fax (519) 883-0991 (ATTN Arch. & Classics) >>> 28/01/2010 3:46:27 pm >>> I=27ve extended old code using lex to accept utf by massaging the input stream, before lex sees it, to parse utf and encode non-ascii Runes into =27=5C33=27 (escape) followed by 4 hex digits. A simple lex rule then decodes for the benefit of yacc. This encodes: /* * lex can=27t cope with character sets wider than 8 bits, so convert * s to runes and encode non-ascii runes as . * result is malloced. */ char * utf2lex(char *s) =7B int nb, bytes; Rune r; char *news, *p, *ds; /* pass 1: count bytes needed by the converted string; watch for UTF */ for (p =3D s, nb =3D 0; *p =21=3D =27=5C0=27; p +=3D bytes, nb++) =7B bytes =3D chartorune(&r, p); if (bytes > 1) nb +=3D 4; =7D news =3D malloc(nb+1); if (news =21=3D 0) =7B /* pass 2: convert s into new string */ news=5Bnb=5D =3D =27=5C0=27; for (p =3D s, ds =3D news; *p =21=3D =27=5C0=27; p +=3D bytes) =7B bytes =3D chartorune(&r, p); if (bytes =3D=3D 1) *ds++ =3D r; else ds +=3D sprint(ds, =22=5C33%.4ux=22, (int)r); =7D =7D return news; =7D and this lex code decodes: %=7B char *lex2rune(Rune *rp, char *s); char *estrdup(char *); static Rune inrune; %=7D E=5C33 %% =7BE=7D....=7B yylval.charp =3D estrdup(lex2rune(&inrune, yytext+1)); return inrune; =7D %% char * lex2rune(Rune *rp, char *s) =7B static char utf=5BUTFmax+1=5D; *rp =3D strtoul(s, 0, 16); utf=5Brunetochar(utf, rp)=5D =3D =27=5C0=27; return utf; =7D --=__Part19333EA1.0__= Content-Type: text/html; charset=ISO-8859-15 Content-Transfer-Encoding: quoted-printable Content-Description: HTML

Thanks, Geoff, and Erik.

However... (with my 5 minute intro to Runes courtesy of Hello World = doc...) we're still talking BMP, right?

(I programmed in B back in the day... i.e. 1980-ish and due to a = career shift have been out of things for a while, so forgive my potential = obtuseness as I gradually reintegrate...!)

This reminds me of what I read here: http://www.w3.org/2005/03/23-lex-U

K

Karlj=FCrgen G. Feuerherm, PhD
Department of Archaeology and = Classical Studies
Wilfrid Laurier University
75 University Avenue = West
Waterloo, Ontario N2L 3C5
Tel. (519) 884-1970 x3193
Fax = (519) 883-0991 (ATTN Arch. & Classics)

>>> <geoff@pl= an9.bell-labs.com> 28/01/2010 3:46:27 pm >>>
I've extended = old code using lex to accept utf by massaging the input
stream, before = lex sees it, to parse utf and encode non-ascii Runes
into '\33' = (escape) followed by 4 hex digits. A simple lex rule then
decodes for = the benefit of yacc.

This encodes:

/*
* lex can't cope = with character sets wider than 8 bits, so convert
* s to runes and = encode non-ascii runes as <esc><hex><hex><hex><h= ex>.
* result is malloced.
*/
char *
utf2lex(char *s)
{int nb, bytes;
Rune r;
char *news, *p, *ds;

/* pass 1: = count bytes needed by the converted string; watch for UTF */
for (p =3D = s, nb =3D 0; *p !=3D '\0'; p +=3D bytes, nb++) {
bytes =3D chartorune(&a= mp;r, p);
if (bytes > 1)
nb +=3D 4;
}
news =3D malloc(nb+1);=
if (news !=3D 0) {
/* pass 2: convert s into new string */
news[n= b] =3D '\0';
for (p =3D s, ds =3D news; *p !=3D '\0'; p +=3D bytes) = {
bytes =3D chartorune(&r, p);
if (bytes =3D=3D 1)
*ds++ =3D = r;
else
ds +=3D sprint(ds, "\33%.4ux", (int)r);
}
}
return = news;
}

and this lex code decodes:

%{
char *lex2rune(Ru= ne *rp, char *s);
char *estrdup(char *);

static Rune inrune;
%= }
E\33
%%
{E}....{
yylval.charp =3D estrdup(lex2rune(&inrun= e, yytext+1));
return inrune;
}
%%
char *
lex2rune(Rune = *rp, char *s)
{
static char utf[UTFmax+1];

*rp =3D strtoul(s, = 0, 16);
utf[runetochar(utf, rp)] =3D '\0';
return utf;
}

--=__Part19333EA1.0__=--