9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: "Karljurgen Feuerherm" <kfeuerherm@wlu.ca>
To: <9fans@9fans.net>
Subject: Re: [9fans] Lex, Yacc, Unicode Plane 1
Date: Thu, 28 Jan 2010 15:59:29 -0500	[thread overview]
Message-ID: <4B61B461020000CC0001D4FA@wlgw07.wlu.ca> (raw)
In-Reply-To: <982374feab1ff1d8ea2f176256d16934@plan9.bell-labs.com>

[-- Attachment #1: Type: text/plain, Size: 2048 bytes --]

Thanks, Geoff, and Erik.
 
However... (with my 5 minute intro to Runes courtesy of Hello World
doc...) we're still talking BMP, right?
 
(I programmed in B back in the day... i.e. 1980-ish and due to a career
shift have been out of things for a while, so forgive my potential
obtuseness as I gradually reintegrate...!)
 
This reminds me of what I read here: http://www.w3.org/2005/03/23-lex-U

 
K
 
Karljürgen G. Feuerherm, PhD
Department of Archaeology and Classical Studies
Wilfrid Laurier University
75 University Avenue West
Waterloo, Ontario N2L 3C5
Tel. (519) 884-1970 x3193
Fax (519) 883-0991 (ATTN Arch. & Classics)

>>> <geoff@plan9.bell-labs.com> 28/01/2010 3:46:27 pm >>>
I've extended old code using lex to accept utf by massaging the input
stream, before lex sees it, to parse utf and encode non-ascii Runes
into '\33' (escape) followed by 4 hex digits. A simple lex rule then
decodes for the benefit of yacc.

This encodes:

/*
* lex can't cope with character sets wider than 8 bits, so convert
* s to runes and encode non-ascii runes as <esc><hex><hex><hex><hex>.
* result is malloced.
*/
char *
utf2lex(char *s)
{
int nb, bytes;
Rune r;
char *news, *p, *ds;

/* pass 1: count bytes needed by the converted string; watch for UTF
*/
for (p = s, nb = 0; *p != '\0'; p += bytes, nb++) {
bytes = chartorune(&r, p);
if (bytes > 1)
nb += 4;
}
news = malloc(nb+1);
if (news != 0) {
/* pass 2: convert s into new string */
news[nb] = '\0';
for (p = s, ds = news; *p != '\0'; p += bytes) {
bytes = chartorune(&r, p);
if (bytes == 1)
*ds++ = r;
else
ds += sprint(ds, "\33%.4ux", (int)r);
}
}
return news;
}

and this lex code decodes:

%{
char *lex2rune(Rune *rp, char *s);
char *estrdup(char *);

static Rune inrune;
%}
E\33
%%
{E}....{
yylval.charp = estrdup(lex2rune(&inrune, yytext+1));
return inrune;
}
%%
char *
lex2rune(Rune *rp, char *s)
{
static char utf[UTFmax+1];

*rp = strtoul(s, 0, 16);
utf[runetochar(utf, rp)] = '\0';
return utf;
}




[-- Attachment #2: HTML --]
[-- Type: text/html, Size: 2677 bytes --]

  reply	other threads:[~2010-01-28 20:59 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-28 19:43 Karljurgen Feuerherm
2010-01-28 20:05 ` erik quanstrom
2010-01-28 20:46 ` geoff
2010-01-28 20:59   ` Karljurgen Feuerherm [this message]
2010-01-28 21:20     ` geoff
2010-01-28 21:51       ` Karljurgen Feuerherm
2010-01-28 22:07         ` ron minnich
2010-01-28 22:19           ` hiro
2010-01-28 22:34             ` Karljurgen Feuerherm
2010-01-28 22:56           ` erik quanstrom
2010-01-28 23:38             ` Federico G. Benavento
2010-01-28 23:42       ` erik quanstrom
2010-01-29  0:08         ` Karljurgen Feuerherm
2010-01-29  0:19       ` Rob Pike
2010-01-29  0:24         ` erik quanstrom
2010-01-29  0:36           ` Russ Cox
2010-01-29  0:42             ` erik quanstrom
2010-01-29  0:58               ` Russ Cox
2010-01-29  6:08             ` erik quanstrom
2010-01-29  6:18               ` Justin Jackson
2010-01-29 14:36                 ` Ethan Grammatikidis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B61B461020000CC0001D4FA@wlgw07.wlu.ca \
    --to=kfeuerherm@wlu.ca \
    --cc=9fans@9fans.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).