9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: geoff@plan9.bell-labs.com
To: 9fans@9fans.net
Subject: Re: [9fans] Lex, Yacc, Unicode Plane 1
Date: Thu, 28 Jan 2010 15:46:27 -0500	[thread overview]
Message-ID: <982374feab1ff1d8ea2f176256d16934@plan9.bell-labs.com> (raw)
In-Reply-To: <4B61A280020000CC0001D4A1@wlgw07.wlu.ca>

I've extended old code using lex to accept utf by massaging the input
stream, before lex sees it, to parse utf and encode non-ascii Runes
into '\33' (escape) followed by 4 hex digits.  A simple lex rule then
decodes for the benefit of yacc.

This encodes:

/*
 * lex can't cope with character sets wider than 8 bits, so convert
 * s to runes and encode non-ascii runes as <esc><hex><hex><hex><hex>.
 * result is malloced.
 */
char *
utf2lex(char *s)
{
	int nb, bytes;
	Rune r;
	char *news, *p, *ds;

	/* pass 1: count bytes needed by the converted string; watch for UTF */
	for (p = s, nb = 0; *p != '\0'; p += bytes, nb++) {
		bytes = chartorune(&r, p);
		if (bytes > 1)
			nb += 4;
	}
	news = malloc(nb+1);
	if (news != 0) {
		/* pass 2: convert s into new string */
		news[nb] = '\0';
		for (p = s, ds = news; *p != '\0'; p += bytes) {
			bytes = chartorune(&r, p);
			if (bytes == 1)
				*ds++ = r;
			else
				ds += sprint(ds, "\33%.4ux", (int)r);
		}
	}
	return news;
}

and this lex code decodes:

%{
char *lex2rune(Rune *rp, char *s);
char *estrdup(char *);

static Rune inrune;
%}
E	\33
%%
{E}....			{
			yylval.charp = estrdup(lex2rune(&inrune, yytext+1));
			return inrune;
			}
%%
char *
lex2rune(Rune *rp, char *s)
{
	static char utf[UTFmax+1];

	*rp = strtoul(s, 0, 16);
	utf[runetochar(utf, rp)] = '\0';
	return utf;
}




  parent reply	other threads:[~2010-01-28 20:46 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-28 19:43 Karljurgen Feuerherm
2010-01-28 20:05 ` erik quanstrom
2010-01-28 20:46 ` geoff [this message]
2010-01-28 20:59   ` Karljurgen Feuerherm
2010-01-28 21:20     ` geoff
2010-01-28 21:51       ` Karljurgen Feuerherm
2010-01-28 22:07         ` ron minnich
2010-01-28 22:19           ` hiro
2010-01-28 22:34             ` Karljurgen Feuerherm
2010-01-28 22:56           ` erik quanstrom
2010-01-28 23:38             ` Federico G. Benavento
2010-01-28 23:42       ` erik quanstrom
2010-01-29  0:08         ` Karljurgen Feuerherm
2010-01-29  0:19       ` Rob Pike
2010-01-29  0:24         ` erik quanstrom
2010-01-29  0:36           ` Russ Cox
2010-01-29  0:42             ` erik quanstrom
2010-01-29  0:58               ` Russ Cox
2010-01-29  6:08             ` erik quanstrom
2010-01-29  6:18               ` Justin Jackson
2010-01-29 14:36                 ` Ethan Grammatikidis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=982374feab1ff1d8ea2f176256d16934@plan9.bell-labs.com \
    --to=geoff@plan9.bell-labs.com \
    --cc=9fans@9fans.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).