From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <224a39ebca9aaa0370eb804cd59e6aac@plan9.jp> To: 9fans@cse.psu.edu Subject: Composition of regexps (Was re: [9fans] regular expressions in plan9 different from the ones in unix?) From: Joel Salomon Date: Fri, 23 Feb 2007 01:27:56 -0500 In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Topicbox-Message-UUID: 12de35a4-ead2-11e9-9d60-3106f5b1d025 On 2/22/07, Russ Cox wrote: > The Plan 9 regexp library matches the old Unix egrep command. > Any regexp you'd try under Plan 9 should work with new egreps, > though not vice versa -- new egreps tend to have newfangled > additions like [:upper:] and \w and {4,6} for repetition. This came up as I was implementing my C lexer for the compilers class I'm taking. How hard would it be to allow access to regcomp(2)'s internals, so I could build up a regexp part-by part a la lex? For example, to recognize C99 hexadecimal floating-point constants, I wrote a second program that builds up the regexp piece-by-piece using smprint(2), then compiling the whole thing: char *decdig =3D "([0-9])", *hexdig =3D "([0-9A-Fa-f])", *sign =3D "([+\\-])", *dot =3D "(\\.)", *dseq, *dexp, *dfrac, *decflt, *hseq, *bexp, *hfrac, *hexflt; dseq =3D smprint("(%s+)", decdig); dexp =3D smprint("([Ee]%s?%s)", sign, dseq); dfrac =3D smprint("((%s?%s%s)|(%s%s))", dseq, dot, dseq, dseq, dot); decflt =3D smprint("(%s%s?)|(%s%s)", dfrac, dexp, dseq, dexp); regcomp(decflt); // make sure it compiles print("decfloat: %s\n", decflt); =09 hseq =3D smprint("(%s+)", hexdig); bexp =3D smprint("([Pp]%s?%s)", sign, dseq); hfrac =3D smprint("((%s?%s%s)|(%s%s))", hseq, dot, hseq, hseq, dot); hexflt =3D smprint("0[Xx](%s|%s)%s", hfrac, hseq, bexp); regcomp(hexflt); // make sure it compiles print("hexfloat: %s\n", hexflt); I know that regcomp builds up the Reprog by combining subprograms with catenation and alternation &c., but I=E2=80=99d be loath to try tinkering there directly without a much better understanding of the algorithm. I=E2=80=99ve glanced through the documents at swtch.com/????? and the re= gcomp source code, just haven=E2=80=99t had the time for an in-depth study. Would such a project be a worthwhile spent of time? (Might it develop into the asteroid to kill the dinosaur waiting for it?) --Joel