From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailout.scc.kit.edu (mailout.scc.kit.edu [129.13.185.202]) by krisdoz.my.domain (8.14.5/8.14.5) with ESMTP id q4SHNd11031573 for ; Mon, 28 May 2012 13:23:40 -0400 (EDT) Received: from hekate.usta.de (asta-nat.asta.uni-karlsruhe.de [172.22.63.82]) by scc-mailout-02.scc.kit.edu with esmtp (Exim 4.72 #1) id 1SZ3fM-0006Fw-Op; Mon, 28 May 2012 19:23:36 +0200 Received: from donnerwolke.usta.de ([172.24.96.3]) by hekate.usta.de with esmtp (Exim 4.77) (envelope-from ) id 1SZ3fM-00021z-OA for tech@mdocml.bsd.lv; Mon, 28 May 2012 19:23:36 +0200 Received: from iris.usta.de ([172.24.96.5] helo=usta.de) by donnerwolke.usta.de with esmtp (Exim 4.72) (envelope-from ) id 1SZ3fM-0007ao-M8 for tech@mdocml.bsd.lv; Mon, 28 May 2012 19:23:36 +0200 Received: from schwarze by usta.de with local (Exim 4.77) (envelope-from ) id 1SZ3fM-0000OY-Ds for tech@mdocml.bsd.lv; Mon, 28 May 2012 19:23:36 +0200 Date: Mon, 28 May 2012 19:23:35 +0200 From: Ingo Schwarze To: tech@mdocml.bsd.lv Subject: make recursive parsing of roff(7) escapes actually work Message-ID: <20120528172335.GB26820@iris.usta.de> X-Mailinglist: mdocml-tech Reply-To: tech@mdocml.bsd.lv MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by krisdoz.my.domain id q4SHNd11031573 Here i'm deleting 80 lines of code and rendering is actually improved. Note that parsing out pairs of parantheses is really pointless: Closing parantheses inside nested escape sequences do not close out previously open parantheses; and when any escape sequence ends, it does end no matter how many parantheses may still be open. The distinction between numeric and non-numeric escape sequences is harmful because groff syntactically accepts nested escapes even inside non-numeric escapes, and we don't worry about semantics anyway. Both aspects together simplify the numescape function to the point that we can just drop it completely, and even the code outside the dropped function becomes shorter. ----- Forwarded message from Ingo Schwarze ----- From: Ingo Schwarze Sender: owner-source-changes@openbsd.org Date: Mon, 28 May 2012 11:08:48 -0600 (MDT) To: source-changes@cvs.openbsd.org Subject: CVS: cvs.openbsd.org: src CVSROOT: /cvs Module name: src Changes by: schwarze@cvs.openbsd.org 2012/05/28 11:08:48 Modified files: usr.bin/mandoc : mandoc.c regress/usr.bin/mandoc/roff/esc: Makefile Added files: regress/usr.bin/mandoc/roff/esc: h.in h.out_ascii Log message: Make recursive parsing of roff(7) escapes actually work in the general case, in particular when the inner escapes are preceded or followed by other terms. While doing so, remove lots of bogus code that was trying to make pointless distinctions between numeric and non-numeric escape sequences, while both actually share the same syntax and we ignore the semantics anyway. This prevents some of the strings defined in the pod2man(1) preamble from producing garbage output, in particular in scandinavian words. Of course, proper rendering of scandinavian national characters cannot be expected even with these fixes. ----- End forwarded message ----- Index: usr.bin/mandoc/mandoc.c =================================================================== RCS file: /cvs/src/usr.bin/mandoc/mandoc.c,v retrieving revision 1.32 diff -u -p -r1.32 mandoc.c --- usr.bin/mandoc/mandoc.c 28 May 2012 13:00:51 -0000 1.32 +++ usr.bin/mandoc/mandoc.c 28 May 2012 16:37:53 -0000 @@ -33,71 +33,13 @@ static int a2time(time_t *, const char *, const char *); static char *time2a(time_t); -static int numescape(const char *); -/* - * Pass over recursive numerical expressions. This context of this - * function is important: it's only called within character-terminating - * escapes (e.g., \s[xxxyyy]), so all we need to do is handle initial - * recursion: we don't care about what's in these blocks. - * This returns the number of characters skipped or -1 if an error - * occurs (the caller should bail). - */ -static int -numescape(const char *start) -{ - int i; - size_t sz; - const char *cp; - - i = 0; - - /* The expression consists of a subexpression. */ - - if ('\\' == start[i]) { - cp = &start[++i]; - /* - * Read past the end of the subexpression. - * Bail immediately on errors. - */ - if (ESCAPE_ERROR == mandoc_escape(&cp, NULL, NULL)) - return(-1); - return(i + cp - &start[i]); - } - - if ('(' != start[i++]) - return(0); - - /* - * A parenthesised subexpression. Read until the closing - * parenthesis, making sure to handle any nested subexpressions - * that might ruin our parse. - */ - - while (')' != start[i]) { - sz = strcspn(&start[i], ")\\"); - i += (int)sz; - - if ('\0' == start[i]) - return(-1); - else if ('\\' != start[i]) - continue; - - cp = &start[++i]; - if (ESCAPE_ERROR == mandoc_escape(&cp, NULL, NULL)) - return(-1); - i += cp - &start[i]; - } - - /* Read past the terminating ')'. */ - return(++i); -} enum mandoc_esc mandoc_escape(const char **end, const char **start, int *sz) { - char c, term, numeric; - int i, lim, ssz, rlim; + char c, term; + int i, rlim; const char *cp, *rstart; enum mandoc_esc gly; @@ -105,9 +47,9 @@ mandoc_escape(const char **end, const ch rstart = cp; if (start) *start = rstart; - i = lim = 0; + i = rlim = 0; gly = ESCAPE_ERROR; - term = numeric = '\0'; + term = '\0'; switch ((c = cp[i++])) { /* @@ -117,7 +59,7 @@ mandoc_escape(const char **end, const ch */ case ('('): gly = ESCAPE_SPECIAL; - lim = 2; + rlim = 2; break; case ('['): gly = ESCAPE_SPECIAL; @@ -179,13 +121,13 @@ mandoc_escape(const char **end, const ch switch (cp[i++]) { case ('('): - lim = 2; + rlim = 2; break; case ('['): term = ']'; break; default: - lim = 1; + rlim = 1; i--; break; } @@ -240,7 +182,7 @@ mandoc_escape(const char **end, const ch gly = ESCAPE_IGNORE; if ('\'' != cp[i++]) return(ESCAPE_ERROR); - term = numeric = '\''; + term = '\''; break; /* @@ -280,16 +222,16 @@ mandoc_escape(const char **end, const ch switch (cp[i++]) { case ('('): - lim = 2; + rlim = 2; break; case ('['): - term = numeric = ']'; + term = ']'; break; case ('\''): - term = numeric = '\''; + term = '\''; break; default: - lim = 1; + rlim = 1; i--; break; } @@ -306,70 +248,47 @@ mandoc_escape(const char **end, const ch */ default: gly = ESCAPE_SPECIAL; - lim = 1; + rlim = 1; i--; break; } assert(ESCAPE_ERROR != gly); - rstart = &cp[i]; + *end = rstart = &cp[i]; if (start) *start = rstart; /* - * If a terminating block has been specified, we need to - * handle the case of recursion, which could have their - * own terminating blocks that mess up our parse. This, by the - * way, means that the "start" and "size" values will be - * effectively meaningless. - */ - - ssz = 0; - if (numeric && -1 == (ssz = numescape(&cp[i]))) - return(ESCAPE_ERROR); - - i += ssz; - rlim = -1; - - /* - * We have a character terminator. Try to read up to that - * character. If we can't (i.e., we hit the nil), then return - * an error; if we can, calculate our length, read past the - * terminating character, and exit. + * Read up to the terminating character, + * paying attention to nested escapes. */ if ('\0' != term) { - *end = strchr(&cp[i], term); - if ('\0' == *end) + while (**end != term) { + switch (**end) { + case ('\0'): + return(ESCAPE_ERROR); + case ('\\'): + (*end)++; + if (ESCAPE_ERROR == + mandoc_escape(end, NULL, NULL)) + return(ESCAPE_ERROR); + break; + default: + (*end)++; + break; + } + } + rlim = (*end)++ - rstart; + } else { + assert(rlim > 0); + if ((size_t)rlim > strlen(rstart)) return(ESCAPE_ERROR); - - rlim = *end - &cp[i]; - if (sz) - *sz = rlim; - (*end)++; - goto out; + *end += rlim; } - - assert(lim > 0); - - /* - * We have a numeric limit. If the string is shorter than that, - * stop and return an error. Else adjust our endpoint, length, - * and return the current glyph. - */ - - if ((size_t)lim > strlen(&cp[i])) - return(ESCAPE_ERROR); - - rlim = lim; if (sz) *sz = rlim; - - *end = &cp[i] + lim; - -out: - assert(rlim >= 0 && rstart); /* Run post-processors. */ Index: regress/usr.bin/mandoc/roff/esc/Makefile =================================================================== RCS file: /cvs/src/regress/usr.bin/mandoc/roff/esc/Makefile,v retrieving revision 1.1 diff -u -p -r1.1 Makefile --- regress/usr.bin/mandoc/roff/esc/Makefile 28 May 2012 13:00:51 -0000 1.1 +++ regress/usr.bin/mandoc/roff/esc/Makefile 28 May 2012 16:37:53 -0000 @@ -1,6 +1,6 @@ # $OpenBSD: Makefile,v 1.1 2012/05/28 13:00:51 schwarze Exp $ -REGRESS_TARGETS=z +REGRESS_TARGETS=h z # Postprocessing to remove "character backspace" sequences # unless they are foolowed by the same character again. Index: regress/usr.bin/mandoc/roff/esc/h.in =================================================================== RCS file: regress/usr.bin/mandoc/roff/esc/h.in diff -N regress/usr.bin/mandoc/roff/esc/h.in --- /dev/null 1 Jan 1970 00:00:00 -0000 +++ regress/usr.bin/mandoc/roff/esc/h.in 28 May 2012 16:37:53 -0000 @@ -0,0 +1,16 @@ +.Dd May 28, 2012 +.Dt ESC-H 1 +.Os OpenBSD +.Sh NAME +.Nm esc-h +.Nd the roff escape h sequence: horizontal movement +.Sh DESCRIPTION +simple: >\h'0'< +.br +escape only: >\h'\w'\&''< +.br +escape at the end: >\h'0+\w'\&''< +.br +escape at the beginning: >\h'\w'\&'+0'< +.br +escape in the middle: >\h'0+\w'\&'+0'< Index: regress/usr.bin/mandoc/roff/esc/h.out_ascii =================================================================== RCS file: regress/usr.bin/mandoc/roff/esc/h.out_ascii diff -N regress/usr.bin/mandoc/roff/esc/h.out_ascii --- /dev/null 1 Jan 1970 00:00:00 -0000 +++ regress/usr.bin/mandoc/roff/esc/h.out_ascii 28 May 2012 16:37:53 -0000 @@ -0,0 +1,13 @@ +ESC-H(1) OpenBSD Reference Manual ESC-H(1) + +NNAAMMEE + eesscc--hh - the roff escape h sequence: horizontal movement + +DDEESSCCRRIIPPTTIIOONN + simple: >< + escape only: >< + escape at the end: >< + escape at the beginning: >< + escape in the middle: >< + +OpenBSD May 28, 2012 OpenBSD -- To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv