From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailout.scc.kit.edu (mailout.scc.kit.edu [129.13.185.202])
	by krisdoz.my.domain (8.14.5/8.14.5) with ESMTP id q4SHNd11031573
	for <tech@mdocml.bsd.lv>; Mon, 28 May 2012 13:23:40 -0400 (EDT)
Received: from hekate.usta.de (asta-nat.asta.uni-karlsruhe.de [172.22.63.82])
	by scc-mailout-02.scc.kit.edu with esmtp (Exim 4.72 #1)
	id 1SZ3fM-0006Fw-Op; Mon, 28 May 2012 19:23:36 +0200
Received: from donnerwolke.usta.de ([172.24.96.3])
	by hekate.usta.de with esmtp (Exim 4.77)
	(envelope-from <schwarze@usta.de>)
	id 1SZ3fM-00021z-OA
	for tech@mdocml.bsd.lv; Mon, 28 May 2012 19:23:36 +0200
Received: from iris.usta.de ([172.24.96.5] helo=usta.de)
	by donnerwolke.usta.de with esmtp (Exim 4.72)
	(envelope-from <schwarze@usta.de>)
	id 1SZ3fM-0007ao-M8
	for tech@mdocml.bsd.lv; Mon, 28 May 2012 19:23:36 +0200
Received: from schwarze by usta.de with local (Exim 4.77)
	(envelope-from <schwarze@usta.de>)
	id 1SZ3fM-0000OY-Ds
	for tech@mdocml.bsd.lv; Mon, 28 May 2012 19:23:36 +0200
Date: Mon, 28 May 2012 19:23:35 +0200
From: Ingo Schwarze <schwarze@usta.de>
To: tech@mdocml.bsd.lv
Subject: make recursive parsing of roff(7) escapes actually work
Message-ID: <20120528172335.GB26820@iris.usta.de>
X-Mailinglist: mdocml-tech
Reply-To: tech@mdocml.bsd.lv
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by krisdoz.my.domain id q4SHNd11031573

Here i'm deleting 80 lines of code
and rendering is actually improved.

Note that parsing out pairs of parantheses is really pointless:
Closing parantheses inside nested escape sequences do not close
out previously open parantheses; and when any escape sequence ends,
it does end no matter how many parantheses may still be open.

The distinction between numeric and non-numeric escape sequences
is harmful because groff syntactically accepts nested escapes even
inside non-numeric escapes, and we don't worry about semantics
anyway.

Both aspects together simplify the numescape function to the point
that we can just drop it completely, and even the code outside the
dropped function becomes shorter.

----- Forwarded message from Ingo Schwarze <schwarze@cvs.openbsd.org> -----

From: Ingo Schwarze <schwarze@cvs.openbsd.org>
Sender: owner-source-changes@openbsd.org
Date: Mon, 28 May 2012 11:08:48 -0600 (MDT)
To: source-changes@cvs.openbsd.org
Subject: CVS: cvs.openbsd.org: src

CVSROOT:	/cvs
Module name:	src
Changes by:	schwarze@cvs.openbsd.org	2012/05/28 11:08:48

Modified files:
	usr.bin/mandoc : mandoc.c 
	regress/usr.bin/mandoc/roff/esc: Makefile 
Added files:
	regress/usr.bin/mandoc/roff/esc: h.in h.out_ascii 

Log message:
Make recursive parsing of roff(7) escapes actually work in the general case,
in particular when the inner escapes are preceded or followed by other terms.
While doing so, remove lots of bogus code that was trying to make pointless
distinctions between numeric and non-numeric escape sequences, while both
actually share the same syntax and we ignore the semantics anyway.

This prevents some of the strings defined in the pod2man(1) preamble
from producing garbage output, in particular in scandinavian words.
Of course, proper rendering of scandinavian national characters
cannot be expected even with these fixes.

----- End forwarded message -----

Index: usr.bin/mandoc/mandoc.c
===================================================================
RCS file: /cvs/src/usr.bin/mandoc/mandoc.c,v
retrieving revision 1.32
diff -u -p -r1.32 mandoc.c
--- usr.bin/mandoc/mandoc.c	28 May 2012 13:00:51 -0000	1.32
+++ usr.bin/mandoc/mandoc.c	28 May 2012 16:37:53 -0000
@@ -33,71 +33,13 @@
 
 static	int	 a2time(time_t *, const char *, const char *);
 static	char	*time2a(time_t);
-static	int	 numescape(const char *);
 
-/*
- * Pass over recursive numerical expressions.  This context of this
- * function is important: it's only called within character-terminating
- * escapes (e.g., \s[xxxyyy]), so all we need to do is handle initial
- * recursion: we don't care about what's in these blocks. 
- * This returns the number of characters skipped or -1 if an error
- * occurs (the caller should bail).
- */
-static int
-numescape(const char *start)
-{
-	int		 i;
-	size_t		 sz;
-	const char	*cp;
-
-	i = 0;
-
-	/* The expression consists of a subexpression. */
-
-	if ('\\' == start[i]) {
-		cp = &start[++i];
-		/*
-		 * Read past the end of the subexpression.
-		 * Bail immediately on errors.
-		 */
-		if (ESCAPE_ERROR == mandoc_escape(&cp, NULL, NULL))
-			return(-1);
-		return(i + cp - &start[i]);
-	} 
-
-	if ('(' != start[i++])
-		return(0);
-
-	/*
-	 * A parenthesised subexpression.  Read until the closing
-	 * parenthesis, making sure to handle any nested subexpressions
-	 * that might ruin our parse.
-	 */
-
-	while (')' != start[i]) {
-		sz = strcspn(&start[i], ")\\");
-		i += (int)sz;
-
-		if ('\0' == start[i])
-			return(-1);
-		else if ('\\' != start[i])
-			continue;
-
-		cp = &start[++i];
-		if (ESCAPE_ERROR == mandoc_escape(&cp, NULL, NULL))
-			return(-1);
-		i += cp - &start[i];
-	}
-
-	/* Read past the terminating ')'. */
-	return(++i);
-}
 
 enum mandoc_esc
 mandoc_escape(const char **end, const char **start, int *sz)
 {
-	char		 c, term, numeric;
-	int		 i, lim, ssz, rlim;
+	char		 c, term;
+	int		 i, rlim;
 	const char	*cp, *rstart;
 	enum mandoc_esc	 gly; 
 
@@ -105,9 +47,9 @@ mandoc_escape(const char **end, const ch
 	rstart = cp;
 	if (start)
 		*start = rstart;
-	i = lim = 0;
+	i = rlim = 0;
 	gly = ESCAPE_ERROR;
-	term = numeric = '\0';
+	term = '\0';
 
 	switch ((c = cp[i++])) {
 	/*
@@ -117,7 +59,7 @@ mandoc_escape(const char **end, const ch
 	 */
 	case ('('):
 		gly = ESCAPE_SPECIAL;
-		lim = 2;
+		rlim = 2;
 		break;
 	case ('['):
 		gly = ESCAPE_SPECIAL;
@@ -179,13 +121,13 @@ mandoc_escape(const char **end, const ch
 
 		switch (cp[i++]) {
 		case ('('):
-			lim = 2;
+			rlim = 2;
 			break;
 		case ('['):
 			term = ']';
 			break;
 		default:
-			lim = 1;
+			rlim = 1;
 			i--;
 			break;
 		}
@@ -240,7 +182,7 @@ mandoc_escape(const char **end, const ch
 			gly = ESCAPE_IGNORE;
 		if ('\'' != cp[i++])
 			return(ESCAPE_ERROR);
-		term = numeric = '\'';
+		term = '\'';
 		break;
 
 	/*
@@ -280,16 +222,16 @@ mandoc_escape(const char **end, const ch
 
 		switch (cp[i++]) {
 		case ('('):
-			lim = 2;
+			rlim = 2;
 			break;
 		case ('['):
-			term = numeric = ']';
+			term = ']';
 			break;
 		case ('\''):
-			term = numeric = '\'';
+			term = '\'';
 			break;
 		default:
-			lim = 1;
+			rlim = 1;
 			i--;
 			break;
 		}
@@ -306,70 +248,47 @@ mandoc_escape(const char **end, const ch
 	 */
 	default:
 		gly = ESCAPE_SPECIAL;
-		lim = 1;
+		rlim = 1;
 		i--;
 		break;
 	}
 
 	assert(ESCAPE_ERROR != gly);
 
-	rstart = &cp[i];
+	*end = rstart = &cp[i];
 	if (start)
 		*start = rstart;
 
 	/*
-	 * If a terminating block has been specified, we need to
-	 * handle the case of recursion, which could have their
-	 * own terminating blocks that mess up our parse.  This, by the
-	 * way, means that the "start" and "size" values will be
-	 * effectively meaningless.
-	 */
-
-	ssz = 0;
-	if (numeric && -1 == (ssz = numescape(&cp[i])))
-		return(ESCAPE_ERROR);
-
-	i += ssz;
-	rlim = -1;
-
-	/*
-	 * We have a character terminator.  Try to read up to that
-	 * character.  If we can't (i.e., we hit the nil), then return
-	 * an error; if we can, calculate our length, read past the
-	 * terminating character, and exit.
+	 * Read up to the terminating character,
+	 * paying attention to nested escapes.
 	 */
 
 	if ('\0' != term) {
-		*end = strchr(&cp[i], term);
-		if ('\0' == *end)
+		while (**end != term) {
+			switch (**end) {
+			case ('\0'):
+				return(ESCAPE_ERROR);
+			case ('\\'):
+				(*end)++;
+				if (ESCAPE_ERROR ==
+				    mandoc_escape(end, NULL, NULL))
+					return(ESCAPE_ERROR);
+				break;
+			default:
+				(*end)++;
+				break;
+			}
+		}
+		rlim = (*end)++ - rstart;
+	} else {
+		assert(rlim > 0);
+		if ((size_t)rlim > strlen(rstart))
 			return(ESCAPE_ERROR);
-
-		rlim = *end - &cp[i];
-		if (sz)
-			*sz = rlim;
-		(*end)++;
-		goto out;
+		*end += rlim;
 	}
-
-	assert(lim > 0);
-
-	/*
-	 * We have a numeric limit.  If the string is shorter than that,
-	 * stop and return an error.  Else adjust our endpoint, length,
-	 * and return the current glyph.
-	 */
-
-	if ((size_t)lim > strlen(&cp[i]))
-		return(ESCAPE_ERROR);
-
-	rlim = lim;
 	if (sz)
 		*sz = rlim;
-
-	*end = &cp[i] + lim;
-
-out:
-	assert(rlim >= 0 && rstart);
 
 	/* Run post-processors. */
 
Index: regress/usr.bin/mandoc/roff/esc/Makefile
===================================================================
RCS file: /cvs/src/regress/usr.bin/mandoc/roff/esc/Makefile,v
retrieving revision 1.1
diff -u -p -r1.1 Makefile
--- regress/usr.bin/mandoc/roff/esc/Makefile	28 May 2012 13:00:51 -0000	1.1
+++ regress/usr.bin/mandoc/roff/esc/Makefile	28 May 2012 16:37:53 -0000
@@ -1,6 +1,6 @@
 # $OpenBSD: Makefile,v 1.1 2012/05/28 13:00:51 schwarze Exp $
 
-REGRESS_TARGETS=z
+REGRESS_TARGETS=h z
 
 # Postprocessing to remove "character backspace" sequences
 # unless they are foolowed by the same character again.
Index: regress/usr.bin/mandoc/roff/esc/h.in
===================================================================
RCS file: regress/usr.bin/mandoc/roff/esc/h.in
diff -N regress/usr.bin/mandoc/roff/esc/h.in
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ regress/usr.bin/mandoc/roff/esc/h.in	28 May 2012 16:37:53 -0000
@@ -0,0 +1,16 @@
+.Dd May 28, 2012
+.Dt ESC-H 1
+.Os OpenBSD
+.Sh NAME
+.Nm esc-h
+.Nd the roff escape h sequence: horizontal movement
+.Sh DESCRIPTION
+simple: >\h'0'<
+.br
+escape only: >\h'\w'\&''<
+.br
+escape at the end: >\h'0+\w'\&''<
+.br
+escape at the beginning: >\h'\w'\&'+0'<
+.br
+escape in the middle: >\h'0+\w'\&'+0'<
Index: regress/usr.bin/mandoc/roff/esc/h.out_ascii
===================================================================
RCS file: regress/usr.bin/mandoc/roff/esc/h.out_ascii
diff -N regress/usr.bin/mandoc/roff/esc/h.out_ascii
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ regress/usr.bin/mandoc/roff/esc/h.out_ascii	28 May 2012 16:37:53 -0000
@@ -0,0 +1,13 @@
+ESC-H(1)                   OpenBSD Reference Manual                   ESC-H(1)
+
+NNAAMMEE
+     eesscc--hh - the roff escape h sequence: horizontal movement
+
+DDEESSCCRRIIPPTTIIOONN
+     simple: ><
+     escape only: ><
+     escape at the end: ><
+     escape at the beginning: ><
+     escape in the middle: ><
+
+OpenBSD                          May 28, 2012                          OpenBSD

--
 To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv