tech@mandoc.bsd.lv
 help / color / mirror / Atom feed
From: Ingo Schwarze <schwarze@usta.de>
To: tech@mdocml.bsd.lv
Subject: make recursive parsing of roff(7) escapes actually work
Date: Mon, 28 May 2012 19:23:35 +0200	[thread overview]
Message-ID: <20120528172335.GB26820@iris.usta.de> (raw)

Here i'm deleting 80 lines of code
and rendering is actually improved.

Note that parsing out pairs of parantheses is really pointless:
Closing parantheses inside nested escape sequences do not close
out previously open parantheses; and when any escape sequence ends,
it does end no matter how many parantheses may still be open.

The distinction between numeric and non-numeric escape sequences
is harmful because groff syntactically accepts nested escapes even
inside non-numeric escapes, and we don't worry about semantics
anyway.

Both aspects together simplify the numescape function to the point
that we can just drop it completely, and even the code outside the
dropped function becomes shorter.

----- Forwarded message from Ingo Schwarze <schwarze@cvs.openbsd.org> -----

From: Ingo Schwarze <schwarze@cvs.openbsd.org>
Sender: owner-source-changes@openbsd.org
Date: Mon, 28 May 2012 11:08:48 -0600 (MDT)
To: source-changes@cvs.openbsd.org
Subject: CVS: cvs.openbsd.org: src

CVSROOT:	/cvs
Module name:	src
Changes by:	schwarze@cvs.openbsd.org	2012/05/28 11:08:48

Modified files:
	usr.bin/mandoc : mandoc.c 
	regress/usr.bin/mandoc/roff/esc: Makefile 
Added files:
	regress/usr.bin/mandoc/roff/esc: h.in h.out_ascii 

Log message:
Make recursive parsing of roff(7) escapes actually work in the general case,
in particular when the inner escapes are preceded or followed by other terms.
While doing so, remove lots of bogus code that was trying to make pointless
distinctions between numeric and non-numeric escape sequences, while both
actually share the same syntax and we ignore the semantics anyway.

This prevents some of the strings defined in the pod2man(1) preamble
from producing garbage output, in particular in scandinavian words.
Of course, proper rendering of scandinavian national characters
cannot be expected even with these fixes.

----- End forwarded message -----

Index: usr.bin/mandoc/mandoc.c
===================================================================
RCS file: /cvs/src/usr.bin/mandoc/mandoc.c,v
retrieving revision 1.32
diff -u -p -r1.32 mandoc.c
--- usr.bin/mandoc/mandoc.c	28 May 2012 13:00:51 -0000	1.32
+++ usr.bin/mandoc/mandoc.c	28 May 2012 16:37:53 -0000
@@ -33,71 +33,13 @@
 
 static	int	 a2time(time_t *, const char *, const char *);
 static	char	*time2a(time_t);
-static	int	 numescape(const char *);
 
-/*
- * Pass over recursive numerical expressions.  This context of this
- * function is important: it's only called within character-terminating
- * escapes (e.g., \s[xxxyyy]), so all we need to do is handle initial
- * recursion: we don't care about what's in these blocks. 
- * This returns the number of characters skipped or -1 if an error
- * occurs (the caller should bail).
- */
-static int
-numescape(const char *start)
-{
-	int		 i;
-	size_t		 sz;
-	const char	*cp;
-
-	i = 0;
-
-	/* The expression consists of a subexpression. */
-
-	if ('\\' == start[i]) {
-		cp = &start[++i];
-		/*
-		 * Read past the end of the subexpression.
-		 * Bail immediately on errors.
-		 */
-		if (ESCAPE_ERROR == mandoc_escape(&cp, NULL, NULL))
-			return(-1);
-		return(i + cp - &start[i]);
-	} 
-
-	if ('(' != start[i++])
-		return(0);
-
-	/*
-	 * A parenthesised subexpression.  Read until the closing
-	 * parenthesis, making sure to handle any nested subexpressions
-	 * that might ruin our parse.
-	 */
-
-	while (')' != start[i]) {
-		sz = strcspn(&start[i], ")\\");
-		i += (int)sz;
-
-		if ('\0' == start[i])
-			return(-1);
-		else if ('\\' != start[i])
-			continue;
-
-		cp = &start[++i];
-		if (ESCAPE_ERROR == mandoc_escape(&cp, NULL, NULL))
-			return(-1);
-		i += cp - &start[i];
-	}
-
-	/* Read past the terminating ')'. */
-	return(++i);
-}
 
 enum mandoc_esc
 mandoc_escape(const char **end, const char **start, int *sz)
 {
-	char		 c, term, numeric;
-	int		 i, lim, ssz, rlim;
+	char		 c, term;
+	int		 i, rlim;
 	const char	*cp, *rstart;
 	enum mandoc_esc	 gly; 
 
@@ -105,9 +47,9 @@ mandoc_escape(const char **end, const ch
 	rstart = cp;
 	if (start)
 		*start = rstart;
-	i = lim = 0;
+	i = rlim = 0;
 	gly = ESCAPE_ERROR;
-	term = numeric = '\0';
+	term = '\0';
 
 	switch ((c = cp[i++])) {
 	/*
@@ -117,7 +59,7 @@ mandoc_escape(const char **end, const ch
 	 */
 	case ('('):
 		gly = ESCAPE_SPECIAL;
-		lim = 2;
+		rlim = 2;
 		break;
 	case ('['):
 		gly = ESCAPE_SPECIAL;
@@ -179,13 +121,13 @@ mandoc_escape(const char **end, const ch
 
 		switch (cp[i++]) {
 		case ('('):
-			lim = 2;
+			rlim = 2;
 			break;
 		case ('['):
 			term = ']';
 			break;
 		default:
-			lim = 1;
+			rlim = 1;
 			i--;
 			break;
 		}
@@ -240,7 +182,7 @@ mandoc_escape(const char **end, const ch
 			gly = ESCAPE_IGNORE;
 		if ('\'' != cp[i++])
 			return(ESCAPE_ERROR);
-		term = numeric = '\'';
+		term = '\'';
 		break;
 
 	/*
@@ -280,16 +222,16 @@ mandoc_escape(const char **end, const ch
 
 		switch (cp[i++]) {
 		case ('('):
-			lim = 2;
+			rlim = 2;
 			break;
 		case ('['):
-			term = numeric = ']';
+			term = ']';
 			break;
 		case ('\''):
-			term = numeric = '\'';
+			term = '\'';
 			break;
 		default:
-			lim = 1;
+			rlim = 1;
 			i--;
 			break;
 		}
@@ -306,70 +248,47 @@ mandoc_escape(const char **end, const ch
 	 */
 	default:
 		gly = ESCAPE_SPECIAL;
-		lim = 1;
+		rlim = 1;
 		i--;
 		break;
 	}
 
 	assert(ESCAPE_ERROR != gly);
 
-	rstart = &cp[i];
+	*end = rstart = &cp[i];
 	if (start)
 		*start = rstart;
 
 	/*
-	 * If a terminating block has been specified, we need to
-	 * handle the case of recursion, which could have their
-	 * own terminating blocks that mess up our parse.  This, by the
-	 * way, means that the "start" and "size" values will be
-	 * effectively meaningless.
-	 */
-
-	ssz = 0;
-	if (numeric && -1 == (ssz = numescape(&cp[i])))
-		return(ESCAPE_ERROR);
-
-	i += ssz;
-	rlim = -1;
-
-	/*
-	 * We have a character terminator.  Try to read up to that
-	 * character.  If we can't (i.e., we hit the nil), then return
-	 * an error; if we can, calculate our length, read past the
-	 * terminating character, and exit.
+	 * Read up to the terminating character,
+	 * paying attention to nested escapes.
 	 */
 
 	if ('\0' != term) {
-		*end = strchr(&cp[i], term);
-		if ('\0' == *end)
+		while (**end != term) {
+			switch (**end) {
+			case ('\0'):
+				return(ESCAPE_ERROR);
+			case ('\\'):
+				(*end)++;
+				if (ESCAPE_ERROR ==
+				    mandoc_escape(end, NULL, NULL))
+					return(ESCAPE_ERROR);
+				break;
+			default:
+				(*end)++;
+				break;
+			}
+		}
+		rlim = (*end)++ - rstart;
+	} else {
+		assert(rlim > 0);
+		if ((size_t)rlim > strlen(rstart))
 			return(ESCAPE_ERROR);
-
-		rlim = *end - &cp[i];
-		if (sz)
-			*sz = rlim;
-		(*end)++;
-		goto out;
+		*end += rlim;
 	}
-
-	assert(lim > 0);
-
-	/*
-	 * We have a numeric limit.  If the string is shorter than that,
-	 * stop and return an error.  Else adjust our endpoint, length,
-	 * and return the current glyph.
-	 */
-
-	if ((size_t)lim > strlen(&cp[i]))
-		return(ESCAPE_ERROR);
-
-	rlim = lim;
 	if (sz)
 		*sz = rlim;
-
-	*end = &cp[i] + lim;
-
-out:
-	assert(rlim >= 0 && rstart);
 
 	/* Run post-processors. */
 
Index: regress/usr.bin/mandoc/roff/esc/Makefile
===================================================================
RCS file: /cvs/src/regress/usr.bin/mandoc/roff/esc/Makefile,v
retrieving revision 1.1
diff -u -p -r1.1 Makefile
--- regress/usr.bin/mandoc/roff/esc/Makefile	28 May 2012 13:00:51 -0000	1.1
+++ regress/usr.bin/mandoc/roff/esc/Makefile	28 May 2012 16:37:53 -0000
@@ -1,6 +1,6 @@
 # $OpenBSD: Makefile,v 1.1 2012/05/28 13:00:51 schwarze Exp $
 
-REGRESS_TARGETS=z
+REGRESS_TARGETS=h z
 
 # Postprocessing to remove "character backspace" sequences
 # unless they are foolowed by the same character again.
Index: regress/usr.bin/mandoc/roff/esc/h.in
===================================================================
RCS file: regress/usr.bin/mandoc/roff/esc/h.in
diff -N regress/usr.bin/mandoc/roff/esc/h.in
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ regress/usr.bin/mandoc/roff/esc/h.in	28 May 2012 16:37:53 -0000
@@ -0,0 +1,16 @@
+.Dd May 28, 2012
+.Dt ESC-H 1
+.Os OpenBSD
+.Sh NAME
+.Nm esc-h
+.Nd the roff escape h sequence: horizontal movement
+.Sh DESCRIPTION
+simple: >\h'0'<
+.br
+escape only: >\h'\w'\&''<
+.br
+escape at the end: >\h'0+\w'\&''<
+.br
+escape at the beginning: >\h'\w'\&'+0'<
+.br
+escape in the middle: >\h'0+\w'\&'+0'<
Index: regress/usr.bin/mandoc/roff/esc/h.out_ascii
===================================================================
RCS file: regress/usr.bin/mandoc/roff/esc/h.out_ascii
diff -N regress/usr.bin/mandoc/roff/esc/h.out_ascii
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ regress/usr.bin/mandoc/roff/esc/h.out_ascii	28 May 2012 16:37:53 -0000
@@ -0,0 +1,13 @@
+ESC-H(1)                   OpenBSD Reference Manual                   ESC-H(1)
+
+N\bNA\bAM\bME\bE
+     e\bes\bsc\bc-\b-h\bh - the roff escape h sequence: horizontal movement
+
+D\bDE\bES\bSC\bCR\bRI\bIP\bPT\bTI\bIO\bON\bN
+     simple: ><
+     escape only: ><
+     escape at the end: ><
+     escape at the beginning: ><
+     escape in the middle: ><
+
+OpenBSD                          May 28, 2012                          OpenBSD

--
 To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv

                 reply	other threads:[~2012-05-28 17:23 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120528172335.GB26820@iris.usta.de \
    --to=schwarze@usta.de \
    --cc=tech@mdocml.bsd.lv \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).