From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from scc-mailout.scc.kit.edu (scc-mailout.scc.kit.edu [129.13.185.202])
	by krisdoz.my.domain (8.14.5/8.14.5) with ESMTP id q040sW2S001360
	for <tech@mdocml.bsd.lv>; Tue, 3 Jan 2012 19:54:32 -0500 (EST)
Received: from hekate.usta.de (asta-nat.asta.uni-karlsruhe.de [172.22.63.82])
	by scc-mailout-02.scc.kit.edu with esmtp (Exim 4.72 #1)
	id 1RiF7f-0007tU-Ei; Wed, 04 Jan 2012 01:54:31 +0100
Received: from donnerwolke.usta.de ([172.24.96.3])
	by hekate.usta.de with esmtp (Exim 4.72)
	(envelope-from <schwarze@usta.de>)
	id 1RiF7f-00001p-F8
	for tech@mdocml.bsd.lv; Wed, 04 Jan 2012 01:54:31 +0100
Received: from iris.usta.de ([172.24.96.5] helo=usta.de)
	by donnerwolke.usta.de with esmtp (Exim 4.72)
	(envelope-from <schwarze@usta.de>)
	id 1RiF7f-0005sY-Dv
	for tech@mdocml.bsd.lv; Wed, 04 Jan 2012 01:54:31 +0100
Received: from schwarze by usta.de with local (Exim 4.72)
	(envelope-from <schwarze@usta.de>)
	id 1RiF7f-0008Vd-3K
	for tech@mdocml.bsd.lv; Wed, 04 Jan 2012 01:54:31 +0100
Date: Wed, 4 Jan 2012 01:54:30 +0100
From: Ingo Schwarze <schwarze@usta.de>
To: tech@mdocml.bsd.lv
Subject: Re: Can of worms: \h"..."
Message-ID: <20120104005430.GF2607@iris.usta.de>
References: <4F02F264.2070407@bsd.lv>
X-Mailinglist: mdocml-tech
Reply-To: tech@mdocml.bsd.lv
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4F02F264.2070407@bsd.lv>
User-Agent: Mutt/1.5.21 (2010-09-15)

Hi Kristaps,

just a very quick answer - it's getting late already and i can't
study this in due detail right now.

Kristaps Dzonsons wrote on Tue, Jan 03, 2012 at 01:19:48PM +0100:

> On the verge of checking in a quick fix for the \h"..." TODO, it
> occurred to me that we either don't want to accomodate for pod2man
> badness OR something more subtle's at work.  \h"..." is specifically
> disallowed by groff(1).  So I searched in the groff source.  Behold!
> 
> In groff.c's input.cpp, we see several escapes (h, H, N, S, v, x)
> directly condition their enclosing markers on the first character
> (see get_delim_number()) while others do so indirectly.  These set
> the end marker on the first character given that it satisfies the
> token::delimiter() method (or whatever is C++'s name for an object
> function).
> 
> The delimiter() function (also in input.cpp) allows any character
> but a certain ASCII subset and whitespace.  groff(7) mentions the
> apostrophe, but it can much much more.
> 
> Question is: do we want this behaviour?  I'd say we do,

If i understand correctly, i tend to say:
Yes, we should accept the same characters as delimiters as groff.

> but as it's somewhat intrusive, I want some consensus before
> committing.  Either way, I do NOT suggest that we outwardly
> document this.

Indeed, documenting the apostrophe as a delimiter is enough,
everything else does not seem particularly sane.

> Note that this also fixes the situation where some non-\N escapes
> were being assigned the NUMERIC identifier, which is only used for
> \N.  I also removed the check for \N numbers, as this is done again
> later.

I didn't run it yet, but suspect that part to be wrong.
The point is: Sure, we have found an explicit delimiting character.
But any other letter will terminate the escape sequence as well, see

  http://www.openbsd.org/cgi-bin/cvsweb/src/regress/usr.bin/mandoc/char/N/

Both the mdoc(7) input and groff(1) output are checked in.
See in particular the "mixed content" on line 18 of basic.in,
line 13 of basic.out_ascii.

Whatever you check in, please don't break that test.  :-)

> Thoughts?

The longish switch(numeric) could probably be replaced by something like

  strchr("0123456789+-/*%<>=&:().", numeric)

Yours,
  Ingo


> Index: mandoc.c
> ===================================================================
> RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/mandoc.c,v
> retrieving revision 1.62
> diff -u -p -r1.62 mandoc.c
> --- mandoc.c	3 Dec 2011 16:08:51 -0000	1.62
> +++ mandoc.c	3 Jan 2012 12:18:51 -0000
> @@ -209,9 +209,15 @@ mandoc_escape(const char **end, const ch
>  		break;
>  
>  	/*
> -	 * These escapes are of the form \X'N', where 'X' is the trigger
> -	 * and 'N' resolves to a numerical expression.
> +	 * These escapes accept most characters as enclosure marks
> +	 * (except for those listed in the switch).
> +	 * The enclosed materials are numbers, so run them through the
> +	 * numerical subexpression calculator after we process.
>  	 */
> +	case ('N'):
> +		/* Special case: numerical representation of char. */
> +		gly = ESCAPE_NUMBERED;
> +		/* FALLTHROUGH */
>  	case ('B'):
>  		/* FALLTHROUGH */
>  	case ('h'):
> @@ -221,7 +227,6 @@ mandoc_escape(const char **end, const ch
>  	case ('L'):
>  		/* FALLTHROUGH */
>  	case ('l'):
> -		gly = ESCAPE_NUMBERED;
>  		/* FALLTHROUGH */
>  	case ('S'):
>  		/* FALLTHROUGH */
> @@ -230,32 +235,62 @@ mandoc_escape(const char **end, const ch
>  	case ('w'):
>  		/* FALLTHROUGH */
>  	case ('x'):
> -		if (ESCAPE_ERROR == gly)
> +		if (ESCAPE_NUMBERED != gly)
>  			gly = ESCAPE_IGNORE;
> -		if ('\'' != cp[i++])
> +		numeric = term = cp[i++];
> +		switch (numeric) {
> +		case('0'):
> +			/* FALLTHROUGH */
> +		case('1'):
> +			/* FALLTHROUGH */
> +		case('2'):
> +			/* FALLTHROUGH */
> +		case('3'):
> +			/* FALLTHROUGH */
> +		case('4'):
> +			/* FALLTHROUGH */
> +		case('5'):
> +			/* FALLTHROUGH */
> +		case('6'):
> +			/* FALLTHROUGH */
> +		case('7'):
> +			/* FALLTHROUGH */
> +		case('8'):
> +			/* FALLTHROUGH */
> +		case('9'):
> +			/* FALLTHROUGH */
> +		case('+'):
> +			/* FALLTHROUGH */
> +		case('-'):
> +			/* FALLTHROUGH */
> +		case('/'):
> +			/* FALLTHROUGH */
> +		case('*'):
> +			/* FALLTHROUGH */
> +		case('%'):
> +			/* FALLTHROUGH */
> +		case('<'):
> +			/* FALLTHROUGH */
> +		case('>'):
> +			/* FALLTHROUGH */
> +		case('='):
> +			/* FALLTHROUGH */
> +		case('&'):
> +			/* FALLTHROUGH */
> +		case(':'):
> +			/* FALLTHROUGH */
> +		case('('):
> +			/* FALLTHROUGH */
> +		case(')'):
> +			/* FALLTHROUGH */
> +		case('.'):
>  			return(ESCAPE_ERROR);
> -		term = numeric = '\'';
> -		break;
> -
> -	/*
> -	 * Special handling for the numbered character escape.
> -	 * XXX Do any other escapes need similar handling?
> -	 */
> -	case ('N'):
> -		if ('\0' == cp[i])
> +		default:
> +			break;
> +		}
> +		if (isspace((unsigned char)numeric))
>  			return(ESCAPE_ERROR);
> -		*end = &cp[++i];
> -		if (isdigit((unsigned char)cp[i-1]))
> -			return(ESCAPE_IGNORE);
> -		while (isdigit((unsigned char)**end))
> -			(*end)++;
> -		if (start)
> -			*start = &cp[i];
> -		if (sz)
> -			*sz = *end - &cp[i];
> -		if ('\0' != **end)
> -			(*end)++;
> -		return(ESCAPE_NUMBERED);
> +		break;
>  
>  	/* 
>  	 * Sizes get a special category of their own.
--
 To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv