tech@mandoc.bsd.lv
 help / color / mirror / Atom feed
* roff_getstr() and input characters
@ 2010-07-06 23:08 Kristaps Dzonsons
       [not found] ` <20100706231643.GC32413@bramka.kerhand.co.uk>
  0 siblings, 1 reply; 2+ messages in thread
From: Kristaps Dzonsons @ 2010-07-06 23:08 UTC (permalink / raw)
  To: tech, Jason McIntyre

[-- Attachment #1: Type: text/plain, Size: 1165 bytes --]

Hi,

(Jason, the bits I'd like you to weigh in on are a few paragraphs down.)

Enclosed is a patch pushing the roff_getstr functionality directly into 
libmdoc.  It works by testing against roff_getstr() in-band and splicing 
together a new buffer if necessary.

I thought about putting the entire mandoc_special() check in libroff, 
but don't want to cause yet another scan over the line buffer. 
check_text() needs to warn against '\t' and '\b' anyway.  This is an 
open question I'll answer later when I start looking at performance.

The reason I want to air it with you (I know it works: I've tested it 
across all manuals) is because it also removes the check for isprint(), 
using strcspn() instead.  As you can see, the rej filter is only for 
'\b', which we must prohibit else we boff output encoding; '\t' for 
non-literals (warning); and '\\' for the specials check.

I argue for lifting the ASCII-constraint because (1) there's nothing in 
mdoc/groff/etc that disallows non-ASCII (e.g., Latin-1) characters and 
(2) it makes the code much cleaner.

Thoughts?

Kristaps

PS, the patch doesn't mandate '\b': I just caught that now and will fix 
it later.

[-- Attachment #2: patch.txt --]
[-- Type: text/plain, Size: 7005 bytes --]

? DONTDELETE.c
? config.h
? config.log
? foo.1
? foo.1.html
? mandoc
? mandoc.core
? mdoc.7.pdf
? patch.txt
? ssh.1.html
? user.8
? regress/mandoc.core
? regress/output
Index: libmandoc.h
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/libmandoc.h,v
retrieving revision 1.8
diff -u -r1.8 libmandoc.h
--- libmandoc.h	19 Jun 2010 20:46:27 -0000	1.8
+++ libmandoc.h	6 Jul 2010 23:06:02 -0000
@@ -19,7 +19,7 @@
 
 __BEGIN_DECLS
 
-int		 mandoc_special(char *);
+int		 mandoc_special(char *, char **, size_t *);
 void		*mandoc_calloc(size_t, size_t);
 char		*mandoc_strdup(const char *);
 void		*mandoc_malloc(size_t);
Index: man_validate.c
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/man_validate.c,v
retrieving revision 1.45
diff -u -r1.45 man_validate.c
--- man_validate.c	28 Jun 2010 14:39:17 -0000	1.45
+++ man_validate.c	6 Jul 2010 23:06:02 -0000
@@ -204,14 +204,15 @@
 static int
 check_text(CHKARGS) 
 {
-	char		*p;
+	char		*p, *spec;
+	size_t		 specsz;
 	int		 pos, c;
 
 	assert(n->string);
 
 	for (p = n->string, pos = n->pos + 1; *p; p++, pos++) {
 		if ('\\' == *p) {
-			c = mandoc_special(p);
+			c = mandoc_special(p, &spec, &specsz);
 			if (c) {
 				p += c - 1;
 				pos += c - 1;
Index: mandoc.c
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/mandoc.c,v
retrieving revision 1.21
diff -u -r1.21 mandoc.c
--- mandoc.c	6 Jul 2010 22:04:31 -0000	1.21
+++ mandoc.c	6 Jul 2010 23:06:02 -0000
@@ -1,4 +1,4 @@
-/*	$Id: mandoc.c,v 1.21 2010/07/06 22:04:31 kristaps Exp $ */
+/*	$Id: libmandoc.c,v 1.1 2010/07/05 20:00:55 kristaps Exp $ */
 /*
  * Copyright (c) 2008, 2009 Kristaps Dzonsons <kristaps@bsd.lv>
  *
@@ -52,7 +52,7 @@
 
 
 int
-mandoc_special(char *p)
+mandoc_special(char *p, char **v, size_t *vsz)
 {
 	int		 terminator;	/* Terminator for \s. */
 	int		 lim;		/* Limit for N in \s. */
@@ -60,6 +60,8 @@
 	char		*sv;
 	
 	sv = p;
+	*v = NULL;
+	*vsz = 0;
 
 	if ('\\' != *p++)
 		return(spec_norm(sv, 0));
@@ -181,8 +183,12 @@
 	case ('*'):
 		if ('\0' == *++p || isspace((u_char)*p))
 			return(spec_norm(sv, 0));
+		*v = p + 1;
 		switch (*p) {
 		case ('('):
+			*vsz = 2;
+			if ('\0' == *++p || isspace((u_char)*p))
+				return(spec_norm(sv, 0));
 			if ('\0' == *++p || isspace((u_char)*p))
 				return(spec_norm(sv, 0));
 			return(spec_norm(sv, 4));
@@ -190,10 +196,12 @@
 			for (c = 3, p++; *p && ']' != *p; p++, c++)
 				if (isspace((u_char)*p))
 					break;
+			*vsz = (size_t)c - 3;
 			return(spec_norm(sv, *p == ']' ? c : 0));
 		default:
 			break;
 		}
+		*vsz = 1;
 		return(spec_norm(sv, 3));
 	case ('('):
 		if ('\0' == *++p || isspace((u_char)*p))
Index: mdoc_validate.c
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/mdoc_validate.c,v
retrieving revision 1.109
diff -u -r1.109 mdoc_validate.c
--- mdoc_validate.c	4 Jul 2010 21:59:30 -0000	1.109
+++ mdoc_validate.c	6 Jul 2010 23:06:02 -0000
@@ -47,7 +47,7 @@
 
 static	int	 check_parent(PRE_ARGS, enum mdoct, enum mdoc_type);
 static	int	 check_stdarg(PRE_ARGS);
-static	int	 check_text(struct mdoc *, int, int, char *);
+static	int	 check_text(struct mdoc *, int, int, char **);
 static	int	 check_argv(struct mdoc *, 
 			struct mdoc_node *, struct mdoc_argv *);
 static	int	 check_args(struct mdoc *, struct mdoc_node *);
@@ -275,13 +275,11 @@
 {
 	v_pre		*p;
 	int		 line, pos;
-	char		*tp;
 
 	if (MDOC_TEXT == n->type) {
-		tp = n->string;
 		line = n->line;
 		pos = n->pos;
-		return(check_text(mdoc, line, pos, tp));
+		return(check_text(mdoc, line, pos, &n->string));
 	}
 
 	if ( ! check_args(mdoc, n))
@@ -439,7 +437,7 @@
 	int		 i;
 
 	for (i = 0; i < (int)v->sz; i++)
-		if ( ! check_text(m, v->line, v->pos, v->value[i]))
+		if ( ! check_text(m, v->line, v->pos, &v->value[i]))
 			return(0);
 
 	if (MDOC_Std == v->arg) {
@@ -454,43 +452,95 @@
 
 
 static int
-check_text(struct mdoc *mdoc, int line, int pos, char *p)
+check_text(struct mdoc *m, int ln, int pos, char **pp)
 {
 	int		 c;
+	size_t		 sz, specsz, cpsz;
+	char		*p, *spec, *cp;
+	const char	*res;
+
+	for (p = *pp; *p; p++, pos++) {
+		sz = strcspn(p, "\t\b\\");
+
+		p += (int)sz;
+
+		if ('\0' == *p)
+			break;
+
+		pos += (int)sz;
+
+		/*
+		 * Filter backspace (not allowed, as it will screw up
+		 * our output formatting) and tabs, which are only
+		 * suggested in literal contexts.  Also halt at escapes
+		 * so we can check that they're acceptable.
+		 */
+
+		switch (*p) {
+		case ('\t'):
+			if (MDOC_LITERAL & m->flags)
+				continue;
+			/* FALLTHROUGH */
+		case ('\b'):
+			if (mdoc_pmsg(m, ln, pos, MANDOCERR_BADCHAR))
+				continue;
+			return(0);
+		default:
+			break;
+		}
+
+		/* Check the special character. */
 
-	/* 
-	 * FIXME: we absolutely cannot let \b get through or it will
-	 * destroy some assumptions in terms of format.
-	 */
-
-	for ( ; *p; p++, pos++) {
-		if ('\t' == *p) {
-			if ( ! (MDOC_LITERAL & mdoc->flags))
-				if ( ! mdoc_pmsg(mdoc, line, pos, MANDOCERR_BADCHAR))
-					return(0);
-		} else if ( ! isprint((u_char)*p) && ASCII_HYPH != *p)
-			if ( ! mdoc_pmsg(mdoc, line, pos, MANDOCERR_BADCHAR))
+		c = mandoc_special(p, &spec, &specsz);
+
+		if (0 == c) {
+			c = mdoc_pmsg(m, ln, pos, MANDOCERR_BADESCAPE);
+			if ( ! (MDOC_IGN_ESCAPE & m->pflags) && ! c)
 				return(0);
+			continue;
+		}
 
-		if ('\\' != *p)
+		if (NULL == spec) {
+			p += c - 1;
+			pos += c - 1;
 			continue;
+		}
+
+		/* Reserved word.  Was it defined using `ds'? */
 
-		c = mandoc_special(p);
-		if (c) {
+		if (NULL == (res = roff_getstrn(spec, specsz))) {
+			c = mdoc_pmsg(m, ln, pos, MANDOCERR_BADESCAPE);
+			if ( ! (MDOC_IGN_ESCAPE & m->pflags) && ! c)
+				return(0);
 			p += c - 1;
 			pos += c - 1;
 			continue;
 		}
 
-		c = mdoc_pmsg(mdoc, line, pos, MANDOCERR_BADESCAPE);
-		if ( ! (MDOC_IGN_ESCAPE & mdoc->pflags) && ! c)
-			return(c);
+		/* Replace the roff-defined string with our own. */
+
+		cpsz = strlen(res) + strlen(*pp) + 1;
+		cp = mandoc_malloc(cpsz);
+		*cp = '\0';
+
+		/* Force only p - *pp + '\0' chars. */
+		strlcat(cp, *pp, (size_t)(p - *pp + 1));
+		strlcat(cp, res, cpsz);
+		strlcat(cp, p + c + 1, cpsz);
+
+		cpsz = (size_t)(p - *pp);
+
+		free(*pp);
+		*pp = cp;
+
+		/* Remember to readjust our position. */
+
+		p = *pp + (int)cpsz - 1;
+		pos = (int)cpsz - 1;
 	}
 
 	return(1);
 }
-
-
 
 
 static int
Index: term.c
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/term.c,v
retrieving revision 1.159
diff -u -r1.159 term.c
--- term.c	4 Jul 2010 22:04:04 -0000	1.159
+++ term.c	6 Jul 2010 23:06:02 -0000
@@ -379,11 +379,6 @@
 	size_t		 sz;
 
 	rhs = chars_a2res(p->symtab, word, len, &sz);
-	if (NULL == rhs) {
-		rhs = roff_getstrn(word, len);
-		if (rhs)
-			sz = strlen(rhs);
-	}
 	if (rhs)
 		encode(p, rhs, sz);
 }

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: roff_getstr() and input characters
       [not found] ` <20100706231643.GC32413@bramka.kerhand.co.uk>
@ 2010-07-07  8:49   ` Kristaps Dzonsons
  0 siblings, 0 replies; 2+ messages in thread
From: Kristaps Dzonsons @ 2010-07-07  8:49 UTC (permalink / raw)
  To: Jason McIntyre; +Cc: tech

>> The reason I want to air it with you (I know it works: I've tested it 
>> across all manuals) is because it also removes the check for isprint(), 
>> using strcspn() instead.  As you can see, the rej filter is only for 
>> '\b', which we must prohibit else we boff output encoding; '\t' for 
>> non-literals (warning); and '\\' for the specials check.
>>
>> I argue for lifting the ASCII-constraint because (1) there's nothing in 
>> mdoc/groff/etc that disallows non-ASCII (e.g., Latin-1) characters and 
>> (2) it makes the code much cleaner.
>>
>> Thoughts?
>>
> 
> i don;t really know what you mean, to be honest. you'll have to dumb
> down your question a bit, i'm afraid...

Jason, right now, mandoc spits out a warning for any non-printable ASCII 
character.

This patch lifts this restriction, instead warning only about tabs and 
the "backspace" character.

We'd spoken about this before, but seeing it in action, I'm no longer 
sure.  The killer points are that -Tps will throw away all non-ASCII 
characters as it can't calculate their glyph widths, and -Thtml 
stipulates UTF-8 encoding, so anything but UTF-8 input will be gobbledygock.

In effect, once one uses a non-ASCII encoding, the rendered output will 
be irregular across output modes and, more importantly, user environment 
(terminals, etc.).  This is, in my opinion, a Bad Thing (tm).

Thoughts?

Kristaps
--
 To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2010-07-07  8:49 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-06 23:08 roff_getstr() and input characters Kristaps Dzonsons
     [not found] ` <20100706231643.GC32413@bramka.kerhand.co.uk>
2010-07-07  8:49   ` Kristaps Dzonsons

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).