discuss@mandoc.bsd.lv
 help / color / mirror / Atom feed
* Discarding non-ASCII input
@ 2012-02-05  0:45 Joerg Sonnenberger
  2012-02-05 10:17 ` Ingo Schwarze
  0 siblings, 1 reply; 5+ messages in thread
From: Joerg Sonnenberger @ 2012-02-05  0:45 UTC (permalink / raw)
  To: discuss

[-- Attachment #1: Type: text/plain, Size: 332 bytes --]

Hi all,
at the moment we are discarding any non-ASCII characters. This turns a
bunch of syntactically documents into complete garbage, e.g. by removing
the arguments for .SH macros. I think it is more reasonable to replace
them with "safe" garbage like iconv on most platforms does.

What do you think of the attached patch?

Joerg

[-- Attachment #2: read.c.diff --]
[-- Type: text/x-diff, Size: 1073 bytes --]

Index: read.c
===================================================================
RCS file: /home/joerg/cvsroot/mdocml/read.c,v
retrieving revision 1.26
diff -u -p -r1.26 read.c
--- read.c	7 Nov 2011 01:24:40 -0000	1.26
+++ read.c	5 Feb 2012 00:31:33 -0000
@@ -325,9 +325,9 @@ mparse_buf_r(struct mparse *curp, struct
 			 * Warn about bogus characters.  If you're using
 			 * non-ASCII encoding, you're screwing your
 			 * readers.  Since I'd rather this not happen,
-			 * I'll be helpful and drop these characters so
-			 * we don't display gibberish.  Note to manual
-			 * writers: use special characters.
+			 * I'll be helpful and replace these characters
+			 * with "?", so we don't display gibberish.
+			 * Note to manual writers: use special characters.
 			 */
 
 			c = (unsigned char) blk.buf[i];
@@ -337,6 +337,9 @@ mparse_buf_r(struct mparse *curp, struct
 				mandoc_msg(MANDOCERR_BADCHAR, curp,
 						curp->line, pos, "ignoring byte");
 				i++;
+				if (pos >= (int)ln.sz)
+					resize_buf(&ln, 256);
+				ln.buf[pos++] = '?';
 				continue;
 			}
 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Discarding non-ASCII input
  2012-02-05  0:45 Discarding non-ASCII input Joerg Sonnenberger
@ 2012-02-05 10:17 ` Ingo Schwarze
  2012-02-05 11:58   ` Joerg Sonnenberger
  0 siblings, 1 reply; 5+ messages in thread
From: Ingo Schwarze @ 2012-02-05 10:17 UTC (permalink / raw)
  To: discuss

Hi Joerg,

Joerg Sonnenberger wrote on Sun, Feb 05, 2012 at 01:45:42AM +0100:

> at the moment we are discarding any non-ASCII characters. This turns a
> bunch of syntactically documents into complete garbage, e.g. by removing
> the arguments for .SH macros. I think it is more reasonable to replace
> them with "safe" garbage like iconv on most platforms does.
> 
> What do you think of the attached patch?

I think i like the idea.

You might also wish to replace the "ignoring byte" by NULL
when changing this.

Thanks,
  Ingo


> Index: read.c
> ===================================================================
> RCS file: /home/joerg/cvsroot/mdocml/read.c,v
> retrieving revision 1.26
> diff -u -p -r1.26 read.c
> --- read.c	7 Nov 2011 01:24:40 -0000	1.26
> +++ read.c	5 Feb 2012 00:31:33 -0000
> @@ -325,9 +325,9 @@ mparse_buf_r(struct mparse *curp, struct
>  			 * Warn about bogus characters.  If you're using
>  			 * non-ASCII encoding, you're screwing your
>  			 * readers.  Since I'd rather this not happen,
> -			 * I'll be helpful and drop these characters so
> -			 * we don't display gibberish.  Note to manual
> -			 * writers: use special characters.
> +			 * I'll be helpful and replace these characters
> +			 * with "?", so we don't display gibberish.
> +			 * Note to manual writers: use special characters.
>  			 */
>  
>  			c = (unsigned char) blk.buf[i];
> @@ -337,6 +337,9 @@ mparse_buf_r(struct mparse *curp, struct
>  				mandoc_msg(MANDOCERR_BADCHAR, curp,
>  						curp->line, pos, "ignoring byte");
>  				i++;
> +				if (pos >= (int)ln.sz)
> +					resize_buf(&ln, 256);
> +				ln.buf[pos++] = '?';
>  				continue;
>  			}
>  

--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Discarding non-ASCII input
  2012-02-05 10:17 ` Ingo Schwarze
@ 2012-02-05 11:58   ` Joerg Sonnenberger
  2012-02-05 12:06     ` Ingo Schwarze
  0 siblings, 1 reply; 5+ messages in thread
From: Joerg Sonnenberger @ 2012-02-05 11:58 UTC (permalink / raw)
  To: discuss

On Sun, Feb 05, 2012 at 11:17:44AM +0100, Ingo Schwarze wrote:
> You might also wish to replace the "ignoring byte" by NULL
> when changing this.

If you mean \0, that wouldn't help too much, since it still doesn't
count as argument. The idea is to translate something that most likely
is an argument into something that preserves this.

Joerg
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Discarding non-ASCII input
  2012-02-05 11:58   ` Joerg Sonnenberger
@ 2012-02-05 12:06     ` Ingo Schwarze
  2012-02-05 14:35       ` Joerg Sonnenberger
  0 siblings, 1 reply; 5+ messages in thread
From: Ingo Schwarze @ 2012-02-05 12:06 UTC (permalink / raw)
  To: discuss

Hi Joerg,

Joerg Sonnenberger wrote on Sun, Feb 05, 2012 at 12:58:03PM +0100:
> On Sun, Feb 05, 2012 at 11:17:44AM +0100, Ingo Schwarze wrote:

>> You might also wish to replace the "ignoring byte" by NULL
>> when changing this.

> If you mean \0, that wouldn't help too much, since it still doesn't
> count as argument. The idea is to translate something that most likely
> is an argument into something that preserves this.

Sorry for being too terse and not clear; i just meant the error
message a few lines above your change, as in the following UNTESTED
patch, to be added to what you already have.

I'm fine with the '?' you propose and don't want to change that
to '\0'.

Thanks,
  Ingo


Index: read.c
===================================================================
RCS file: /cvs/src/usr.bin/mandoc/read.c,v
retrieving revision 1.5
diff -u -p -r1.5 read.c
--- read.c	5 Nov 2011 16:02:18 -0000	1.5
+++ read.c	5 Feb 2012 12:02:04 -0000
@@ -325,7 +325,7 @@ mparse_buf_r(struct mparse *curp, struct
 			if ( ! (isascii(c) && 
 					(isgraph(c) || isblank(c)))) {
 				mandoc_msg(MANDOCERR_BADCHAR, curp,
-						curp->line, pos, "ignoring byte");
+						curp->line, pos, NULL);
 				i++;
 				continue;
 			}
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Discarding non-ASCII input
  2012-02-05 12:06     ` Ingo Schwarze
@ 2012-02-05 14:35       ` Joerg Sonnenberger
  0 siblings, 0 replies; 5+ messages in thread
From: Joerg Sonnenberger @ 2012-02-05 14:35 UTC (permalink / raw)
  To: discuss

On Sun, Feb 05, 2012 at 01:06:10PM +0100, Ingo Schwarze wrote:
> Hi Joerg,
> 
> Joerg Sonnenberger wrote on Sun, Feb 05, 2012 at 12:58:03PM +0100:
> > On Sun, Feb 05, 2012 at 11:17:44AM +0100, Ingo Schwarze wrote:
> 
> >> You might also wish to replace the "ignoring byte" by NULL
> >> when changing this.
> 
> > If you mean \0, that wouldn't help too much, since it still doesn't
> > count as argument. The idea is to translate something that most likely
> > is an argument into something that preserves this.
> 
> Sorry for being too terse and not clear; i just meant the error
> message a few lines above your change, as in the following UNTESTED
> patch, to be added to what you already have.

Ah, that makes more sense. Fine with me, the message is verbose enough.

Joerg
--
 To unsubscribe send an email to discuss+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-02-05 14:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-05  0:45 Discarding non-ASCII input Joerg Sonnenberger
2012-02-05 10:17 ` Ingo Schwarze
2012-02-05 11:58   ` Joerg Sonnenberger
2012-02-05 12:06     ` Ingo Schwarze
2012-02-05 14:35       ` Joerg Sonnenberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).