tech@mandoc.bsd.lv
 help / color / mirror / Atom feed
* Initial Unicode/UTF-8 patch.
@ 2011-05-13 14:47 Kristaps Dzonsons
  2011-05-13 15:13 ` Joerg Sonnenberger
  0 siblings, 1 reply; 6+ messages in thread
From: Kristaps Dzonsons @ 2011-05-13 14:47 UTC (permalink / raw)
  To: tech

[-- Attachment #1: Type: text/plain, Size: 1149 bytes --]

Hi,

This patch adds initial Unicode character support to mandoc.  See 
screenshots.  It doesn't have the -Tutf8 argument implemented or 
whatever---this is entirely the backend.

Features:

  * Uses \U'N' escape for unicode.  I don't know if this is standard.
    http://lists.gnu.org/archive/html/groff/2000-04/msg00037.html

  * Uses wchar.h to convert Unicode to UTF-8 and check columns etc.

  * Filters \U'' ASCII just like \N''.

This patch is NOT complete.  It's a start and proof of concept.  It 
doesn't, for example, handle text-decoration for the UTF-8.

The correct solution for -Tascii, of course, is to have the termp buffer 
be a wchar_t array (or int, or whatever) instead of char.  This removes 
the penalty of converting to and from a UTF-8 string and makes us 
"natively" support Unicode (eat it, groff!).  This also makes the 
text-decoration easy and will simplify the logic in this patch.

I was also surprised to find that -Thtml doesn't do "real" length 
checking (see term_strlen()), which will need to be implemented as well. 
  I'll probably just abstract term_strlen() into out.c or whatever.

Please comment,

Kristaps

[-- Attachment #2: patch.utf8.txt --]
[-- Type: text/plain, Size: 6921 bytes --]

Index: html.c
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/html.c,v
retrieving revision 1.137
diff -u -r1.137 html.c
--- html.c	30 Apr 2011 22:24:31 -0000	1.137
+++ html.c	13 May 2011 14:47:09 -0000
@@ -299,9 +299,10 @@
 print_encode(struct html *h, const char *p, int norecurse)
 {
 	size_t		 sz;
-	int		 len, nospace;
+	int		 len, nospace, wc;
 	const char	*seq;
 	enum mandoc_esc	 esc;
+	char		 num[32];
 	static const char rejs[6] = { '\\', '<', '>', '&', ASCII_HYPH, '\0' };
 
 	nospace = 0;
@@ -337,6 +338,24 @@
 			break;
 
 		switch (esc) {
+		case (ESCAPE_UNICODE):
+			/*
+			 * Unicode escape (hex value).
+			 * Put it into a static buffer then try
+			 * converting it with strtol into a proper
+			 * Unicode value (disallow bogus ASCII).
+			 * Finally, use wctomb() to convert the number
+			 * to a UTF-8 byte-string.
+			 */
+			if (len > (int)sizeof(num) - 1)
+				break;
+			memcpy(num, seq, len);
+			num[len] = '\0';
+			wc = strtol(num, NULL, 16);
+			if (wc < 0x80 && ! isprint(wc))
+				break;
+			printf("&#%d;", wc);
+			break;
 		case (ESCAPE_NUMBERED):
 			print_num(h, seq, len);
 			break;
Index: main.c
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/main.c,v
retrieving revision 1.161
diff -u -r1.161 main.c
--- main.c	31 Mar 2011 10:53:43 -0000	1.161
+++ main.c	13 May 2011 14:47:09 -0000
@@ -20,6 +20,7 @@
 #endif
 
 #include <assert.h>
+#include <locale.h>
 #include <stdio.h>
 #include <stdint.h>
 #include <stdlib.h>
@@ -82,6 +83,8 @@
 	struct curparse	 curp;
 	enum mparset	 type;
 	enum mandoclevel rc;
+
+	setlocale(LC_ALL, "");
 
 	progname = strrchr(argv[0], '/');
 	if (progname == NULL)
Index: mandoc.c
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/mandoc.c,v
retrieving revision 1.49
diff -u -r1.49 mandoc.c
--- mandoc.c	30 Apr 2011 10:18:24 -0000	1.49
+++ mandoc.c	13 May 2011 14:47:09 -0000
@@ -223,6 +223,10 @@
 		/* FALLTHROUGH */
 	case ('S'):
 		/* FALLTHROUGH */
+	case ('U'):
+		if (ESCAPE_ERROR == gly)
+			gly = ESCAPE_UNICODE;
+		/* FALLTHROUGH */
 	case ('v'):
 		/* FALLTHROUGH */
 	case ('w'):
Index: mandoc.h
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/mandoc.h,v
retrieving revision 1.74
diff -u -r1.74 mandoc.h
--- mandoc.h	30 Apr 2011 22:24:31 -0000	1.74
+++ mandoc.h	13 May 2011 14:47:09 -0000
@@ -299,6 +299,7 @@
 	ESCAPE_FONTROMAN, /* roman font mode */
 	ESCAPE_FONTPREV, /* previous font mode */
 	ESCAPE_NUMBERED, /* a numbered glyph */
+	ESCAPE_UNICODE,
 	ESCAPE_NOSPACE /* suppress space if the last on a line */
 };
 
Index: term.c
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/term.c,v
retrieving revision 1.186
diff -u -r1.186 term.c
--- term.c	30 Apr 2011 22:24:31 -0000	1.186
+++ term.c	13 May 2011 14:47:09 -0000
@@ -27,6 +27,7 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
+#include <wchar.h>
 
 #include "mandoc.h"
 #include "out.h"
@@ -39,6 +40,8 @@
 static	void		  adjbuf(struct termp *p, size_t);
 static	void		  encode(struct termp *, const char *, size_t);
 
+#define	UTF8B(x)	  ((x) & 128 && ((x) & 64))
+#define	UTF8C(x)	  ((x) & 128 && ! ((x) & 64))
 
 void
 term_free(struct termp *p)
@@ -128,10 +131,12 @@
 	size_t		 vend;	/* end of word visual position on output */
 	size_t		 bp;    /* visual right border position */
 	size_t		 dv;    /* temporary for visual pos calculations */
+	size_t		 k;
 	int		 j;     /* temporary loop index for p->buf */
 	int		 jhy;	/* last hyph before overflow w/r/t j */
 	size_t		 maxvis; /* output position of visible boundary */
 	size_t		 mmax; /* used in calculating bp */
+	wchar_t		 mb;	/* temporary mbyte used for unicode */
 
 	/*
 	 * First, establish the maximum columns of "visible" content.
@@ -191,7 +196,30 @@
 					ASCII_HYPH == p->buf[j])
 				jhy = j;
 
-			vend += (*p->width)(p, p->buf[j]);
+			/*
+			 * If we're a regular character, check width.
+			 * If a UTF-8 character, scan through to the end
+			 * of the UTF-8 byte-stream, convert the stream
+			 * to an int, then use wcwidth() to get its
+			 * output column width.
+			 */
+
+			if ( ! UTF8B(p->buf[j])) {
+				vend += (*p->width)(p, p->buf[j]);
+				continue;
+			}
+
+			j++;
+			k = 1;
+
+			while (j < (int)p->col && UTF8C(p->buf[j])) {
+				j++; 
+				k++;
+			}
+
+			if (mbtowc(&mb, &p->buf[j - k], k) > 0)
+				vend += term_len(p, wcwidth(mb));
+			j--;
 		}
 
 		/*
@@ -247,9 +275,30 @@
 				vbl = 0;
 			}
 
+			/*
+			 * If we're a hyphen, convert.
+			 * If we're a regular character, check width.
+			 * If a UTF-8 character, output each character
+			 * til the end of the single UTF-8 wchar, then
+			 * calculate its width using wcwidth().
+			 */
+
 			if (ASCII_HYPH == p->buf[i]) {
 				(*p->letter)(p, '-');
 				p->viscol += (*p->width)(p, '-');
+			} else if (UTF8B(p->buf[i])) {
+				(*p->letter)(p, p->buf[i++]);
+				k = 1;
+
+				while (i < (int)p->col && UTF8C(p->buf[i])) {
+					(*p->letter)(p, p->buf[i]);
+					k++;
+					i++;
+				}
+
+				if (mbtowc(&mb, &p->buf[i - k], k) > 0)
+					p->viscol += term_len(p, wcwidth(mb));
+				i--;
 			} else {
 				(*p->letter)(p, p->buf[i]);
 				p->viscol += (*p->width)(p, p->buf[i]);
@@ -455,8 +504,10 @@
 term_word(struct termp *p, const char *word)
 {
 	const char	*seq;
-	int		 sz;
+	int		 sz, wc;
 	size_t		 ssz;
+	char		 num[32],
+			 utf8[MB_CUR_MAX];
 	enum mandoc_esc	 esc;
 
 	if ( ! (TERMP_NOSPACE & p->flags)) {
@@ -491,6 +542,27 @@
 			break;
 
 		switch (esc) {
+		case (ESCAPE_UNICODE):
+			/*
+			 * Unicode escape (hex value).
+			 * Put it into a static buffer then try
+			 * converting it with strtol into a proper
+			 * Unicode value (disallow bogus ASCII).
+			 * Finally, use wctomb() to convert the number
+			 * to a UTF-8 byte-string.
+			 */
+			if (sz > (int)sizeof(num) - 1)
+				break;
+			memcpy(num, seq, sz);
+			num[sz] = '\0';
+			wc = strtol(num, NULL, 16);
+			if (wc < 0x80 && ! isprint(wc))
+				break;
+			sz = wctomb(utf8, (wchar_t)wc);
+			if (sz < 1)
+				break;
+			encode(p, utf8, sz);
+			break;
 		case (ESCAPE_NUMBERED):
 			numbered(p, seq, sz);
 			break;
@@ -601,6 +673,7 @@
 term_strlen(const struct termp *p, const char *cp)
 {
 	size_t		 sz, rsz, i;
+	char		 buf[32];
 	int		 ssz;
 	enum mandoc_esc	 esc;
 	const char	*seq, *rhs;
@@ -621,6 +694,16 @@
 				return(sz);
 
 			switch (esc) {
+			case (ESCAPE_UNICODE):
+				rhs = NULL;
+				if (ssz > (int)sizeof(buf) - 1)
+					break;
+				memcpy(buf, seq, ssz);
+				buf[ssz] = '\0';
+				ssz = wcwidth(strtol(buf, NULL, 16));
+				if (ssz > 0)
+					sz += term_len(p, ssz);
+				break;
 			case (ESCAPE_PREDEF):
 				rhs = mchars_res2str
 					(p->symtab, seq, ssz, &rsz);

[-- Attachment #3: screen.png --]
[-- Type: image/png, Size: 29257 bytes --]

[-- Attachment #4: screen2.png --]
[-- Type: image/png, Size: 94439 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Initial Unicode/UTF-8 patch.
  2011-05-13 14:47 Initial Unicode/UTF-8 patch Kristaps Dzonsons
@ 2011-05-13 15:13 ` Joerg Sonnenberger
  2011-05-13 15:19   ` Kristaps Dzonsons
  0 siblings, 1 reply; 6+ messages in thread
From: Joerg Sonnenberger @ 2011-05-13 15:13 UTC (permalink / raw)
  To: tech

On Fri, May 13, 2011 at 10:47:38AM -0400, Kristaps Dzonsons wrote:
>  * Uses wchar.h to convert Unicode to UTF-8 and check columns etc.

This part is plainly wrong. Do not assume that wchar_t == Unicode Code
Point, that's broken. Convert from UTF-8 to the locale's character set
(see nl_langinfo and CODESET) using iconv.

Joerg
--
 To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Initial Unicode/UTF-8 patch.
  2011-05-13 15:13 ` Joerg Sonnenberger
@ 2011-05-13 15:19   ` Kristaps Dzonsons
  2011-05-13 15:23     ` Joerg Sonnenberger
  0 siblings, 1 reply; 6+ messages in thread
From: Kristaps Dzonsons @ 2011-05-13 15:19 UTC (permalink / raw)
  To: tech

On 13/05/2011 11:13, Joerg Sonnenberger wrote:
> On Fri, May 13, 2011 at 10:47:38AM -0400, Kristaps Dzonsons wrote:
>>   * Uses wchar.h to convert Unicode to UTF-8 and check columns etc.
>
> This part is plainly wrong. Do not assume that wchar_t == Unicode Code
> Point, that's broken. Convert from UTF-8 to the locale's character set
> (see nl_langinfo and CODESET) using iconv.

Joerg,

Is there a non-iconv way to translate between a Unicode codepoint and 
the internal representation of wchar_t?

(I'd assumed __STDC_ISO_1064__ in this proof of concept.)
--
 To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Initial Unicode/UTF-8 patch.
  2011-05-13 15:19   ` Kristaps Dzonsons
@ 2011-05-13 15:23     ` Joerg Sonnenberger
  2011-05-13 15:43       ` Kristaps Dzonsons
  0 siblings, 1 reply; 6+ messages in thread
From: Joerg Sonnenberger @ 2011-05-13 15:23 UTC (permalink / raw)
  To: tech

On Fri, May 13, 2011 at 11:19:54AM -0400, Kristaps Dzonsons wrote:
> On 13/05/2011 11:13, Joerg Sonnenberger wrote:
> >On Fri, May 13, 2011 at 10:47:38AM -0400, Kristaps Dzonsons wrote:
> >>  * Uses wchar.h to convert Unicode to UTF-8 and check columns etc.
> >
> >This part is plainly wrong. Do not assume that wchar_t == Unicode Code
> >Point, that's broken. Convert from UTF-8 to the locale's character set
> >(see nl_langinfo and CODESET) using iconv.
> 
> Joerg,
> 
> Is there a non-iconv way to translate between a Unicode codepoint
> and the internal representation of wchar_t?
> 
> (I'd assumed __STDC_ISO_1064__ in this proof of concept.)

No portable one.

Joerg
--
 To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Initial Unicode/UTF-8 patch.
  2011-05-13 15:23     ` Joerg Sonnenberger
@ 2011-05-13 15:43       ` Kristaps Dzonsons
  2011-05-13 16:03         ` Joerg Sonnenberger
  0 siblings, 1 reply; 6+ messages in thread
From: Kristaps Dzonsons @ 2011-05-13 15:43 UTC (permalink / raw)
  To: tech

>>>>   * Uses wchar.h to convert Unicode to UTF-8 and check columns etc.
>>>
>>> This part is plainly wrong. Do not assume that wchar_t == Unicode Code
>>> Point, that's broken. Convert from UTF-8 to the locale's character set
>>> (see nl_langinfo and CODESET) using iconv.
>>
>> Joerg,
>>
>> Is there a non-iconv way to translate between a Unicode codepoint
>> and the internal representation of wchar_t?
>>
>> (I'd assumed __STDC_ISO_1064__ in this proof of concept.)
>
> No portable one.

...then from a Unicode codepoint to UTF-8?  iconv only does string 
arrays.  I can always do the bit-twiddling myself, but I'd rather an 
official library handle it.
--
 To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Initial Unicode/UTF-8 patch.
  2011-05-13 15:43       ` Kristaps Dzonsons
@ 2011-05-13 16:03         ` Joerg Sonnenberger
  0 siblings, 0 replies; 6+ messages in thread
From: Joerg Sonnenberger @ 2011-05-13 16:03 UTC (permalink / raw)
  To: tech

On Fri, May 13, 2011 at 11:43:33AM -0400, Kristaps Dzonsons wrote:
> ...then from a Unicode codepoint to UTF-8?  iconv only does string
> arrays.  I can always do the bit-twiddling myself, but I'd rather an
> official library handle it.

Try UCS-4LE / UCS-4BE or UCS-2LE / UCS-2BE depending on what exactly you
use.

Joerg
--
 To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-05-13 16:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-13 14:47 Initial Unicode/UTF-8 patch Kristaps Dzonsons
2011-05-13 15:13 ` Joerg Sonnenberger
2011-05-13 15:19   ` Kristaps Dzonsons
2011-05-13 15:23     ` Joerg Sonnenberger
2011-05-13 15:43       ` Kristaps Dzonsons
2011-05-13 16:03         ` Joerg Sonnenberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).