* Initial Unicode/UTF-8 patch.
@ 2011-05-13 14:47 Kristaps Dzonsons
2011-05-13 15:13 ` Joerg Sonnenberger
0 siblings, 1 reply; 6+ messages in thread
From: Kristaps Dzonsons @ 2011-05-13 14:47 UTC (permalink / raw)
To: tech
[-- Attachment #1: Type: text/plain, Size: 1149 bytes --]
Hi,
This patch adds initial Unicode character support to mandoc. See
screenshots. It doesn't have the -Tutf8 argument implemented or
whatever---this is entirely the backend.
Features:
* Uses \U'N' escape for unicode. I don't know if this is standard.
http://lists.gnu.org/archive/html/groff/2000-04/msg00037.html
* Uses wchar.h to convert Unicode to UTF-8 and check columns etc.
* Filters \U'' ASCII just like \N''.
This patch is NOT complete. It's a start and proof of concept. It
doesn't, for example, handle text-decoration for the UTF-8.
The correct solution for -Tascii, of course, is to have the termp buffer
be a wchar_t array (or int, or whatever) instead of char. This removes
the penalty of converting to and from a UTF-8 string and makes us
"natively" support Unicode (eat it, groff!). This also makes the
text-decoration easy and will simplify the logic in this patch.
I was also surprised to find that -Thtml doesn't do "real" length
checking (see term_strlen()), which will need to be implemented as well.
I'll probably just abstract term_strlen() into out.c or whatever.
Please comment,
Kristaps
[-- Attachment #2: patch.utf8.txt --]
[-- Type: text/plain, Size: 6921 bytes --]
Index: html.c
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/html.c,v
retrieving revision 1.137
diff -u -r1.137 html.c
--- html.c 30 Apr 2011 22:24:31 -0000 1.137
+++ html.c 13 May 2011 14:47:09 -0000
@@ -299,9 +299,10 @@
print_encode(struct html *h, const char *p, int norecurse)
{
size_t sz;
- int len, nospace;
+ int len, nospace, wc;
const char *seq;
enum mandoc_esc esc;
+ char num[32];
static const char rejs[6] = { '\\', '<', '>', '&', ASCII_HYPH, '\0' };
nospace = 0;
@@ -337,6 +338,24 @@
break;
switch (esc) {
+ case (ESCAPE_UNICODE):
+ /*
+ * Unicode escape (hex value).
+ * Put it into a static buffer then try
+ * converting it with strtol into a proper
+ * Unicode value (disallow bogus ASCII).
+ * Finally, use wctomb() to convert the number
+ * to a UTF-8 byte-string.
+ */
+ if (len > (int)sizeof(num) - 1)
+ break;
+ memcpy(num, seq, len);
+ num[len] = '\0';
+ wc = strtol(num, NULL, 16);
+ if (wc < 0x80 && ! isprint(wc))
+ break;
+ printf("&#%d;", wc);
+ break;
case (ESCAPE_NUMBERED):
print_num(h, seq, len);
break;
Index: main.c
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/main.c,v
retrieving revision 1.161
diff -u -r1.161 main.c
--- main.c 31 Mar 2011 10:53:43 -0000 1.161
+++ main.c 13 May 2011 14:47:09 -0000
@@ -20,6 +20,7 @@
#endif
#include <assert.h>
+#include <locale.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
@@ -82,6 +83,8 @@
struct curparse curp;
enum mparset type;
enum mandoclevel rc;
+
+ setlocale(LC_ALL, "");
progname = strrchr(argv[0], '/');
if (progname == NULL)
Index: mandoc.c
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/mandoc.c,v
retrieving revision 1.49
diff -u -r1.49 mandoc.c
--- mandoc.c 30 Apr 2011 10:18:24 -0000 1.49
+++ mandoc.c 13 May 2011 14:47:09 -0000
@@ -223,6 +223,10 @@
/* FALLTHROUGH */
case ('S'):
/* FALLTHROUGH */
+ case ('U'):
+ if (ESCAPE_ERROR == gly)
+ gly = ESCAPE_UNICODE;
+ /* FALLTHROUGH */
case ('v'):
/* FALLTHROUGH */
case ('w'):
Index: mandoc.h
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/mandoc.h,v
retrieving revision 1.74
diff -u -r1.74 mandoc.h
--- mandoc.h 30 Apr 2011 22:24:31 -0000 1.74
+++ mandoc.h 13 May 2011 14:47:09 -0000
@@ -299,6 +299,7 @@
ESCAPE_FONTROMAN, /* roman font mode */
ESCAPE_FONTPREV, /* previous font mode */
ESCAPE_NUMBERED, /* a numbered glyph */
+ ESCAPE_UNICODE,
ESCAPE_NOSPACE /* suppress space if the last on a line */
};
Index: term.c
===================================================================
RCS file: /usr/vhosts/mdocml.bsd.lv/cvs/mdocml/term.c,v
retrieving revision 1.186
diff -u -r1.186 term.c
--- term.c 30 Apr 2011 22:24:31 -0000 1.186
+++ term.c 13 May 2011 14:47:09 -0000
@@ -27,6 +27,7 @@
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
+#include <wchar.h>
#include "mandoc.h"
#include "out.h"
@@ -39,6 +40,8 @@
static void adjbuf(struct termp *p, size_t);
static void encode(struct termp *, const char *, size_t);
+#define UTF8B(x) ((x) & 128 && ((x) & 64))
+#define UTF8C(x) ((x) & 128 && ! ((x) & 64))
void
term_free(struct termp *p)
@@ -128,10 +131,12 @@
size_t vend; /* end of word visual position on output */
size_t bp; /* visual right border position */
size_t dv; /* temporary for visual pos calculations */
+ size_t k;
int j; /* temporary loop index for p->buf */
int jhy; /* last hyph before overflow w/r/t j */
size_t maxvis; /* output position of visible boundary */
size_t mmax; /* used in calculating bp */
+ wchar_t mb; /* temporary mbyte used for unicode */
/*
* First, establish the maximum columns of "visible" content.
@@ -191,7 +196,30 @@
ASCII_HYPH == p->buf[j])
jhy = j;
- vend += (*p->width)(p, p->buf[j]);
+ /*
+ * If we're a regular character, check width.
+ * If a UTF-8 character, scan through to the end
+ * of the UTF-8 byte-stream, convert the stream
+ * to an int, then use wcwidth() to get its
+ * output column width.
+ */
+
+ if ( ! UTF8B(p->buf[j])) {
+ vend += (*p->width)(p, p->buf[j]);
+ continue;
+ }
+
+ j++;
+ k = 1;
+
+ while (j < (int)p->col && UTF8C(p->buf[j])) {
+ j++;
+ k++;
+ }
+
+ if (mbtowc(&mb, &p->buf[j - k], k) > 0)
+ vend += term_len(p, wcwidth(mb));
+ j--;
}
/*
@@ -247,9 +275,30 @@
vbl = 0;
}
+ /*
+ * If we're a hyphen, convert.
+ * If we're a regular character, check width.
+ * If a UTF-8 character, output each character
+ * til the end of the single UTF-8 wchar, then
+ * calculate its width using wcwidth().
+ */
+
if (ASCII_HYPH == p->buf[i]) {
(*p->letter)(p, '-');
p->viscol += (*p->width)(p, '-');
+ } else if (UTF8B(p->buf[i])) {
+ (*p->letter)(p, p->buf[i++]);
+ k = 1;
+
+ while (i < (int)p->col && UTF8C(p->buf[i])) {
+ (*p->letter)(p, p->buf[i]);
+ k++;
+ i++;
+ }
+
+ if (mbtowc(&mb, &p->buf[i - k], k) > 0)
+ p->viscol += term_len(p, wcwidth(mb));
+ i--;
} else {
(*p->letter)(p, p->buf[i]);
p->viscol += (*p->width)(p, p->buf[i]);
@@ -455,8 +504,10 @@
term_word(struct termp *p, const char *word)
{
const char *seq;
- int sz;
+ int sz, wc;
size_t ssz;
+ char num[32],
+ utf8[MB_CUR_MAX];
enum mandoc_esc esc;
if ( ! (TERMP_NOSPACE & p->flags)) {
@@ -491,6 +542,27 @@
break;
switch (esc) {
+ case (ESCAPE_UNICODE):
+ /*
+ * Unicode escape (hex value).
+ * Put it into a static buffer then try
+ * converting it with strtol into a proper
+ * Unicode value (disallow bogus ASCII).
+ * Finally, use wctomb() to convert the number
+ * to a UTF-8 byte-string.
+ */
+ if (sz > (int)sizeof(num) - 1)
+ break;
+ memcpy(num, seq, sz);
+ num[sz] = '\0';
+ wc = strtol(num, NULL, 16);
+ if (wc < 0x80 && ! isprint(wc))
+ break;
+ sz = wctomb(utf8, (wchar_t)wc);
+ if (sz < 1)
+ break;
+ encode(p, utf8, sz);
+ break;
case (ESCAPE_NUMBERED):
numbered(p, seq, sz);
break;
@@ -601,6 +673,7 @@
term_strlen(const struct termp *p, const char *cp)
{
size_t sz, rsz, i;
+ char buf[32];
int ssz;
enum mandoc_esc esc;
const char *seq, *rhs;
@@ -621,6 +694,16 @@
return(sz);
switch (esc) {
+ case (ESCAPE_UNICODE):
+ rhs = NULL;
+ if (ssz > (int)sizeof(buf) - 1)
+ break;
+ memcpy(buf, seq, ssz);
+ buf[ssz] = '\0';
+ ssz = wcwidth(strtol(buf, NULL, 16));
+ if (ssz > 0)
+ sz += term_len(p, ssz);
+ break;
case (ESCAPE_PREDEF):
rhs = mchars_res2str
(p->symtab, seq, ssz, &rsz);
[-- Attachment #3: screen.png --]
[-- Type: image/png, Size: 29257 bytes --]
[-- Attachment #4: screen2.png --]
[-- Type: image/png, Size: 94439 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Initial Unicode/UTF-8 patch.
2011-05-13 14:47 Initial Unicode/UTF-8 patch Kristaps Dzonsons
@ 2011-05-13 15:13 ` Joerg Sonnenberger
2011-05-13 15:19 ` Kristaps Dzonsons
0 siblings, 1 reply; 6+ messages in thread
From: Joerg Sonnenberger @ 2011-05-13 15:13 UTC (permalink / raw)
To: tech
On Fri, May 13, 2011 at 10:47:38AM -0400, Kristaps Dzonsons wrote:
> * Uses wchar.h to convert Unicode to UTF-8 and check columns etc.
This part is plainly wrong. Do not assume that wchar_t == Unicode Code
Point, that's broken. Convert from UTF-8 to the locale's character set
(see nl_langinfo and CODESET) using iconv.
Joerg
--
To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Initial Unicode/UTF-8 patch.
2011-05-13 15:13 ` Joerg Sonnenberger
@ 2011-05-13 15:19 ` Kristaps Dzonsons
2011-05-13 15:23 ` Joerg Sonnenberger
0 siblings, 1 reply; 6+ messages in thread
From: Kristaps Dzonsons @ 2011-05-13 15:19 UTC (permalink / raw)
To: tech
On 13/05/2011 11:13, Joerg Sonnenberger wrote:
> On Fri, May 13, 2011 at 10:47:38AM -0400, Kristaps Dzonsons wrote:
>> * Uses wchar.h to convert Unicode to UTF-8 and check columns etc.
>
> This part is plainly wrong. Do not assume that wchar_t == Unicode Code
> Point, that's broken. Convert from UTF-8 to the locale's character set
> (see nl_langinfo and CODESET) using iconv.
Joerg,
Is there a non-iconv way to translate between a Unicode codepoint and
the internal representation of wchar_t?
(I'd assumed __STDC_ISO_1064__ in this proof of concept.)
--
To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Initial Unicode/UTF-8 patch.
2011-05-13 15:19 ` Kristaps Dzonsons
@ 2011-05-13 15:23 ` Joerg Sonnenberger
2011-05-13 15:43 ` Kristaps Dzonsons
0 siblings, 1 reply; 6+ messages in thread
From: Joerg Sonnenberger @ 2011-05-13 15:23 UTC (permalink / raw)
To: tech
On Fri, May 13, 2011 at 11:19:54AM -0400, Kristaps Dzonsons wrote:
> On 13/05/2011 11:13, Joerg Sonnenberger wrote:
> >On Fri, May 13, 2011 at 10:47:38AM -0400, Kristaps Dzonsons wrote:
> >> * Uses wchar.h to convert Unicode to UTF-8 and check columns etc.
> >
> >This part is plainly wrong. Do not assume that wchar_t == Unicode Code
> >Point, that's broken. Convert from UTF-8 to the locale's character set
> >(see nl_langinfo and CODESET) using iconv.
>
> Joerg,
>
> Is there a non-iconv way to translate between a Unicode codepoint
> and the internal representation of wchar_t?
>
> (I'd assumed __STDC_ISO_1064__ in this proof of concept.)
No portable one.
Joerg
--
To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Initial Unicode/UTF-8 patch.
2011-05-13 15:23 ` Joerg Sonnenberger
@ 2011-05-13 15:43 ` Kristaps Dzonsons
2011-05-13 16:03 ` Joerg Sonnenberger
0 siblings, 1 reply; 6+ messages in thread
From: Kristaps Dzonsons @ 2011-05-13 15:43 UTC (permalink / raw)
To: tech
>>>> * Uses wchar.h to convert Unicode to UTF-8 and check columns etc.
>>>
>>> This part is plainly wrong. Do not assume that wchar_t == Unicode Code
>>> Point, that's broken. Convert from UTF-8 to the locale's character set
>>> (see nl_langinfo and CODESET) using iconv.
>>
>> Joerg,
>>
>> Is there a non-iconv way to translate between a Unicode codepoint
>> and the internal representation of wchar_t?
>>
>> (I'd assumed __STDC_ISO_1064__ in this proof of concept.)
>
> No portable one.
...then from a Unicode codepoint to UTF-8? iconv only does string
arrays. I can always do the bit-twiddling myself, but I'd rather an
official library handle it.
--
To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Initial Unicode/UTF-8 patch.
2011-05-13 15:43 ` Kristaps Dzonsons
@ 2011-05-13 16:03 ` Joerg Sonnenberger
0 siblings, 0 replies; 6+ messages in thread
From: Joerg Sonnenberger @ 2011-05-13 16:03 UTC (permalink / raw)
To: tech
On Fri, May 13, 2011 at 11:43:33AM -0400, Kristaps Dzonsons wrote:
> ...then from a Unicode codepoint to UTF-8? iconv only does string
> arrays. I can always do the bit-twiddling myself, but I'd rather an
> official library handle it.
Try UCS-4LE / UCS-4BE or UCS-2LE / UCS-2BE depending on what exactly you
use.
Joerg
--
To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-05-13 16:03 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-13 14:47 Initial Unicode/UTF-8 patch Kristaps Dzonsons
2011-05-13 15:13 ` Joerg Sonnenberger
2011-05-13 15:19 ` Kristaps Dzonsons
2011-05-13 15:23 ` Joerg Sonnenberger
2011-05-13 15:43 ` Kristaps Dzonsons
2011-05-13 16:03 ` Joerg Sonnenberger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).