* Parsing errors, output regressions with new XML parser
@ 2019-03-30 0:19 Stephen Gregoratto
2019-04-02 13:16 ` Ingo Schwarze
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: Stephen Gregoratto @ 2019-03-30 0:19 UTC (permalink / raw)
To: tech
Ingo,
I see you've been working hard on ripping out libexpat from
docbook2mdoc. While this should simplify development, I do have some
problems with the new parser:
- XML comments aren't ignored. This leads to documents like these[1]
being formatted as one loooong section under NAME.
- escaped XML chars aren't converted back into ASCII:
<programlisting>
xdg-email 'Jeremy White <jwhite@example.com>'
</programlisting>
EXAMPLES
xdg-email 'Jeremy White <jwhite@example.com>'
- There are regressions in how <author> and <citerefentry>
nodes are transformed. The example I pointed out previously:
<author>
<personname>
<firstname>Joe</firstname>
<surname>Bloggs</surname>
</personname>
<email>joe@foo.net</email>
</author>
Now converts to:
.Dd $Mdocdate$
.Dt UNKNOWN 1
.Os
.Sh AUTHORS
.Nm foo
is maintained by
.An \&Joe Bloggs ,
.Aq Mt joe@foo.net
\&.
Another regression is that closing delimiters are put on separate
lines. This leads to SEE ALSO sections like this[2] being formatted
like so:
.Sh \&SEE ALSO
.Xr man 7
,
.Xr mdoc 7
,
.Xr ms 7
,
.Xr me 7
,
.Xr mm 7
,
.Xr mwww 7
,
.Xr troff 1
\&.
I noticed in a previous email you've begun working on a regression
test suite of sorts. I could probably submit a couple examples of my
own so these errors don't crop up again.
- entities are not expanded. Some documents, like xmllint[3], will
declare an ENTITY in the DTD. A solution here would be to use a tool
like xmllint to expand the entities into their full versions like so:
xmllint --noent xmllint.xml | docbook2mdoc > xmllint.1
That should be it for the parser stuff for now. I've been playing around
with the new statistics program and I should release some data on that
soon. I've been working on a git repo in which projects that use DocBook
are added as submodules. What I'm doing now is that I'll "clean" the
files with xmllint (using options --loaddtd --noent --nocdata --nsclean
--dropdtd --format) and then run statistics over them.
Also, I noticed that cvsweb was down for most of yesterday. Scheduled
maintenance?
[1] https://gitlab.gnome.org/GNOME/gtk/blob/master/docs/reference/gtk/css-overview.xml#L20
[2] https://gitlab.com/esr/doclifter/blob/master/doclifter.xml#L988
[3] https://gitlab.gnome.org/GNOME/libxml2/raw/master/doc/xmllint.xml
--
Stephen Gregoratto
PGP: 3FC6 3D0E 2801 C348 1C44 2D34 A80C 0F8E 8BAB EC8B
--
To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Parsing errors, output regressions with new XML parser
2019-03-30 0:19 Parsing errors, output regressions with new XML parser Stephen Gregoratto
@ 2019-04-02 13:16 ` Ingo Schwarze
2019-04-02 16:02 ` Ingo Schwarze
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Ingo Schwarze @ 2019-04-02 13:16 UTC (permalink / raw)
To: Stephen Gregoratto; +Cc: tech
Hi Stephen,
Stephen Gregoratto wrote on Sat, Mar 30, 2019 at 11:19:19AM +1100:
> - XML comments aren't ignored.
Most comments were already ignored.
However...
> This leads to documents like these[1]
> being formatted as one loooong section under NAME.
> [1] https://gitlab.gnome.org/GNOME/gtk/blob/master/docs/reference/gtk/css-overview.xml#L20
... you have a point here. If a comment contains a greater-than sign,
than was mistaken as the end of the comment.
Fixed with the commit below.
Thanks for the report,
Ingo
Log Message:
-----------
skip XML comments even if they contain greater-than characters;
issue reported by Stephen Gregoratto <dev at sgregoratto dot me>
Modified Files:
--------------
docbook2mdoc:
parse.c
Revision Data
-------------
Index: parse.c
===================================================================
RCS file: /home/cvs/mdocml/docbook2mdoc/parse.c,v
retrieving revision 1.7
retrieving revision 1.8
diff -Lparse.c -Lparse.c -u -p -r1.7 -r1.8
--- parse.c
+++ parse.c
@@ -522,6 +522,7 @@ struct ptree *
parse_file(struct parse *p, int fd, const char *fname)
{
char b[4096];
+ char *cp;
ssize_t rsz; /* Return value from read(2). */
size_t rlen; /* Number of bytes in b[]. */
size_t poff; /* Parse offset in b[]. */
@@ -647,6 +648,29 @@ parse_file(struct parse *p, int fd, cons
if (advance(p, b, rlen, &pend, " >") &&
rsz > 0)
break;
+ if (pend > poff + 3 &&
+ strncmp(b + poff, "<!--", 4) == 0) {
+
+ /* Skip a comment. */
+
+ cp = strstr(b + pend - 2, "-->");
+ if (cp == NULL) {
+ if (rsz > 0) {
+ pend = rlen;
+ break;
+ }
+ cp = b + rlen;
+ } else
+ cp += 3;
+ while (b + pend < cp) {
+ if (b[++pend] == '\n') {
+ p->nline++;
+ p->ncol = 1;
+ } else
+ p->ncol++;
+ }
+ continue;
+ }
elem_end = 0;
if (b[pend] != '>')
in_tag = 1;
--
To unsubscribe send an email to source+unsubscribe@mandoc.bsd.lv
--
To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Parsing errors, output regressions with new XML parser
2019-03-30 0:19 Parsing errors, output regressions with new XML parser Stephen Gregoratto
2019-04-02 13:16 ` Ingo Schwarze
@ 2019-04-02 16:02 ` Ingo Schwarze
2019-04-02 16:50 ` Ingo Schwarze
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Ingo Schwarze @ 2019-04-02 16:02 UTC (permalink / raw)
To: Stephen Gregoratto; +Cc: tech
Hi Stephen,
Stephen Gregoratto wrote on Sat, Mar 30, 2019 at 11:19:19AM +1100:
> - escaped XML chars aren't converted back into ASCII:
The commit below implements the basic functionality.
It can be polished in the future.
> Also, I noticed that cvsweb was down for most of yesterday. Scheduled
> maintenance?
No, slowcgi(8) crashed. I updated and restarted slowcgi(8), and it
should be back to normal operation now.
Yours,
Ingo
Log Message:
-----------
Translate XML character entity references to roff character escape sequences.
Missing feature reported by Stephen Gregoratto <dev at sgregoratto dot me>.
Remaining known issues:
* Whitespace handling isn't perfect yet.
* Numeric character references aren't handled yet.
* The list of entities is still very incomplete.
* When it grows longer, we may have to switch to binary search.
* Local entities declared in the DTD are not yet handled.
Modified Files:
--------------
docbook2mdoc:
docbook2mdoc.c
macro.c
node.h
parse.c
Revision Data
-------------
Index: node.h
===================================================================
RCS file: /home/cvs/mdocml/docbook2mdoc/node.h,v
retrieving revision 1.6
retrieving revision 1.7
diff -Lnode.h -Lnode.h -u -p -r1.6 -r1.7
--- node.h
+++ node.h
@@ -54,6 +54,7 @@ enum nodeid {
NODE_EMPHASIS,
NODE_ENTRY,
NODE_ENVAR,
+ NODE_ESCAPE,
NODE_FIELDSYNOPSIS,
NODE_FILENAME,
NODE_FIRSTTERM,
Index: macro.c
===================================================================
RCS file: /home/cvs/mdocml/docbook2mdoc/macro.c,v
retrieving revision 1.2
retrieving revision 1.3
diff -Lmacro.c -Lmacro.c -u -p -r1.2 -r1.3
--- macro.c
+++ macro.c
@@ -161,9 +161,10 @@ macro_addnode(struct format *f, struct p
* that text, letting macro_addarg() decide about quoting.
*/
- if (pn->node == NODE_TEXT ||
+ if (pn->node == NODE_TEXT || pn->node == NODE_ESCAPE ||
((pn = TAILQ_FIRST(&pn->childq)) != NULL &&
- pn->node == NODE_TEXT && TAILQ_NEXT(pn, child) == NULL)) {
+ (pn->node == NODE_TEXT || pn->node == NODE_ESCAPE) &&
+ TAILQ_NEXT(pn, child) == NULL)) {
macro_addarg(f, pn->b, flags);
return;
}
@@ -239,7 +240,7 @@ print_textnode(struct format *f, struct
{
struct pnode *nc;
- if (n->node == NODE_TEXT)
+ if (n->node == NODE_TEXT || n->node == NODE_ESCAPE)
print_text(f, n->b, ARG_SPACE);
else
TAILQ_FOREACH(nc, &n->childq, child)
Index: docbook2mdoc.c
===================================================================
RCS file: /home/cvs/mdocml/docbook2mdoc/docbook2mdoc.c,v
retrieving revision 1.78
retrieving revision 1.79
diff -Ldocbook2mdoc.c -Ldocbook2mdoc.c -u -p -r1.78 -r1.79
--- docbook2mdoc.c
+++ docbook2mdoc.c
@@ -643,6 +643,13 @@ pnode_print(struct format *p, struct pno
case NODE_ENVAR:
macro_open(p, "Ev");
break;
+ case NODE_ESCAPE:
+ if (p->linestate == LINE_NEW)
+ p->linestate = LINE_TEXT;
+ else
+ putchar(' ');
+ fputs(pn->b, stdout);
+ break;
case NODE_FILENAME:
macro_open(p, "Pa");
break;
Index: parse.c
===================================================================
RCS file: /home/cvs/mdocml/docbook2mdoc/parse.c,v
retrieving revision 1.8
retrieving revision 1.9
diff -Lparse.c -Lparse.c -u -p -r1.8 -r1.9
--- parse.c
+++ parse.c
@@ -189,6 +189,62 @@ static const struct element elements[] =
{ NULL, NODE_IGNORE }
};
+struct entity {
+ const char *name;
+ const char *roff;
+};
+
+/*
+ * XML character entity references found in the wild.
+ * Those that don't have an exact mandoc_char(7) representation
+ * are approximated, and the desired codepoint is given as a comment.
+ * Encoding them as \\[u...] would leave -Tascii out in the cold.
+ */
+static const struct entity entities[] = {
+ { "alpha", "\\(*a" },
+ { "amp", "&" },
+ { "apos", "'" },
+ { "auml", "\\(:a" },
+ { "beta", "\\(*b" },
+ { "circ", "^" }, /* U+02C6 */
+ { "copy", "\\(co" },
+ { "dagger", "\\(dg" },
+ { "Delta", "\\(*D" },
+ { "eacute", "\\('e" },
+ { "emsp", "\\ " }, /* U+2003 */
+ { "gt", ">" },
+ { "hairsp", "\\^" },
+ { "kappa", "\\(*k" },
+ { "larr", "\\(<-" },
+ { "ldquo", "\\(lq" },
+ { "le", "\\(<=" },
+ { "lowbar", "_" },
+ { "lsqb", "[" },
+ { "lt", "<" },
+ { "mdash", "\\(em" },
+ { "minus", "\\-" },
+ { "ndash", "\\(en" },
+ { "nbsp", "\\ " },
+ { "num", "#" },
+ { "oslash", "\\(/o" },
+ { "ouml", "\\(:o" },
+ { "percnt", "%" },
+ { "quot", "\\(dq" },
+ { "rarr", "\\(->" },
+ { "rArr", "\\(rA" },
+ { "rdquo", "\\(rq" },
+ { "reg", "\\(rg" },
+ { "rho", "\\(*r" },
+ { "rsqb", "]" },
+ { "sigma", "\\(*s" },
+ { "shy", "\\&" }, /* U+00AD */
+ { "tau", "\\(*t" },
+ { "tilde", "\\[u02DC]" },
+ { "times", "\\[tmu]" },
+ { "uuml", "\\(:u" },
+ { NULL, NULL }
+};
+
static void
error_msg(struct parse *p, const char *fmt, ...)
{
@@ -275,6 +331,52 @@ pnode_trim(struct pnode *pn)
break;
}
+static void
+xml_entity(struct parse *p, const char *name)
+{
+ const struct entity *entity;
+ struct pnode *dat;
+
+ if (p->del > 0)
+ return;
+
+ if (p->cur == NULL) {
+ error_msg(p, "discarding entity before document: &%s;", name);
+ return;
+ }
+
+ /* Close out the text node, if there is one. */
+ if (p->cur->node == NODE_TEXT) {
+ pnode_trim(p->cur);
+ p->cur = p->cur->parent;
+ }
+
+ if (p->tree->flags & TREE_CLOSED && p->cur == p->tree->root)
+ warn_msg(p, "entity after end of document: &%s;", name);
+
+ for (entity = entities; entity->name != NULL; entity++)
+ if (strcmp(name, entity->name) == 0)
+ break;
+
+ if (entity->roff == NULL) {
+ error_msg(p, "unknown entity &%s;", name);
+ return;
+ }
+
+ /* Create, append, and close out an entity node. */
+ if ((dat = calloc(1, sizeof(*dat))) == NULL ||
+ (dat->b = dat->real = strdup(entity->roff)) == NULL) {
+ perror(NULL);
+ exit(1);
+ }
+ dat->node = NODE_ESCAPE;
+ dat->bsz = strlen(dat->b);
+ dat->parent = p->cur;
+ TAILQ_INIT(&dat->childq);
+ TAILQ_INIT(&dat->attrq);
+ TAILQ_INSERT_TAIL(&p->cur->childq, dat, child);
+}
+
/*
* Begin an element.
*/
@@ -573,14 +675,14 @@ parse_file(struct parse *p, int fd, cons
}
/*
- * The following three cases (in_arg, in_tag,
- * and starting a tag) all parse a word or
- * quoted string. If that extends beyond the
+ * The following four cases (in_arg, in_tag, and
+ * starting an entity or a tag) all parse a word
+ * or quoted string. If that extends beyond the
* read buffer and the last read(2) still got
* data, they all break out of the token loop
* to request more data from the read loop.
*
- * Also, they all detect self-closing tags,
+ * Also, three of them detect self-closing tags,
* those ending with "/>", setting the flag
* elem_end and calling xml_elem_end() at the
* very end, after handling the attribute value,
@@ -689,10 +791,21 @@ parse_file(struct parse *p, int fd, cons
if (elem_end)
xml_elem_end(p, b + poff);
- /* Process text up to the next tag. */
+ /* Process an entity. */
+
+ } else if (b[poff] == '&') {
+ if (advance(p, b, rlen, &pend, ";") &&
+ rsz > 0)
+ break;
+ b[pend] = '\0';
+ if (pend < rlen)
+ pend++;
+ xml_entity(p, b + poff + 1);
+
+ /* Process text up to the next tag or entity. */
} else {
- if (advance(p, b, rlen, &pend, "<") == 0)
+ if (advance(p, b, rlen, &pend, "<&") == 0)
p->ncol--;
xml_char(p, b + poff, pend - poff);
}
--
To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Parsing errors, output regressions with new XML parser
2019-03-30 0:19 Parsing errors, output regressions with new XML parser Stephen Gregoratto
2019-04-02 13:16 ` Ingo Schwarze
2019-04-02 16:02 ` Ingo Schwarze
@ 2019-04-02 16:50 ` Ingo Schwarze
2019-04-02 17:20 ` Ingo Schwarze
2019-04-02 17:48 ` Ingo Schwarze
4 siblings, 0 replies; 6+ messages in thread
From: Ingo Schwarze @ 2019-04-02 16:50 UTC (permalink / raw)
To: Stephen Gregoratto; +Cc: tech
Hi Stephen,
Stephen Gregoratto wrote on Sat, Mar 30, 2019 at 11:19:19AM +1100:
> - There are regressions in how <author> and <citerefentry>
> nodes are transformed. The example I pointed out previously:
>
> <author>
> <personname>
> <firstname>Joe</firstname>
> <surname>Bloggs</surname>
> </personname>
> <email>joe@foo.net</email>
> </author>
>
> Now converts to:
>
> .Dd $Mdocdate$
> .Dt UNKNOWN 1
> .Os
> .Sh AUTHORS
> .Nm foo
> is maintained by
> .An \&Joe Bloggs ,
> .Aq Mt joe@foo.net
> \&.
Fixing that was quite straightforward, see the commit below.
I also added a regression test based on your example to my regress/
directory, which is so far private. But at some point, i will almost
certainly want to publish that; i didn't do that yet because making
Makefiles portable is a pain. Are you comfortable with your
contributions being included under the ISC license (of course
including the correct Copyright notice with your name)?
$ cat regress/author/email.{xml,out}
<part>
<!-- Copyright (c) 2019 Stephen Gregoratto -->
<author>
<affiliation>Placeholder Inc.</affiliation>
<email>joe@foo.net</email>
<personname>
<firstname>Joe</firstname>
<surname>Bloggs</surname>
</personname>
</author>
</part>
.An \&Joe Bloggs Aq Mt joe@foo.net ,
Placeholder Inc.
Thanks,
Ingo
Log Message:
-----------
use the idiom ".An Name Aq Mt email" for author email addresses;
issue reported by Stephen Gregoratto <dev at sgregoratto dot me>
Modified Files:
--------------
docbook2mdoc:
docbook2mdoc.c
Revision Data
-------------
Index: docbook2mdoc.c
===================================================================
RCS file: /home/cvs/mdocml/docbook2mdoc/docbook2mdoc.c,v
retrieving revision 1.79
retrieving revision 1.80
diff -Ldocbook2mdoc.c -Ldocbook2mdoc.c -u -p -r1.79 -r1.80
--- docbook2mdoc.c
+++ docbook2mdoc.c
@@ -438,6 +438,16 @@ pnode_printauthor(struct format *f, stru
}
/*
+ * If there is an email address,
+ * print it on the same macro line.
+ */
+
+ if ((nc = pnode_findfirst(n, NODE_EMAIL)) != NULL) {
+ pnode_print(f, nc);
+ pnode_unlink(nc);
+ }
+
+ /*
* If there are still unprinted children, end the scope
* with a comma. Otherwise, leave the scope open in case
* a text node follows that starts with closing punctuation.
--
To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Parsing errors, output regressions with new XML parser
2019-03-30 0:19 Parsing errors, output regressions with new XML parser Stephen Gregoratto
` (2 preceding siblings ...)
2019-04-02 16:50 ` Ingo Schwarze
@ 2019-04-02 17:20 ` Ingo Schwarze
2019-04-02 17:48 ` Ingo Schwarze
4 siblings, 0 replies; 6+ messages in thread
From: Ingo Schwarze @ 2019-04-02 17:20 UTC (permalink / raw)
To: Stephen Gregoratto; +Cc: tech
Hi Stephen,
Stephen Gregoratto wrote on Sat, Mar 30, 2019 at 11:19:19AM +1100:
> Another regression is that closing delimiters are put on separate
> lines. This leads to SEE ALSO sections like this[2] being formatted
> [2] https://gitlab.com/esr/doclifter/blob/master/doclifter.xml#L988
> like so:
>
> .Sh \&SEE ALSO
> .Xr man 7
> ,
> .Xr mdoc 7
> ,
> .Xr ms 7
> ,
> .Xr me 7
> ,
> .Xr mm 7
> ,
> .Xr mwww 7
> ,
> .Xr troff 1
> \&.
That was very easy to fix, see the commit below.
Looks like i broke that in the big rev. 1.68.
> I noticed in a previous email you've begun working on a regression
> test suite of sorts. I could probably submit a couple examples of my
> own so these errors don't crop up again.
Sure, that would be welcome!
Maybe i should make the suite publicly available soon,
even though it is not portable yet.
Yours,
Ingo
Log Message:
-----------
handle trailing delimiters after <citerefentry>/.Xr;
bug reported by Stephen Gregoratto <dev at sgregoratto dot me>
Modified Files:
--------------
docbook2mdoc:
docbook2mdoc.c
Revision Data
-------------
Index: docbook2mdoc.c
===================================================================
RCS file: /home/cvs/mdocml/docbook2mdoc/docbook2mdoc.c,v
retrieving revision 1.80
retrieving revision 1.81
diff -Ldocbook2mdoc.c -Ldocbook2mdoc.c -u -p -r1.80 -r1.81
--- docbook2mdoc.c
+++ docbook2mdoc.c
@@ -172,7 +172,6 @@ pnode_printciterefentry(struct format *p
macro_addarg(p, "1", ARG_SPACE);
else
macro_addnode(p, manvol, ARG_SPACE | ARG_SINGLE);
- macro_close(p);
pnode_unlinksub(pn);
}
--
To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Parsing errors, output regressions with new XML parser
2019-03-30 0:19 Parsing errors, output regressions with new XML parser Stephen Gregoratto
` (3 preceding siblings ...)
2019-04-02 17:20 ` Ingo Schwarze
@ 2019-04-02 17:48 ` Ingo Schwarze
4 siblings, 0 replies; 6+ messages in thread
From: Ingo Schwarze @ 2019-04-02 17:48 UTC (permalink / raw)
To: Stephen Gregoratto; +Cc: tech
Hi Stephen,
Stephen Gregoratto wrote on Sat, Mar 30, 2019 at 11:19:19AM +1100:
> - entities are not expanded. Some documents, like xmllint[3], will
> declare an ENTITY in the DTD.
My first idea was to convert
<!ENTITY alias "realtext">
&alias;
to
.ds alias "realtext"
\*[alias]
but it turns out that doesn't work because "realtext" can contain XML.
My second idea was to build a table of user-defined entities
in the parser and then parse "realtext" from xml_entity() on demand;
but that is also hard to implement in the current framework because
parse_file() conflates parsing with the physical read(2).
> A solution here would be to use a tool
> like xmllint to expand the entities into their full versions like so:
>
> xmllint --noent xmllint.xml | docbook2mdoc > xmllint.1
Right, i guess that is reasonable for now. At some point, we might
wish to revisit the topic, but i don't think user-defined entities
are the most pressing issue at right now.
I hope i addressed all topics you brought up; otherwise, please don't
hesitate to remind me.
Thanks,
Ingo
--
To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-04-02 17:48 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-30 0:19 Parsing errors, output regressions with new XML parser Stephen Gregoratto
2019-04-02 13:16 ` Ingo Schwarze
2019-04-02 16:02 ` Ingo Schwarze
2019-04-02 16:50 ` Ingo Schwarze
2019-04-02 17:20 ` Ingo Schwarze
2019-04-02 17:48 ` Ingo Schwarze
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).