* Parsing errors, output regressions with new XML parser @ 2019-03-30 0:19 Stephen Gregoratto 2019-04-02 13:16 ` Ingo Schwarze ` (4 more replies) 0 siblings, 5 replies; 6+ messages in thread From: Stephen Gregoratto @ 2019-03-30 0:19 UTC (permalink / raw) To: tech Ingo, I see you've been working hard on ripping out libexpat from docbook2mdoc. While this should simplify development, I do have some problems with the new parser: - XML comments aren't ignored. This leads to documents like these[1] being formatted as one loooong section under NAME. - escaped XML chars aren't converted back into ASCII: <programlisting> xdg-email 'Jeremy White <jwhite@example.com>' </programlisting> EXAMPLES xdg-email 'Jeremy White <jwhite@example.com>' - There are regressions in how <author> and <citerefentry> nodes are transformed. The example I pointed out previously: <author> <personname> <firstname>Joe</firstname> <surname>Bloggs</surname> </personname> <email>joe@foo.net</email> </author> Now converts to: .Dd $Mdocdate$ .Dt UNKNOWN 1 .Os .Sh AUTHORS .Nm foo is maintained by .An \&Joe Bloggs , .Aq Mt joe@foo.net \&. Another regression is that closing delimiters are put on separate lines. This leads to SEE ALSO sections like this[2] being formatted like so: .Sh \&SEE ALSO .Xr man 7 , .Xr mdoc 7 , .Xr ms 7 , .Xr me 7 , .Xr mm 7 , .Xr mwww 7 , .Xr troff 1 \&. I noticed in a previous email you've begun working on a regression test suite of sorts. I could probably submit a couple examples of my own so these errors don't crop up again. - entities are not expanded. Some documents, like xmllint[3], will declare an ENTITY in the DTD. A solution here would be to use a tool like xmllint to expand the entities into their full versions like so: xmllint --noent xmllint.xml | docbook2mdoc > xmllint.1 That should be it for the parser stuff for now. I've been playing around with the new statistics program and I should release some data on that soon. I've been working on a git repo in which projects that use DocBook are added as submodules. What I'm doing now is that I'll "clean" the files with xmllint (using options --loaddtd --noent --nocdata --nsclean --dropdtd --format) and then run statistics over them. Also, I noticed that cvsweb was down for most of yesterday. Scheduled maintenance? [1] https://gitlab.gnome.org/GNOME/gtk/blob/master/docs/reference/gtk/css-overview.xml#L20 [2] https://gitlab.com/esr/doclifter/blob/master/doclifter.xml#L988 [3] https://gitlab.gnome.org/GNOME/libxml2/raw/master/doc/xmllint.xml -- Stephen Gregoratto PGP: 3FC6 3D0E 2801 C348 1C44 2D34 A80C 0F8E 8BAB EC8B -- To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Parsing errors, output regressions with new XML parser 2019-03-30 0:19 Parsing errors, output regressions with new XML parser Stephen Gregoratto @ 2019-04-02 13:16 ` Ingo Schwarze 2019-04-02 16:02 ` Ingo Schwarze ` (3 subsequent siblings) 4 siblings, 0 replies; 6+ messages in thread From: Ingo Schwarze @ 2019-04-02 13:16 UTC (permalink / raw) To: Stephen Gregoratto; +Cc: tech Hi Stephen, Stephen Gregoratto wrote on Sat, Mar 30, 2019 at 11:19:19AM +1100: > - XML comments aren't ignored. Most comments were already ignored. However... > This leads to documents like these[1] > being formatted as one loooong section under NAME. > [1] https://gitlab.gnome.org/GNOME/gtk/blob/master/docs/reference/gtk/css-overview.xml#L20 ... you have a point here. If a comment contains a greater-than sign, than was mistaken as the end of the comment. Fixed with the commit below. Thanks for the report, Ingo Log Message: ----------- skip XML comments even if they contain greater-than characters; issue reported by Stephen Gregoratto <dev at sgregoratto dot me> Modified Files: -------------- docbook2mdoc: parse.c Revision Data ------------- Index: parse.c =================================================================== RCS file: /home/cvs/mdocml/docbook2mdoc/parse.c,v retrieving revision 1.7 retrieving revision 1.8 diff -Lparse.c -Lparse.c -u -p -r1.7 -r1.8 --- parse.c +++ parse.c @@ -522,6 +522,7 @@ struct ptree * parse_file(struct parse *p, int fd, const char *fname) { char b[4096]; + char *cp; ssize_t rsz; /* Return value from read(2). */ size_t rlen; /* Number of bytes in b[]. */ size_t poff; /* Parse offset in b[]. */ @@ -647,6 +648,29 @@ parse_file(struct parse *p, int fd, cons if (advance(p, b, rlen, &pend, " >") && rsz > 0) break; + if (pend > poff + 3 && + strncmp(b + poff, "<!--", 4) == 0) { + + /* Skip a comment. */ + + cp = strstr(b + pend - 2, "-->"); + if (cp == NULL) { + if (rsz > 0) { + pend = rlen; + break; + } + cp = b + rlen; + } else + cp += 3; + while (b + pend < cp) { + if (b[++pend] == '\n') { + p->nline++; + p->ncol = 1; + } else + p->ncol++; + } + continue; + } elem_end = 0; if (b[pend] != '>') in_tag = 1; -- To unsubscribe send an email to source+unsubscribe@mandoc.bsd.lv -- To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Parsing errors, output regressions with new XML parser 2019-03-30 0:19 Parsing errors, output regressions with new XML parser Stephen Gregoratto 2019-04-02 13:16 ` Ingo Schwarze @ 2019-04-02 16:02 ` Ingo Schwarze 2019-04-02 16:50 ` Ingo Schwarze ` (2 subsequent siblings) 4 siblings, 0 replies; 6+ messages in thread From: Ingo Schwarze @ 2019-04-02 16:02 UTC (permalink / raw) To: Stephen Gregoratto; +Cc: tech Hi Stephen, Stephen Gregoratto wrote on Sat, Mar 30, 2019 at 11:19:19AM +1100: > - escaped XML chars aren't converted back into ASCII: The commit below implements the basic functionality. It can be polished in the future. > Also, I noticed that cvsweb was down for most of yesterday. Scheduled > maintenance? No, slowcgi(8) crashed. I updated and restarted slowcgi(8), and it should be back to normal operation now. Yours, Ingo Log Message: ----------- Translate XML character entity references to roff character escape sequences. Missing feature reported by Stephen Gregoratto <dev at sgregoratto dot me>. Remaining known issues: * Whitespace handling isn't perfect yet. * Numeric character references aren't handled yet. * The list of entities is still very incomplete. * When it grows longer, we may have to switch to binary search. * Local entities declared in the DTD are not yet handled. Modified Files: -------------- docbook2mdoc: docbook2mdoc.c macro.c node.h parse.c Revision Data ------------- Index: node.h =================================================================== RCS file: /home/cvs/mdocml/docbook2mdoc/node.h,v retrieving revision 1.6 retrieving revision 1.7 diff -Lnode.h -Lnode.h -u -p -r1.6 -r1.7 --- node.h +++ node.h @@ -54,6 +54,7 @@ enum nodeid { NODE_EMPHASIS, NODE_ENTRY, NODE_ENVAR, + NODE_ESCAPE, NODE_FIELDSYNOPSIS, NODE_FILENAME, NODE_FIRSTTERM, Index: macro.c =================================================================== RCS file: /home/cvs/mdocml/docbook2mdoc/macro.c,v retrieving revision 1.2 retrieving revision 1.3 diff -Lmacro.c -Lmacro.c -u -p -r1.2 -r1.3 --- macro.c +++ macro.c @@ -161,9 +161,10 @@ macro_addnode(struct format *f, struct p * that text, letting macro_addarg() decide about quoting. */ - if (pn->node == NODE_TEXT || + if (pn->node == NODE_TEXT || pn->node == NODE_ESCAPE || ((pn = TAILQ_FIRST(&pn->childq)) != NULL && - pn->node == NODE_TEXT && TAILQ_NEXT(pn, child) == NULL)) { + (pn->node == NODE_TEXT || pn->node == NODE_ESCAPE) && + TAILQ_NEXT(pn, child) == NULL)) { macro_addarg(f, pn->b, flags); return; } @@ -239,7 +240,7 @@ print_textnode(struct format *f, struct { struct pnode *nc; - if (n->node == NODE_TEXT) + if (n->node == NODE_TEXT || n->node == NODE_ESCAPE) print_text(f, n->b, ARG_SPACE); else TAILQ_FOREACH(nc, &n->childq, child) Index: docbook2mdoc.c =================================================================== RCS file: /home/cvs/mdocml/docbook2mdoc/docbook2mdoc.c,v retrieving revision 1.78 retrieving revision 1.79 diff -Ldocbook2mdoc.c -Ldocbook2mdoc.c -u -p -r1.78 -r1.79 --- docbook2mdoc.c +++ docbook2mdoc.c @@ -643,6 +643,13 @@ pnode_print(struct format *p, struct pno case NODE_ENVAR: macro_open(p, "Ev"); break; + case NODE_ESCAPE: + if (p->linestate == LINE_NEW) + p->linestate = LINE_TEXT; + else + putchar(' '); + fputs(pn->b, stdout); + break; case NODE_FILENAME: macro_open(p, "Pa"); break; Index: parse.c =================================================================== RCS file: /home/cvs/mdocml/docbook2mdoc/parse.c,v retrieving revision 1.8 retrieving revision 1.9 diff -Lparse.c -Lparse.c -u -p -r1.8 -r1.9 --- parse.c +++ parse.c @@ -189,6 +189,62 @@ static const struct element elements[] = { NULL, NODE_IGNORE } }; +struct entity { + const char *name; + const char *roff; +}; + +/* + * XML character entity references found in the wild. + * Those that don't have an exact mandoc_char(7) representation + * are approximated, and the desired codepoint is given as a comment. + * Encoding them as \\[u...] would leave -Tascii out in the cold. + */ +static const struct entity entities[] = { + { "alpha", "\\(*a" }, + { "amp", "&" }, + { "apos", "'" }, + { "auml", "\\(:a" }, + { "beta", "\\(*b" }, + { "circ", "^" }, /* U+02C6 */ + { "copy", "\\(co" }, + { "dagger", "\\(dg" }, + { "Delta", "\\(*D" }, + { "eacute", "\\('e" }, + { "emsp", "\\ " }, /* U+2003 */ + { "gt", ">" }, + { "hairsp", "\\^" }, + { "kappa", "\\(*k" }, + { "larr", "\\(<-" }, + { "ldquo", "\\(lq" }, + { "le", "\\(<=" }, + { "lowbar", "_" }, + { "lsqb", "[" }, + { "lt", "<" }, + { "mdash", "\\(em" }, + { "minus", "\\-" }, + { "ndash", "\\(en" }, + { "nbsp", "\\ " }, + { "num", "#" }, + { "oslash", "\\(/o" }, + { "ouml", "\\(:o" }, + { "percnt", "%" }, + { "quot", "\\(dq" }, + { "rarr", "\\(->" }, + { "rArr", "\\(rA" }, + { "rdquo", "\\(rq" }, + { "reg", "\\(rg" }, + { "rho", "\\(*r" }, + { "rsqb", "]" }, + { "sigma", "\\(*s" }, + { "shy", "\\&" }, /* U+00AD */ + { "tau", "\\(*t" }, + { "tilde", "\\[u02DC]" }, + { "times", "\\[tmu]" }, + { "uuml", "\\(:u" }, + { NULL, NULL } +}; + static void error_msg(struct parse *p, const char *fmt, ...) { @@ -275,6 +331,52 @@ pnode_trim(struct pnode *pn) break; } +static void +xml_entity(struct parse *p, const char *name) +{ + const struct entity *entity; + struct pnode *dat; + + if (p->del > 0) + return; + + if (p->cur == NULL) { + error_msg(p, "discarding entity before document: &%s;", name); + return; + } + + /* Close out the text node, if there is one. */ + if (p->cur->node == NODE_TEXT) { + pnode_trim(p->cur); + p->cur = p->cur->parent; + } + + if (p->tree->flags & TREE_CLOSED && p->cur == p->tree->root) + warn_msg(p, "entity after end of document: &%s;", name); + + for (entity = entities; entity->name != NULL; entity++) + if (strcmp(name, entity->name) == 0) + break; + + if (entity->roff == NULL) { + error_msg(p, "unknown entity &%s;", name); + return; + } + + /* Create, append, and close out an entity node. */ + if ((dat = calloc(1, sizeof(*dat))) == NULL || + (dat->b = dat->real = strdup(entity->roff)) == NULL) { + perror(NULL); + exit(1); + } + dat->node = NODE_ESCAPE; + dat->bsz = strlen(dat->b); + dat->parent = p->cur; + TAILQ_INIT(&dat->childq); + TAILQ_INIT(&dat->attrq); + TAILQ_INSERT_TAIL(&p->cur->childq, dat, child); +} + /* * Begin an element. */ @@ -573,14 +675,14 @@ parse_file(struct parse *p, int fd, cons } /* - * The following three cases (in_arg, in_tag, - * and starting a tag) all parse a word or - * quoted string. If that extends beyond the + * The following four cases (in_arg, in_tag, and + * starting an entity or a tag) all parse a word + * or quoted string. If that extends beyond the * read buffer and the last read(2) still got * data, they all break out of the token loop * to request more data from the read loop. * - * Also, they all detect self-closing tags, + * Also, three of them detect self-closing tags, * those ending with "/>", setting the flag * elem_end and calling xml_elem_end() at the * very end, after handling the attribute value, @@ -689,10 +791,21 @@ parse_file(struct parse *p, int fd, cons if (elem_end) xml_elem_end(p, b + poff); - /* Process text up to the next tag. */ + /* Process an entity. */ + + } else if (b[poff] == '&') { + if (advance(p, b, rlen, &pend, ";") && + rsz > 0) + break; + b[pend] = '\0'; + if (pend < rlen) + pend++; + xml_entity(p, b + poff + 1); + + /* Process text up to the next tag or entity. */ } else { - if (advance(p, b, rlen, &pend, "<") == 0) + if (advance(p, b, rlen, &pend, "<&") == 0) p->ncol--; xml_char(p, b + poff, pend - poff); } -- To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Parsing errors, output regressions with new XML parser 2019-03-30 0:19 Parsing errors, output regressions with new XML parser Stephen Gregoratto 2019-04-02 13:16 ` Ingo Schwarze 2019-04-02 16:02 ` Ingo Schwarze @ 2019-04-02 16:50 ` Ingo Schwarze 2019-04-02 17:20 ` Ingo Schwarze 2019-04-02 17:48 ` Ingo Schwarze 4 siblings, 0 replies; 6+ messages in thread From: Ingo Schwarze @ 2019-04-02 16:50 UTC (permalink / raw) To: Stephen Gregoratto; +Cc: tech Hi Stephen, Stephen Gregoratto wrote on Sat, Mar 30, 2019 at 11:19:19AM +1100: > - There are regressions in how <author> and <citerefentry> > nodes are transformed. The example I pointed out previously: > > <author> > <personname> > <firstname>Joe</firstname> > <surname>Bloggs</surname> > </personname> > <email>joe@foo.net</email> > </author> > > Now converts to: > > .Dd $Mdocdate$ > .Dt UNKNOWN 1 > .Os > .Sh AUTHORS > .Nm foo > is maintained by > .An \&Joe Bloggs , > .Aq Mt joe@foo.net > \&. Fixing that was quite straightforward, see the commit below. I also added a regression test based on your example to my regress/ directory, which is so far private. But at some point, i will almost certainly want to publish that; i didn't do that yet because making Makefiles portable is a pain. Are you comfortable with your contributions being included under the ISC license (of course including the correct Copyright notice with your name)? $ cat regress/author/email.{xml,out} <part> <!-- Copyright (c) 2019 Stephen Gregoratto --> <author> <affiliation>Placeholder Inc.</affiliation> <email>joe@foo.net</email> <personname> <firstname>Joe</firstname> <surname>Bloggs</surname> </personname> </author> </part> .An \&Joe Bloggs Aq Mt joe@foo.net , Placeholder Inc. Thanks, Ingo Log Message: ----------- use the idiom ".An Name Aq Mt email" for author email addresses; issue reported by Stephen Gregoratto <dev at sgregoratto dot me> Modified Files: -------------- docbook2mdoc: docbook2mdoc.c Revision Data ------------- Index: docbook2mdoc.c =================================================================== RCS file: /home/cvs/mdocml/docbook2mdoc/docbook2mdoc.c,v retrieving revision 1.79 retrieving revision 1.80 diff -Ldocbook2mdoc.c -Ldocbook2mdoc.c -u -p -r1.79 -r1.80 --- docbook2mdoc.c +++ docbook2mdoc.c @@ -438,6 +438,16 @@ pnode_printauthor(struct format *f, stru } /* + * If there is an email address, + * print it on the same macro line. + */ + + if ((nc = pnode_findfirst(n, NODE_EMAIL)) != NULL) { + pnode_print(f, nc); + pnode_unlink(nc); + } + + /* * If there are still unprinted children, end the scope * with a comma. Otherwise, leave the scope open in case * a text node follows that starts with closing punctuation. -- To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Parsing errors, output regressions with new XML parser 2019-03-30 0:19 Parsing errors, output regressions with new XML parser Stephen Gregoratto ` (2 preceding siblings ...) 2019-04-02 16:50 ` Ingo Schwarze @ 2019-04-02 17:20 ` Ingo Schwarze 2019-04-02 17:48 ` Ingo Schwarze 4 siblings, 0 replies; 6+ messages in thread From: Ingo Schwarze @ 2019-04-02 17:20 UTC (permalink / raw) To: Stephen Gregoratto; +Cc: tech Hi Stephen, Stephen Gregoratto wrote on Sat, Mar 30, 2019 at 11:19:19AM +1100: > Another regression is that closing delimiters are put on separate > lines. This leads to SEE ALSO sections like this[2] being formatted > [2] https://gitlab.com/esr/doclifter/blob/master/doclifter.xml#L988 > like so: > > .Sh \&SEE ALSO > .Xr man 7 > , > .Xr mdoc 7 > , > .Xr ms 7 > , > .Xr me 7 > , > .Xr mm 7 > , > .Xr mwww 7 > , > .Xr troff 1 > \&. That was very easy to fix, see the commit below. Looks like i broke that in the big rev. 1.68. > I noticed in a previous email you've begun working on a regression > test suite of sorts. I could probably submit a couple examples of my > own so these errors don't crop up again. Sure, that would be welcome! Maybe i should make the suite publicly available soon, even though it is not portable yet. Yours, Ingo Log Message: ----------- handle trailing delimiters after <citerefentry>/.Xr; bug reported by Stephen Gregoratto <dev at sgregoratto dot me> Modified Files: -------------- docbook2mdoc: docbook2mdoc.c Revision Data ------------- Index: docbook2mdoc.c =================================================================== RCS file: /home/cvs/mdocml/docbook2mdoc/docbook2mdoc.c,v retrieving revision 1.80 retrieving revision 1.81 diff -Ldocbook2mdoc.c -Ldocbook2mdoc.c -u -p -r1.80 -r1.81 --- docbook2mdoc.c +++ docbook2mdoc.c @@ -172,7 +172,6 @@ pnode_printciterefentry(struct format *p macro_addarg(p, "1", ARG_SPACE); else macro_addnode(p, manvol, ARG_SPACE | ARG_SINGLE); - macro_close(p); pnode_unlinksub(pn); } -- To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Parsing errors, output regressions with new XML parser 2019-03-30 0:19 Parsing errors, output regressions with new XML parser Stephen Gregoratto ` (3 preceding siblings ...) 2019-04-02 17:20 ` Ingo Schwarze @ 2019-04-02 17:48 ` Ingo Schwarze 4 siblings, 0 replies; 6+ messages in thread From: Ingo Schwarze @ 2019-04-02 17:48 UTC (permalink / raw) To: Stephen Gregoratto; +Cc: tech Hi Stephen, Stephen Gregoratto wrote on Sat, Mar 30, 2019 at 11:19:19AM +1100: > - entities are not expanded. Some documents, like xmllint[3], will > declare an ENTITY in the DTD. My first idea was to convert <!ENTITY alias "realtext"> &alias; to .ds alias "realtext" \*[alias] but it turns out that doesn't work because "realtext" can contain XML. My second idea was to build a table of user-defined entities in the parser and then parse "realtext" from xml_entity() on demand; but that is also hard to implement in the current framework because parse_file() conflates parsing with the physical read(2). > A solution here would be to use a tool > like xmllint to expand the entities into their full versions like so: > > xmllint --noent xmllint.xml | docbook2mdoc > xmllint.1 Right, i guess that is reasonable for now. At some point, we might wish to revisit the topic, but i don't think user-defined entities are the most pressing issue at right now. I hope i addressed all topics you brought up; otherwise, please don't hesitate to remind me. Thanks, Ingo -- To unsubscribe send an email to tech+unsubscribe@mandoc.bsd.lv ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-04-02 17:48 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-03-30 0:19 Parsing errors, output regressions with new XML parser Stephen Gregoratto 2019-04-02 13:16 ` Ingo Schwarze 2019-04-02 16:02 ` Ingo Schwarze 2019-04-02 16:50 ` Ingo Schwarze 2019-04-02 17:20 ` Ingo Schwarze 2019-04-02 17:48 ` Ingo Schwarze
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).