Hi, Consider the quadratic formula: x={-b +- sqrt{b sup 2 - 4ac}} over 2a Wikipedia suggests it should be rendered in MathML like so (leaving out invisible operators): <mrow> <mi>x</mi> <mo>=</mo> <mfrac> <mrow> <mo>−</mo> <mi>b</mi> <mo>±</mo> <msqrt> <msup> <mi>b</mi> <mn>2</mn> </msup> <mo>−</mo> <mn>4</mn> <mi>a</mi> <mi>c</mi> </msqrt> </mrow> <mrow> <mn>2</mn> <mi>a</mi> </mrow> </mfrac> </mrow> mandoc -Thtml renders it like so: <mrow> <mi>x=</mi> <mfrac> <mrow> <mi>-b</mi> <mi>±</mi> <msqrt> <mrow> <msup> <mi>b</mi> <mi>2</mi> </msup> <mi>−</mi> <mi>4ac</mi> </mrow> </msqrt> </mrow> <mi>2a</mi> </mfrac> </mrow> A few things are noticeable here: - mandoc only uses <mi>, not <mo> or <mn>. - mandoc will transform a '-' into U+2212, but only when it's not directly adjacent to a digit. - In Firefox, <mi> only seems to italicize single letters. It looks like adjacent variables, numbers, and operators should be split: - 'x=' should become <mi>x</mi><mo>=</mo> - '-b' should become <mo>−</mo><mi>b</mi> - '-4ac' should become <mo>−</mo><mn>4</mn><mi>a</mi><mi>c</mi> The MathML standard says (MathML 3.0 2e # 3.2.33) that "sin" is appropriately marked up with <mi>. So <mi>sin</mi> should be enough to correctly render eqn's mathematical words. It seems that for non-mathematical words to be rendered with italics by default, they should be rendered with a <mi> per letter? -- Anthony J. Bentley -- To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv
Hi Anthony, Anthony J. Bentley wrote on Tue, Jun 20, 2017 at 02:04:29AM -0600: > - mandoc only uses <mi>, not <mo> or <mn>. > - mandoc will transform a '-' into U+2212, but only when it's not > directly adjacent to a digit. These are still open. > - In Firefox, <mi> only seems to italicize single letters. > > It looks like adjacent variables, numbers, and operators should be split: > - 'x=' should become <mi>x</mi><mo>=</mo> > - '-b' should become <mo>−</mo><mi>b</mi> > - '-4ac' should become <mo>−</mo><mn>4</mn><mi>a</mi><mi>c</mi> > > The MathML standard says (MathML 3.0 2e # 3.2.33) that "sin" is > appropriately marked up with <mi>. So <mi>sin</mi> should be enough to > correctly render eqn's mathematical words. It seems that for > non-mathematical words to be rendered with italics by default, they > should be rendered with a <mi> per letter? The following commit implements the parser side parts needed to fix that. Some formatter parts are still open. Thanks for the analysis, Ingo Log Message: ----------- Outside explicit font context, give every letter its own box. The formatters need this to correctly select fonts. Missing feature reported by bentley@. Modified Files: -------------- mdocml: eqn.c Revision Data ------------- Index: eqn.c =================================================================== RCS file: /home/cvs/mdocml/mdocml/eqn.c,v retrieving revision 1.65 retrieving revision 1.66 diff -Leqn.c -Leqn.c -u -p -r1.65 -r1.66 --- eqn.c +++ eqn.c @@ -20,6 +20,7 @@ #include <sys/types.h> #include <assert.h> +#include <ctype.h> #include <limits.h> #include <stdio.h> #include <stdlib.h> @@ -718,8 +719,8 @@ static enum rofferr eqn_parse(struct eqn_node *ep, struct eqn_box *parent) { char sym[64]; - struct eqn_box *cur; - const char *start; + struct eqn_box *cur, *fontp, *nbox; + const char *cp, *cpn, *start; char *p; size_t sz; enum eqn_tok tok, subtok; @@ -1092,21 +1093,51 @@ this_tok: */ while (parent->args == parent->expectargs) parent = parent->parent; - if (tok == EQN_TOK_FUNC) { - for (cur = parent; cur != NULL; cur = cur->parent) - if (cur->font != EQNFONT_NONE) - break; - if (cur == NULL || cur->font != EQNFONT_ROMAN) { - parent = eqn_box_alloc(ep, parent); - parent->type = EQN_LISTONE; - parent->font = EQNFONT_ROMAN; - parent->expectargs = 1; - } + /* + * Wrap well-known function names in a roman box, + * unless they already are in roman context. + */ + for (fontp = parent; fontp != NULL; fontp = fontp->parent) + if (fontp->font != EQNFONT_NONE) + break; + if (tok == EQN_TOK_FUNC && + (fontp == NULL || fontp->font != EQNFONT_ROMAN)) { + parent = fontp = eqn_box_alloc(ep, parent); + parent->type = EQN_LISTONE; + parent->font = EQNFONT_ROMAN; + parent->expectargs = 1; } cur = eqn_box_alloc(ep, parent); cur->type = EQN_TEXT; cur->text = p; - + /* + * If not inside any explicit font context, + * give every letter its own box. + */ + if (fontp == NULL && *p != '\0') { + cp = p; + for (;;) { + cpn = cp + 1; + if (*cp == '\\') + mandoc_escape(&cpn, NULL, NULL); + if (*cpn == '\0') + break; + if (isalpha((unsigned char)*cp) == 0 && + isalpha((unsigned char)*cpn) == 0) { + cp = cpn; + continue; + } + nbox = eqn_box_alloc(ep, parent); + nbox->type = EQN_TEXT; + nbox->text = mandoc_strdup(cpn); + p = mandoc_strndup(cur->text, + cpn - cur->text); + free(cur->text); + cur->text = p; + cur = nbox; + cp = nbox->text; + } + } /* * Post-process list status. */ -- To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv
Hi Anthony, Anthony J. Bentley wrote on Tue, Jun 20, 2017 at 02:04:29AM -0600: > Consider the quadratic formula: > > x={-b +- sqrt{b sup 2 - 4ac}} over 2a > > Wikipedia suggests it should be rendered in MathML like so (leaving > out invisible operators): > > <mrow> > <mi>x</mi> > <mo>=</mo> > <mfrac> > <mrow> > <mo>−</mo> > <mi>b</mi> > <mo>±</mo> > <msqrt> > <msup> > <mi>b</mi> > <mn>2</mn> > </msup> > <mo>−</mo> > <mn>4</mn> > <mi>a</mi> > <mi>c</mi> > </msqrt> > </mrow> > <mrow> > <mn>2</mn> > <mi>a</mi> > </mrow> > </mfrac> > </mrow> After committing the patch appended below, mandoc now renders as follows: <mrow> <mi>x</mi> <!-- new identifier/operator splitting --> <mo>=</mo> <!-- new operator element --> <mfrac> <mrow> <mo>-</mo> <!-- XXX still no U+2212 --> <mi>b</mi> <mo>±</mo> <msqrt> <mrow> <!-- XXX no detection of needless rows yet --> <msup> <mi>b</mi> <mn>2</mn> <!-- new number element --> </msup> <mi>−</mi> <!-- XXX no non-ASCII operator detection --> <mn>4</mn> <mi fontstyle="italic">ac</mi> <!-- SEE BELOW --> </mrow> </msqrt> </mrow> <mn>2</mn> <!-- XXX oops, do we need a row here? --> <mi>a</mi> </mfrac> </mrow> The <mi fontstyle="italic">ac</mi> does not seem wrong. If you write "ac", mandoc cannot be sure whether this is a two-letter identifier (which is correctly marked up above) or the product of two identifiers. In this case, you should probably write "a c" (with a blank) to make it clear that these are two identifiers, and then it will render as <mi>a</mi><mi>c</mi>. > - mandoc only uses <mi>, not <mo> or <mn>. Fixed. > - mandoc will transform a '-' into U+2212, but only when it's not > directly adjacent to a digit. Open. > - In Firefox, <mi> only seems to italicize single letters. That is required by the MathML standard, see the description of <mi>. > It looks like adjacent variables, numbers, and operators should be split: > - 'x=' should become <mi>x</mi><mo>=</mo> Done. > - '-b' should become <mo>−</mo><mi>b</mi> Done except U+2212. > - '-4ac' should become <mo>−</mo><mn>4</mn><mi>a</mi><mi>c</mi> I disagree: '4ac' is fine as it is, and '4a c' does become what you ask for. > The MathML standard says (MathML 3.0 2e # 3.2.33) that "sin" is > appropriately marked up with <mi>. So <mi>sin</mi> should be enough to > correctly render eqn's mathematical words. It seems that for > non-mathematical words to be rendered with italics by default, they > should be rendered with a <mi> per letter? That would be possible, but it is not required, and it gives strange results for multi-letter identifiers. Yours, Ingo Log Message: ----------- Write text boxes as <mi>, <mn>, or <mo> as appropriate, and write fontstyle or fontweight attributes where required. Missing features reported by bentley@. Modified Files: -------------- mdocml: eqn_html.c html.c html.h Revision Data ------------- Index: html.h =================================================================== RCS file: /home/cvs/mdocml/mdocml/html.h,v retrieving revision 1.85 retrieving revision 1.86 diff -Lhtml.h -Lhtml.h -u -p -r1.85 -r1.86 --- html.h +++ html.h @@ -51,6 +51,7 @@ enum htmltag { TAG_MATH, TAG_MROW, TAG_MI, + TAG_MN, TAG_MO, TAG_MSUP, TAG_MSUB, Index: html.c =================================================================== RCS file: /home/cvs/mdocml/mdocml/html.c,v retrieving revision 1.214 retrieving revision 1.215 diff -Lhtml.c -Lhtml.c -u -p -r1.214 -r1.215 --- html.c +++ html.c @@ -87,6 +87,7 @@ static const struct htmldata htmltags[TA {"math", HTML_NLALL | HTML_INDENT}, {"mrow", 0}, {"mi", 0}, + {"mn", 0}, {"mo", 0}, {"msup", 0}, {"msub", 0}, Index: eqn_html.c =================================================================== RCS file: /home/cvs/mdocml/mdocml/eqn_html.c,v retrieving revision 1.12 retrieving revision 1.13 diff -Leqn_html.c -Leqn_html.c -u -p -r1.12 -r1.13 --- eqn_html.c +++ eqn_html.c @@ -20,6 +20,7 @@ #include <sys/types.h> #include <assert.h> +#include <ctype.h> #include <stdio.h> #include <stdlib.h> #include <string.h> @@ -33,7 +34,10 @@ eqn_box(struct html *p, const struct eqn { struct tag *post, *row, *cell, *t; const struct eqn_box *child, *parent; + const unsigned char *cp; size_t i, j, rows; + enum htmltag tag; + enum eqn_fontt font; if (NULL == bp) return; @@ -136,9 +140,51 @@ eqn_box(struct html *p, const struct eqn print_otag(p, TAG_MTD, ""); } - if (NULL != bp->text) { - assert(NULL == post); - post = print_otag(p, TAG_MI, ""); + if (bp->text != NULL) { + assert(post == NULL); + tag = TAG_MI; + cp = (unsigned char *)bp->text; + if (isdigit(cp[0]) || (cp[0] == '.' && isdigit(cp[1]))) { + tag = TAG_MN; + while (*++cp != '\0') { + if (*cp != '.' && !isdigit(*cp)) { + tag = TAG_MI; + break; + } + } + } else if (*cp != '\0' && isalpha(*cp) == 0) { + tag = TAG_MO; + while (*++cp != '\0') { + if (isalnum(*cp)) { + tag = TAG_MI; + break; + } + } + } + font = bp->font; + if (bp->text[0] != '\0' && + (((tag == TAG_MN || tag == TAG_MO) && + font == EQNFONT_ROMAN) || + (tag == TAG_MI && font == (bp->text[1] == '\0' ? + EQNFONT_ITALIC : EQNFONT_ROMAN)))) + font = EQNFONT_NONE; + switch (font) { + case EQNFONT_NONE: + post = print_otag(p, tag, ""); + break; + case EQNFONT_ROMAN: + post = print_otag(p, tag, "?", "fontstyle", "normal"); + break; + case EQNFONT_BOLD: + case EQNFONT_FAT: + post = print_otag(p, tag, "?", "fontweight", "bold"); + break; + case EQNFONT_ITALIC: + post = print_otag(p, tag, "?", "fontstyle", "italic"); + break; + default: + abort(); + } print_text(p, bp->text); } else if (NULL == post) { if (NULL != bp->left || NULL != bp->right) -- To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv
Hi Ingo,
Thanks for the improvements!
Ingo Schwarze writes:
> The <mi fontstyle="italic">ac</mi> does not seem wrong.
> If you write "ac", mandoc cannot be sure whether this is
> a two-letter identifier (which is correctly marked up above)
> or the product of two identifiers.
>
> In this case, you should probably write "a c" (with a blank)
> to make it clear that these are two identifiers, and then it
> will render as <mi>a</mi><mi>c</mi>.
This decision seems fine to me.
--
Anthony J. Bentley
--
To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv
Hi Anthony, Ingo Schwarze wrote on Fri, Jun 23, 2017 at 04:57:12AM +0200: > Anthony J. Bentley wrote on Tue, Jun 20, 2017 at 02:04:29AM -0600: >> Consider the quadratic formula: >> x={-b +- sqrt{b sup 2 - 4ac}} over 2a > After committing the patch appended below, mandoc now renders > as follows: > > <mrow> > <mi>x</mi> <!-- new identifier/operator splitting --> > <mo>=</mo> <!-- new operator element --> > <mfrac> > <mrow> [...] > </mrow> > <mn>2</mn> <!-- XXX oops, do we need a row here? --> > <mi>a</mi> > </mfrac> > </mrow> I just fixed this with the following commit, to render as: </mrow> <mrow> <mn>2</mn> <mi>a</mi> </mrow> </mfrac> Yours, Ingo Log Message: ----------- splitting a text box sometimes requires wrapping it in a list Modified Files: -------------- mandoc: eqn.c Revision Data ------------- Index: eqn.c =================================================================== RCS file: /home/cvs/mandoc/mandoc/eqn.c,v retrieving revision 1.68 retrieving revision 1.69 diff -Leqn.c -Leqn.c -u -p -r1.68 -r1.69 --- eqn.c +++ eqn.c @@ -1139,7 +1139,25 @@ this_tok: break; if (ccln == ccl) continue; - /* Boundary found, add a new box. */ + /* Boundary found, split the text. */ + if (parent->args == parent->expectargs) { + /* Remove the text from the tree. */ + if (cur->prev == NULL) + parent->first = cur->next; + else + cur->prev->next = NULL; + parent->last = cur->prev; + parent->args--; + /* Set up a list instead. */ + nbox = eqn_box_alloc(ep, parent); + nbox->type = EQN_LIST; + /* Insert the word into the list. */ + nbox->first = nbox->last = cur; + cur->parent = nbox; + cur->prev = NULL; + parent = nbox; + } + /* Append a new text box. */ nbox = eqn_box_alloc(ep, parent); nbox->type = EQN_TEXT; nbox->text = mandoc_strdup(cpn); -- To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv