From: Ingo Schwarze <schwarze@usta.de> To: mario.blaettermann@gmail.com Cc: discuss@mandoc.bsd.lv Subject: Re: HTML output: section headers with diacritics not in table of contents Date: Fri, 25 Mar 2022 13:27:00 +0100 [thread overview] Message-ID: <Yj21FDsRyhI4eOvN@asta-kit.de> (raw) In-Reply-To: <CAHi0vA_pfSMOHCd+JuG0efwz=WYUDs2N4=Mgv+BzHcP62T-jAw@mail.gmail.com> Hi Mario, Mario Blättermann wrote on Thu, Mar 24, 2022 at 06:13:23PM +0100: > recently I'm switched from GNU man-db to mandoc. It's really a big > step ahead, especially regarding the creation of HTML pages, but it > has its own peculiarities … > > For creating a HTML man page I use the following command: > > mandoc -T html -O toc ./manpage.1 > manpage.1.html You should really use the -O style=... and -O man=... options in addition to the options you are already using. Without "style", CSS support is next to absent; no real style sheet is linked to, and only a minimal style sheet is embedded with <style>, so minimal that many features cannot work. Without "man", you get no hyperlinks from .Xr macros. > This works so far for English man pages. For man pages in other > languages, I stumbled upon problems with creating toc entries. Regarding -O toc, you should be aware that a number of senior OpenBSD developers hated the feature so much that i disabled it completely on man.openbsd.org. They argue the feature is superfluous, noisy, and the whole idea of prefixing a TOC to a manual page is misguided. Being disabled by default, the feature matures only very slowly, and i have to depend on your feedback, and on other people using mandoc in a similar way as you are using it, to slowly improve the -O toc feature. > For example, the "SYNOPSIS" is "ÜBERSICHT" in German, What a terrible mistranslation... You know, German happens to be my native language, and in general English usage, "synopsis" can indeed mean "Zusammenfassung, Kurzfassung, Uebersicht", but in a manual page, "SYNOPSIS" really means "short syntax display", so an exact name for the section would be "Kurzdarstellung der Syntax" oder "Syntax-Kurzdarstellung" or some similar wording; "Uebersicht" with no further qualification is badly misleading. Such mistranslations obviously not only happen for reserved words like "SYNOPSIS", but also in the main text of manual pages. That's why i hate translated manual pages so much. Reading German manual pages, i usually find them pretty unitelligible. Multiple times, i talked to Japanese software developers and even though fluency in English is much less common in Japan than, say, in Sweden or the Netherlands or even in Germany, they consistently told me that very few Japanese people use Japanese manual pages because even with a limited understanding of English, Japanese programmers tend to find out quickly that they are much better served by English than by Japanese manual pages, so they tend to improve their English reading stills until they understand what they need. So while i'm not aggressively trying to *not* support translated manual pages, i don't think translated manual pages are particularly relevant either. > and the "Ü" is displayed correctly, Yes, mandoc aims to correctly handle LC_CTYPE=*.UTF-8. If Unicode characters are displayed incorrectly in UTF-8 or HTML output mode, i consider that a bug because non-ASCII characters are also crucial for the understanding of a limited number of English manual pages. Here is the canonical example: https://man.openbsd.org/lgamma.3 > but the header is not clickable because it > doesn't have a toc entry. You can see this in the Archlinux online man > pages [1]; as you might know, "Archmanweb" uses Mandoc. > https://man.archlinux.org/man/diff.1.de Yes, i'm aware than Arch uses Mandoc :-). https://mandoc.bsd.lv/ links to https://man.archlinux.org/ below "Documentation and help", and so does https://mandoc.bsd.lv/ports.html, saying that Arch uses mandoc for the web since 2021 Jan 14. Handling non-ASCII tag content is non-trivial. Similar to preconv(1), mandoc(1) neads to translate non-ASCII input to roff(7) character escape sequences at input time. So when trying to figure whether the string "Uebersicht" should be tagged, we have this situation: Breakpoint 2, post_SH (man=0x6e29d380000, n=0x6e29d351000) at /usr/src/usr.bin/mandoc/man_validate.c:312 312 { (gdb) n 316 nc = n->child; (gdb) 317 switch (n->type) { (gdb) 319 tag = NULL; (gdb) 320 deroff(&tag, n); (gdb) 321 if (tag != NULL) { (gdb) p tag $2 = 0x6e29d35c520 "\\[u00DC]BERSICHT" Now, how is the mandoc tagging module supposed to figure out that \\[u00DC] is a letter? Using iswalpha(3) would be awkward at best given that mandoc does not use wchar_t at all, and even more so because resolving the \\[u00DC] sequence only happens at the output stage, so a wchar_t or char32_t form of the character isn't really available to the tagging module. Consequenly, the tagging module treats \\[u00DC] like any other roff(7) escape sequence, aborting tag processing. Since tag generation gets aborted for the first letter of this word, no tag is generated at all, which results in no HTML id= attribute being generated for the h1 class="Sh" element, and hence no underlining and no clickability. This points to a deeper problem. Section names in manual pages serve a double function. On the one hand, they communicate to the reader which content to expect in the section. That purpose is partly defeated by translating a manual page because translations tend to be inconsistent. Some manual pages translate "SYNOPSIS" to "Syntax" in German; this particular on to "Uebersicht". So the effect that the reader immediately recognizes the section name is lost, even if (and in particular when) the reader is used to reading German manual pages. (The same applies to the main text, of course; translations of technical terms are usually inconsistent and often awkward and misleading, making the text hard to understand even for native speakers of the target language.) On the other hand, section names serve a syntactic function, modifying the interpretation of macros and other content in the section in question. That syntactic function is *completely* lost by translating the manual page. For example, mandoc implements special handling for the following sections: SEC_SYNOPSIS (11 special features) SEC_NAME (7 special features) SEC_LIBRARY (5 special features) SEC_DESCRIPTION (2 special features) SEC_RETURN_VALUES (1 special feature) SEC_ERRORS (2 special features) SEC_SEE_ALSO (7 special features) SEC_AUTHORS (8 special features) All that special processing is completely lost in a translated manual page. > The German keyboard produces the letter "Ü" as a single character > named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a kind of > splitted "Ü" available: U U+0055 LATIN CAPITAL LETTER U ̈ U+0308 > COMBINING DIAERESIS. If I change it in the Groff source, toc creation > works fine using this splitted one. Well, not really. The two-character form of the umlaut is "U\\[u0308]" instead of "\\[u00DC]", so the tag will be "U", which isn't really useful either. > Moreover, in the Vietnamese version of the same man page [2], even > [2] https://man.archlinux.org/man/diff.1.vi > more toc entries are missing. Obviously because multiple section > headers start with "T", followed by diacritics, no toc entry is > created for those. But interesting: TÓM TẮT doesn't have an entry, TÊN > does have one. I can't imagine how Mandoc distinguishes between > acceptable and unacceptable diacritics. It doesn't. If you have conflicting tag locations (in this case two definitions of the term "T") and no other information which one is more authoritative, the first one wins. But in less(1), you can still navigate both by typing ":t T" to navigate to the first definition of the term "T", then hit the key "T" (next tag) to navigate to the next definition of the term "T" (a bit awkward we ended up with the term "T" in this example, which clashes with the less(1) command "T" :). > The described behavior is the same with a pure Mandoc on my local > system and with Archmanweb. However, the developers of Debiman > obviously found a solution [3], maybe unconsciously …? In any case, > [3] https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html > their Mandoc is wrapped in a Go-based environment. Besides some extra > features Archmanweb doesn't have (for example, better detection of > cross-references to other man pages if they are not formatted as > such), the toc creation works, even for the Vietnamese version [4]. > [4] https://manpages.debian.org/unstable/manpages-vi/diff.1.vi.html.gz The right-hand sidebar in manpages.debian.org is not coming from mandoc. I suspect that is generated by some Go code. > Any idea what is wrong? Well, first I thought the problem is on my > machine, but Archmanweb shows the same behavior. As a workaround, I > could produce a few more toc entries by replacing "Ü" with "Ü" and > similar, but as long as I don't know what rules Mandoc applies > internally, it's almost impossible to fix. To mention, as one of the > maintainers of the manpages-l10n project [5], I have to maintain many > [5] https://salsa.debian.org/manpages-l10n-team/manpages-l10n > languages, not only my own one … Wow, you are fluent in many languages? That's impressive. :-) > I consider the online collections just as important as the local > versions, especially for linking to a specific man page section or > subsection in email or web, Yes, that is indeed helpful when answering user questions on mailing list. Fortunately, the links that really matter typically work even with translated manual pages because the syntax elements that users need to learn are usually *not* translated and their English names usually contain ASCII characters only, for example https://man.archlinux.org/man/diff.1.vi#u https://man.archlinux.org/man/diff.1.vi#help > and for searching in man pages which are not installed locally. > Any help with solving this problem would be appreciated. Right now, i'm not yet sure what to do about tagging of words that involve non-ASCII characters. Maybe we can figure something out, but how? Hmmm... Maybe mandoc should treat any \\[uXXXX] sequence as a letter for the purposes of tagging? The code needed for that will look rather awkward though, and even when implemented perfectly, the tags will be UTF-8 rather than ASCII-encoded. Would links like https://man.archlinux.org/man/diff.1.de#%C3%9CBERSICHT really be all that useful? What do people think? Yours, Ingo -- To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv
next prev parent reply other threads:[~2022-03-25 12:27 UTC|newest] Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-03-24 17:13 Mario Blättermann 2022-03-24 17:33 ` Michael Stapelberg 2022-03-24 18:00 ` Mario Blättermann 2022-03-25 12:27 ` Ingo Schwarze [this message] 2022-03-25 16:07 ` Mario Blättermann 2022-03-25 20:58 ` Jan Stary 2022-03-26 12:34 ` Ingo Schwarze 2022-03-26 13:35 ` Mario Blättermann 2022-03-25 16:21 ` Anthony J. Bentley 2022-03-25 21:15 ` Jan Stary 2022-03-26 10:33 ` Ingo Schwarze 2022-03-26 17:55 ` Anthony J. Bentley 2022-03-27 11:17 ` Ingo Schwarze 2022-03-27 11:44 ` Ingo Schwarze 2022-03-25 16:57 ` Mario Blättermann 2022-03-25 20:36 ` Jan Stary 2022-03-25 20:59 ` Mario Blättermann 2022-03-25 21:20 ` Jan Stary 2022-03-26 9:25 ` Ingo Schwarze
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=Yj21FDsRyhI4eOvN@asta-kit.de \ --to=schwarze@usta.de \ --cc=discuss@mandoc.bsd.lv \ --cc=mario.blaettermann@gmail.com \ --subject='Re: HTML output: section headers with diacritics not in table of contents' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).