From: "Mario Blättermann" <mario.blaettermann@gmail.com> To: discuss@mandoc.bsd.lv Subject: Re: HTML output: section headers with diacritics not in table of contents Date: Fri, 25 Mar 2022 17:07:13 +0100 [thread overview] Message-ID: <CAHi0vA-GTGOP=8pC1gFOn6=yERZycuT3Mm_iOYrN1DsYY7Rozg@mail.gmail.com> (raw) In-Reply-To: <Yj21FDsRyhI4eOvN@asta-kit.de> Hello Ingo, Am Fr., 25. März 2022 um 13:27 Uhr schrieb Ingo Schwarze <schwarze@usta.de>: > > Hi Mario, > > Mario Blättermann wrote on Thu, Mar 24, 2022 at 06:13:23PM +0100: > > > recently I'm switched from GNU man-db to mandoc. It's really a big > > step ahead, especially regarding the creation of HTML pages, but it > > has its own peculiarities … > > > > For creating a HTML man page I use the following command: > > > > mandoc -T html -O toc ./manpage.1 > manpage.1.html > > You should really use the -O style=... and -O man=... options in > addition to the options you are already using. Without "style", > CSS support is next to absent; no real style sheet is linked to, > and only a minimal style sheet is embedded with <style>, so minimal > that many features cannot work. As far as I understand, proper TOC creation depends on a CSS file? > Without "man", you get no hyperlinks > from .Xr macros. > It's not about hyperlinks, this works at least in Archmanweb, and on my local machine I don't need such links > > This works so far for English man pages. For man pages in other > > languages, I stumbled upon problems with creating toc entries. > > Regarding -O toc, you should be aware that a number of senior OpenBSD > developers hated the feature so much that i disabled it completely > on man.openbsd.org. They argue the feature is superfluous, noisy, > and the whole idea of prefixing a TOC to a manual page is misguided. > > Being disabled by default, the feature matures only very slowly, and > i have to depend on your feedback, and on other people using mandoc > in a similar way as you are using it, to slowly improve the -O toc > feature. > > > For example, the "SYNOPSIS" is "ÜBERSICHT" in German, > > What a terrible mistranslation... > > You know, German happens to be my native language, and in > general English usage, "synopsis" can indeed mean "Zusammenfassung, > Kurzfassung, Uebersicht", but in a manual page, "SYNOPSIS" really means > "short syntax display", so an exact name for the section would be > "Kurzdarstellung der Syntax" oder "Syntax-Kurzdarstellung" or some > similar wording; "Uebersicht" with no further qualification is badly > misleading. > Thanks you for your honest opinion. We use this translation already for years, and no one ever complained about it. > Such mistranslations obviously not only happen for reserved words > like "SYNOPSIS", but also in the main text of manual pages. > That's why i hate translated manual pages so much. Reading German > manual pages, i usually find them pretty unitelligible. > OK, if you hate German man pages anyway, why get upset...? > Multiple times, i talked to Japanese software developers and even > though fluency in English is much less common in Japan than, say, in > Sweden or the Netherlands or even in Germany, they consistently told > me that very few Japanese people use Japanese manual pages because > even with a limited understanding of English, Japanese programmers > tend to find out quickly that they are much better served by English > than by Japanese manual pages, so they tend to improve their English > reading stills until they understand what they need. > > So while i'm not aggressively trying to *not* support translated manual > pages, i don't think translated manual pages are particularly relevant > either. > OK, I understand. I don't expect any further efforts to get a better TOC creation from your side. Maybe I can discuss this with the Archmanweb developers. Best Regards, Mario > > and the "Ü" is displayed correctly, > > Yes, mandoc aims to correctly handle LC_CTYPE=*.UTF-8. If Unicode > characters are displayed incorrectly in UTF-8 or HTML output mode, > i consider that a bug because non-ASCII characters are also crucial > for the understanding of a limited number of English manual pages. > Here is the canonical example: > > https://man.openbsd.org/lgamma.3 > > > but the header is not clickable because it > > doesn't have a toc entry. You can see this in the Archlinux online man > > pages [1]; as you might know, "Archmanweb" uses Mandoc. > > https://man.archlinux.org/man/diff.1.de > > Yes, i'm aware than Arch uses Mandoc :-). > https://mandoc.bsd.lv/ links to https://man.archlinux.org/ > below "Documentation and help", > and so does https://mandoc.bsd.lv/ports.html, > saying that Arch uses mandoc for the web since 2021 Jan 14. > > > Handling non-ASCII tag content is non-trivial. > Similar to preconv(1), mandoc(1) neads to translate non-ASCII > input to roff(7) character escape sequences at input time. > So when trying to figure whether the string "Uebersicht" > should be tagged, we have this situation: > > Breakpoint 2, post_SH (man=0x6e29d380000, n=0x6e29d351000) > at /usr/src/usr.bin/mandoc/man_validate.c:312 > 312 { > (gdb) n > 316 nc = n->child; > (gdb) > 317 switch (n->type) { > (gdb) > 319 tag = NULL; > (gdb) > 320 deroff(&tag, n); > (gdb) > 321 if (tag != NULL) { > (gdb) p tag > $2 = 0x6e29d35c520 "\\[u00DC]BERSICHT" > > Now, how is the mandoc tagging module supposed to figure out that > \\[u00DC] is a letter? Using iswalpha(3) would be awkward at best > given that mandoc does not use wchar_t at all, and even more so because > resolving the \\[u00DC] sequence only happens at the output stage, so > a wchar_t or char32_t form of the character isn't really available to > the tagging module. Consequenly, the tagging module treats \\[u00DC] > like any other roff(7) escape sequence, aborting tag processing. > Since tag generation gets aborted for the first letter of this word, > no tag is generated at all, which results in no HTML id= attribute > being generated for the h1 class="Sh" element, and hence no underlining > and no clickability. > > This points to a deeper problem. > > Section names in manual pages serve a double function. On the one > hand, they communicate to the reader which content to expect in the > section. That purpose is partly defeated by translating a manual > page because translations tend to be inconsistent. Some manual pages > translate "SYNOPSIS" to "Syntax" in German; this particular on to > "Uebersicht". So the effect that the reader immediately recognizes > the section name is lost, even if (and in particular when) the reader > is used to reading German manual pages. (The same applies to the > main text, of course; translations of technical terms are usually > inconsistent and often awkward and misleading, making the text hard > to understand even for native speakers of the target language.) > > On the other hand, section names serve a syntactic function, > modifying the interpretation of macros and other content in > the section in question. That syntactic function is *completely* > lost by translating the manual page. For example, mandoc > implements special handling for the following sections: > > SEC_SYNOPSIS (11 special features) > SEC_NAME (7 special features) > SEC_LIBRARY (5 special features) > SEC_DESCRIPTION (2 special features) > SEC_RETURN_VALUES (1 special feature) > SEC_ERRORS (2 special features) > SEC_SEE_ALSO (7 special features) > SEC_AUTHORS (8 special features) > > All that special processing is completely lost in a translated manual page. > > > The German keyboard produces the letter "Ü" as a single character > > named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a kind of > > splitted "Ü" available: U U+0055 LATIN CAPITAL LETTER U ̈ U+0308 > > COMBINING DIAERESIS. If I change it in the Groff source, toc creation > > works fine using this splitted one. > > Well, not really. The two-character form of the umlaut is "U\\[u0308]" > instead of "\\[u00DC]", so the tag will be "U", which isn't really > useful either. > > > Moreover, in the Vietnamese version of the same man page [2], even > > [2] https://man.archlinux.org/man/diff.1.vi > > more toc entries are missing. Obviously because multiple section > > headers start with "T", followed by diacritics, no toc entry is > > created for those. But interesting: TÓM TẮT doesn't have an entry, TÊN > > does have one. I can't imagine how Mandoc distinguishes between > > acceptable and unacceptable diacritics. > > It doesn't. If you have conflicting tag locations (in this case > two definitions of the term "T") and no other information which one > is more authoritative, the first one wins. But in less(1), you > can still navigate both by typing ":t T" to navigate to the first > definition of the term "T", then hit the key "T" (next tag) to > navigate to the next definition of the term "T" (a bit awkward > we ended up with the term "T" in this example, which clashes with > the less(1) command "T" :). > > > The described behavior is the same with a pure Mandoc on my local > > system and with Archmanweb. However, the developers of Debiman > > obviously found a solution [3], maybe unconsciously …? In any case, > > [3] https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html > > their Mandoc is wrapped in a Go-based environment. Besides some extra > > features Archmanweb doesn't have (for example, better detection of > > cross-references to other man pages if they are not formatted as > > such), the toc creation works, even for the Vietnamese version [4]. > > [4] https://manpages.debian.org/unstable/manpages-vi/diff.1.vi.html.gz > > The right-hand sidebar in manpages.debian.org is not coming from > mandoc. I suspect that is generated by some Go code. > > > Any idea what is wrong? Well, first I thought the problem is on my > > machine, but Archmanweb shows the same behavior. As a workaround, I > > could produce a few more toc entries by replacing "Ü" with "Ü" and > > similar, but as long as I don't know what rules Mandoc applies > > internally, it's almost impossible to fix. To mention, as one of the > > maintainers of the manpages-l10n project [5], I have to maintain many > > [5] https://salsa.debian.org/manpages-l10n-team/manpages-l10n > > languages, not only my own one … > > Wow, you are fluent in many languages? That's impressive. :-) > > > I consider the online collections just as important as the local > > versions, especially for linking to a specific man page section or > > subsection in email or web, > > Yes, that is indeed helpful when answering user questions on mailing list. > Fortunately, the links that really matter typically work even with > translated manual pages because the syntax elements that users need > to learn are usually *not* translated and their English names usually > contain ASCII characters only, for example > > https://man.archlinux.org/man/diff.1.vi#u > https://man.archlinux.org/man/diff.1.vi#help > > > and for searching in man pages which are not installed locally. > > Any help with solving this problem would be appreciated. > > Right now, i'm not yet sure what to do about tagging of words that > involve non-ASCII characters. Maybe we can figure something out, > but how? Hmmm... > Maybe mandoc should treat any \\[uXXXX] sequence as a letter for > the purposes of tagging? The code needed for that will look rather > awkward though, and even when implemented perfectly, the tags will > be UTF-8 rather than ASCII-encoded. Would links like > > https://man.archlinux.org/man/diff.1.de#%C3%9CBERSICHT > > really be all that useful? What do people think? > > Yours, > Ingo -- To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv
next prev parent reply other threads:[~2022-03-25 16:07 UTC|newest] Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-03-24 17:13 Mario Blättermann 2022-03-24 17:33 ` Michael Stapelberg 2022-03-24 18:00 ` Mario Blättermann 2022-03-25 12:27 ` Ingo Schwarze 2022-03-25 16:07 ` Mario Blättermann [this message] 2022-03-25 20:58 ` Jan Stary 2022-03-26 12:34 ` Ingo Schwarze 2022-03-26 13:35 ` Mario Blättermann 2022-03-25 16:21 ` Anthony J. Bentley 2022-03-25 21:15 ` Jan Stary 2022-03-26 10:33 ` Ingo Schwarze 2022-03-26 17:55 ` Anthony J. Bentley 2022-03-27 11:17 ` Ingo Schwarze 2022-03-27 11:44 ` Ingo Schwarze 2022-03-25 16:57 ` Mario Blättermann 2022-03-25 20:36 ` Jan Stary 2022-03-25 20:59 ` Mario Blättermann 2022-03-25 21:20 ` Jan Stary 2022-03-26 9:25 ` Ingo Schwarze
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to='CAHi0vA-GTGOP=8pC1gFOn6=yERZycuT3Mm_iOYrN1DsYY7Rozg@mail.gmail.com' \ --to=mario.blaettermann@gmail.com \ --cc=discuss@mandoc.bsd.lv \ --subject='Re: HTML output: section headers with diacritics not in table of contents' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).