discuss@mandoc.bsd.lv
 help / color / mirror / Atom feed
From: "Mario Blättermann" <mario.blaettermann@gmail.com>
To: discuss@mandoc.bsd.lv
Subject: Re: HTML output: section headers with diacritics not in table of contents
Date: Fri, 25 Mar 2022 17:07:13 +0100	[thread overview]
Message-ID: <CAHi0vA-GTGOP=8pC1gFOn6=yERZycuT3Mm_iOYrN1DsYY7Rozg@mail.gmail.com> (raw)
In-Reply-To: <Yj21FDsRyhI4eOvN@asta-kit.de>

Hello Ingo,

Am Fr., 25. März 2022 um 13:27 Uhr schrieb Ingo Schwarze <schwarze@usta.de>:
>
> Hi Mario,
>
> Mario Blättermann wrote on Thu, Mar 24, 2022 at 06:13:23PM +0100:
>
> > recently I'm switched from GNU man-db to mandoc. It's really a big
> > step ahead, especially regarding the creation of HTML pages, but it
> > has its own peculiarities …
> >
> > For creating a HTML man page I use the following command:
> >
> > mandoc -T html -O toc ./manpage.1 > manpage.1.html
>
> You should really use the -O style=... and -O man=... options in
> addition to the options you are already using.  Without "style",
> CSS support is next to absent; no real style sheet is linked to,
> and only a minimal style sheet is embedded with <style>, so minimal
> that many features cannot work.

As far as I understand, proper TOC creation depends on a CSS file?

> Without "man", you get no hyperlinks
> from .Xr macros.
>
It's not about hyperlinks, this works at least in Archmanweb, and on
my local machine I don't need such links

> > This works so far for English man pages. For man pages in other
> > languages, I stumbled upon problems with creating toc entries.
>
> Regarding -O toc, you should be aware that a number of senior OpenBSD
> developers hated the feature so much that i disabled it completely
> on man.openbsd.org.  They argue the feature is superfluous, noisy,
> and the whole idea of prefixing a TOC to a manual page is misguided.
>
> Being disabled by default, the feature matures only very slowly, and
> i have to depend on your feedback, and on other people using mandoc
> in a similar way as you are using it, to slowly improve the -O toc
> feature.
>
> > For example, the "SYNOPSIS" is "ÜBERSICHT" in German,
>
> What a terrible mistranslation...
>
> You know, German happens to be my native language, and in
> general English usage, "synopsis" can indeed mean "Zusammenfassung,
> Kurzfassung, Uebersicht", but in a manual page, "SYNOPSIS" really means
> "short syntax display", so an exact name for the section would be
> "Kurzdarstellung der Syntax" oder "Syntax-Kurzdarstellung" or some
> similar wording; "Uebersicht" with no further qualification is badly
> misleading.
>
Thanks you for your honest opinion. We use this translation already
for years, and no one ever complained about it.

> Such mistranslations obviously not only happen for reserved words
> like "SYNOPSIS", but also in the main text of manual pages.
> That's why i hate translated manual pages so much.  Reading German
> manual pages, i usually find them pretty unitelligible.
>
OK, if you hate German man pages anyway, why get upset...?

> Multiple times, i talked to Japanese software developers and even
> though fluency in English is much less common in Japan than, say, in
> Sweden or the Netherlands or even in Germany, they consistently told
> me that very few Japanese people use Japanese manual pages because
> even with a limited understanding of English, Japanese programmers
> tend to find out quickly that they are much better served by English
> than by Japanese manual pages, so they tend to improve their English
> reading stills until they understand what they need.
>
> So while i'm not aggressively trying to *not* support translated manual
> pages, i don't think translated manual pages are particularly relevant
> either.
>
OK, I understand. I don't expect any further efforts to get a better
TOC creation from your side. Maybe I can discuss this with the
Archmanweb developers.

Best Regards,
Mario


> > and the "Ü" is displayed correctly,
>
> Yes, mandoc aims to correctly handle LC_CTYPE=*.UTF-8.  If Unicode
> characters are displayed incorrectly in UTF-8 or HTML output mode,
> i consider that a bug because non-ASCII characters are also crucial
> for the understanding of a limited number of English manual pages.
> Here is the canonical example:
>
>   https://man.openbsd.org/lgamma.3
>
> > but the header is not clickable because it
> > doesn't have a toc entry. You can see this in the Archlinux online man
> > pages [1]; as you might know, "Archmanweb" uses Mandoc.
> > https://man.archlinux.org/man/diff.1.de
>
> Yes, i'm aware than Arch uses Mandoc :-).
> https://mandoc.bsd.lv/ links to https://man.archlinux.org/
> below "Documentation and help",
> and so does https://mandoc.bsd.lv/ports.html,
> saying that Arch uses mandoc for the web since 2021 Jan 14.
>
>
> Handling non-ASCII tag content is non-trivial.
> Similar to preconv(1), mandoc(1) neads to translate non-ASCII
> input to roff(7) character escape sequences at input time.
> So when trying to figure whether the string "Uebersicht"
> should be tagged, we have this situation:
>
>   Breakpoint 2, post_SH (man=0x6e29d380000, n=0x6e29d351000)
>       at /usr/src/usr.bin/mandoc/man_validate.c:312
>   312     {
>   (gdb) n
>   316             nc = n->child;
>   (gdb)
>   317             switch (n->type) {
>   (gdb)
>   319                     tag = NULL;
>   (gdb)
>   320                     deroff(&tag, n);
>   (gdb)
>   321                     if (tag != NULL) {
>   (gdb) p tag
>   $2 = 0x6e29d35c520 "\\[u00DC]BERSICHT"
>
> Now, how is the mandoc tagging module supposed to figure out that
> \\[u00DC] is a letter?  Using iswalpha(3) would be awkward at best
> given that mandoc does not use wchar_t at all, and even more so because
> resolving the \\[u00DC] sequence only happens at the output stage, so
> a wchar_t or char32_t form of the character isn't really available to
> the tagging module.  Consequenly, the tagging module treats \\[u00DC]
> like any other roff(7) escape sequence, aborting tag processing.
> Since tag generation gets aborted for the first letter of this word,
> no tag is generated at all, which results in no HTML id= attribute
> being generated for the h1 class="Sh" element, and hence no underlining
> and no clickability.
>
> This points to a deeper problem.
>
> Section names in manual pages serve a double function.  On the one
> hand, they communicate to the reader which content to expect in the
> section.  That purpose is partly defeated by translating a manual
> page because translations tend to be inconsistent.  Some manual pages
> translate "SYNOPSIS" to "Syntax" in German; this particular on to
> "Uebersicht".  So the effect that the reader immediately recognizes
> the section name is lost, even if (and in particular when) the reader
> is used to reading German manual pages.  (The same applies to the
> main text, of course; translations of technical terms are usually
> inconsistent and often awkward and misleading, making the text hard
> to understand even for native speakers of the target language.)
>
> On the other hand, section names serve a syntactic function,
> modifying the interpretation of macros and other content in
> the section in question.  That syntactic function is *completely*
> lost by translating the manual page.  For example, mandoc
> implements special handling for the following sections:
>
>   SEC_SYNOPSIS      (11 special features)
>   SEC_NAME          (7 special features)
>   SEC_LIBRARY       (5 special features)
>   SEC_DESCRIPTION   (2 special features)
>   SEC_RETURN_VALUES (1 special feature)
>   SEC_ERRORS        (2 special features)
>   SEC_SEE_ALSO      (7 special features)
>   SEC_AUTHORS       (8 special features)
>
> All that special processing is completely lost in a translated manual page.
>
> > The German keyboard produces the letter "Ü" as a single character
> > named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a kind of
> > splitted "Ü" available: U U+0055 LATIN CAPITAL LETTER U ‎̈ U+0308
> > COMBINING DIAERESIS. If I change it in the Groff source, toc creation
> > works fine using this splitted one.
>
> Well, not really.  The two-character form of the umlaut is "U\\[u0308]"
> instead of "\\[u00DC]", so the tag will be "U", which isn't really
> useful either.
>
> > Moreover, in the Vietnamese version of the same man page [2], even
> > [2] https://man.archlinux.org/man/diff.1.vi
> > more toc entries are missing. Obviously because multiple section
> > headers start with "T", followed by diacritics, no toc entry is
> > created for those. But interesting: TÓM TẮT doesn't have an entry, TÊN
> > does have one. I can't imagine how Mandoc distinguishes between
> > acceptable and unacceptable diacritics.
>
> It doesn't.  If you have conflicting tag locations (in this case
> two definitions of the term "T") and no other information which one
> is more authoritative, the first one wins.  But in less(1), you
> can still navigate both by typing ":t T" to navigate to the first
> definition of the term "T", then hit the key "T" (next tag) to
> navigate to the next definition of the term "T" (a bit awkward
> we ended up with the term "T" in this example, which clashes with
> the less(1) command "T" :).
>
> > The described behavior is the same with a pure Mandoc on my local
> > system and with Archmanweb. However, the developers of Debiman
> > obviously found a solution [3], maybe unconsciously …? In any case,
> > [3] https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html
> > their Mandoc is wrapped in a Go-based environment. Besides some extra
> > features Archmanweb doesn't have (for example, better detection of
> > cross-references to other man pages if they are not formatted as
> > such), the toc creation works, even for the Vietnamese version [4].
> > [4] https://manpages.debian.org/unstable/manpages-vi/diff.1.vi.html.gz
>
> The right-hand sidebar in manpages.debian.org is not coming from
> mandoc.  I suspect that is generated by some Go code.
>
> > Any idea what is wrong? Well, first I thought the problem is on my
> > machine, but Archmanweb shows the same behavior. As a workaround, I
> > could produce a few more toc entries by replacing "Ü" with "Ü" and
> > similar, but as long as I don't know what rules Mandoc applies
> > internally, it's almost impossible to fix. To mention, as one of the
> > maintainers of the manpages-l10n project [5], I have to maintain many
> > [5] https://salsa.debian.org/manpages-l10n-team/manpages-l10n
> > languages, not only my own one …
>
> Wow, you are fluent in many languages?  That's impressive.  :-)
>
> > I consider the online collections just as important as the local
> > versions, especially for linking to a specific man page section or
> > subsection in email or web,
>
> Yes, that is indeed helpful when answering user questions on mailing list.
> Fortunately, the links that really matter typically work even with
> translated manual pages because the syntax elements that users need
> to learn are usually *not* translated and their English names usually
> contain ASCII characters only, for example
>
>   https://man.archlinux.org/man/diff.1.vi#u
>   https://man.archlinux.org/man/diff.1.vi#help
>
> > and for searching in man pages which are not installed locally.
> > Any help with solving this problem would be appreciated.
>
> Right now, i'm not yet sure what to do about tagging of words that
> involve non-ASCII characters.  Maybe we can figure something out,
> but how?  Hmmm...
> Maybe mandoc should treat any \\[uXXXX] sequence as a letter for
> the purposes of tagging?  The code needed for that will look rather
> awkward though, and even when implemented perfectly, the tags will
> be UTF-8 rather than ASCII-encoded.  Would links like
>
>   https://man.archlinux.org/man/diff.1.de#%C3%9CBERSICHT
>
> really be all that useful?  What do people think?
>
> Yours,
>   Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


  reply	other threads:[~2022-03-25 16:07 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-24 17:13 Mario Blättermann
2022-03-24 17:33 ` Michael Stapelberg
2022-03-24 18:00   ` Mario Blättermann
2022-03-25 12:27 ` Ingo Schwarze
2022-03-25 16:07   ` Mario Blättermann [this message]
2022-03-25 20:58     ` Jan Stary
2022-03-26 12:34     ` Ingo Schwarze
2022-03-26 13:35       ` Mario Blättermann
2022-03-25 16:21   ` Anthony J. Bentley
2022-03-25 21:15     ` Jan Stary
2022-03-26 10:33     ` Ingo Schwarze
2022-03-26 17:55       ` Anthony J. Bentley
2022-03-27 11:17         ` Ingo Schwarze
2022-03-27 11:44           ` Ingo Schwarze
2022-03-25 16:57   ` Mario Blättermann
2022-03-25 20:36     ` Jan Stary
2022-03-25 20:59       ` Mario Blättermann
2022-03-25 21:20         ` Jan Stary
2022-03-26  9:25           ` Ingo Schwarze

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHi0vA-GTGOP=8pC1gFOn6=yERZycuT3Mm_iOYrN1DsYY7Rozg@mail.gmail.com' \
    --to=mario.blaettermann@gmail.com \
    --cc=discuss@mandoc.bsd.lv \
    --subject='Re: HTML output: section headers with diacritics not in table of contents' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).