help / color / mirror / Atom feed
From: "Mario Blättermann" <mario.blaettermann@gmail.com>
To: Ingo Schwarze <schwarze@usta.de>
Cc: discuss@mandoc.bsd.lv
Subject: Re: HTML output: section headers with diacritics not in table of contents
Date: Fri, 25 Mar 2022 17:57:21 +0100	[thread overview]
Message-ID: <CAHi0vA8zpGgj71Y60XuT-m01vTy4E-b93s7WMh-PhGFo6tZiyA@mail.gmail.com> (raw)
In-Reply-To: <Yj21FDsRyhI4eOvN@asta-kit.de>

Hello Ingo,

sorry, I haven't seen the rest of your message.

Am Fr., 25. März 2022 um 13:27 Uhr schrieb Ingo Schwarze <schwarze@usta.de>:
> Hi Mario,
> Mario Blättermann wrote on Thu, Mar 24, 2022 at 06:13:23PM +0100:
> > recently I'm switched from GNU man-db to mandoc. It's really a big
> > step ahead, especially regarding the creation of HTML pages, but it
> > has its own peculiarities …
> >
> > For creating a HTML man page I use the following command:
> >
> > mandoc -T html -O toc ./manpage.1 > manpage.1.html
> You should really use the -O style=... and -O man=... options in
> addition to the options you are already using.  Without "style",
> CSS support is next to absent; no real style sheet is linked to,
> and only a minimal style sheet is embedded with <style>, so minimal
> that many features cannot work.  Without "man", you get no hyperlinks
> from .Xr macros.
> > This works so far for English man pages. For man pages in other
> > languages, I stumbled upon problems with creating toc entries.
> Regarding -O toc, you should be aware that a number of senior OpenBSD
> developers hated the feature so much that i disabled it completely
> on man.openbsd.org.  They argue the feature is superfluous, noisy,
> and the whole idea of prefixing a TOC to a manual page is misguided.
To mention, the TOC works at man.openbsd.org, for example:

> Being disabled by default, the feature matures only very slowly, and
> i have to depend on your feedback, and on other people using mandoc
> in a similar way as you are using it, to slowly improve the -O toc
> feature.
> > For example, the "SYNOPSIS" is "ÜBERSICHT" in German,
> What a terrible mistranslation...
> You know, German happens to be my native language, and in
> general English usage, "synopsis" can indeed mean "Zusammenfassung,
> Kurzfassung, Uebersicht", but in a manual page, "SYNOPSIS" really means
> "short syntax display", so an exact name for the section would be
> "Kurzdarstellung der Syntax" oder "Syntax-Kurzdarstellung" or some
> similar wording; "Uebersicht" with no further qualification is badly
> misleading.
> Such mistranslations obviously not only happen for reserved words
> like "SYNOPSIS", but also in the main text of manual pages.
> That's why i hate translated manual pages so much.  Reading German
> manual pages, i usually find them pretty unitelligible.
> Multiple times, i talked to Japanese software developers and even
> though fluency in English is much less common in Japan than, say, in
> Sweden or the Netherlands or even in Germany, they consistently told
> me that very few Japanese people use Japanese manual pages because
> even with a limited understanding of English, Japanese programmers
> tend to find out quickly that they are much better served by English
> than by Japanese manual pages, so they tend to improve their English
> reading stills until they understand what they need.
> So while i'm not aggressively trying to *not* support translated manual
> pages, i don't think translated manual pages are particularly relevant
> either.
> > and the "Ü" is displayed correctly,
> Yes, mandoc aims to correctly handle LC_CTYPE=*.UTF-8.  If Unicode
> characters are displayed incorrectly in UTF-8 or HTML output mode,
> i consider that a bug because non-ASCII characters are also crucial
> for the understanding of a limited number of English manual pages.
> Here is the canonical example:
>   https://man.openbsd.org/lgamma.3
> > but the header is not clickable because it
> > doesn't have a toc entry. You can see this in the Archlinux online man
> > pages [1]; as you might know, "Archmanweb" uses Mandoc.
> > https://man.archlinux.org/man/diff.1.de
> Yes, i'm aware than Arch uses Mandoc :-).
> https://mandoc.bsd.lv/ links to https://man.archlinux.org/
> below "Documentation and help",
> and so does https://mandoc.bsd.lv/ports.html,
> saying that Arch uses mandoc for the web since 2021 Jan 14.
> Handling non-ASCII tag content is non-trivial.
> Similar to preconv(1), mandoc(1) neads to translate non-ASCII
> input to roff(7) character escape sequences at input time.
> So when trying to figure whether the string "Uebersicht"
> should be tagged, we have this situation:
>   Breakpoint 2, post_SH (man=0x6e29d380000, n=0x6e29d351000)
>       at /usr/src/usr.bin/mandoc/man_validate.c:312
>   312     {
>   (gdb) n
>   316             nc = n->child;
>   (gdb)
>   317             switch (n->type) {
>   (gdb)
>   319                     tag = NULL;
>   (gdb)
>   320                     deroff(&tag, n);
>   (gdb)
>   321                     if (tag != NULL) {
>   (gdb) p tag
>   $2 = 0x6e29d35c520 "\\[u00DC]BERSICHT"
> Now, how is the mandoc tagging module supposed to figure out that
> \\[u00DC] is a letter?  Using iswalpha(3) would be awkward at best
> given that mandoc does not use wchar_t at all, and even more so because
> resolving the \\[u00DC] sequence only happens at the output stage, so
> a wchar_t or char32_t form of the character isn't really available to
> the tagging module.  Consequenly, the tagging module treats \\[u00DC]
> like any other roff(7) escape sequence, aborting tag processing.
> Since tag generation gets aborted for the first letter of this word,
> no tag is generated at all, which results in no HTML id= attribute
> being generated for the h1 class="Sh" element, and hence no underlining
> and no clickability.
> This points to a deeper problem.
> Section names in manual pages serve a double function.  On the one
> hand, they communicate to the reader which content to expect in the
> section.  That purpose is partly defeated by translating a manual
> page because translations tend to be inconsistent.  Some manual pages
> translate "SYNOPSIS" to "Syntax" in German; this particular on to
> "Uebersicht".  So the effect that the reader immediately recognizes
> the section name is lost, even if (and in particular when) the reader
> is used to reading German manual pages.  (The same applies to the
> main text, of course; translations of technical terms are usually
> inconsistent and often awkward and misleading, making the text hard
> to understand even for native speakers of the target language.)
> On the other hand, section names serve a syntactic function,
> modifying the interpretation of macros and other content in
> the section in question.  That syntactic function is *completely*
> lost by translating the manual page.  For example, mandoc
> implements special handling for the following sections:
>   SEC_SYNOPSIS      (11 special features)
>   SEC_NAME          (7 special features)
>   SEC_LIBRARY       (5 special features)
>   SEC_DESCRIPTION   (2 special features)
>   SEC_RETURN_VALUES (1 special feature)
>   SEC_ERRORS        (2 special features)
>   SEC_SEE_ALSO      (7 special features)
>   SEC_AUTHORS       (8 special features)
> All that special processing is completely lost in a translated manual page.
> > The German keyboard produces the letter "Ü" as a single character
> > named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a kind of
> > splitted "Ü" available: U U+0055 LATIN CAPITAL LETTER U ‎̈ U+0308
> > COMBINING DIAERESIS. If I change it in the Groff source, toc creation
> > works fine using this splitted one.
> Well, not really.  The two-character form of the umlaut is "U\\[u0308]"
> instead of "\\[u00DC]", so the tag will be "U", which isn't really
> useful either.
Yes, but as long as no other section header starts with »U«, it would
at least create one clickable  TOC entry. But if another »U« exists
and the second letter is non-ASCII, it wouldn't work anymore.

> > Moreover, in the Vietnamese version of the same man page [2], even
> > [2] https://man.archlinux.org/man/diff.1.vi
> > more toc entries are missing. Obviously because multiple section
> > headers start with "T", followed by diacritics, no toc entry is
> > created for those. But interesting: TÓM TẮT doesn't have an entry, TÊN
> > does have one. I can't imagine how Mandoc distinguishes between
> > acceptable and unacceptable diacritics.
> It doesn't.  If you have conflicting tag locations (in this case
> two definitions of the term "T") and no other information which one
> is more authoritative, the first one wins.  But in less(1), you
> can still navigate both by typing ":t T" to navigate to the first
> definition of the term "T", then hit the key "T" (next tag) to
> navigate to the next definition of the term "T" (a bit awkward
> we ended up with the term "T" in this example, which clashes with
> the less(1) command "T" :).
> > The described behavior is the same with a pure Mandoc on my local
> > system and with Archmanweb. However, the developers of Debiman
> > obviously found a solution [3], maybe unconsciously …? In any case,
> > [3] https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html
> > their Mandoc is wrapped in a Go-based environment. Besides some extra
> > features Archmanweb doesn't have (for example, better detection of
> > cross-references to other man pages if they are not formatted as
> > such), the toc creation works, even for the Vietnamese version [4].
> > [4] https://manpages.debian.org/unstable/manpages-vi/diff.1.vi.html.gz
> The right-hand sidebar in manpages.debian.org is not coming from
> mandoc.  I suspect that is generated by some Go code.
Yes, yesterday answered a Debiman developer her that Debiman had this
feature earlier than Mandoc, so it's their own implementation.

> > Any idea what is wrong? Well, first I thought the problem is on my
> > machine, but Archmanweb shows the same behavior. As a workaround, I
> > could produce a few more toc entries by replacing "Ü" with "Ü" and
> > similar, but as long as I don't know what rules Mandoc applies
> > internally, it's almost impossible to fix. To mention, as one of the
> > maintainers of the manpages-l10n project [5], I have to maintain many
> > [5] https://salsa.debian.org/manpages-l10n-team/manpages-l10n
> > languages, not only my own one …
> Wow, you are fluent in many languages?  That's impressive.  :-)
I didn't say I'm fluent in all these languages, but I maintain them
from a technical point of view. I have to bother with addendum files,
right-to-left languages, cyrillic alphabets … what doesn't mean that I
speak Norwegian or Macedonian or Persian, whatever. As a maintainer, I
have to make sure that the translated man pages will be generated from
.po files using Po4a, in a readable shape, that's all.

> > I consider the online collections just as important as the local
> > versions, especially for linking to a specific man page section or
> > subsection in email or web,
> Yes, that is indeed helpful when answering user questions on mailing list.
> Fortunately, the links that really matter typically work even with
> translated manual pages because the syntax elements that users need
> to learn are usually *not* translated and their English names usually
> contain ASCII characters only, for example
>   https://man.archlinux.org/man/diff.1.vi#u
>   https://man.archlinux.org/man/diff.1.vi#help
Yes, I still can click on TOC entries within the sections, which don't
contain any non-ASCII.

> > and for searching in man pages which are not installed locally.
> > Any help with solving this problem would be appreciated.
> Right now, i'm not yet sure what to do about tagging of words that
> involve non-ASCII characters.  Maybe we can figure something out,
> but how?  Hmmm...
> Maybe mandoc should treat any \\[uXXXX] sequence as a letter for
> the purposes of tagging?  The code needed for that will look rather
> awkward though, and even when implemented perfectly, the tags will
> be UTF-8 rather than ASCII-encoded.  Would links like
>   https://man.archlinux.org/man/diff.1.de#%C3%9CBERSICHT
> really be all that useful?  What do people think?
This link *would* we useful, at least, it will be accepted (although I
can't really test it due to the stillmissing TOC entry). Google Chrome
translates it into https://man.archlinux.org/man/diff.1.de#ÜBERSICHT
in the address bar, but when I copy the link, then I get the original
link back. Don't know how other browsers would behave in this case.

Your idea with treating \\[uXXXX] sequences seems to be a step in the
right direction. I also can't read Go code from the Debiman developers
to figure out how to implement this in C.

Best Regards,
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv

  parent reply	other threads:[~2022-03-25 16:57 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-24 17:13 Mario Blättermann
2022-03-24 17:33 ` Michael Stapelberg
2022-03-24 18:00   ` Mario Blättermann
2022-03-25 12:27 ` Ingo Schwarze
2022-03-25 16:07   ` Mario Blättermann
2022-03-25 20:58     ` Jan Stary
2022-03-26 12:34     ` Ingo Schwarze
2022-03-26 13:35       ` Mario Blättermann
2022-03-25 16:21   ` Anthony J. Bentley
2022-03-25 21:15     ` Jan Stary
2022-03-26 10:33     ` Ingo Schwarze
2022-03-26 17:55       ` Anthony J. Bentley
2022-03-27 11:17         ` Ingo Schwarze
2022-03-27 11:44           ` Ingo Schwarze
2022-03-25 16:57   ` Mario Blättermann [this message]
2022-03-25 20:36     ` Jan Stary
2022-03-25 20:59       ` Mario Blättermann
2022-03-25 21:20         ` Jan Stary
2022-03-26  9:25           ` Ingo Schwarze

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHi0vA8zpGgj71Y60XuT-m01vTy4E-b93s7WMh-PhGFo6tZiyA@mail.gmail.com \
    --to=mario.blaettermann@gmail.com \
    --cc=discuss@mandoc.bsd.lv \
    --cc=schwarze@usta.de \
    --subject='Re: HTML output: section headers with diacritics not in table of contents' \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).