discuss@mandoc.bsd.lv
 help / color / mirror / Atom feed
* HTML output: section headers with diacritics not in table of contents
@ 2022-03-24 17:13 Mario Blättermann
  2022-03-24 17:33 ` Michael Stapelberg
  2022-03-25 12:27 ` Ingo Schwarze
  0 siblings, 2 replies; 19+ messages in thread
From: Mario Blättermann @ 2022-03-24 17:13 UTC (permalink / raw)
  To: discuss

Hello,

recently I'm switched from GNU man-db to mandoc. It's really a big
step ahead, especially regarding the creation of HTML pages, but it
has its own peculiarities …

For creating a HTML man page I use the following command:

mandoc -T html -O toc ./manpage.1 > manpage.1.html

This works so far for English man pages. For man pages in other
languages, I stumbled upon problems with creating toc entries. For
example, the "SYNOPSIS" is "ÜBERSICHT" in German, and the "Ü" is
displayed correctly, but the header is not clickable because it
doesn't have a toc entry. You can see this in the Archlinux online man
pages [1]; as you might know, "Archmanweb" uses Mandoc.

The German keyboard produces the letter "Ü" as a single character
named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a kind of
splitted "Ü" available: U U+0055 LATIN CAPITAL LETTER U ‎̈ U+0308
COMBINING DIAERESIS. If I change it in the Groff source, toc creation
works fine using this splitted one.

Moreover, in the Vietnamese version of the same man page [2], even
more toc entries are missing. Obviously because multiple section
headers start with "T", followed by diacritics, no toc entry is
created for those. But interesting: TÓM TẮT doesn't have an entry, TÊN
does have one. I can't imagine how Mandoc distinguishes between
acceptable and unacceptable diacritics.

The described behavior is the same with a pure Mandoc on my local
system and with Archmanweb. However, the developers of Debiman
obviously found a solution [3], maybe unconsciously …? In any case,
their Mandoc is wrapped in a Go-based environment. Besides some extra
features Archmanweb doesn't have (for example, better detection of
cross-references to other man pages if they are not formatted as
such), the toc creation works, even for the Vietnamese version [4].

Any idea what is wrong? Well, first I thought the problem is on my
machine, but Archmanweb shows the same behavior. As a workaround, I
could produce a few more toc entries by replacing "Ü" with "Ü" and
similar, but as long as I don't know what rules Mandoc applies
internally, it's almost impossible to fix. To mention, as one of the
maintainers of the manpages-l10n project [5], I have to maintain many
languages, not only my own one …

I consider the online collections just as important as the local
versions, especially for linking to a specific man page section or
subsection in email or web, and for searching in man pages which are
not installed locally. Any help with solving this problem would be
appreciated.

[1] https://man.archlinux.org/man/diff.1.de
[2] https://man.archlinux.org/man/diff.1.vi
[3] https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html
[4] https://manpages.debian.org/unstable/manpages-vi/diff.1.vi.html.gz
[5] https://salsa.debian.org/manpages-l10n-team/manpages-l10n

Best Regards,
Mario
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-24 17:13 HTML output: section headers with diacritics not in table of contents Mario Blättermann
@ 2022-03-24 17:33 ` Michael Stapelberg
  2022-03-24 18:00   ` Mario Blättermann
  2022-03-25 12:27 ` Ingo Schwarze
  1 sibling, 1 reply; 19+ messages in thread
From: Michael Stapelberg @ 2022-03-24 17:33 UTC (permalink / raw)
  To: discuss

[-- Attachment #1: Type: text/plain, Size: 3619 bytes --]

On Thu, 24 Mar 2022 at 18:13, Mario Blättermann <
mario.blaettermann@gmail.com> wrote:

> Hello,
>
> recently I'm switched from GNU man-db to mandoc. It's really a big
> step ahead, especially regarding the creation of HTML pages, but it
> has its own peculiarities …
>
> For creating a HTML man page I use the following command:
>
> mandoc -T html -O toc ./manpage.1 > manpage.1.html
>
> This works so far for English man pages. For man pages in other
> languages, I stumbled upon problems with creating toc entries. For
> example, the "SYNOPSIS" is "ÜBERSICHT" in German, and the "Ü" is
> displayed correctly, but the header is not clickable because it
> doesn't have a toc entry. You can see this in the Archlinux online man
> pages [1]; as you might know, "Archmanweb" uses Mandoc.
>
> The German keyboard produces the letter "Ü" as a single character
> named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a kind of
> splitted "Ü" available: U U+0055 LATIN CAPITAL LETTER U ‎̈ U+0308
> COMBINING DIAERESIS. If I change it in the Groff source, toc creation
> works fine using this splitted one.
>
> Moreover, in the Vietnamese version of the same man page [2], even
> more toc entries are missing. Obviously because multiple section
> headers start with "T", followed by diacritics, no toc entry is
> created for those. But interesting: TÓM TẮT doesn't have an entry, TÊN
> does have one. I can't imagine how Mandoc distinguishes between
> acceptable and unacceptable diacritics.
>
> The described behavior is the same with a pure Mandoc on my local
> system and with Archmanweb. However, the developers of Debiman
> obviously found a solution [3], maybe unconsciously …? In any case,
> their Mandoc is wrapped in a Go-based environment. Besides some extra
> features Archmanweb doesn't have (for example, better detection of
> cross-references to other man pages if they are not formatted as
> such), the toc creation works, even for the Vietnamese version [4].
>

Hello, I’m the author of debiman :)
The reason why it uses a different TOC implementation is historical:
debiman introduced a TOC in 2017, whereas mandoc itself only gained -O toc
in 2018.

I’m glad to hear that our code is unicode clean in that regard.
Good unicode/internationalization was one of the project’s goals,
and is easy to accomplish in Go.


>
> Any idea what is wrong? Well, first I thought the problem is on my
> machine, but Archmanweb shows the same behavior. As a workaround, I
> could produce a few more toc entries by replacing "Ü" with "Ü" and
> similar, but as long as I don't know what rules Mandoc applies
> internally, it's almost impossible to fix. To mention, as one of the
> maintainers of the manpages-l10n project [5], I have to maintain many
> languages, not only my own one …
>
> I consider the online collections just as important as the local
> versions, especially for linking to a specific man page section or
> subsection in email or web, and for searching in man pages which are
> not installed locally. Any help with solving this problem would be
> appreciated.
>
> [1] https://man.archlinux.org/man/diff.1.de
> [2] https://man.archlinux.org/man/diff.1.vi
> [3] https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html
> [4] https://manpages.debian.org/unstable/manpages-vi/diff.1.vi.html.gz
> [5] https://salsa.debian.org/manpages-l10n-team/manpages-l10n
>
> Best Regards,
> Mario
> --
>  To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv
>
>

-- 
Best regards,
Michael

[-- Attachment #2: Type: text/html, Size: 5077 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-24 17:33 ` Michael Stapelberg
@ 2022-03-24 18:00   ` Mario Blättermann
  0 siblings, 0 replies; 19+ messages in thread
From: Mario Blättermann @ 2022-03-24 18:00 UTC (permalink / raw)
  To: discuss

Hello Michael,
thanks for your quick answer.

Am Do., 24. März 2022 um 18:34 Uhr schrieb Michael Stapelberg
<stapelberg@debian.org>:
>
>
>
> On Thu, 24 Mar 2022 at 18:13, Mario Blättermann <mario.blaettermann@gmail.com> wrote:
>>
>> Hello,
>>
>> recently I'm switched from GNU man-db to mandoc. It's really a big
>> step ahead, especially regarding the creation of HTML pages, but it
>> has its own peculiarities …
>>
>> For creating a HTML man page I use the following command:
>>
>> mandoc -T html -O toc ./manpage.1 > manpage.1.html
>>
>> This works so far for English man pages. For man pages in other
>> languages, I stumbled upon problems with creating toc entries. For
>> example, the "SYNOPSIS" is "ÜBERSICHT" in German, and the "Ü" is
>> displayed correctly, but the header is not clickable because it
>> doesn't have a toc entry. You can see this in the Archlinux online man
>> pages [1]; as you might know, "Archmanweb" uses Mandoc.
>>
>> The German keyboard produces the letter "Ü" as a single character
>> named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a kind of
>> splitted "Ü" available: U U+0055 LATIN CAPITAL LETTER U ‎̈ U+0308
>> COMBINING DIAERESIS. If I change it in the Groff source, toc creation
>> works fine using this splitted one.
>>
>> Moreover, in the Vietnamese version of the same man page [2], even
>> more toc entries are missing. Obviously because multiple section
>> headers start with "T", followed by diacritics, no toc entry is
>> created for those. But interesting: TÓM TẮT doesn't have an entry, TÊN
>> does have one. I can't imagine how Mandoc distinguishes between
>> acceptable and unacceptable diacritics.
>>
>> The described behavior is the same with a pure Mandoc on my local
>> system and with Archmanweb. However, the developers of Debiman
>> obviously found a solution [3], maybe unconsciously …? In any case,
>> their Mandoc is wrapped in a Go-based environment. Besides some extra
>> features Archmanweb doesn't have (for example, better detection of
>> cross-references to other man pages if they are not formatted as
>> such), the toc creation works, even for the Vietnamese version [4].
>
>
> Hello, I’m the author of debiman :)
> The reason why it uses a different TOC implementation is historical:
> debiman introduced a TOC in 2017, whereas mandoc itself only gained -O toc in 2018.
>
OK, Debiman uses its own TOC implementation, so it needs either to be
fixed in Mandoc itself, what would be the preferred solution, or
reimplemented in Python for Archmanweb. But the latter wouldn't solve
the problem for local users.

BTW, there are some more online man page collections using Mandoc, for
OpenBSD, NetBSD, FreeBSD. But neither of the BSDs seem to have
translated man pages, so I can't test the behavior.

> I’m glad to hear that our code is unicode clean in that regard.
> Good unicode/internationalization was one of the project’s goals,
> and is easy to accomplish in Go.
>
Yes, of  course. But I don't have any programming skills, so I hope
that a Mandoc developer can fix it.

Best Regards,
Mario


>>
>>
>> Any idea what is wrong? Well, first I thought the problem is on my
>> machine, but Archmanweb shows the same behavior. As a workaround, I
>> could produce a few more toc entries by replacing "Ü" with "Ü" and
>> similar, but as long as I don't know what rules Mandoc applies
>> internally, it's almost impossible to fix. To mention, as one of the
>> maintainers of the manpages-l10n project [5], I have to maintain many
>> languages, not only my own one …
>>
>> I consider the online collections just as important as the local
>> versions, especially for linking to a specific man page section or
>> subsection in email or web, and for searching in man pages which are
>> not installed locally. Any help with solving this problem would be
>> appreciated.
>>
>> [1] https://man.archlinux.org/man/diff.1.de
>> [2] https://man.archlinux.org/man/diff.1.vi
>> [3] https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html
>> [4] https://manpages.debian.org/unstable/manpages-vi/diff.1.vi.html.gz
>> [5] https://salsa.debian.org/manpages-l10n-team/manpages-l10n
>>
>> Best Regards,
>> Mario
>> --
>>  To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv
>>
>
>
> --
> Best regards,
> Michael
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-24 17:13 HTML output: section headers with diacritics not in table of contents Mario Blättermann
  2022-03-24 17:33 ` Michael Stapelberg
@ 2022-03-25 12:27 ` Ingo Schwarze
  2022-03-25 16:07   ` Mario Blättermann
                     ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Ingo Schwarze @ 2022-03-25 12:27 UTC (permalink / raw)
  To: mario.blaettermann; +Cc: discuss

Hi Mario,

Mario Blättermann wrote on Thu, Mar 24, 2022 at 06:13:23PM +0100:

> recently I'm switched from GNU man-db to mandoc. It's really a big
> step ahead, especially regarding the creation of HTML pages, but it
> has its own peculiarities …
> 
> For creating a HTML man page I use the following command:
> 
> mandoc -T html -O toc ./manpage.1 > manpage.1.html

You should really use the -O style=... and -O man=... options in
addition to the options you are already using.  Without "style",
CSS support is next to absent; no real style sheet is linked to,
and only a minimal style sheet is embedded with <style>, so minimal
that many features cannot work.  Without "man", you get no hyperlinks
from .Xr macros.

> This works so far for English man pages. For man pages in other
> languages, I stumbled upon problems with creating toc entries.

Regarding -O toc, you should be aware that a number of senior OpenBSD
developers hated the feature so much that i disabled it completely
on man.openbsd.org.  They argue the feature is superfluous, noisy,
and the whole idea of prefixing a TOC to a manual page is misguided.

Being disabled by default, the feature matures only very slowly, and
i have to depend on your feedback, and on other people using mandoc
in a similar way as you are using it, to slowly improve the -O toc
feature.

> For example, the "SYNOPSIS" is "ÜBERSICHT" in German,

What a terrible mistranslation...

You know, German happens to be my native language, and in
general English usage, "synopsis" can indeed mean "Zusammenfassung,
Kurzfassung, Uebersicht", but in a manual page, "SYNOPSIS" really means
"short syntax display", so an exact name for the section would be
"Kurzdarstellung der Syntax" oder "Syntax-Kurzdarstellung" or some
similar wording; "Uebersicht" with no further qualification is badly
misleading.

Such mistranslations obviously not only happen for reserved words
like "SYNOPSIS", but also in the main text of manual pages.
That's why i hate translated manual pages so much.  Reading German
manual pages, i usually find them pretty unitelligible.

Multiple times, i talked to Japanese software developers and even
though fluency in English is much less common in Japan than, say, in
Sweden or the Netherlands or even in Germany, they consistently told
me that very few Japanese people use Japanese manual pages because
even with a limited understanding of English, Japanese programmers
tend to find out quickly that they are much better served by English
than by Japanese manual pages, so they tend to improve their English
reading stills until they understand what they need.

So while i'm not aggressively trying to *not* support translated manual
pages, i don't think translated manual pages are particularly relevant
either.

> and the "Ü" is displayed correctly,

Yes, mandoc aims to correctly handle LC_CTYPE=*.UTF-8.  If Unicode
characters are displayed incorrectly in UTF-8 or HTML output mode,
i consider that a bug because non-ASCII characters are also crucial
for the understanding of a limited number of English manual pages.
Here is the canonical example:

  https://man.openbsd.org/lgamma.3

> but the header is not clickable because it
> doesn't have a toc entry. You can see this in the Archlinux online man
> pages [1]; as you might know, "Archmanweb" uses Mandoc.
> https://man.archlinux.org/man/diff.1.de

Yes, i'm aware than Arch uses Mandoc :-).
https://mandoc.bsd.lv/ links to https://man.archlinux.org/
below "Documentation and help",
and so does https://mandoc.bsd.lv/ports.html,
saying that Arch uses mandoc for the web since 2021 Jan 14.


Handling non-ASCII tag content is non-trivial.
Similar to preconv(1), mandoc(1) neads to translate non-ASCII
input to roff(7) character escape sequences at input time.
So when trying to figure whether the string "Uebersicht"
should be tagged, we have this situation:

  Breakpoint 2, post_SH (man=0x6e29d380000, n=0x6e29d351000)
      at /usr/src/usr.bin/mandoc/man_validate.c:312
  312     {
  (gdb) n
  316             nc = n->child;
  (gdb) 
  317             switch (n->type) {
  (gdb) 
  319                     tag = NULL;
  (gdb) 
  320                     deroff(&tag, n);
  (gdb) 
  321                     if (tag != NULL) {
  (gdb) p tag
  $2 = 0x6e29d35c520 "\\[u00DC]BERSICHT"

Now, how is the mandoc tagging module supposed to figure out that
\\[u00DC] is a letter?  Using iswalpha(3) would be awkward at best
given that mandoc does not use wchar_t at all, and even more so because
resolving the \\[u00DC] sequence only happens at the output stage, so
a wchar_t or char32_t form of the character isn't really available to
the tagging module.  Consequenly, the tagging module treats \\[u00DC]
like any other roff(7) escape sequence, aborting tag processing.
Since tag generation gets aborted for the first letter of this word,
no tag is generated at all, which results in no HTML id= attribute
being generated for the h1 class="Sh" element, and hence no underlining
and no clickability.

This points to a deeper problem.  

Section names in manual pages serve a double function.  On the one
hand, they communicate to the reader which content to expect in the
section.  That purpose is partly defeated by translating a manual
page because translations tend to be inconsistent.  Some manual pages
translate "SYNOPSIS" to "Syntax" in German; this particular on to
"Uebersicht".  So the effect that the reader immediately recognizes
the section name is lost, even if (and in particular when) the reader
is used to reading German manual pages.  (The same applies to the
main text, of course; translations of technical terms are usually
inconsistent and often awkward and misleading, making the text hard
to understand even for native speakers of the target language.)

On the other hand, section names serve a syntactic function,
modifying the interpretation of macros and other content in
the section in question.  That syntactic function is *completely*
lost by translating the manual page.  For example, mandoc
implements special handling for the following sections:

  SEC_SYNOPSIS      (11 special features)
  SEC_NAME          (7 special features)
  SEC_LIBRARY       (5 special features)
  SEC_DESCRIPTION   (2 special features)
  SEC_RETURN_VALUES (1 special feature)
  SEC_ERRORS        (2 special features)
  SEC_SEE_ALSO      (7 special features)
  SEC_AUTHORS       (8 special features)

All that special processing is completely lost in a translated manual page.

> The German keyboard produces the letter "Ü" as a single character
> named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a kind of
> splitted "Ü" available: U U+0055 LATIN CAPITAL LETTER U ‎̈ U+0308
> COMBINING DIAERESIS. If I change it in the Groff source, toc creation
> works fine using this splitted one.

Well, not really.  The two-character form of the umlaut is "U\\[u0308]"
instead of "\\[u00DC]", so the tag will be "U", which isn't really
useful either.

> Moreover, in the Vietnamese version of the same man page [2], even
> [2] https://man.archlinux.org/man/diff.1.vi
> more toc entries are missing. Obviously because multiple section
> headers start with "T", followed by diacritics, no toc entry is
> created for those. But interesting: TÓM TẮT doesn't have an entry, TÊN
> does have one. I can't imagine how Mandoc distinguishes between
> acceptable and unacceptable diacritics.

It doesn't.  If you have conflicting tag locations (in this case
two definitions of the term "T") and no other information which one
is more authoritative, the first one wins.  But in less(1), you
can still navigate both by typing ":t T" to navigate to the first
definition of the term "T", then hit the key "T" (next tag) to
navigate to the next definition of the term "T" (a bit awkward
we ended up with the term "T" in this example, which clashes with
the less(1) command "T" :).

> The described behavior is the same with a pure Mandoc on my local
> system and with Archmanweb. However, the developers of Debiman
> obviously found a solution [3], maybe unconsciously …? In any case,
> [3] https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html
> their Mandoc is wrapped in a Go-based environment. Besides some extra
> features Archmanweb doesn't have (for example, better detection of
> cross-references to other man pages if they are not formatted as
> such), the toc creation works, even for the Vietnamese version [4].
> [4] https://manpages.debian.org/unstable/manpages-vi/diff.1.vi.html.gz

The right-hand sidebar in manpages.debian.org is not coming from
mandoc.  I suspect that is generated by some Go code.

> Any idea what is wrong? Well, first I thought the problem is on my
> machine, but Archmanweb shows the same behavior. As a workaround, I
> could produce a few more toc entries by replacing "Ü" with "Ü" and
> similar, but as long as I don't know what rules Mandoc applies
> internally, it's almost impossible to fix. To mention, as one of the
> maintainers of the manpages-l10n project [5], I have to maintain many
> [5] https://salsa.debian.org/manpages-l10n-team/manpages-l10n
> languages, not only my own one …

Wow, you are fluent in many languages?  That's impressive.  :-)

> I consider the online collections just as important as the local
> versions, especially for linking to a specific man page section or
> subsection in email or web,

Yes, that is indeed helpful when answering user questions on mailing list.
Fortunately, the links that really matter typically work even with
translated manual pages because the syntax elements that users need
to learn are usually *not* translated and their English names usually
contain ASCII characters only, for example

  https://man.archlinux.org/man/diff.1.vi#u
  https://man.archlinux.org/man/diff.1.vi#help

> and for searching in man pages which are not installed locally.
> Any help with solving this problem would be appreciated.

Right now, i'm not yet sure what to do about tagging of words that
involve non-ASCII characters.  Maybe we can figure something out,
but how?  Hmmm...
Maybe mandoc should treat any \\[uXXXX] sequence as a letter for
the purposes of tagging?  The code needed for that will look rather
awkward though, and even when implemented perfectly, the tags will
be UTF-8 rather than ASCII-encoded.  Would links like

  https://man.archlinux.org/man/diff.1.de#%C3%9CBERSICHT

really be all that useful?  What do people think?

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-25 12:27 ` Ingo Schwarze
@ 2022-03-25 16:07   ` Mario Blättermann
  2022-03-25 20:58     ` Jan Stary
  2022-03-26 12:34     ` Ingo Schwarze
  2022-03-25 16:21   ` Anthony J. Bentley
  2022-03-25 16:57   ` Mario Blättermann
  2 siblings, 2 replies; 19+ messages in thread
From: Mario Blättermann @ 2022-03-25 16:07 UTC (permalink / raw)
  To: discuss

Hello Ingo,

Am Fr., 25. März 2022 um 13:27 Uhr schrieb Ingo Schwarze <schwarze@usta.de>:
>
> Hi Mario,
>
> Mario Blättermann wrote on Thu, Mar 24, 2022 at 06:13:23PM +0100:
>
> > recently I'm switched from GNU man-db to mandoc. It's really a big
> > step ahead, especially regarding the creation of HTML pages, but it
> > has its own peculiarities …
> >
> > For creating a HTML man page I use the following command:
> >
> > mandoc -T html -O toc ./manpage.1 > manpage.1.html
>
> You should really use the -O style=... and -O man=... options in
> addition to the options you are already using.  Without "style",
> CSS support is next to absent; no real style sheet is linked to,
> and only a minimal style sheet is embedded with <style>, so minimal
> that many features cannot work.

As far as I understand, proper TOC creation depends on a CSS file?

> Without "man", you get no hyperlinks
> from .Xr macros.
>
It's not about hyperlinks, this works at least in Archmanweb, and on
my local machine I don't need such links

> > This works so far for English man pages. For man pages in other
> > languages, I stumbled upon problems with creating toc entries.
>
> Regarding -O toc, you should be aware that a number of senior OpenBSD
> developers hated the feature so much that i disabled it completely
> on man.openbsd.org.  They argue the feature is superfluous, noisy,
> and the whole idea of prefixing a TOC to a manual page is misguided.
>
> Being disabled by default, the feature matures only very slowly, and
> i have to depend on your feedback, and on other people using mandoc
> in a similar way as you are using it, to slowly improve the -O toc
> feature.
>
> > For example, the "SYNOPSIS" is "ÜBERSICHT" in German,
>
> What a terrible mistranslation...
>
> You know, German happens to be my native language, and in
> general English usage, "synopsis" can indeed mean "Zusammenfassung,
> Kurzfassung, Uebersicht", but in a manual page, "SYNOPSIS" really means
> "short syntax display", so an exact name for the section would be
> "Kurzdarstellung der Syntax" oder "Syntax-Kurzdarstellung" or some
> similar wording; "Uebersicht" with no further qualification is badly
> misleading.
>
Thanks you for your honest opinion. We use this translation already
for years, and no one ever complained about it.

> Such mistranslations obviously not only happen for reserved words
> like "SYNOPSIS", but also in the main text of manual pages.
> That's why i hate translated manual pages so much.  Reading German
> manual pages, i usually find them pretty unitelligible.
>
OK, if you hate German man pages anyway, why get upset...?

> Multiple times, i talked to Japanese software developers and even
> though fluency in English is much less common in Japan than, say, in
> Sweden or the Netherlands or even in Germany, they consistently told
> me that very few Japanese people use Japanese manual pages because
> even with a limited understanding of English, Japanese programmers
> tend to find out quickly that they are much better served by English
> than by Japanese manual pages, so they tend to improve their English
> reading stills until they understand what they need.
>
> So while i'm not aggressively trying to *not* support translated manual
> pages, i don't think translated manual pages are particularly relevant
> either.
>
OK, I understand. I don't expect any further efforts to get a better
TOC creation from your side. Maybe I can discuss this with the
Archmanweb developers.

Best Regards,
Mario


> > and the "Ü" is displayed correctly,
>
> Yes, mandoc aims to correctly handle LC_CTYPE=*.UTF-8.  If Unicode
> characters are displayed incorrectly in UTF-8 or HTML output mode,
> i consider that a bug because non-ASCII characters are also crucial
> for the understanding of a limited number of English manual pages.
> Here is the canonical example:
>
>   https://man.openbsd.org/lgamma.3
>
> > but the header is not clickable because it
> > doesn't have a toc entry. You can see this in the Archlinux online man
> > pages [1]; as you might know, "Archmanweb" uses Mandoc.
> > https://man.archlinux.org/man/diff.1.de
>
> Yes, i'm aware than Arch uses Mandoc :-).
> https://mandoc.bsd.lv/ links to https://man.archlinux.org/
> below "Documentation and help",
> and so does https://mandoc.bsd.lv/ports.html,
> saying that Arch uses mandoc for the web since 2021 Jan 14.
>
>
> Handling non-ASCII tag content is non-trivial.
> Similar to preconv(1), mandoc(1) neads to translate non-ASCII
> input to roff(7) character escape sequences at input time.
> So when trying to figure whether the string "Uebersicht"
> should be tagged, we have this situation:
>
>   Breakpoint 2, post_SH (man=0x6e29d380000, n=0x6e29d351000)
>       at /usr/src/usr.bin/mandoc/man_validate.c:312
>   312     {
>   (gdb) n
>   316             nc = n->child;
>   (gdb)
>   317             switch (n->type) {
>   (gdb)
>   319                     tag = NULL;
>   (gdb)
>   320                     deroff(&tag, n);
>   (gdb)
>   321                     if (tag != NULL) {
>   (gdb) p tag
>   $2 = 0x6e29d35c520 "\\[u00DC]BERSICHT"
>
> Now, how is the mandoc tagging module supposed to figure out that
> \\[u00DC] is a letter?  Using iswalpha(3) would be awkward at best
> given that mandoc does not use wchar_t at all, and even more so because
> resolving the \\[u00DC] sequence only happens at the output stage, so
> a wchar_t or char32_t form of the character isn't really available to
> the tagging module.  Consequenly, the tagging module treats \\[u00DC]
> like any other roff(7) escape sequence, aborting tag processing.
> Since tag generation gets aborted for the first letter of this word,
> no tag is generated at all, which results in no HTML id= attribute
> being generated for the h1 class="Sh" element, and hence no underlining
> and no clickability.
>
> This points to a deeper problem.
>
> Section names in manual pages serve a double function.  On the one
> hand, they communicate to the reader which content to expect in the
> section.  That purpose is partly defeated by translating a manual
> page because translations tend to be inconsistent.  Some manual pages
> translate "SYNOPSIS" to "Syntax" in German; this particular on to
> "Uebersicht".  So the effect that the reader immediately recognizes
> the section name is lost, even if (and in particular when) the reader
> is used to reading German manual pages.  (The same applies to the
> main text, of course; translations of technical terms are usually
> inconsistent and often awkward and misleading, making the text hard
> to understand even for native speakers of the target language.)
>
> On the other hand, section names serve a syntactic function,
> modifying the interpretation of macros and other content in
> the section in question.  That syntactic function is *completely*
> lost by translating the manual page.  For example, mandoc
> implements special handling for the following sections:
>
>   SEC_SYNOPSIS      (11 special features)
>   SEC_NAME          (7 special features)
>   SEC_LIBRARY       (5 special features)
>   SEC_DESCRIPTION   (2 special features)
>   SEC_RETURN_VALUES (1 special feature)
>   SEC_ERRORS        (2 special features)
>   SEC_SEE_ALSO      (7 special features)
>   SEC_AUTHORS       (8 special features)
>
> All that special processing is completely lost in a translated manual page.
>
> > The German keyboard produces the letter "Ü" as a single character
> > named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a kind of
> > splitted "Ü" available: U U+0055 LATIN CAPITAL LETTER U ‎̈ U+0308
> > COMBINING DIAERESIS. If I change it in the Groff source, toc creation
> > works fine using this splitted one.
>
> Well, not really.  The two-character form of the umlaut is "U\\[u0308]"
> instead of "\\[u00DC]", so the tag will be "U", which isn't really
> useful either.
>
> > Moreover, in the Vietnamese version of the same man page [2], even
> > [2] https://man.archlinux.org/man/diff.1.vi
> > more toc entries are missing. Obviously because multiple section
> > headers start with "T", followed by diacritics, no toc entry is
> > created for those. But interesting: TÓM TẮT doesn't have an entry, TÊN
> > does have one. I can't imagine how Mandoc distinguishes between
> > acceptable and unacceptable diacritics.
>
> It doesn't.  If you have conflicting tag locations (in this case
> two definitions of the term "T") and no other information which one
> is more authoritative, the first one wins.  But in less(1), you
> can still navigate both by typing ":t T" to navigate to the first
> definition of the term "T", then hit the key "T" (next tag) to
> navigate to the next definition of the term "T" (a bit awkward
> we ended up with the term "T" in this example, which clashes with
> the less(1) command "T" :).
>
> > The described behavior is the same with a pure Mandoc on my local
> > system and with Archmanweb. However, the developers of Debiman
> > obviously found a solution [3], maybe unconsciously …? In any case,
> > [3] https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html
> > their Mandoc is wrapped in a Go-based environment. Besides some extra
> > features Archmanweb doesn't have (for example, better detection of
> > cross-references to other man pages if they are not formatted as
> > such), the toc creation works, even for the Vietnamese version [4].
> > [4] https://manpages.debian.org/unstable/manpages-vi/diff.1.vi.html.gz
>
> The right-hand sidebar in manpages.debian.org is not coming from
> mandoc.  I suspect that is generated by some Go code.
>
> > Any idea what is wrong? Well, first I thought the problem is on my
> > machine, but Archmanweb shows the same behavior. As a workaround, I
> > could produce a few more toc entries by replacing "Ü" with "Ü" and
> > similar, but as long as I don't know what rules Mandoc applies
> > internally, it's almost impossible to fix. To mention, as one of the
> > maintainers of the manpages-l10n project [5], I have to maintain many
> > [5] https://salsa.debian.org/manpages-l10n-team/manpages-l10n
> > languages, not only my own one …
>
> Wow, you are fluent in many languages?  That's impressive.  :-)
>
> > I consider the online collections just as important as the local
> > versions, especially for linking to a specific man page section or
> > subsection in email or web,
>
> Yes, that is indeed helpful when answering user questions on mailing list.
> Fortunately, the links that really matter typically work even with
> translated manual pages because the syntax elements that users need
> to learn are usually *not* translated and their English names usually
> contain ASCII characters only, for example
>
>   https://man.archlinux.org/man/diff.1.vi#u
>   https://man.archlinux.org/man/diff.1.vi#help
>
> > and for searching in man pages which are not installed locally.
> > Any help with solving this problem would be appreciated.
>
> Right now, i'm not yet sure what to do about tagging of words that
> involve non-ASCII characters.  Maybe we can figure something out,
> but how?  Hmmm...
> Maybe mandoc should treat any \\[uXXXX] sequence as a letter for
> the purposes of tagging?  The code needed for that will look rather
> awkward though, and even when implemented perfectly, the tags will
> be UTF-8 rather than ASCII-encoded.  Would links like
>
>   https://man.archlinux.org/man/diff.1.de#%C3%9CBERSICHT
>
> really be all that useful?  What do people think?
>
> Yours,
>   Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-25 12:27 ` Ingo Schwarze
  2022-03-25 16:07   ` Mario Blättermann
@ 2022-03-25 16:21   ` Anthony J. Bentley
  2022-03-25 21:15     ` Jan Stary
  2022-03-26 10:33     ` Ingo Schwarze
  2022-03-25 16:57   ` Mario Blättermann
  2 siblings, 2 replies; 19+ messages in thread
From: Anthony J. Bentley @ 2022-03-25 16:21 UTC (permalink / raw)
  To: discuss

Hi Ingo,

Ingo Schwarze writes:
> Maybe mandoc should treat any \\[uXXXX] sequence as a letter for
> the purposes of tagging?  The code needed for that will look rather
> awkward though, and even when implemented perfectly, the tags will
> be UTF-8 rather than ASCII-encoded.  Would links like
>
>   https://man.archlinux.org/man/diff.1.de#%C3%9CBERSICHT
>
> really be all that useful?  What do people think?

There would be no need for Mandoc to percent-encode UTF-8 here.
In HTML5, a URL fragment (that is, the portion after the '#') may
contain unescaped "URL code points," which are:

   "ASCII alphanumeric, U+0021 (!), U+0024 ($), U+0026 (&), U+0027 ('),
    U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+002A (*),
    U+002B (+), U+002C (,), U+002D (-), U+002E (.), U+002F (/), U+003A
    (:), U+003B (;), U+003D (=), U+003F (?), U+0040 (@), U+005F (_),
    U+007E (~), and code points in the range U+00A0 to U+10FFFD,
    inclusive, excluding surrogates and noncharacters."

-- 
Anthony J. Bentley
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-25 12:27 ` Ingo Schwarze
  2022-03-25 16:07   ` Mario Blättermann
  2022-03-25 16:21   ` Anthony J. Bentley
@ 2022-03-25 16:57   ` Mario Blättermann
  2022-03-25 20:36     ` Jan Stary
  2 siblings, 1 reply; 19+ messages in thread
From: Mario Blättermann @ 2022-03-25 16:57 UTC (permalink / raw)
  To: Ingo Schwarze; +Cc: discuss

Hello Ingo,

sorry, I haven't seen the rest of your message.

Am Fr., 25. März 2022 um 13:27 Uhr schrieb Ingo Schwarze <schwarze@usta.de>:
>
> Hi Mario,
>
> Mario Blättermann wrote on Thu, Mar 24, 2022 at 06:13:23PM +0100:
>
> > recently I'm switched from GNU man-db to mandoc. It's really a big
> > step ahead, especially regarding the creation of HTML pages, but it
> > has its own peculiarities …
> >
> > For creating a HTML man page I use the following command:
> >
> > mandoc -T html -O toc ./manpage.1 > manpage.1.html
>
> You should really use the -O style=... and -O man=... options in
> addition to the options you are already using.  Without "style",
> CSS support is next to absent; no real style sheet is linked to,
> and only a minimal style sheet is embedded with <style>, so minimal
> that many features cannot work.  Without "man", you get no hyperlinks
> from .Xr macros.
>
> > This works so far for English man pages. For man pages in other
> > languages, I stumbled upon problems with creating toc entries.
>
> Regarding -O toc, you should be aware that a number of senior OpenBSD
> developers hated the feature so much that i disabled it completely
> on man.openbsd.org.  They argue the feature is superfluous, noisy,
> and the whole idea of prefixing a TOC to a manual page is misguided.
>
To mention, the TOC works at man.openbsd.org, for example:
http://man.openbsd.org/mandoc#Syntax_tree_output

> Being disabled by default, the feature matures only very slowly, and
> i have to depend on your feedback, and on other people using mandoc
> in a similar way as you are using it, to slowly improve the -O toc
> feature.
>
> > For example, the "SYNOPSIS" is "ÜBERSICHT" in German,
>
> What a terrible mistranslation...
>
> You know, German happens to be my native language, and in
> general English usage, "synopsis" can indeed mean "Zusammenfassung,
> Kurzfassung, Uebersicht", but in a manual page, "SYNOPSIS" really means
> "short syntax display", so an exact name for the section would be
> "Kurzdarstellung der Syntax" oder "Syntax-Kurzdarstellung" or some
> similar wording; "Uebersicht" with no further qualification is badly
> misleading.
>
> Such mistranslations obviously not only happen for reserved words
> like "SYNOPSIS", but also in the main text of manual pages.
> That's why i hate translated manual pages so much.  Reading German
> manual pages, i usually find them pretty unitelligible.
>
> Multiple times, i talked to Japanese software developers and even
> though fluency in English is much less common in Japan than, say, in
> Sweden or the Netherlands or even in Germany, they consistently told
> me that very few Japanese people use Japanese manual pages because
> even with a limited understanding of English, Japanese programmers
> tend to find out quickly that they are much better served by English
> than by Japanese manual pages, so they tend to improve their English
> reading stills until they understand what they need.
>
> So while i'm not aggressively trying to *not* support translated manual
> pages, i don't think translated manual pages are particularly relevant
> either.
>
> > and the "Ü" is displayed correctly,
>
> Yes, mandoc aims to correctly handle LC_CTYPE=*.UTF-8.  If Unicode
> characters are displayed incorrectly in UTF-8 or HTML output mode,
> i consider that a bug because non-ASCII characters are also crucial
> for the understanding of a limited number of English manual pages.
> Here is the canonical example:
>
>   https://man.openbsd.org/lgamma.3
>
> > but the header is not clickable because it
> > doesn't have a toc entry. You can see this in the Archlinux online man
> > pages [1]; as you might know, "Archmanweb" uses Mandoc.
> > https://man.archlinux.org/man/diff.1.de
>
> Yes, i'm aware than Arch uses Mandoc :-).
> https://mandoc.bsd.lv/ links to https://man.archlinux.org/
> below "Documentation and help",
> and so does https://mandoc.bsd.lv/ports.html,
> saying that Arch uses mandoc for the web since 2021 Jan 14.
>
>
> Handling non-ASCII tag content is non-trivial.
> Similar to preconv(1), mandoc(1) neads to translate non-ASCII
> input to roff(7) character escape sequences at input time.
> So when trying to figure whether the string "Uebersicht"
> should be tagged, we have this situation:
>
>   Breakpoint 2, post_SH (man=0x6e29d380000, n=0x6e29d351000)
>       at /usr/src/usr.bin/mandoc/man_validate.c:312
>   312     {
>   (gdb) n
>   316             nc = n->child;
>   (gdb)
>   317             switch (n->type) {
>   (gdb)
>   319                     tag = NULL;
>   (gdb)
>   320                     deroff(&tag, n);
>   (gdb)
>   321                     if (tag != NULL) {
>   (gdb) p tag
>   $2 = 0x6e29d35c520 "\\[u00DC]BERSICHT"
>
> Now, how is the mandoc tagging module supposed to figure out that
> \\[u00DC] is a letter?  Using iswalpha(3) would be awkward at best
> given that mandoc does not use wchar_t at all, and even more so because
> resolving the \\[u00DC] sequence only happens at the output stage, so
> a wchar_t or char32_t form of the character isn't really available to
> the tagging module.  Consequenly, the tagging module treats \\[u00DC]
> like any other roff(7) escape sequence, aborting tag processing.
> Since tag generation gets aborted for the first letter of this word,
> no tag is generated at all, which results in no HTML id= attribute
> being generated for the h1 class="Sh" element, and hence no underlining
> and no clickability.
>
> This points to a deeper problem.
>
> Section names in manual pages serve a double function.  On the one
> hand, they communicate to the reader which content to expect in the
> section.  That purpose is partly defeated by translating a manual
> page because translations tend to be inconsistent.  Some manual pages
> translate "SYNOPSIS" to "Syntax" in German; this particular on to
> "Uebersicht".  So the effect that the reader immediately recognizes
> the section name is lost, even if (and in particular when) the reader
> is used to reading German manual pages.  (The same applies to the
> main text, of course; translations of technical terms are usually
> inconsistent and often awkward and misleading, making the text hard
> to understand even for native speakers of the target language.)
>
> On the other hand, section names serve a syntactic function,
> modifying the interpretation of macros and other content in
> the section in question.  That syntactic function is *completely*
> lost by translating the manual page.  For example, mandoc
> implements special handling for the following sections:
>
>   SEC_SYNOPSIS      (11 special features)
>   SEC_NAME          (7 special features)
>   SEC_LIBRARY       (5 special features)
>   SEC_DESCRIPTION   (2 special features)
>   SEC_RETURN_VALUES (1 special feature)
>   SEC_ERRORS        (2 special features)
>   SEC_SEE_ALSO      (7 special features)
>   SEC_AUTHORS       (8 special features)
>
> All that special processing is completely lost in a translated manual page.
>
> > The German keyboard produces the letter "Ü" as a single character
> > named "LATIN CAPITAL LETTER U WITH DIAERESIS", but there's a kind of
> > splitted "Ü" available: U U+0055 LATIN CAPITAL LETTER U ‎̈ U+0308
> > COMBINING DIAERESIS. If I change it in the Groff source, toc creation
> > works fine using this splitted one.
>
> Well, not really.  The two-character form of the umlaut is "U\\[u0308]"
> instead of "\\[u00DC]", so the tag will be "U", which isn't really
> useful either.
>
Yes, but as long as no other section header starts with »U«, it would
at least create one clickable  TOC entry. But if another »U« exists
and the second letter is non-ASCII, it wouldn't work anymore.

> > Moreover, in the Vietnamese version of the same man page [2], even
> > [2] https://man.archlinux.org/man/diff.1.vi
> > more toc entries are missing. Obviously because multiple section
> > headers start with "T", followed by diacritics, no toc entry is
> > created for those. But interesting: TÓM TẮT doesn't have an entry, TÊN
> > does have one. I can't imagine how Mandoc distinguishes between
> > acceptable and unacceptable diacritics.
>
> It doesn't.  If you have conflicting tag locations (in this case
> two definitions of the term "T") and no other information which one
> is more authoritative, the first one wins.  But in less(1), you
> can still navigate both by typing ":t T" to navigate to the first
> definition of the term "T", then hit the key "T" (next tag) to
> navigate to the next definition of the term "T" (a bit awkward
> we ended up with the term "T" in this example, which clashes with
> the less(1) command "T" :).
>
> > The described behavior is the same with a pure Mandoc on my local
> > system and with Archmanweb. However, the developers of Debiman
> > obviously found a solution [3], maybe unconsciously …? In any case,
> > [3] https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html
> > their Mandoc is wrapped in a Go-based environment. Besides some extra
> > features Archmanweb doesn't have (for example, better detection of
> > cross-references to other man pages if they are not formatted as
> > such), the toc creation works, even for the Vietnamese version [4].
> > [4] https://manpages.debian.org/unstable/manpages-vi/diff.1.vi.html.gz
>
> The right-hand sidebar in manpages.debian.org is not coming from
> mandoc.  I suspect that is generated by some Go code.
>
Yes, yesterday answered a Debiman developer her that Debiman had this
feature earlier than Mandoc, so it's their own implementation.

> > Any idea what is wrong? Well, first I thought the problem is on my
> > machine, but Archmanweb shows the same behavior. As a workaround, I
> > could produce a few more toc entries by replacing "Ü" with "Ü" and
> > similar, but as long as I don't know what rules Mandoc applies
> > internally, it's almost impossible to fix. To mention, as one of the
> > maintainers of the manpages-l10n project [5], I have to maintain many
> > [5] https://salsa.debian.org/manpages-l10n-team/manpages-l10n
> > languages, not only my own one …
>
> Wow, you are fluent in many languages?  That's impressive.  :-)
>
I didn't say I'm fluent in all these languages, but I maintain them
from a technical point of view. I have to bother with addendum files,
right-to-left languages, cyrillic alphabets … what doesn't mean that I
speak Norwegian or Macedonian or Persian, whatever. As a maintainer, I
have to make sure that the translated man pages will be generated from
.po files using Po4a, in a readable shape, that's all.

> > I consider the online collections just as important as the local
> > versions, especially for linking to a specific man page section or
> > subsection in email or web,
>
> Yes, that is indeed helpful when answering user questions on mailing list.
> Fortunately, the links that really matter typically work even with
> translated manual pages because the syntax elements that users need
> to learn are usually *not* translated and their English names usually
> contain ASCII characters only, for example
>
>   https://man.archlinux.org/man/diff.1.vi#u
>   https://man.archlinux.org/man/diff.1.vi#help
>
Yes, I still can click on TOC entries within the sections, which don't
contain any non-ASCII.

> > and for searching in man pages which are not installed locally.
> > Any help with solving this problem would be appreciated.
>
> Right now, i'm not yet sure what to do about tagging of words that
> involve non-ASCII characters.  Maybe we can figure something out,
> but how?  Hmmm...
> Maybe mandoc should treat any \\[uXXXX] sequence as a letter for
> the purposes of tagging?  The code needed for that will look rather
> awkward though, and even when implemented perfectly, the tags will
> be UTF-8 rather than ASCII-encoded.  Would links like
>
>   https://man.archlinux.org/man/diff.1.de#%C3%9CBERSICHT
>
> really be all that useful?  What do people think?
>
This link *would* we useful, at least, it will be accepted (although I
can't really test it due to the stillmissing TOC entry). Google Chrome
translates it into https://man.archlinux.org/man/diff.1.de#ÜBERSICHT
in the address bar, but when I copy the link, then I get the original
link back. Don't know how other browsers would behave in this case.

Your idea with treating \\[uXXXX] sequences seems to be a step in the
right direction. I also can't read Go code from the Debiman developers
to figure out how to implement this in C.

Best Regards,
Mario
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-25 16:57   ` Mario Blättermann
@ 2022-03-25 20:36     ` Jan Stary
  2022-03-25 20:59       ` Mario Blättermann
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Stary @ 2022-03-25 20:36 UTC (permalink / raw)
  To: discuss

On Mar 25 17:57:21, mario.blaettermann@gmail.com wrote:
> > Regarding -O toc, you should be aware that a number of senior OpenBSD
> > developers hated the feature so much that i disabled it completely
> > on man.openbsd.org.  They argue the feature is superfluous, noisy,
> > and the whole idea of prefixing a TOC to a manual page is misguided.
> >
> To mention, the TOC works at man.openbsd.org, for example:
> http://man.openbsd.org/mandoc#Syntax_tree_output

I don't see a TOC on that page.

--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-25 16:07   ` Mario Blättermann
@ 2022-03-25 20:58     ` Jan Stary
  2022-03-26 12:34     ` Ingo Schwarze
  1 sibling, 0 replies; 19+ messages in thread
From: Jan Stary @ 2022-03-25 20:58 UTC (permalink / raw)
  To: discuss

[OT, but anyway]

On Mar 25 17:07:13, mario.blaettermann@gmail.com wrote:
> > > For example, the "SYNOPSIS" is "ÜBERSICHT" in German,
> >
> > What a terrible mistranslation...
> >
> > You know, German happens to be my native language, and in
> > general English usage, "synopsis" can indeed mean "Zusammenfassung,
> > Kurzfassung, Uebersicht", but in a manual page, "SYNOPSIS" really means
> > "short syntax display", so an exact name for the section would be
> > "Kurzdarstellung der Syntax" oder "Syntax-Kurzdarstellung" or some
> > similar wording; "Uebersicht" with no further qualification is badly
> > misleading.
> >
> Thanks you for your honest opinion. We use this translation already
> for years, and no one ever complained about it.

Exactly: nobody cares about (mis)translated manpages.

> > Such mistranslations obviously not only happen for reserved words
> > like "SYNOPSIS", but also in the main text of manual pages.
> > That's why i hate translated manual pages so much.  Reading German
> > manual pages, i usually find them pretty unitelligible.
> >
> OK, if you hate German man pages anyway, why get upset...?

Without speaking for Ingo, I don't think it's specific to German
in any way. For a random example, see Debian's Czech version of ln(1):

https://salsa.debian.org/manpages-l10n-team/manpages-l10n/-/blob/master/po/cs/man1/ln.1.po

	msgid "ln - make links between files"
	msgstr "ln - vytváří odkazy na soubory"

"odkaz" is not a "link", and nobody ever says "odkaz", ever,
for either hardlink or symlink. These translations invariably
opt for some home-grown quasi-equivalents; I don't understand
why people bother with these at all.

	Jan

PS: To be honest, there is a difference between a Czech version of a manpage,
    and, say, a Hungarian translation. The Hungarian is irrelevant to me;
    the Czech one, being in my mother tongue, is laughable.
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-25 20:36     ` Jan Stary
@ 2022-03-25 20:59       ` Mario Blättermann
  2022-03-25 21:20         ` Jan Stary
  0 siblings, 1 reply; 19+ messages in thread
From: Mario Blättermann @ 2022-03-25 20:59 UTC (permalink / raw)
  To: discuss

Hello Jan,

Am Fr., 25. März 2022 um 21:36 Uhr schrieb Jan Stary <hans@stare.cz>:
>
> On Mar 25 17:57:21, mario.blaettermann@gmail.com wrote:
> > > Regarding -O toc, you should be aware that a number of senior OpenBSD
> > > developers hated the feature so much that i disabled it completely
> > > on man.openbsd.org.  They argue the feature is superfluous, noisy,
> > > and the whole idea of prefixing a TOC to a manual page is misguided.
> > >
> > To mention, the TOC works at man.openbsd.org, for example:
> > http://man.openbsd.org/mandoc#Syntax_tree_output
>
> I don't see a TOC on that page.
>
What you've expected is probably a TOC like [1]. But the TOC generated
by Mandoc doesn't mean a real table of contents in this case, it's
merely a »hidden« TOC, which enables to click on section headers and
options to generate hyperlinks for that position. If you can open the
link mentioned above, and »Syntax tree output« is on top of the page
displayed in the browser, then this kind of TOC works.

[1] https://archlinux.org/pacman/makepkg.8.html

Best Regards,
Mario
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-25 16:21   ` Anthony J. Bentley
@ 2022-03-25 21:15     ` Jan Stary
  2022-03-26 10:33     ` Ingo Schwarze
  1 sibling, 0 replies; 19+ messages in thread
From: Jan Stary @ 2022-03-25 21:15 UTC (permalink / raw)
  To: discuss

On Mar 25 10:21:48, anthony@anjbe.name wrote:
> Hi Ingo,
> 
> Ingo Schwarze writes:
> > Maybe mandoc should treat any \\[uXXXX] sequence as a letter for
> > the purposes of tagging?  The code needed for that will look rather
> > awkward though, and even when implemented perfectly, the tags will
> > be UTF-8 rather than ASCII-encoded.  Would links like
> >
> >   https://man.archlinux.org/man/diff.1.de#%C3%9CBERSICHT
> >
> > really be all that useful?  What do people think?
> 
> There would be no need for Mandoc to percent-encode UTF-8 here.
> In HTML5, a URL fragment (that is, the portion after the '#') may
> contain unescaped "URL code points," which are:
> 
>    "ASCII alphanumeric, U+0021 (!), U+0024 ($), U+0026 (&), U+0027 ('),
>     U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+002A (*),
>     U+002B (+), U+002C (,), U+002D (-), U+002E (.), U+002F (/), U+003A
>     (:), U+003B (;), U+003D (=), U+003F (?), U+0040 (@), U+005F (_),
>     U+007E (~), and code points in the range U+00A0 to U+10FFFD,
>     inclusive, excluding surrogates and noncharacters."

Ah, right. I took the liberty of tweaking mandoc's html output
of the manpage below to have a #NÄMË instead of the current
<a href="#N">N&#x00C4;M&#x00CB;</a> and it works juts fine.
http://stare.cz/.tmp/mt.html#NÄMË

Thanks for the lolz.

	Jan


.Dd Mar 25, 2022
.Dt MT 666
.Os
.Sh NÄMË
.Nm Mötley Crüe
.Nd hëävy mëtäl ümläüt
.Sh SŸNÖPSŸS
.Nm
.Sh DËSCRÏPTÏÖN
.Nm
röcks äs fück.
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-25 20:59       ` Mario Blättermann
@ 2022-03-25 21:20         ` Jan Stary
  2022-03-26  9:25           ` Ingo Schwarze
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Stary @ 2022-03-25 21:20 UTC (permalink / raw)
  To: discuss

On Mar 25 21:59:26, mario.blaettermann@gmail.com wrote:
> Hello Jan,
> 
> Am Fr., 25. März 2022 um 21:36 Uhr schrieb Jan Stary <hans@stare.cz>:
> >
> > On Mar 25 17:57:21, mario.blaettermann@gmail.com wrote:
> > > > Regarding -O toc, you should be aware that a number of senior OpenBSD
> > > > developers hated the feature so much that i disabled it completely
> > > > on man.openbsd.org.  They argue the feature is superfluous, noisy,
> > > > and the whole idea of prefixing a TOC to a manual page is misguided.
> > > >
> > > To mention, the TOC works at man.openbsd.org, for example:
> > > http://man.openbsd.org/mandoc#Syntax_tree_output
> >
> > I don't see a TOC on that page.
> >
> What you've expected is probably a TOC like [1].

Yes, because that's what mandoc calls a TOC:

  toc If an input file contains at least two non-standard sections,
      print a table of contents near the beginning of the output.

> But the TOC generated
> by Mandoc doesn't mean a real table of contents in this case, it's
> merely a »hidden« TOC, which enables to click on section headers and
> options to generate hyperlinks for that position. If you can open the
> link mentioned above, and »Syntax tree output« is on top of the page
> displayed in the browser, then this kind of TOC works.
> [1] https://archlinux.org/pacman/makepkg.8.html

I think these are 'tags' in the mandoc parlance.
http://man.openbsd.org/mandoc#tag

	Jan

--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-25 21:20         ` Jan Stary
@ 2022-03-26  9:25           ` Ingo Schwarze
  0 siblings, 0 replies; 19+ messages in thread
From: Ingo Schwarze @ 2022-03-26  9:25 UTC (permalink / raw)
  To: mario.blaettermann; +Cc: discuss

Hi Mario,

Jan Stary wrote on Fri, Mar 25, 2022 at 10:20:30PM +0100:
> On Mar 25 21:59:26, mario.blaettermann@gmail.com wrote:
>> Am Fr., 25. März 2022 um 21:36 Uhr schrieb Jan Stary <hans@stare.cz>:
>>> On Mar 25 17:57:21, mario.blaettermann@gmail.com wrote:

>>>>> Regarding -O toc, you should be aware that a number of senior OpenBSD
>>>>> developers hated the feature so much that i disabled it completely
>>>>> on man.openbsd.org.  They argue the feature is superfluous, noisy,
>>>>> and the whole idea of prefixing a TOC to a manual page is misguided.

>>>> To mention, the TOC works at man.openbsd.org, for example:
>>>> http://man.openbsd.org/mandoc#Syntax_tree_output

>>> I don't see a TOC on that page.

>> What you've expected is probably a TOC like [1].

> Yes, because that's what mandoc calls a TOC:
> 
>   toc If an input file contains at least two non-standard sections,
>       print a table of contents near the beginning of the output.

Exactly.

To see what a "TABLE OF CONTENTS" (TOC) is, please look at

  https://man.bsd.lv/pf.conf.5

where i just enabled TOCs for demonstrational purposes and compare
that page to

  https://man.openbsd.org/pf.conf.5

where TOCs are disabled.

The mandoc(1) page never gets a TOC because it does not contain two
custom sections (in fact, not even one).

>> But the TOC generated
>> by Mandoc doesn't mean a real table of contents in this case, it's
>> merely a »hidden« TOC, which enables to click on section headers and
>> options to generate hyperlinks for that position. If you can open the
>> link mentioned above, and »Syntax tree output« is on top of the page
>> displayed in the browser, then this kind of TOC works.
>> [1] https://archlinux.org/pacman/makepkg.8.html

> I think these are 'tags' in the mandoc parlance.
> http://man.openbsd.org/mandoc#tag

Exactly, and that mandoc(1) terminology is motivated by this
pre-existing terminology:

  https://man.openbsd.org/less.1#T
  https://man.openbsd.org/ctags.1

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-25 16:21   ` Anthony J. Bentley
  2022-03-25 21:15     ` Jan Stary
@ 2022-03-26 10:33     ` Ingo Schwarze
  2022-03-26 17:55       ` Anthony J. Bentley
  1 sibling, 1 reply; 19+ messages in thread
From: Ingo Schwarze @ 2022-03-26 10:33 UTC (permalink / raw)
  To: anthony; +Cc: discuss

Hi Anthony,

Anthony J. Bentley wrote on Fri, Mar 25, 2022 at 10:21:48AM -0600:
> Ingo Schwarze writes:

>> Maybe mandoc should treat any \\[uXXXX] sequence as a letter for
>> the purposes of tagging?  The code needed for that will look rather
>> awkward though, and even when implemented perfectly, the tags will
>> be UTF-8 rather than ASCII-encoded.  Would links like
>>
>>   https://man.archlinux.org/man/diff.1.de#%C3%9CBERSICHT
>>
>> really be all that useful?  What do people think?

> There would be no need for Mandoc to percent-encode UTF-8 here.
> In HTML5, a URL fragment (that is, the portion after the '#') may
> contain unescaped "URL code points," which are:
> 
>    "ASCII alphanumeric, U+0021 (!), U+0024 ($), U+0026 (&), U+0027 ('),
>     U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+002A (*),
>     U+002B (+), U+002C (,), U+002D (-), U+002E (.), U+002F (/), U+003A
>     (:), U+003B (;), U+003D (=), U+003F (?), U+0040 (@), U+005F (_),
>     U+007E (~), and code points in the range U+00A0 to U+10FFFD,
>     inclusive, excluding surrogates and noncharacters."

Thanks, that sounds like a useful hint.

Excluding surrogates is easy, and

  http://www.unicode.org/faq/private_use.html#noncharacters

tells me what "noncharacters" are.  Since those 66 codepoints are
stable, it is feasible to exclude them, too, without needing any
Unicode library.

All the same, before starting work on an implementation, i would
also appreciate your opinion, Anthony (and possibly of similarly
prolific users and maintainers in other operating systems) whether
such functionality seems desirable to you, because i feel like
sitting on the fence myself.

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-25 16:07   ` Mario Blättermann
  2022-03-25 20:58     ` Jan Stary
@ 2022-03-26 12:34     ` Ingo Schwarze
  2022-03-26 13:35       ` Mario Blättermann
  1 sibling, 1 reply; 19+ messages in thread
From: Ingo Schwarze @ 2022-03-26 12:34 UTC (permalink / raw)
  To: mario.blaettermann; +Cc: discuss

Hi Mario,

Mario Blättermann wrote on Fri, Mar 25, 2022 at 05:07:13PM +0100:
> Am Fr., 25. März 2022 um 13:27 Uhr schrieb Ingo Schwarze <schwarze@usta.de>:
>> Mario Blättermann wrote on Thu, Mar 24, 2022 at 06:13:23PM +0100:

>>> recently I'm switched from GNU man-db to mandoc. It's really a big
>>> step ahead, especially regarding the creation of HTML pages, but it
>>> has its own peculiarities …
>>>
>>> For creating a HTML man page I use the following command:
>>>
>>> mandoc -T html -O toc ./manpage.1 > manpage.1.html

>> You should really use the -O style=... and -O man=... options in
>> addition to the options you are already using.  Without "style",
>> CSS support is next to absent; no real style sheet is linked to,
>> and only a minimal style sheet is embedded with <style>, so minimal
>> that many features cannot work.

> As far as I understand, proper TOC creation depends on a CSS file?

No.  The TOC (in the sense Jan and i explained in earlier messages)
does not need CSS.

Then again, i guess what you meant here probably was "tagging depends
on a CSS file".  That statement would be mostly misleading but arguably
somewhat true to a lesser extent.  To understand what i mean, look at
this HTML code (make sure to disable HTML in you mail user agent if you
have it enabled):

  <h1 class="Sh" id="DESCRIPTION">
    <a class="permalink" href="#DESCRIPTION">DESCRIPTION</a>
  </h1>

Tagging involves four aspects:

 1. The id= attribute of the h1 element shown above.
    The value of that attribute, "DESCRIPTION", is called
    the "tag" in mandoc(1), less(1), and ctags(1) parlance,
    admittedly a bit unfortunately as "h1" is called a tag
    in HTML parlance.  The mandoc/less/ctags tag is generated
    if and only if the mdoc(7) or man(7) parser finds a section
    title that looks sufficiently alphabetic.  It's intentional
    that i use an imprecise wording here because what
    "sufficiently alphabetic" means is the technical detail
    that we are considering to change right now.  This tag
    and id=-attribute is generated even if you do not use a CSS
    file.

 2. The 'a class="permalink"' element.  As long as a tag was
    generated in no. 1 above, that element is also generated no
    matter whether you use a CSS file or not.

 3. Formatting of the h1 element depends on the stylesheet.
    The following CSS properties are absent when you fail to
    use a stylesheet:

	margin-top: 1.2em;
	margin-bottom: 0.6em;
	margin-left: -3.2em;

    Also, no tooltip is shown when you hover your mouse over
    the h1 element unless you use the CSS file.

    Arguably, none of this no. 3 is related to tagging.

 4. Formatting of the "a" element depends on the stylesheet.
    The following CSS properties are absent when you fail to
    use a stylesheet:

	color: inherit;
	font: inherit;
	text-decoration: inherit; 
	border-bottom: thin dotted;

    Arguably, this no. 4 is related to tagging because these
    properties determine how the presence of the tag is
    indicated by the rendering.  Then again, without using
    the stylesheet, the presence of the tag is usually also
    indicated in whatever way is the default for the browser,
    for example by a big blue font with a solid underline.

>> Without "man", you get no hyperlinks from .Xr macros.

> It's not about hyperlinks, this works at least in Archmanweb, and on
> my local machine I don't need such links

I have no idea how Archmanweb might create hyperlinks for manual
page references unless you pass the -O man= option to mandoc.
Well, Archmanweb might perhaps tinker around with the generated
HTML code after the fact, using some crude heuristics.  I don't
know what Archmanweb does.

>> Such mistranslations obviously not only happen for reserved words
>> like "SYNOPSIS", but also in the main text of manual pages.
>> That's why i hate translated manual pages so much.  Reading German
>> manual pages, i usually find them pretty unitelligible.

> OK, if you hate German man pages anyway, why get upset...?

For several reasons.

First and foremost, i really care about good documentation, so bad
documentation bothers me.

Secondly, once in a while machines maintained by other people
(not my own machines, of course) show me German error messages
and/or German documentation even if i don't ask for that, and
having to do extra configuration work just to get an intelligibly
user interface on some random machine feels annoying to me.

Finally, i am interested in questions of languages (formal and
living) in general, even though i'm not a specialist for language
theory (neither for formal nor for living languages).

My former teacher in theoretical physics, Prof. Dahmen (whom i greatly
respected in other matters) always made a point of strongly insisting that
a thesis ought to be written in German because developing professional and
technical terminology in all possible fields is crucial (in his opinion)
to keep a living language alive.  Even though i always found the idea
intriguing, i never managed to make up my mind whether that opinion is
true or false, or rather: to which degree it is reasonable.

But for technical terms in computer science, i fear German already is
a dead language (in Prof. Dahmen's sense) whether we like it or not.
Firmly established translations do exist for many technical terms in
computer science (for example input = Eingabe), even more technical terms
are firmly established as loanwords in German (for example hyperlink =
Hyperlink, patch = Patch), but huge numbers of technical terms do not
have a generally accepted and used translation to German.  In such cases,
people sometimes simply use the English word when talking in German (for
example diff = Diff), which may sometimes indicate that the establishment
of a new loanword is in progress.  In many cases, a translation does not
really exist.  A striking example from the example manual page you picked
is the English technical term "unified diff".  The (admittedly meager)
German Wikipedia page https://de.wikipedia.org/wiki/Diff works around
the gap in the langauge by using the somewhat leangthy wording "Das
sogenannte vereinheitlichte Format (unified diff)".  This solution feels
completely adequate to me: it is easy to understand by both professionals
and beginners, and the wording is also elegant from the perspective of
the language.

https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html
says, by contrast:

  -u, -U ANZAHL, --unified[=ANZAHL]
    ANZAHL Zeilen (Vorgabe 3) des vereinheitlichten Kontexts ausgeben

That's completely unitelligible for a German native speaker unless they
are also fluent in English *and* already know what the technical meanings
of "unified" *and* of "context" are in this particular context.  The only
way to understand this particular German wording is to translate the word
"vereinheitlicht" back to English and then recognize that "unified" and
"context" here both function as highly specialized technical terms
and *neither* of them is to be interpreted in the everyday sense of the
plain English words "unified" and "context".

I discuss this here in so much detail because i do care about such
matters and because i think such considerations do have some bearing
on the question which functionality matters to which degree in a
formatting program for technical documentation.

You cannot design a program well without considering how it should
and how it should better not be used.

>> So while i'm not aggressively trying to *not* support translated manual
>> pages, i don't think translated manual pages are particularly relevant
>> either.

> OK, I understand. I don't expect any further efforts to get a better
> TOC creation from your side. Maybe I can discuss this with the
> Archmanweb developers.

I fear you misunderstand.  I didn't mean to say, "fuck you, go to hell".
I'd like to apologize if it sounded like that to you.

I regularly consider features for implementation even when i consider
them "not particularly relevant".  If something is not partcularly
relevant and causes huge effort or disruption, it is likely to be
rejected.  But if something is easy to do, it might be worthwhile
even if it only provides marginal benefit.

No feature is implemented without carfully scrutinizing the design,
though.

Besides, i may be missing something and it might emerge that the defect
you are talking about causes more trouble than i so far think, and
the feature you are proposing provides more benefit than i so far
recognize.  In another mail, i said "i feel like sitting on the
fence."

And finally, while the questions of how the formatter should handle a
translated manual page and how translations can be improved to actually
become useable are clearly somewhat related, in the following sense, they
are at the same time close to orthogonal: *if* formatters get better at
handling translated manuals, that also helps to make translated manuals
better, no matter how the latter may be achieved in the text itself.
Maybe not all hope is lost for reviving at least some of the most widely
used native languages for this particular technical field, for example
German, Spanish, and Japanese.  As i said, i'm not sure whether that
is desirable, feasible, and if so, how.

Yours,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-26 12:34     ` Ingo Schwarze
@ 2022-03-26 13:35       ` Mario Blättermann
  0 siblings, 0 replies; 19+ messages in thread
From: Mario Blättermann @ 2022-03-26 13:35 UTC (permalink / raw)
  To: discuss

Hello Ingo,

Am Sa., 26. März 2022 um 13:34 Uhr schrieb Ingo Schwarze <schwarze@usta.de>:
>
> Hi Mario,
>
> Mario Blättermann wrote on Fri, Mar 25, 2022 at 05:07:13PM +0100:
> > Am Fr., 25. März 2022 um 13:27 Uhr schrieb Ingo Schwarze <schwarze@usta.de>:
> >> Mario Blättermann wrote on Thu, Mar 24, 2022 at 06:13:23PM +0100:
>
> >>> recently I'm switched from GNU man-db to mandoc. It's really a big
> >>> step ahead, especially regarding the creation of HTML pages, but it
> >>> has its own peculiarities …
> >>>
> >>> For creating a HTML man page I use the following command:
> >>>
> >>> mandoc -T html -O toc ./manpage.1 > manpage.1.html
>
> >> You should really use the -O style=... and -O man=... options in
> >> addition to the options you are already using.  Without "style",
> >> CSS support is next to absent; no real style sheet is linked to,
> >> and only a minimal style sheet is embedded with <style>, so minimal
> >> that many features cannot work.
>
> > As far as I understand, proper TOC creation depends on a CSS file?
>
> No.  The TOC (in the sense Jan and i explained in earlier messages)
> does not need CSS.
>
> Then again, i guess what you meant here probably was "tagging depends
> on a CSS file".  That statement would be mostly misleading but arguably
> somewhat true to a lesser extent.  To understand what i mean, look at
> this HTML code (make sure to disable HTML in you mail user agent if you
> have it enabled):
>
>   <h1 class="Sh" id="DESCRIPTION">
>     <a class="permalink" href="#DESCRIPTION">DESCRIPTION</a>
>   </h1>
>
> Tagging involves four aspects:
>
>  1. The id= attribute of the h1 element shown above.
>     The value of that attribute, "DESCRIPTION", is called
>     the "tag" in mandoc(1), less(1), and ctags(1) parlance,
>     admittedly a bit unfortunately as "h1" is called a tag
>     in HTML parlance.  The mandoc/less/ctags tag is generated
>     if and only if the mdoc(7) or man(7) parser finds a section
>     title that looks sufficiently alphabetic.  It's intentional
>     that i use an imprecise wording here because what
>     "sufficiently alphabetic" means is the technical detail
>     that we are considering to change right now.  This tag
>     and id=-attribute is generated even if you do not use a CSS
>     file.
>
>  2. The 'a class="permalink"' element.  As long as a tag was
>     generated in no. 1 above, that element is also generated no
>     matter whether you use a CSS file or not.
>
>  3. Formatting of the h1 element depends on the stylesheet.
>     The following CSS properties are absent when you fail to
>     use a stylesheet:
>
>         margin-top: 1.2em;
>         margin-bottom: 0.6em;
>         margin-left: -3.2em;
>
>     Also, no tooltip is shown when you hover your mouse over
>     the h1 element unless you use the CSS file.
>
>     Arguably, none of this no. 3 is related to tagging.
>
>  4. Formatting of the "a" element depends on the stylesheet.
>     The following CSS properties are absent when you fail to
>     use a stylesheet:
>
>         color: inherit;
>         font: inherit;
>         text-decoration: inherit;
>         border-bottom: thin dotted;
>
>     Arguably, this no. 4 is related to tagging because these
>     properties determine how the presence of the tag is
>     indicated by the rendering.  Then again, without using
>     the stylesheet, the presence of the tag is usually also
>     indicated in whatever way is the default for the browser,
>     for example by a big blue font with a solid underline.
>
> >> Without "man", you get no hyperlinks from .Xr macros.
>
> > It's not about hyperlinks, this works at least in Archmanweb, and on
> > my local machine I don't need such links
>
> I have no idea how Archmanweb might create hyperlinks for manual
> page references unless you pass the -O man= option to mandoc.
> Well, Archmanweb might perhaps tinker around with the generated
> HTML code after the fact, using some crude heuristics.  I don't
> know what Archmanweb does.
>
> >> Such mistranslations obviously not only happen for reserved words
> >> like "SYNOPSIS", but also in the main text of manual pages.
> >> That's why i hate translated manual pages so much.  Reading German
> >> manual pages, i usually find them pretty unitelligible.
>
> > OK, if you hate German man pages anyway, why get upset...?
>
> For several reasons.
>
> First and foremost, i really care about good documentation, so bad
> documentation bothers me.
>
> Secondly, once in a while machines maintained by other people
> (not my own machines, of course) show me German error messages
> and/or German documentation even if i don't ask for that, and
> having to do extra configuration work just to get an intelligibly
> user interface on some random machine feels annoying to me.
>
> Finally, i am interested in questions of languages (formal and
> living) in general, even though i'm not a specialist for language
> theory (neither for formal nor for living languages).
>
> My former teacher in theoretical physics, Prof. Dahmen (whom i greatly
> respected in other matters) always made a point of strongly insisting that
> a thesis ought to be written in German because developing professional and
> technical terminology in all possible fields is crucial (in his opinion)
> to keep a living language alive.  Even though i always found the idea
> intriguing, i never managed to make up my mind whether that opinion is
> true or false, or rather: to which degree it is reasonable.
>
> But for technical terms in computer science, i fear German already is
> a dead language (in Prof. Dahmen's sense) whether we like it or not.
> Firmly established translations do exist for many technical terms in
> computer science (for example input = Eingabe), even more technical terms
> are firmly established as loanwords in German (for example hyperlink =
> Hyperlink, patch = Patch), but huge numbers of technical terms do not
> have a generally accepted and used translation to German.  In such cases,
> people sometimes simply use the English word when talking in German (for
> example diff = Diff), which may sometimes indicate that the establishment
> of a new loanword is in progress.  In many cases, a translation does not
> really exist.  A striking example from the example manual page you picked
> is the English technical term "unified diff".  The (admittedly meager)
> German Wikipedia page https://de.wikipedia.org/wiki/Diff works around
> the gap in the langauge by using the somewhat leangthy wording "Das
> sogenannte vereinheitlichte Format (unified diff)".  This solution feels
> completely adequate to me: it is easy to understand by both professionals
> and beginners, and the wording is also elegant from the perspective of
> the language.
>
> https://manpages.debian.org/bullseye/manpages-de/diff.1.de.html
> says, by contrast:
>
>   -u, -U ANZAHL, --unified[=ANZAHL]
>     ANZAHL Zeilen (Vorgabe 3) des vereinheitlichten Kontexts ausgeben
>
> That's completely unitelligible for a German native speaker unless they
> are also fluent in English *and* already know what the technical meanings
> of "unified" *and* of "context" are in this particular context.  The only
> way to understand this particular German wording is to translate the word
> "vereinheitlicht" back to English and then recognize that "unified" and
> "context" here both function as highly specialized technical terms
> and *neither* of them is to be interpreted in the everyday sense of the
> plain English words "unified" and "context".
>
This is an example for a German term found after some discussion
between translators, without developers involved. See below. Keep also
in mind, translators often try to transate as many terms as possible.
It happens very often that my reviewers complain about »Denglish«
terms. I remember a discussion some years ago where someone really
asked for »Herunterladung« instead of »Download« …

> I discuss this here in so much detail because i do care about such
> matters and because i think such considerations do have some bearing
> on the question which functionality matters to which degree in a
> formatting program for technical documentation.
>
> You cannot design a program well without considering how it should
> and how it should better not be used.
>
> >> So while i'm not aggressively trying to *not* support translated manual
> >> pages, i don't think translated manual pages are particularly relevant
> >> either.
>
> > OK, I understand. I don't expect any further efforts to get a better
> > TOC creation from your side. Maybe I can discuss this with the
> > Archmanweb developers.
>
> I fear you misunderstand.  I didn't mean to say, "fuck you, go to hell".
> I'd like to apologize if it sounded like that to you.
>
No, the problem was that I hadn't seen the rest of your mail because
it was truncated by the mail web interface. See my follow-up.

> I regularly consider features for implementation even when i consider
> them "not particularly relevant".  If something is not partcularly
> relevant and causes huge effort or disruption, it is likely to be
> rejected.  But if something is easy to do, it might be worthwhile
> even if it only provides marginal benefit.
>
> No feature is implemented without carfully scrutinizing the design,
> though.
>
> Besides, i may be missing something and it might emerge that the defect
> you are talking about causes more trouble than i so far think, and
> the feature you are proposing provides more benefit than i so far
> recognize.  In another mail, i said "i feel like sitting on the
> fence."
>
> And finally, while the questions of how the formatter should handle a
> translated manual page and how translations can be improved to actually
> become useable are clearly somewhat related, in the following sense, they
> are at the same time close to orthogonal: *if* formatters get better at
> handling translated manuals, that also helps to make translated manuals
> better, no matter how the latter may be achieved in the text itself.
> Maybe not all hope is lost for reviving at least some of the most widely
> used native languages for this particular technical field, for example
> German, Spanish, and Japanese.  As i said, i'm not sure whether that
> is desirable, feasible, and if so, how.
>
To become a translator is one of the first steps for lots of users who
thought about to give something back to the community. »Oh, I speak
reasonably English, and I speak German, so maybe it's a good idea to
translate something«. But lots of such translations will be send to
developers without any review by an experienced translator, and older,
but still valid translations haven't been reviewed ever, although the
appropriate .po file is under maintenance of a translation team for
years.

And besides that, people with programming skills usually don't bother
with translations. So there's actually no intersection between
develpers and translators: terms expected and needed by developers and
advanced users are not covered by the translations written by
non-developers. Moreover, often developers don't use a locale matching
their native language, so they don't see at all what's wrong.

Of course, I understand your complaints about translation quality.
Especially in case of man pages we have another problem: For some
languages, old textual translations will be shipped by the
distributions again and again, drifting more and more away from the
English versions. In manpages-l10n, we use Po4a to ease the pain with
keeping the translated versions up-to-date, but some teams, like the
Japanese team, don't do so. The current bash.1 is from GNU Bash 5.1,
but the Japanese version is from Bash 4.2, twelve years old. That's
why translated man pages have still a bad reputation regarding their
age, although the manpages-l10n versions will be released every three
months.

But as long as we don't get enough feedback from the target audience,
we can't improve that much. The translation of »SYNOPSIS« to
»ÜBERSICHT« is probably the result of a discussion between involved
translators years ago, or even written by the first translator ever
and thoughtless taken over by all others. As long as no one complains
about it, we keep it.

Best Regards,
Mario
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-26 10:33     ` Ingo Schwarze
@ 2022-03-26 17:55       ` Anthony J. Bentley
  2022-03-27 11:17         ` Ingo Schwarze
  0 siblings, 1 reply; 19+ messages in thread
From: Anthony J. Bentley @ 2022-03-26 17:55 UTC (permalink / raw)
  To: Ingo Schwarze; +Cc: discuss

Hi Ingo,

Ingo Schwarze writes:
> All the same, before starting work on an implementation, i would
> also appreciate your opinion, Anthony (and possibly of similarly
> prolific users and maintainers in other operating systems) whether
> such functionality seems desirable to you, because i feel like
> sitting on the fence myself.

As a user, I was mildly surprised to learn Mandoc limits URL fragments
to ASCII; I would expect it to create UTF-8 fragments seamlessly.
So consider me in favor of such a change.

In fact, if it’s workable, I would also like to be able to use UTF-8
tags in less(1) as well.

-- 
Anthony J. Bentley
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-26 17:55       ` Anthony J. Bentley
@ 2022-03-27 11:17         ` Ingo Schwarze
  2022-03-27 11:44           ` Ingo Schwarze
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Schwarze @ 2022-03-27 11:17 UTC (permalink / raw)
  To: anthony, mario.blaettermann; +Cc: discuss

Hi,

Anthony J. Bentley wrote on Sat, Mar 26, 2022 at 11:55:13AM -0600:
> Ingo Schwarze writes:

>> All the same, before starting work on an implementation, i would
>> also appreciate your opinion, Anthony (and possibly of similarly
>> prolific users and maintainers in other operating systems) whether
>> such functionality seems desirable to you, because i feel like
>> sitting on the fence myself.

> As a user, I was mildly surprised to learn Mandoc limits URL fragments
> to ASCII; I would expect it to create UTF-8 fragments seamlessly.
> So consider me in favor of such a change.
> 
> In fact, if it’s workable, I would also like to be able to use UTF-8
> tags in less(1) as well.

OK, i see.  So i have added a TODO entry, see below.
I'll not start working on it before the upcoming OpenBSD release,
but after that would be a good time to see what can be done.

Thanks for your input,
  Ingo
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: HTML output: section headers with diacritics not in table of contents
  2022-03-27 11:17         ` Ingo Schwarze
@ 2022-03-27 11:44           ` Ingo Schwarze
  0 siblings, 0 replies; 19+ messages in thread
From: Ingo Schwarze @ 2022-03-27 11:44 UTC (permalink / raw)
  To: anthony, mario.blaettermann; +Cc: discuss

Duh, the classical gaffe yet again.

Ingo Schwarze wrote on Sun, Mar 27, 2022 at 01:17:58PM +0200:
> Anthony J. Bentley wrote on Sat, Mar 26, 2022 at 11:55:13AM -0600:
>> Ingo Schwarze writes:

>>> All the same, before starting work on an implementation, i would
>>> also appreciate your opinion, Anthony (and possibly of similarly
>>> prolific users and maintainers in other operating systems) whether
>>> such functionality seems desirable to you, because i feel like
>>> sitting on the fence myself.

>> As a user, I was mildly surprised to learn Mandoc limits URL fragments
>> to ASCII; I would expect it to create UTF-8 fragments seamlessly.
>> So consider me in favor of such a change.
>> 
>> In fact, if it’s workable, I would also like to be able to use UTF-8
>> tags in less(1) as well.

> OK, i see.  So i have added a TODO entry, see below.
> I'll not start working on it before the upcoming OpenBSD release,
> but after that would be a good time to see what can be done.


Log Message:
-----------
new TODO entry: handle Unicode letters in tags

Modified Files:
--------------
    mandoc:
        TODO

Revision Data
-------------
Index: TODO
===================================================================
RCS file: /home/cvs/mandoc/mandoc/TODO,v
retrieving revision 1.321
retrieving revision 1.322
diff -LTODO -LTODO -u -p -r1.321 -r1.322
--- TODO
+++ TODO
@@ -313,6 +313,11 @@ are mere guesses, and some may be wrong.
   weerd@ 28 Sep 2021 12:44:07 +0200
   loc **  exist *  algo *  size *  imp ***
 
+- handle Unicode letters in tags in both HTML and terminal output
+  thread "section headers with diacritics" starting with
+  Mario Blaettermann 24 Mar 2022 18:13:23 +0100
+  loc **  exist *  algo *  size *  imp **
+
 - -T man does not handle eqn(7) and tbl(7)
   Stephen Gregoratto 16 Feb 2020 01:28:07 +1100
   also https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=901636
--
 To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-03-27 11:45 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-24 17:13 HTML output: section headers with diacritics not in table of contents Mario Blättermann
2022-03-24 17:33 ` Michael Stapelberg
2022-03-24 18:00   ` Mario Blättermann
2022-03-25 12:27 ` Ingo Schwarze
2022-03-25 16:07   ` Mario Blättermann
2022-03-25 20:58     ` Jan Stary
2022-03-26 12:34     ` Ingo Schwarze
2022-03-26 13:35       ` Mario Blättermann
2022-03-25 16:21   ` Anthony J. Bentley
2022-03-25 21:15     ` Jan Stary
2022-03-26 10:33     ` Ingo Schwarze
2022-03-26 17:55       ` Anthony J. Bentley
2022-03-27 11:17         ` Ingo Schwarze
2022-03-27 11:44           ` Ingo Schwarze
2022-03-25 16:57   ` Mario Blättermann
2022-03-25 20:36     ` Jan Stary
2022-03-25 20:59       ` Mario Blättermann
2022-03-25 21:20         ` Jan Stary
2022-03-26  9:25           ` Ingo Schwarze

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).