From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.0 required=5.0 tests=T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 683 invoked from network); 26 Mar 2022 10:33:34 -0000 Received: from bsd.lv (HELO mandoc.bsd.lv) (66.111.2.12) by inbox.vuxu.org with ESMTPUTF8; 26 Mar 2022 10:33:34 -0000 Received: from fantadrom.bsd.lv (localhost [127.0.0.1]) by mandoc.bsd.lv (OpenSMTPD) with ESMTP id 98dceaad for ; Sat, 26 Mar 2022 05:33:32 -0500 (EST) Received: from scc-mailout-kit-02.scc.kit.edu (scc-mailout-kit-02.scc.kit.edu [129.13.231.82]) by mandoc.bsd.lv (OpenSMTPD) with ESMTP id 84cd5525 for ; Sat, 26 Mar 2022 05:33:31 -0500 (EST) Received: from hekate.asta.kit.edu ([2a00:1398:5:f401::77]) by scc-mailout-kit-02.scc.kit.edu with esmtps (TLS1.3:ECDHE_SECP256R1__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (envelope-from ) id 1nY3jd-008O12-NK; Sat, 26 Mar 2022 11:33:30 +0100 Received: from login-1.asta.kit.edu ([2a00:1398:5:f400::72]) by hekate.asta.kit.edu with esmtp (Exim 4.94.2) (envelope-from ) id 1nY3jb-003vxC-Ld; Sat, 26 Mar 2022 11:33:28 +0100 Received: from schwarze by login-1.asta.kit.edu with local (Exim 4.92) (envelope-from ) id 1nY3jc-0005tt-21; Sat, 26 Mar 2022 11:33:28 +0100 Date: Sat, 26 Mar 2022 11:33:28 +0100 From: Ingo Schwarze To: anthony@anjbe.name Cc: discuss@mandoc.bsd.lv Subject: Re: HTML output: section headers with diacritics not in table of contents Message-ID: References: <10474-1648225308.014815@KUMT.SLa5.YYhl> X-Mailinglist: mandoc-discuss Reply-To: discuss@mandoc.bsd.lv MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <10474-1648225308.014815@KUMT.SLa5.YYhl> Hi Anthony, Anthony J. Bentley wrote on Fri, Mar 25, 2022 at 10:21:48AM -0600: > Ingo Schwarze writes: >> Maybe mandoc should treat any \\[uXXXX] sequence as a letter for >> the purposes of tagging? The code needed for that will look rather >> awkward though, and even when implemented perfectly, the tags will >> be UTF-8 rather than ASCII-encoded. Would links like >> >> https://man.archlinux.org/man/diff.1.de#%C3%9CBERSICHT >> >> really be all that useful? What do people think? > There would be no need for Mandoc to percent-encode UTF-8 here. > In HTML5, a URL fragment (that is, the portion after the '#') may > contain unescaped "URL code points," which are: > > "ASCII alphanumeric, U+0021 (!), U+0024 ($), U+0026 (&), U+0027 ('), > U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+002A (*), > U+002B (+), U+002C (,), U+002D (-), U+002E (.), U+002F (/), U+003A > (:), U+003B (;), U+003D (=), U+003F (?), U+0040 (@), U+005F (_), > U+007E (~), and code points in the range U+00A0 to U+10FFFD, > inclusive, excluding surrogates and noncharacters." Thanks, that sounds like a useful hint. Excluding surrogates is easy, and http://www.unicode.org/faq/private_use.html#noncharacters tells me what "noncharacters" are. Since those 66 codepoints are stable, it is feasible to exclude them, too, without needing any Unicode library. All the same, before starting work on an implementation, i would also appreciate your opinion, Anthony (and possibly of similarly prolific users and maintainers in other operating systems) whether such functionality seems desirable to you, because i feel like sitting on the fence myself. Yours, Ingo -- To unsubscribe send an email to discuss+unsubscribe@mandoc.bsd.lv