From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=5.0 tests=BODY_QUOTE_MALF_MSGID, DKIM_INVALID,DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 23009 invoked from network); 30 Jul 2020 09:40:03 -0000 Received: from mother.openwall.net (195.42.179.200) by inbox.vuxu.org with ESMTPUTF8; 30 Jul 2020 09:40:03 -0000 Received: (qmail 1302 invoked by uid 550); 30 Jul 2020 09:39:57 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 1281 invoked from network); 30 Jul 2020 09:39:57 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; t=1596101985; s=strato-dkim-0002; d=clisp.org; h=References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From: X-RZG-CLASS-ID:X-RZG-AUTH:From:Subject:Sender; bh=n84hOzFnjSGiaLS8ag50VLZ2u6bLAPKH3Q4UlItrvi8=; b=KpoO7gVC9k5IWoPBDgv+WHg+1Sf0YdfZnWaCrWZFToB+DrukS90k6GvQxv8eNF23KA fCL+kEpno4169gaEpqkmKo6v9hrWZhJF/NqXZFbSmNEyucL+hK/qjcFokN7YADOZ1Xo9 6rEza+kW1B+Dw+Uqlj9WUY+LRpM14+DSCv9snGO03Jb4UZzdOPwrxCToob5TI25DXdYe wjXI5tdmer7hbhmDyVQNEN24OxXmEYlNc/U+0ycsX5UM5FTTXJ18X07L0CuP3S5zu5o/ kXQr6F/N7fLuv/jxjUBcZdH1Y2mph9JzZC5fVLqNfbQOYx7A7McxuLyMi+/eH8bNQo5t dvAQ== X-RZG-AUTH: ":Ln4Re0+Ic/6oZXR1YgKryK8brlshOcZlIWs+iCP5vnk6shH+AHjwLuWOH6fzxfs=" X-RZG-CLASS-ID: mo00 From: Bruno Haible To: bug-gnulib@gnu.org Cc: Florian Weimer , Arjun Shankar , Rich Felker , "A. Wilcox" , musl@lists.openwall.com Date: Thu, 30 Jul 2020 11:39:43 +0200 Message-ID: <79808844.bqqDOferBU@omega> User-Agent: KMail/5.1.3 (Linux/4.4.0-186-generic; KDE/5.18.0; x86_64; ; ) In-Reply-To: <87d04djrz2.fsf@oldenburg2.str.redhat.com> References: <2117749.CLknGyfR5K@omega> <87d04djrz2.fsf@oldenburg2.str.redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8" Subject: [musl] Re: iconv replacements [Dropping bug-bison from CC] > > Yes and no. The code is not making assumptions about a particular iconv= () > > implementation. But it needs to distinguish two categories of replaceme= nts > > done by iconv(): > > - those that are harmless (for example when replacing a Unicode TAG > > character U+E00xx with an empty output), > > - those that are better not presented to the user, if the programmer = has > > specified a fallback (for example, replacing all non-ASCII characte= rs > > with NUL, '?', or '*'). > > > > The standards don't help in making the distinction. > > > > Therefore whether you consider said glibc and libiconv behaviour as > > "non-conforming" or not is irrelevant. >=20 > Could you sketch briefly what you need? We have identified some issues > with the existing iconv interface. If we add an enhancement, it would > make sense to cover these requirements. POSIX [1] says: "If iconv() encounters a character in the input buffer that is valid, but= for which an identical character does not exist in the target codeset, iconv= () shall perform an implementation-defined conversion on this character." "The iconv() function shall ... return the number of non-identical conver= sions performed." This is sufficient for detecting that iconv() did something that the application might or might not like. =46or decent application behaviour in UTF-8, legacy 8-bit, and ASCII locales I wrote a module 'unicodeio' that accepts an ASCII fallback given by the programmer. For example, for the string "Fran=C3=A7ois Pinard" a fallback "Francois Pinard" can be given, and for the string "=E2=80=A2" a fallback "= =2E" can be given. In this code, it needs to analyze what iconv() actually did and distinguish replacements that are OK (no need to activate the ASCII fallback) and those that are worse than the ASCII fallback. For example: - Replacing '=C3=A7' with '?' (NetBSD, Solaris 11) or '*' (musl) or NUL (= IRIX) is worse than the ASCII fallback. - Replacing a Unicode tag character with an empty string is OK. - Replacing GREEK SMALL LETTER MU with MICRO SIGN is OK. - Replacing FULLWIDTH COLON with ':' is OK (most likely equivalent to the ASCII fallback). That's my requirement from the application side. I don't know whether an iconv() implementation can help here, given the limited interface of iconv. Maybe there could be an alternative to //TRANSLIT in the iconv_open() argument, that would specify e.g. that tag characters and and replacements in UnicodeData.txt are OK but other replacements are not OK? Where either - OK means a conversion that does not increment the return value, - "not OK" means a conversion that increments the return value, or - OK means a conversion that increments the return value, - "not OK" means an error return (-1 / EILSEQ). Bruno [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html