* [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio) @ 2020-07-29 23:23 A. Wilcox 2020-07-29 23:48 ` A. Wilcox 2020-07-30 0:05 ` Rich Felker 0 siblings, 2 replies; 7+ messages in thread From: A. Wilcox @ 2020-07-29 23:23 UTC (permalink / raw) To: bruno, bug-bison; +Cc: musl [-- Attachment #1.1: Type: text/plain, Size: 436 bytes --] Seeing some weird behaviour here building Bison 3.7 on musl libc. Something seems to be "intelligent" enough to know that \u2022 is a bullet character, and is replacing it with "*" instead of ".", causing all the tests to fail: awilcox on gwyn [17] bison: LC_ALL=C /bin/printf '\u2022\n' | od -t x1 0000000 2a 0a 0000002 Best, --arw -- A. Wilcox (awilfox) Project Lead, Adélie Linux https://www.adelielinux.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio) 2020-07-29 23:23 [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio) A. Wilcox @ 2020-07-29 23:48 ` A. Wilcox 2020-07-30 0:05 ` Rich Felker 1 sibling, 0 replies; 7+ messages in thread From: A. Wilcox @ 2020-07-29 23:48 UTC (permalink / raw) To: musl [-- Attachment #1.1: Type: text/plain, Size: 693 bytes --] On 29/07/2020 18:23, A. Wilcox wrote: > Seeing some weird behaviour here building Bison 3.7 on musl libc. > > Something seems to be "intelligent" enough to know that \u2022 is a > bullet character, and is replacing it with "*" instead of ".", causing > all the tests to fail: > > awilcox on gwyn [17] bison: LC_ALL=C /bin/printf '\u2022\n' | od -t x1 > 0000000 2a 0a > 0000002 > > Best, > --arw > ugh. The email address for Bison's list was copied wrong; if you reply all to my original message, please change @lists.gnu.org to @gnu.org. Apologies for the noise. Best, --arw -- A. Wilcox (awilfox) Project Lead, Adélie Linux https://www.adelielinux.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio) 2020-07-29 23:23 [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio) A. Wilcox 2020-07-29 23:48 ` A. Wilcox @ 2020-07-30 0:05 ` Rich Felker 2020-07-30 0:12 ` A. Wilcox 2020-07-30 1:43 ` Bruno Haible 1 sibling, 2 replies; 7+ messages in thread From: Rich Felker @ 2020-07-30 0:05 UTC (permalink / raw) To: A. Wilcox; +Cc: bruno, bug-bison, musl On Wed, Jul 29, 2020 at 06:23:19PM -0500, A. Wilcox wrote: > Seeing some weird behaviour here building Bison 3.7 on musl libc. > > Something seems to be "intelligent" enough to know that \u2022 is a > bullet character, and is replacing it with "*" instead of ".", causing > all the tests to fail: > > awilcox on gwyn [17] bison: LC_ALL=C /bin/printf '\u2022\n' | od -t x1 > 0000000 2a 0a > 0000002 I don't think the '*' has anything to do with it being a bullet character. It's just the implementation-defined replacement character musl's iconv uses. I would guess the code in bison and coreutils printf is assuming the non-conforming glibc behavior for iconv of returning an error if a character from the input is not exactly representable in the output, rather than making replacements and returning the number of inexact conversions made. Rich ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio) 2020-07-30 0:05 ` Rich Felker @ 2020-07-30 0:12 ` A. Wilcox 2020-07-30 1:43 ` Bruno Haible 1 sibling, 0 replies; 7+ messages in thread From: A. Wilcox @ 2020-07-30 0:12 UTC (permalink / raw) To: musl, bug-bison [-- Attachment #1.1: Type: text/plain, Size: 1508 bytes --] On 29/07/2020 19:05, Rich Felker wrote: > On Wed, Jul 29, 2020 at 06:23:19PM -0500, A. Wilcox wrote: >> Seeing some weird behaviour here building Bison 3.7 on musl libc. >> >> Something seems to be "intelligent" enough to know that \u2022 is a >> bullet character, and is replacing it with "*" instead of ".", causing >> all the tests to fail: >> >> awilcox on gwyn [17] bison: LC_ALL=C /bin/printf '\u2022\n' | od -t x1 >> 0000000 2a 0a >> 0000002 > > I don't think the '*' has anything to do with it being a bullet > character. It's just the implementation-defined replacement character > musl's iconv uses. Ah, ok. > I would guess the code in bison and coreutils printf is assuming the > non-conforming glibc behavior for iconv of returning an error if a > character from the input is not exactly representable in the output, > rather than making replacements and returning the number of inexact > conversions made. Actually, it's assuming iconv will replace \u2022 with '.', and failing because it isn't: @@ -1,9 +1,9 @@ State 0 - 0 $accept: . S $end - 1 S: . 'a' A 'a' - 2 | . 'b' A 'b' - 3 | . 'c' c + 0 $accept: * S $end + 1 S: * 'a' A 'a' + 2 | * 'b' A 'b' + 3 | * 'c' c 'a' shift, and go to state 1 'b' shift, and go to state 2 This test gets more and more "fun" the more platforms it's ported to. --arw -- A. Wilcox (awilfox) Project Lead, Adélie Linux https://www.adelielinux.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio) 2020-07-30 0:05 ` Rich Felker 2020-07-30 0:12 ` A. Wilcox @ 2020-07-30 1:43 ` Bruno Haible 2020-07-30 9:02 ` Florian Weimer 1 sibling, 1 reply; 7+ messages in thread From: Bruno Haible @ 2020-07-30 1:43 UTC (permalink / raw) To: Rich Felker; +Cc: A. Wilcox, bug-bison, musl, bug-gnulib [CCing bug-gnulib] Rich Felker wrote: > I don't think the '*' has anything to do with it being a bullet > character. It's just the implementation-defined replacement character > musl's iconv uses. Correct. > I would guess the code in bison and coreutils printf is assuming the > non-conforming glibc behavior for iconv of returning an error if a > character from the input is not exactly representable in the output, > rather than making replacements and returning the number of inexact > conversions made. Yes and no. The code is not making assumptions about a particular iconv() implementation. But it needs to distinguish two categories of replacements done by iconv(): - those that are harmless (for example when replacing a Unicode TAG character U+E00xx with an empty output), - those that are better not presented to the user, if the programmer has specified a fallback (for example, replacing all non-ASCII characters with NUL, '?', or '*'). The standards don't help in making the distinction. Therefore whether you consider said glibc and libiconv behaviour as "non-conforming" or not is irrelevant. I have now adjusted the code to handle musl libc better. Bruno ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio) 2020-07-30 1:43 ` Bruno Haible @ 2020-07-30 9:02 ` Florian Weimer 2020-07-30 9:39 ` [musl] Re: iconv replacements Bruno Haible 0 siblings, 1 reply; 7+ messages in thread From: Florian Weimer @ 2020-07-30 9:02 UTC (permalink / raw) To: Bruno Haible Cc: Rich Felker, musl, A. Wilcox, bug-bison, bug-gnulib, Arjun Shankar * Bruno Haible: > Yes and no. The code is not making assumptions about a particular iconv() > implementation. But it needs to distinguish two categories of replacements > done by iconv(): > - those that are harmless (for example when replacing a Unicode TAG > character U+E00xx with an empty output), > - those that are better not presented to the user, if the programmer has > specified a fallback (for example, replacing all non-ASCII characters > with NUL, '?', or '*'). > > The standards don't help in making the distinction. > > Therefore whether you consider said glibc and libiconv behaviour as > "non-conforming" or not is irrelevant. Could you sketch briefly what you need? We have identified some issues with the existing iconv interface. If we add an enhancement, it would make sense to cover these requirements. Thanks, Florian ^ permalink raw reply [flat|nested] 7+ messages in thread
* [musl] Re: iconv replacements 2020-07-30 9:02 ` Florian Weimer @ 2020-07-30 9:39 ` Bruno Haible 0 siblings, 0 replies; 7+ messages in thread From: Bruno Haible @ 2020-07-30 9:39 UTC (permalink / raw) To: bug-gnulib; +Cc: Florian Weimer, Arjun Shankar, Rich Felker, A. Wilcox, musl [Dropping bug-bison from CC] > > Yes and no. The code is not making assumptions about a particular iconv() > > implementation. But it needs to distinguish two categories of replacements > > done by iconv(): > > - those that are harmless (for example when replacing a Unicode TAG > > character U+E00xx with an empty output), > > - those that are better not presented to the user, if the programmer has > > specified a fallback (for example, replacing all non-ASCII characters > > with NUL, '?', or '*'). > > > > The standards don't help in making the distinction. > > > > Therefore whether you consider said glibc and libiconv behaviour as > > "non-conforming" or not is irrelevant. > > Could you sketch briefly what you need? We have identified some issues > with the existing iconv interface. If we add an enhancement, it would > make sense to cover these requirements. POSIX [1] says: "If iconv() encounters a character in the input buffer that is valid, but for which an identical character does not exist in the target codeset, iconv() shall perform an implementation-defined conversion on this character." "The iconv() function shall ... return the number of non-identical conversions performed." This is sufficient for detecting that iconv() did something that the application might or might not like. For decent application behaviour in UTF-8, legacy 8-bit, and ASCII locales I wrote a module 'unicodeio' that accepts an ASCII fallback given by the programmer. For example, for the string "François Pinard" a fallback "Francois Pinard" can be given, and for the string "•" a fallback "." can be given. In this code, it needs to analyze what iconv() actually did and distinguish replacements that are OK (no need to activate the ASCII fallback) and those that are worse than the ASCII fallback. For example: - Replacing 'ç' with '?' (NetBSD, Solaris 11) or '*' (musl) or NUL (IRIX) is worse than the ASCII fallback. - Replacing a Unicode tag character with an empty string is OK. - Replacing GREEK SMALL LETTER MU with MICRO SIGN is OK. - Replacing FULLWIDTH COLON with ':' is OK (most likely equivalent to the ASCII fallback). That's my requirement from the application side. I don't know whether an iconv() implementation can help here, given the limited interface of iconv. Maybe there could be an alternative to //TRANSLIT in the iconv_open() argument, that would specify e.g. that tag characters and <compat> and <wide> replacements in UnicodeData.txt are OK but other replacements are not OK? Where either - OK means a conversion that does not increment the return value, - "not OK" means a conversion that increments the return value, or - OK means a conversion that increments the return value, - "not OK" means an error return (-1 / EILSEQ). Bruno [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-07-30 9:40 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-29 23:23 [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio) A. Wilcox 2020-07-29 23:48 ` A. Wilcox 2020-07-30 0:05 ` Rich Felker 2020-07-30 0:12 ` A. Wilcox 2020-07-30 1:43 ` Bruno Haible 2020-07-30 9:02 ` Florian Weimer 2020-07-30 9:39 ` [musl] Re: iconv replacements Bruno Haible
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).