mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio)
@ 2020-07-29 23:23 A. Wilcox
  2020-07-29 23:48 ` A. Wilcox
  2020-07-30  0:05 ` Rich Felker
  0 siblings, 2 replies; 7+ messages in thread
From: A. Wilcox @ 2020-07-29 23:23 UTC (permalink / raw)
  To: bruno, bug-bison; +Cc: musl


[-- Attachment #1.1: Type: text/plain, Size: 436 bytes --]

Seeing some weird behaviour here building Bison 3.7 on musl libc.

Something seems to be "intelligent" enough to know that \u2022 is a
bullet character, and is replacing it with "*" instead of ".", causing
all the tests to fail:

awilcox on gwyn [17] bison: LC_ALL=C /bin/printf '\u2022\n' | od -t x1
0000000 2a 0a
0000002

Best,
--arw

-- 
A. Wilcox (awilfox)
Project Lead, Adélie Linux
https://www.adelielinux.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio)
  2020-07-29 23:23 [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio) A. Wilcox
@ 2020-07-29 23:48 ` A. Wilcox
  2020-07-30  0:05 ` Rich Felker
  1 sibling, 0 replies; 7+ messages in thread
From: A. Wilcox @ 2020-07-29 23:48 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 693 bytes --]

On 29/07/2020 18:23, A. Wilcox wrote:
> Seeing some weird behaviour here building Bison 3.7 on musl libc.
> 
> Something seems to be "intelligent" enough to know that \u2022 is a
> bullet character, and is replacing it with "*" instead of ".", causing
> all the tests to fail:
> 
> awilcox on gwyn [17] bison: LC_ALL=C /bin/printf '\u2022\n' | od -t x1
> 0000000 2a 0a
> 0000002
> 
> Best,
> --arw
> 


ugh.  The email address for Bison's list was copied wrong; if you reply
all to my original message, please change @lists.gnu.org to @gnu.org.

Apologies for the noise.

Best,
--arw
-- 
A. Wilcox (awilfox)
Project Lead, Adélie Linux
https://www.adelielinux.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio)
  2020-07-29 23:23 [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio) A. Wilcox
  2020-07-29 23:48 ` A. Wilcox
@ 2020-07-30  0:05 ` Rich Felker
  2020-07-30  0:12   ` A. Wilcox
  2020-07-30  1:43   ` Bruno Haible
  1 sibling, 2 replies; 7+ messages in thread
From: Rich Felker @ 2020-07-30  0:05 UTC (permalink / raw)
  To: A. Wilcox; +Cc: bruno, bug-bison, musl

On Wed, Jul 29, 2020 at 06:23:19PM -0500, A. Wilcox wrote:
> Seeing some weird behaviour here building Bison 3.7 on musl libc.
> 
> Something seems to be "intelligent" enough to know that \u2022 is a
> bullet character, and is replacing it with "*" instead of ".", causing
> all the tests to fail:
> 
> awilcox on gwyn [17] bison: LC_ALL=C /bin/printf '\u2022\n' | od -t x1
> 0000000 2a 0a
> 0000002

I don't think the '*' has anything to do with it being a bullet
character. It's just the implementation-defined replacement character
musl's iconv uses.

I would guess the code in bison and coreutils printf is assuming the
non-conforming glibc behavior for iconv of returning an error if a
character from the input is not exactly representable in the output,
rather than making replacements and returning the number of inexact
conversions made.

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio)
  2020-07-30  0:05 ` Rich Felker
@ 2020-07-30  0:12   ` A. Wilcox
  2020-07-30  1:43   ` Bruno Haible
  1 sibling, 0 replies; 7+ messages in thread
From: A. Wilcox @ 2020-07-30  0:12 UTC (permalink / raw)
  To: musl, bug-bison


[-- Attachment #1.1: Type: text/plain, Size: 1508 bytes --]

On 29/07/2020 19:05, Rich Felker wrote:
> On Wed, Jul 29, 2020 at 06:23:19PM -0500, A. Wilcox wrote:
>> Seeing some weird behaviour here building Bison 3.7 on musl libc.
>>
>> Something seems to be "intelligent" enough to know that \u2022 is a
>> bullet character, and is replacing it with "*" instead of ".", causing
>> all the tests to fail:
>>
>> awilcox on gwyn [17] bison: LC_ALL=C /bin/printf '\u2022\n' | od -t x1
>> 0000000 2a 0a
>> 0000002
> 
> I don't think the '*' has anything to do with it being a bullet
> character. It's just the implementation-defined replacement character
> musl's iconv uses.


Ah, ok.


> I would guess the code in bison and coreutils printf is assuming the
> non-conforming glibc behavior for iconv of returning an error if a
> character from the input is not exactly representable in the output,
> rather than making replacements and returning the number of inexact
> conversions made.


Actually, it's assuming iconv will replace \u2022 with '.', and failing
because it isn't:


@@ -1,9 +1,9 @@
 State 0

-    0 $accept: . S $end
-    1 S: . 'a' A 'a'
-    2  | . 'b' A 'b'
-    3  | . 'c' c
+    0 $accept: * S $end
+    1 S: * 'a' A 'a'
+    2  | * 'b' A 'b'
+    3  | * 'c' c

     'a'  shift, and go to state 1
     'b'  shift, and go to state 2



This test gets more and more "fun" the more platforms it's ported to.

--arw

-- 
A. Wilcox (awilfox)
Project Lead, Adélie Linux
https://www.adelielinux.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio)
  2020-07-30  0:05 ` Rich Felker
  2020-07-30  0:12   ` A. Wilcox
@ 2020-07-30  1:43   ` Bruno Haible
  2020-07-30  9:02     ` Florian Weimer
  1 sibling, 1 reply; 7+ messages in thread
From: Bruno Haible @ 2020-07-30  1:43 UTC (permalink / raw)
  To: Rich Felker; +Cc: A. Wilcox, bug-bison, musl, bug-gnulib

[CCing bug-gnulib]

Rich Felker wrote:
> I don't think the '*' has anything to do with it being a bullet
> character. It's just the implementation-defined replacement character
> musl's iconv uses.

Correct.

> I would guess the code in bison and coreutils printf is assuming the
> non-conforming glibc behavior for iconv of returning an error if a
> character from the input is not exactly representable in the output,
> rather than making replacements and returning the number of inexact
> conversions made.

Yes and no. The code is not making assumptions about a particular iconv()
implementation. But it needs to distinguish two categories of replacements
done by iconv():
  - those that are harmless (for example when replacing a Unicode TAG
    character U+E00xx with an empty output),
  - those that are better not presented to the user, if the programmer has
    specified a fallback (for example, replacing all non-ASCII characters
    with NUL, '?', or '*').

The standards don't help in making the distinction.

Therefore whether you consider said glibc and libiconv behaviour as
"non-conforming" or not is irrelevant.

I have now adjusted the code to handle musl libc better.

Bruno


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio)
  2020-07-30  1:43   ` Bruno Haible
@ 2020-07-30  9:02     ` Florian Weimer
  2020-07-30  9:39       ` [musl] Re: iconv replacements Bruno Haible
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Weimer @ 2020-07-30  9:02 UTC (permalink / raw)
  To: Bruno Haible
  Cc: Rich Felker, musl, A. Wilcox, bug-bison, bug-gnulib, Arjun Shankar

* Bruno Haible:

> Yes and no. The code is not making assumptions about a particular iconv()
> implementation. But it needs to distinguish two categories of replacements
> done by iconv():
>   - those that are harmless (for example when replacing a Unicode TAG
>     character U+E00xx with an empty output),
>   - those that are better not presented to the user, if the programmer has
>     specified a fallback (for example, replacing all non-ASCII characters
>     with NUL, '?', or '*').
>
> The standards don't help in making the distinction.
>
> Therefore whether you consider said glibc and libiconv behaviour as
> "non-conforming" or not is irrelevant.

Could you sketch briefly what you need?  We have identified some issues
with the existing iconv interface.  If we add an enhancement, it would
make sense to cover these requirements.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [musl] Re: iconv replacements
  2020-07-30  9:02     ` Florian Weimer
@ 2020-07-30  9:39       ` Bruno Haible
  0 siblings, 0 replies; 7+ messages in thread
From: Bruno Haible @ 2020-07-30  9:39 UTC (permalink / raw)
  To: bug-gnulib; +Cc: Florian Weimer, Arjun Shankar, Rich Felker, A. Wilcox, musl

[Dropping bug-bison from CC]

> > Yes and no. The code is not making assumptions about a particular iconv()
> > implementation. But it needs to distinguish two categories of replacements
> > done by iconv():
> >   - those that are harmless (for example when replacing a Unicode TAG
> >     character U+E00xx with an empty output),
> >   - those that are better not presented to the user, if the programmer has
> >     specified a fallback (for example, replacing all non-ASCII characters
> >     with NUL, '?', or '*').
> >
> > The standards don't help in making the distinction.
> >
> > Therefore whether you consider said glibc and libiconv behaviour as
> > "non-conforming" or not is irrelevant.
> 
> Could you sketch briefly what you need?  We have identified some issues
> with the existing iconv interface.  If we add an enhancement, it would
> make sense to cover these requirements.

POSIX [1] says:

  "If iconv() encounters a character in the input buffer that is valid, but for
   which an identical character does not exist in the target codeset, iconv()
   shall perform an implementation-defined conversion on this character."

  "The iconv() function shall ... return the number of non-identical conversions performed."

This is sufficient for detecting that iconv() did something that the
application might or might not like.

For decent application behaviour in UTF-8, legacy 8-bit, and ASCII locales
I wrote a module 'unicodeio' that accepts an ASCII fallback given by the
programmer. For example, for the string "François Pinard" a fallback
"Francois Pinard" can be given, and for the string "•" a fallback "." can
be given.

In this code, it needs to analyze what iconv() actually did and distinguish
replacements that are OK (no need to activate the ASCII fallback) and those
that are worse than the ASCII fallback. For example:
  - Replacing 'ç' with '?' (NetBSD, Solaris 11) or '*' (musl) or NUL (IRIX)
    is worse than the ASCII fallback.
  - Replacing a Unicode tag character with an empty string is OK.
  - Replacing GREEK SMALL LETTER MU with MICRO SIGN is OK.
  - Replacing FULLWIDTH COLON with ':' is OK (most likely equivalent to the
    ASCII fallback).

That's my requirement from the application side. I don't know whether an
iconv() implementation can help here, given the limited interface of iconv.

Maybe there could be an alternative to //TRANSLIT in the iconv_open()
argument, that would specify e.g. that tag characters and <compat> and <wide>
replacements in UnicodeData.txt are OK but other replacements are not OK?
Where either
  - OK means a conversion that does not increment the return value,
  - "not OK" means a conversion that increments the return value,
or
  - OK means a conversion that increments the return value,
  - "not OK" means an error return (-1 / EILSEQ).

Bruno

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-07-30  9:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-29 23:23 [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio) A. Wilcox
2020-07-29 23:48 ` A. Wilcox
2020-07-30  0:05 ` Rich Felker
2020-07-30  0:12   ` A. Wilcox
2020-07-30  1:43   ` Bruno Haible
2020-07-30  9:02     ` Florian Weimer
2020-07-30  9:39       ` [musl] Re: iconv replacements Bruno Haible

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).