mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Bruno Haible <bruno@clisp.org>
To: bug-gnulib@gnu.org
Cc: Florian Weimer <fweimer@redhat.com>,
	Arjun Shankar <ashankar@redhat.com>,
	Rich Felker <dalias@libc.org>,
	"A. Wilcox" <awilfox@adelielinux.org>,
	musl@lists.openwall.com
Subject: [musl] Re: iconv replacements
Date: Thu, 30 Jul 2020 11:39:43 +0200	[thread overview]
Message-ID: <79808844.bqqDOferBU@omega> (raw)
In-Reply-To: <87d04djrz2.fsf@oldenburg2.str.redhat.com>

[Dropping bug-bison from CC]

> > Yes and no. The code is not making assumptions about a particular iconv()
> > implementation. But it needs to distinguish two categories of replacements
> > done by iconv():
> >   - those that are harmless (for example when replacing a Unicode TAG
> >     character U+E00xx with an empty output),
> >   - those that are better not presented to the user, if the programmer has
> >     specified a fallback (for example, replacing all non-ASCII characters
> >     with NUL, '?', or '*').
> >
> > The standards don't help in making the distinction.
> >
> > Therefore whether you consider said glibc and libiconv behaviour as
> > "non-conforming" or not is irrelevant.
> 
> Could you sketch briefly what you need?  We have identified some issues
> with the existing iconv interface.  If we add an enhancement, it would
> make sense to cover these requirements.

POSIX [1] says:

  "If iconv() encounters a character in the input buffer that is valid, but for
   which an identical character does not exist in the target codeset, iconv()
   shall perform an implementation-defined conversion on this character."

  "The iconv() function shall ... return the number of non-identical conversions performed."

This is sufficient for detecting that iconv() did something that the
application might or might not like.

For decent application behaviour in UTF-8, legacy 8-bit, and ASCII locales
I wrote a module 'unicodeio' that accepts an ASCII fallback given by the
programmer. For example, for the string "François Pinard" a fallback
"Francois Pinard" can be given, and for the string "•" a fallback "." can
be given.

In this code, it needs to analyze what iconv() actually did and distinguish
replacements that are OK (no need to activate the ASCII fallback) and those
that are worse than the ASCII fallback. For example:
  - Replacing 'ç' with '?' (NetBSD, Solaris 11) or '*' (musl) or NUL (IRIX)
    is worse than the ASCII fallback.
  - Replacing a Unicode tag character with an empty string is OK.
  - Replacing GREEK SMALL LETTER MU with MICRO SIGN is OK.
  - Replacing FULLWIDTH COLON with ':' is OK (most likely equivalent to the
    ASCII fallback).

That's my requirement from the application side. I don't know whether an
iconv() implementation can help here, given the limited interface of iconv.

Maybe there could be an alternative to //TRANSLIT in the iconv_open()
argument, that would specify e.g. that tag characters and <compat> and <wide>
replacements in UnicodeData.txt are OK but other replacements are not OK?
Where either
  - OK means a conversion that does not increment the return value,
  - "not OK" means a conversion that increments the return value,
or
  - OK means a conversion that increments the return value,
  - "not OK" means an error return (-1 / EILSEQ).

Bruno

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html


      reply	other threads:[~2020-07-30  9:40 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-29 23:23 [musl] Building Bison 3.7 with musl (was Re: portability issues with unicodeio) A. Wilcox
2020-07-29 23:48 ` A. Wilcox
2020-07-30  0:05 ` Rich Felker
2020-07-30  0:12   ` A. Wilcox
2020-07-30  1:43   ` Bruno Haible
2020-07-30  9:02     ` Florian Weimer
2020-07-30  9:39       ` Bruno Haible [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=79808844.bqqDOferBU@omega \
    --to=bruno@clisp.org \
    --cc=ashankar@redhat.com \
    --cc=awilfox@adelielinux.org \
    --cc=bug-gnulib@gnu.org \
    --cc=dalias@libc.org \
    --cc=fweimer@redhat.com \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).