Re: [ Guidance ] Potential New Routines; Requesting Help

mailing list of musl libc
 help / color / mirror / code / Atom feed

From: JeanHeyd Meneide <phdofthehouse@gmail.com>
To: Rich Felker <dalias@libc.org>
Cc: Florian Weimer <fw@deneb.enyo.de>, musl@lists.openwall.com
Subject: Re: [ Guidance ] Potential New Routines; Requesting Help
Date: Thu, 26 Dec 2019 00:43:45 -0500	[thread overview]
Message-ID: <CANHA4OhM_pFXd_GKDVL1yxSAMBRko_MPr=euJ7Y5MvEkv70+JA@mail.gmail.com> (raw)
In-Reply-To: <20191226021354.GE30412@brightrain.aerifal.cx>

Dear Rich Felker and Florien Weimer,

     Thank you for taking an interest in this!

On Wed, Dec 25, 2019 at 9:14 PM Rich Felker <dalias@libc.org> wrote:
>
> On Wed, Dec 25, 2019 at 09:07:05PM +0100, Florian Weimer wrote:
> >
> > I'm somewhat concerned that the C multibyte functions are too broken
> > to be useful.  There is a at least one widely implemented character
> > set (Big5 as specified for HTML5) which does not fit the model implied
> > by the standard.  Big5 does not have shift states, but a C
> > implementation using UTF-32 for wchar_t has to pretend it has because
> > correct conversion from Unicode to Big5 needs lookahead and cannot be
> > performed one character at a time.
>
> I don't think this can be modeled with shift states. C explicitly
> forbids a stateful wchar_t encoding/multi-wchar_t-characters. Shift
> states would be meaningful for the other direction.
>
> In any case I don't think it really matters. There are no existing
> implementations with this version of Big5 (with the offending HKSCS
> characters included) as the locale charset, since it can't work, and
> there really is no good reason to be adding *new* locale encodings.
> The reason we (speaking of the larger community; musl doesn't) have
> non-UTF-8 locales is legacy compatibility for users who need or insist
> on keeping them.

     I have no intentions on adding new locale-based charsets (and I
absolutely agree that we should not be adding anymore). That being
said, I want to focus on the main part of this, which is the ability
to model existing encodings which have both/either shift states and/or
multi-character expansions.

     It is true wchar_t is invariably busted. There is no way a wide
character string can be multi-unit: that shipped sailed when wchar_t
was specified as is, and when it was codified in various APIs such as
mbtowc/wctomb and friends. I will, however, note that the paper
specifically wants to add the Restartable versions of "single unit" wc
and mb to/from functions. The reason I chose the restartable forms is
because in C2x a defect report was accepted that clarified the
original intent of the single-character functions with respect to
their R versions:

> "After discussion, the committee concluded that mbstate was already specified to handle this case, and as such the second interpretation is intended. The committee believes that there is an underspecification, and solicited a further paper from the author along the lines of the second option. Although not discussed a Suggested Technical Corrigendum can be found at N2040." -- WG14, April 2016, http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2059.htm#dr_488

    This means that while wcto* and *towc functions are broken, the
restartable versions need not be. By returning a value of "0" (e.g.,
for c8rtomb) or returning a value of "-3" (e.g., for mbrtoc8), we can
write out multiple characters based on the input and any data stored
in mbstate_t. This allows us to handle e.g. UTF-16 for c16,
multi-conversions for c8, and more. My understanding is that for one
of the referenced encodings in the linked mailing post (TSCII), this
would cover that use case. My understanding is also that for some Big5
encodings it would -- as you stated -- treat ambiguous leading
sequences as a shift state, accumulate data in the mbstate_t, and then
write out data if the sequence is made unambiguous by having further
data provided to the next call.

     Is my understanding incorrect? Is there an implementation
limitation I am missing here? I would hate to do this the wrong way
and make the encoding situation even worse: my goal is to absolutely
ensure we can transition legacy encodings to statically-known
encodings.

Sincerely,
JeanHeyd Meneide

next prev parent reply	other threads:[~2019-12-26  5:43 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-24 23:06 JeanHeyd Meneide
2019-12-25 20:07 ` Florian Weimer
2019-12-26  2:13   ` Rich Felker
2019-12-26  5:43     ` JeanHeyd Meneide [this message]
2019-12-30 17:28       ` Rich Felker
2019-12-30 18:53         ` JeanHeyd Meneide
2019-12-26  9:43     ` Florian Weimer
2019-12-30 17:31 ` Rich Felker
2019-12-30 18:39   ` JeanHeyd Meneide
2019-12-30 19:57     ` Rich Felker
2019-12-31  3:58       ` JeanHeyd Meneide

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CANHA4OhM_pFXd_GKDVL1yxSAMBRko_MPr=euJ7Y5MvEkv70+JA@mail.gmail.com' \
    --to=phdofthehouse@gmail.com \
    --cc=dalias@libc.org \
    --cc=fw@deneb.enyo.de \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).