From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/15077 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: [ Guidance ] Potential New Routines; Requesting Help Date: Mon, 30 Dec 2019 12:28:22 -0500 Message-ID: <20191230172822.GH30412@brightrain.aerifal.cx> References: <87zhfg185y.fsf@mid.deneb.enyo.de> <20191226021354.GE30412@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="217966"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Florian Weimer , musl@lists.openwall.com To: JeanHeyd Meneide Original-X-From: musl-return-15093-gllmg-musl=m.gmane.org@lists.openwall.com Mon Dec 30 18:28:45 2019 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.89) (envelope-from ) id 1ilyqS-000uUa-Hd for gllmg-musl@m.gmane.org; Mon, 30 Dec 2019 18:28:44 +0100 Original-Received: (qmail 18335 invoked by uid 550); 30 Dec 2019 17:28:41 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 18317 invoked from network); 30 Dec 2019 17:28:41 -0000 Content-Disposition: inline In-Reply-To: Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:15077 Archived-At: On Thu, Dec 26, 2019 at 12:43:45AM -0500, JeanHeyd Meneide wrote: > Dear Rich Felker and Florien Weimer, > > Thank you for taking an interest in this! > > On Wed, Dec 25, 2019 at 9:14 PM Rich Felker wrote: > > > > On Wed, Dec 25, 2019 at 09:07:05PM +0100, Florian Weimer wrote: > > > > > > I'm somewhat concerned that the C multibyte functions are too broken > > > to be useful. There is a at least one widely implemented character > > > set (Big5 as specified for HTML5) which does not fit the model implied > > > by the standard. Big5 does not have shift states, but a C > > > implementation using UTF-32 for wchar_t has to pretend it has because > > > correct conversion from Unicode to Big5 needs lookahead and cannot be > > > performed one character at a time. > > > > I don't think this can be modeled with shift states. C explicitly > > forbids a stateful wchar_t encoding/multi-wchar_t-characters. Shift > > states would be meaningful for the other direction. > > > > In any case I don't think it really matters. There are no existing > > implementations with this version of Big5 (with the offending HKSCS > > characters included) as the locale charset, since it can't work, and > > there really is no good reason to be adding *new* locale encodings. > > The reason we (speaking of the larger community; musl doesn't) have > > non-UTF-8 locales is legacy compatibility for users who need or insist > > on keeping them. > > I have no intentions on adding new locale-based charsets (and I > absolutely agree that we should not be adding anymore). That being > said, I want to focus on the main part of this, which is the ability > to model existing encodings which have both/either shift states and/or > multi-character expansions. I think you misunderstood my remarks here. I was not talking about invention of new charsets (which we seem to agree should not happen), but making it possible to use existing legacy charsets which were previously not usable as a locale's encoding due to limitations of the C APIs. I see making that possible as counter-productive. It does not serve to let users keep doing something they were already doing (compatibility), only do to something newly backwards. > It is true wchar_t is invariably busted. There is no way a wide > character string can be multi-unit: that shipped sailed when wchar_t > was specified as is, and when it was codified in various APIs such as > mbtowc/wctomb and friends. I will, however, note that the paper > specifically wants to add the Restartable versions of "single unit" wc > and mb to/from functions. I don't follow. mbrtowc and wcrtomb already exist and have since at least C99. > > "After discussion, the committee concluded that mbstate was > > already specified to handle this case, and as such the second > > interpretation is intended.. The committee believes that there is > > an underspecification, and solicited a further paper from the > > author along the lines of the second option. Although not > > discussed a Suggested Technical Corrigendum can be found at > > N2040." -- WG14, April 2016, > > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2059.htm#dr_488 > > This means that while wcto* and *towc functions are broken, the I don't see them as broken. They support every encoding that has ever worked in the past as the encoding for a locale (tautologically). The only way they're "broken" is if you want to add new locale encodings that weren't previously supportable. > Is my understanding incorrect? Is there an implementation > limitation I am missing here? I would hate to do this the wrong way > and make the encoding situation even worse: my goal is to absolutely > ensure we can transition legacy encodings to statically-known > encodings. The C locale API does not exist to convert arbitrary encodings, only the one in use as the locale's encoding. Its purpose is to abstract the concept of how text is encoded in the system's/user's environment such that applications can honor it while (with recent versions of C) being able to determine the identity of characters in terms of Unicode and emit specific characters provided that they're representable in the encoding. Conversion of arbitrary encodings other than the one in use by the locale requires a different API that takes encodings by name or some other identifier. The standard (POSIX) API for this is iconv, which has plenty of limitations of its own, some the same as what you've identified. Rich