From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/6489 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: [PATCH] implement a private state for the uchar.h functions Date: Tue, 11 Nov 2014 09:39:00 -0500 Message-ID: <20141111143900.GJ22465@brightrain.aerifal.cx> References: <1415528228.2457.1188.camel@eris.loria.fr> <20141111032110.GG22465@brightrain.aerifal.cx> <1415713982.2457.1704.camel@eris.loria.fr> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1415716765 5662 80.91.229.3 (11 Nov 2014 14:39:25 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 11 Nov 2014 14:39:25 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-6502-gllmg-musl=m.gmane.org@lists.openwall.com Tue Nov 11 15:39:15 2014 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1XoCbC-0006Wx-Rw for gllmg-musl@m.gmane.org; Tue, 11 Nov 2014 15:39:14 +0100 Original-Received: (qmail 14065 invoked by uid 550); 11 Nov 2014 14:39:12 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 14051 invoked from network); 11 Nov 2014 14:39:12 -0000 Content-Disposition: inline In-Reply-To: <1415713982.2457.1704.camel@eris.loria.fr> User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:6489 Archived-At: On Tue, Nov 11, 2014 at 02:53:02PM +0100, Jens Gustedt wrote: > Am Montag, den 10.11.2014, 22:21 -0500 schrieb Rich Felker: > > On Sun, Nov 09, 2014 at 11:18:08AM +0100, Jens Gustedt wrote: > > > The C standard is imperative on that: > > > > > > 7.28.1 ... If ps is a null pointer, each function uses its own internal > > > mbstate_t object instead, which is initialized at program startup to > > > the initial conversion state; > > > > Thanks. Actually I originally had this functionality and removed it > > because it seemed to be unnecessary, due to the requirement being > > buried in that introductory text rather than the descriptions of the > > individual functions. I figured the committee had just intentionally > > decided not to copy this backwards functionality from the old > > multibyte functions into the new uchar ones, but sadly that's not the > > case... > > Yes these are bizarre additions. That has almost a dozen different > static states for all of the different restartable functions. > > Perhaps I misunderstood something, but isn't it that in direction mbs > -> charXX_t these functions allow to handle surrogates, but the other > way around is not possible? Both directions are possible. c16rtomb returns 0 and saves the first surrogate as state for the next call. mbrtoc16 writes out the first surrogate, saves the second in the state, and returns 4 on the first call, then returns (size_t)-3 and writes out the second surrogate on the next call. Yes it's hideously ugly but it way trivial to implement. > From that new unicode support in C11 I get some of the ideas, but some > things remain quite misterious > > - having a standard way to specify unicode characters inside a string > of any kind through \u and \U is really a great achievement Yes and no. I don't think anyone really wants to use these. They're unreadable except when used extremely sparingly, and embedding natural language text in source is widely frowned upon anyway which limits the usefulness. But it is nice to at least have a way if/when you need it. > - introducing types charXX_t and constants literals with u and U is > already less clear. The only thing that can be done with them is > conversion, there are no auxiliary functions. In particular the > character counting and classification problems for surrogates is > still not solved. The provided conversions to/from multibyte are useless because the current multibyte character set cannot necessarily even represent them. Initially I thought they should have provided conversions to/from wchar_t, but that would also be useless since wchar_t is only officially meaningful for characters in the current (multibyte) character set. The only conversions that would actually be useful are between UTF-8, UTF-16, and UTF-32, but those are all well-defined in an implementation-independent manner and thus something you can provide yourself (even though at least 70% if people doing so do it wrong...) which I can only assume is the reason the language standard doesn't provide them. > - introducing a u8 prefix for strings that guarantees utf8 encoding > for mbs sounds nice. But then there is nothing that relates these > to "normal" string literals. What are we supposed to do with these? Process them with your own code, or just pass them to external interfaces that expect UTF-8 (e.g. filesystem structures, network protocols, etc.). Rich