From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/15061 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Florian Weimer Newsgroups: gmane.linux.lib.musl.general Subject: Re: [ Guidance ] Potential New Routines; Requesting Help Date: Wed, 25 Dec 2019 21:07:05 +0100 Message-ID: <87zhfg185y.fsf@mid.deneb.enyo.de> References: Reply-To: musl@lists.openwall.com Mime-Version: 1.0 Content-Type: text/plain Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="265310"; mail-complaints-to="usenet@blaine.gmane.org" Cc: musl@lists.openwall.com To: JeanHeyd Meneide Original-X-From: musl-return-15077-gllmg-musl=m.gmane.org@lists.openwall.com Wed Dec 25 21:08:08 2019 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.89) (envelope-from ) id 1ikCwx-0016sn-ND for gllmg-musl@m.gmane.org; Wed, 25 Dec 2019 21:08:07 +0100 Original-Received: (qmail 10070 invoked by uid 550); 25 Dec 2019 20:08:05 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 10049 invoked from network); 25 Dec 2019 20:08:04 -0000 In-Reply-To: (JeanHeyd Meneide's message of "Tue, 24 Dec 2019 18:06:50 -0500") Xref: news.gmane.org gmane.linux.lib.musl.general:15061 Archived-At: * JeanHeyd Meneide: > I hope this e-mail finds you doing well this Holiday Season! I am > interested in developing a few fast routines for text encoding for > musl after the positive reception of a paper for the C Standard > related to fast conversion routines: > > https://thephd.github.io/vendor/future_cxx/papers/source/C%20-%20Efficient%20Character%20Conversions.html I'm somewhat concerned that the C multibyte functions are too broken to be useful. There is a at least one widely implemented character set (Big5 as specified for HTML5) which does not fit the model implied by the standard. Big5 does not have shift states, but a C implementation using UTF-32 for wchar_t has to pretend it has because correct conversion from Unicode to Big5 needs lookahead and cannot be performed one character at a time. This would at least affect the proposed c8rtomb function. I posted a brief review of the problematic charsets in glibc here: