From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/15068 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: JeanHeyd Meneide Newsgroups: gmane.linux.lib.musl.general Subject: Re: [ Guidance ] Potential New Routines; Requesting Help Date: Thu, 26 Dec 2019 00:43:45 -0500 Message-ID: References: <87zhfg185y.fsf@mid.deneb.enyo.de> <20191226021354.GE30412@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="53852"; mail-complaints-to="usenet@blaine.gmane.org" Cc: Florian Weimer , musl@lists.openwall.com To: Rich Felker Original-X-From: musl-return-15083-gllmg-musl=m.gmane.org@lists.openwall.com Thu Dec 26 06:44:12 2019 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.89) (envelope-from ) id 1ikLwS-000DuH-B0 for gllmg-musl@m.gmane.org; Thu, 26 Dec 2019 06:44:12 +0100 Original-Received: (qmail 30690 invoked by uid 550); 26 Dec 2019 05:44:09 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 30671 invoked from network); 26 Dec 2019 05:44:09 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=G7mPLy56fGRWMFGgQRyaNHlyc0g2IhD6AoVqLf05jGY=; b=ZK4ExICH+tIOyWkKtexYFOC+lgeTGrC09LAqiSdOMML8DbYvTCQ+4mKW3m8Ejomb3g tv37n02YOo4/AXmcoYYirjfbl0cHCAvmgokLrsvLaycZlaHrHYevG6YFuBDo117Yteit hHC7dprujPPydRXee5lPk+kwDWt6x1/a8BZ1zShcMdsQWnHHxpsBddyvDXaxEx0eF/7K 2kxTcmcxyPgLtNcazNmveQGZvamnmJCk4Ufu0DBykL5IMU+v5E27ZgXSGftMnNJ2Qzcb GAMtjaKdSEnHtNvmcwY5bFwuWPS5razEFsrOZQ/u4hulPB5FcCrjWvyssLPxAi3HKj0j GrlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=G7mPLy56fGRWMFGgQRyaNHlyc0g2IhD6AoVqLf05jGY=; b=bsVAjeh3IgJZ2Ab/+B7dSEKosahky74gGglJoueqjKg31qE/6DqerCPU0BMCocq7mr epn9TGPohslctKF0VkdnCuGj8/Ek9jOrqQEvNiOpg0LBsF7OwtgSzMy34L2u9vFmcKq+ j5rdiFKj/A/1zf+o4FsTJiUxbAXjNIQciJSh7YJJcWsxtD6iIIEJg0NIA2JTNd0skzzl X0o3TiTB8+S/qYxBbcMO+xAckrVitUmkgGtX5eUNFezKH8p2NCQB3qBt1M1RYfUIj42K RaF4S2wYSkB/Zaqcq37g+/eVp7juOIi+tmvpCOJRIe0ryWddF5AvETJNgk0nZMIC8uo0 jm3g== X-Gm-Message-State: APjAAAWeQFTcfx4+eUXI6i382aup4OIOU7AVfU5uhZo1EHuDiEa3SATN n2ZMMzfl6h/iu7QE0uFz690+IZRs36vyB3hJHBA= X-Google-Smtp-Source: APXvYqw5Xhmqaje9BJym112l2e31Xjm9DZXvHn8WijdXHTv9FZUaxHvainbeGoxD2/I3lDVuhv55+dabXWaiaVlRbFY= X-Received: by 2002:a67:d011:: with SMTP id r17mr14735404vsi.159.1577339037088; Wed, 25 Dec 2019 21:43:57 -0800 (PST) In-Reply-To: <20191226021354.GE30412@brightrain.aerifal.cx> Xref: news.gmane.org gmane.linux.lib.musl.general:15068 Archived-At: Dear Rich Felker and Florien Weimer, Thank you for taking an interest in this! On Wed, Dec 25, 2019 at 9:14 PM Rich Felker wrote: > > On Wed, Dec 25, 2019 at 09:07:05PM +0100, Florian Weimer wrote: > > > > I'm somewhat concerned that the C multibyte functions are too broken > > to be useful. There is a at least one widely implemented character > > set (Big5 as specified for HTML5) which does not fit the model implied > > by the standard. Big5 does not have shift states, but a C > > implementation using UTF-32 for wchar_t has to pretend it has because > > correct conversion from Unicode to Big5 needs lookahead and cannot be > > performed one character at a time. > > I don't think this can be modeled with shift states. C explicitly > forbids a stateful wchar_t encoding/multi-wchar_t-characters. Shift > states would be meaningful for the other direction. > > In any case I don't think it really matters. There are no existing > implementations with this version of Big5 (with the offending HKSCS > characters included) as the locale charset, since it can't work, and > there really is no good reason to be adding *new* locale encodings. > The reason we (speaking of the larger community; musl doesn't) have > non-UTF-8 locales is legacy compatibility for users who need or insist > on keeping them. I have no intentions on adding new locale-based charsets (and I absolutely agree that we should not be adding anymore). That being said, I want to focus on the main part of this, which is the ability to model existing encodings which have both/either shift states and/or multi-character expansions. It is true wchar_t is invariably busted. There is no way a wide character string can be multi-unit: that shipped sailed when wchar_t was specified as is, and when it was codified in various APIs such as mbtowc/wctomb and friends. I will, however, note that the paper specifically wants to add the Restartable versions of "single unit" wc and mb to/from functions. The reason I chose the restartable forms is because in C2x a defect report was accepted that clarified the original intent of the single-character functions with respect to their R versions: > "After discussion, the committee concluded that mbstate was already speci= fied to handle this case, and as such the second interpretation is intended= . The committee believes that there is an underspecification, and solicited= a further paper from the author along the lines of the second option. Alth= ough not discussed a Suggested Technical Corrigendum can be found at N2040.= " -- WG14, April 2016, http://www.open-std.org/jtc1/sc22/wg14/www/docs/n205= 9.htm#dr_488 This means that while wcto* and *towc functions are broken, the restartable versions need not be. By returning a value of "0" (e.g., for c8rtomb) or returning a value of "-3" (e.g., for mbrtoc8), we can write out multiple characters based on the input and any data stored in mbstate_t. This allows us to handle e.g. UTF-16 for c16, multi-conversions for c8, and more. My understanding is that for one of the referenced encodings in the linked mailing post (TSCII), this would cover that use case. My understanding is also that for some Big5 encodings it would -- as you stated -- treat ambiguous leading sequences as a shift state, accumulate data in the mbstate_t, and then write out data if the sequence is made unambiguous by having further data provided to the next call. Is my understanding incorrect? Is there an implementation limitation I am missing here? I would hate to do this the wrong way and make the encoding situation even worse: my goal is to absolutely ensure we can transition legacy encodings to statically-known encodings. Sincerely, JeanHeyd Meneide