From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/15082 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: JeanHeyd Meneide Newsgroups: gmane.linux.lib.musl.general Subject: Re: [ Guidance ] Potential New Routines; Requesting Help Date: Mon, 30 Dec 2019 22:58:27 -0500 Message-ID: References: <20191230173106.GI30412@brightrain.aerifal.cx> <20191230195744.GJ30412@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="24252"; mail-complaints-to="usenet@blaine.gmane.org" Cc: musl@lists.openwall.com To: Rich Felker Original-X-From: musl-return-15098-gllmg-musl=m.gmane.org@lists.openwall.com Tue Dec 31 04:58:57 2019 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.89) (envelope-from ) id 1im8gI-00066i-7M for gllmg-musl@m.gmane.org; Tue, 31 Dec 2019 04:58:54 +0100 Original-Received: (qmail 10101 invoked by uid 550); 31 Dec 2019 03:58:51 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 10083 invoked from network); 31 Dec 2019 03:58:50 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=saWwEh4oubkUsEtJzgwse3sF984xlkcZvf5Iz6fe0w8=; b=lcriP835lEY7H2bldgGYlo7j8slnlojzEyc2bSjApOBZ2Te8YlwsjYRqkjTVJqm952 gvekSrYJET4SxCceanoDAi+C7+8tVvd7kQ3DyoVy9RdJqL9gOvIdk6Pcko75be1aC8Lj UrmnROX7t3titdrzoSPYcGzSHf8oYoFi12aEvicl8UhDub6smIIsZy3bb8HY/hj5kjHD z9DHo6s+ZBZsIg91+uvMwh9nyT/cYhYt/k8LQfd8NiDk3XEItsb3oaBcYdynmNAA14ot iKfPXFDHSdZoExc/BdY7xlWdhKKaojdyua/ccJEyvxn7KmAXrYKeoIAKFPGZ7tZ2CQsd C06g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=saWwEh4oubkUsEtJzgwse3sF984xlkcZvf5Iz6fe0w8=; b=PJIQAykMHLV+esOJ740vjF4vLJnQV82rd91+I6vOi1EfBv3PorqiQgV5zqE5pZnz5N wcqARAGBSG5meC7l9QtBXohRNu6MwYBtssSHeDbehF5mGLDyXxf8MF9paZzfKgrdQjTz ek1DsnUXPrA1X7IAl9Ks3hBwbV8OJLbfy/1Z6oQs0anKxeMPuk5QOxQ/JH5fpR24VKSa 4fc13d1c1L7w/oxrbIkSzgJDo6b57AqUpsVbdtRbBaJAoPYNXitawkJ+a69nvizR6Jtw Y85CT7DIBUrTH1o9RvHdz5ghuIzDBGeGymTLWrbLxepwhO8r6XKX6ttxtJLuFspFB+x4 9GOQ== X-Gm-Message-State: APjAAAVS0IQ0FqP+vHGjyvFdGp4+Nle0fLU9Y8YTddLMVZscHyOBAgKu cX/hbrNUceASRS4VeDUp5+jyUtipAbjrm3JwK7j5+F4j X-Google-Smtp-Source: APXvYqyVA/k7ZUDcJeK4bgJKP/URT2+GeJZT+XQOm4d9WOrvTO2qG6z6hhnkiJ3aorbCBkp/TMAZ3UnYAPUCPnIBDdU= X-Received: by 2002:a67:af11:: with SMTP id v17mr15905414vsl.99.1577764718507; Mon, 30 Dec 2019 19:58:38 -0800 (PST) In-Reply-To: <20191230195744.GJ30412@brightrain.aerifal.cx> Xref: news.gmane.org gmane.linux.lib.musl.general:15082 Archived-At: On Mon, Dec 30, 2019 at 2:57 PM Rich Felker wrote: > I don't think these interfaces gives you an "out" in a way that's > fully conforming. The C model is that there's a set of characters > supported in the current locale, and each of them has one or more > multibyte representations (possibly involving shift states) and a > single wide character representation. Converting between UTF-16 or > UTF-32 and wchar_t outside the scope of characters that exist in the > current locale isn't presently a meaningful concept, and wouldn't > enable you to get meaningful results from wctype.h functions, etc. > (Would you propose having a second set of such functions for char32_t > to handle that? Really it sounds like what you want is an out to > deprecate wchar_t and use char32_t in its place, which wouldn't be a > bad idea...) This is actually something I am extremely interested in tackling. But I need to make sure everyone can get their data in current applications from mb/wide characters to the char32_t. Then a potential can be worked on that takes case mapping, case folding, and all of the other useful things Unicode has brought to the table and work with Unicode Code Points. One of the things I saw before is that there was a previous proposal to extend wctype.h with other functions that was very large, and despite being well motivated it did not succeed in WG14. Also on my list of things is the fact that char16_t and char32_t do not necessarily have to be Unicode (__STD_C_UTF32__ and friends). This means that if we settle on char32_t for these interfaces, we may set a potential trap for users who migrate and then try to port to platforms where c16 does not mean UTF-16, and c32 does not mean UTF-32. In coordinating with a few static analysis vendors who cover a very large range of compiler implementations both C and C++, they have reportedly not yet found a compiler which makes char16/32_t not be UTF-16/32 (some platforms forget to define the macros but still use those encodings). I hope that in the future a paper can be brought to WG14 to make those encodings required for char16/32_t, rather than checking the macro and leaving users out to dry. Right now everything de-facto works, but I worry... Still. I want to introduce each logical piece of functionality in its own paper, with its own scope and motivation. This, in my opinion, seems to work much better. Work on transition and replacement, then deprecate the things which are know from experience are bad. I don't know if my plan is going to work, but having nobody vote against my first ever WG14 proposal is a good start and I want to be careful to not get stuck in Committee on mega-proposals that scare people. > Solving these problems for implementations burdened by a legacy *wrong > choice* of definition of wchar_t is not possible by adding more > interfaces alone; it requires a lot of changes to the underlying > abstract model of what a character is in C. I'm not really in favor of > such changes. They complicate and burden existing working > implementations for the sake of ones that made bad choices. Windows in > particular *can* and *should* fix wchar_t to be 32-bit. The Windows > API uses WCHAR, not wchar_t, anyway, so that a change in wchar_t is > really not a big deal for interface compatibility, and has conformance > problems like wprintf treating %s/%ls incorrectly that require > breaking changes to fix. Good stdlib implementations on Windows > already fix these things. They should, absolutely. Still, I think that preventing lossy conversions for wchar_t usage on platforms where the wide character is used to interface with the system is a worthwhile endeavor. I don't think it is feasible (or would ever fly in WG14) to change what wchar_t is and how it behaves: but I would rather invest time in implementing interfaces that can offer better and more complete functionality. I'm trying to keep my changes well-scoped, motivated, and small. > The __STDC_ISO_10646__ macro is the way to determine that the encoding > of wchar_t is Unicode (or some subset if WCHAR_MAX doesn't admit the > full range). Otherwise it's not something you can meaningfully work > with except as an abstract number, but in that case you just want to > avoid it as much as possible and convert directly between multibyte > characters and char16_t/char32_t. I don't see how converting directly > between wchar_t and char16_t/char32_t is more useful, even if it is a > prettier factorization of the code. It is an abstract number with no meaning to the developer, but the platform (e.g., IBM using various GB encodings for wchar_t on certain platforms where __STDC_ISO_10646__ is not defined) knows that meaning. My intention is that by letting the Standard Library and platform handle it, you can get from a blob of abstract numbers to meaningful text in a Standard way. Not only for wchar_t, but for mb strings too. > A far more useful thing to know than wchar_t encoding is the multibyte > encoding. POSIX gives you this in nl_langinfo(CODESET) but plain C has > no equivalent. I'd actually like to see WG14 adopt this into plain C. This is actually something I am considering! There are a few sister papers related to this percolating through another Standards Committee right now; I want to see how that goes before bringing it to WG14. But, I think that functionality should come in addition to - not instead of - additional conversion functions. Platforms own wchar_t and multibyte char encodings: if the user has to write conversion routines themselves after checking the equivalent of nl_langinfo, we may end up with incomplete or half-done support for encodings in many programs! > On musl (where I'm familiar with performance properties), > byte-at-a-time conversion is roughly half the speed of bulk, which > looks big but is diminishingly so if you're actually doing something > with the result (just converting to wchar_t for its own sake is not > very useful). Character-at-a-time is probably somewhat less slow than > byte-at-a-time. When I wrote this I put in heavy effort to make > byte/character-at-a-time not horribly slow, because it's normally the > natural programming model. Wide character strings are not an idiomatic > type to work with in C. If it is still okay, I will put my best effort into making sure the character-at-a-time and similar functions are something you and other musl contributors can be happy with! Sincerely, JeanHeyd