From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7618 Path: news.gmane.org!not-for-mail From: =?UTF-8?B?572X5YuH5YiaKFlvbmdnYW5nIEx1bykg?= Newsgroups: gmane.linux.lib.musl.general,gmane.comp.standards.posix.austin.general,gmane.comp.compilers.clang.devel Subject: Re: Re: [cfe-dev] Is that getting wchar_t to be 32bit on win32 a good idea for compatible with Unix world by implement posix layer on win32 API? Date: Sun, 10 May 2015 20:19:46 +0800 Message-ID: References: <20150509103645.GG29035@port70.net> <20150509200535.GK17573@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1431260427 20684 80.91.229.3 (10 May 2015 12:20:27 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 10 May 2015 12:20:27 +0000 (UTC) Cc: John Sully , Karsten Blees , musl@lists.openwall.com, dplakosh@cert.org, austin-group-l@opengroup.org, hsutter@microsoft.com, Clang Dev , James McNellis To: Rich Felker Original-X-From: musl-return-7631-gllmg-musl=m.gmane.org@lists.openwall.com Sun May 10 14:20:25 2015 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1YrQDV-0007YD-5z for gllmg-musl@m.gmane.org; Sun, 10 May 2015 14:20:21 +0200 Original-Received: (qmail 18072 invoked by uid 550); 10 May 2015 12:20:19 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 18054 invoked from network); 10 May 2015 12:20:18 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:from:date:message-id :subject:to:cc:content-type:content-transfer-encoding; bh=YxLibm6Grs4OxnWgrqiR2BNZmL7F9U6PfeIFeKqYvcM=; b=oJDVImBG88mKFUk1JXsJawzMefVTAswHMwxagBCiqOZBBihWC92ID3jK9E9pmS0l9V JFTLp/xi4ofHfOgdNM/+vIBZDiScVuz7ye7eL1Tv1IiR2Us4ZVm1ftnrgW/vqt4sq850 OqfIVhMg6Q7HXhOmfje6XkljXuuaJ+tTNxfyT/sqr0hFgHrg9Si9vTGUo0NqVWt1XqIw uU2q1F3E7bn/E4CugxqkQLRMbFXg1r/6oDHiQw+JrHktatPOX6gyvhBu2g8pn0/0EVxM S1UQOognFXeitYH82cjpI7q5miUqPEMQhpweUukOUTlWYXDMDen2WgGoTREiqvYCVCUl 9eNA== X-Received: by 10.50.142.67 with SMTP id ru3mr7716848igb.40.1431260407242; Sun, 10 May 2015 05:20:07 -0700 (PDT) In-Reply-To: <20150509200535.GK17573@brightrain.aerifal.cx> Xref: news.gmane.org gmane.linux.lib.musl.general:7618 gmane.comp.standards.posix.austin.general:10763 gmane.comp.compilers.clang.devel:42678 Archived-At: 2015-05-10 4:05 GMT+08:00 Rich Felker : > On Sat, May 09, 2015 at 07:19:14PM +0800, =E7=BD=97=E5=8B=87=E5=88=9A(Yon= ggang Luo) wrote: >> 2015-05-09 18:36 GMT+08:00 Szabolcs Nagy : >> > * John Sully [2015-05-09 00:55:12 -0700]: >> >> In my opinion you almost never want 32-bit wide characters once you l= earn >> >> of their limitations. Most people assume that if they use them they = can >> >> return to the one character -> one glyph idiom like ASCII. But Unico= de is >> > >> > wchar_t must be at least 21 bits on a system that spports unicode >> > in any locale: it has to be able to represent all code points of the >> > supported character set. >> > >> > in practice this means that the only conforming definition to iso c >> > (and thus posix, c++ and other standards based on c) is a 32bit wchar_= t >> > (the signedness can be choosen freely). >> > >> > so the definition is not based on what "you almost never want" or what >> > "most people assume". >> > >> > if the goal is to provide a posix implementation then 16bit wchar_t >> > is not an option (assuming the system wants to be able to communicate >> > with the external world that uses unicode text). >> wchar_t is not the only way to communicate with the external way, and >> it's also not suite for communicate to the external world, > > Of course it's not. UTF-8 is. But per both ISO C and POSIX, any > character the locale supports has a representation as wchar_t. If > wchar_t is only 16-bit, then you fundamentally can't support all of > Unicode in the locale's encoding. mbrtowc has to fail with EILSEQ for > 4-byte characters, regex functions cannot process 4-byte characters, > etc. Such a system is is conforming to the requirements for C and > POSIX but does not support Unicode (in full) at the locale level. > >> from the C11 standard, it's never restrict the wchar_t's width, and >> for Posix, most API are implement in >> utf8, and indeed, Windows need the posix layer mainly because of those >> API that using utf8, not wchar_t APIs, >> for the communicate reason to getting wchar_t to be 32 bit on Win32 is >> not a good idea, >> >> And for portable text processing(Including win32) apps or libs, they >> would and should never dependents on the wchar_t must be 32 bit width. > > If __STDC_ISO_10646__ is defined, wchar_t must have at least 21 value > bits. Applications which are portable only to systems where this macro > is defined, or which have some fallback (like dropping multilingual > text support) for systems where it's not defined, CAN make such > assumptions. > >> And C11/C++11 already provide uchar.h to provide cross-platform >> char16_t and char32_t, so there is no reason to getting wchar_t to be >> 32bit >> on win32 for suport posix on win32. > > If wchar_t is 16-bit, you can't represent non-BMP characters in > char32_t because they can't be part of the locale's character set. All > char32_t buys you then is 16 wasted zero bits. > >> We were intent to creating a usable posix layer on win32, not creating >> a theoretical POSIX layer that would be useless, on win32, we should >> considerate the de facto things >> on win32. > > Uselessness is a big assumption you're making that's not supported by > data. If you actually provide a working POSIX layer, you'll have > pretty much any application that's currently working on Linux, BSDs, > etc. (with actual portable code, not system-specific #ifdefs) working > on Windows with few or no changes. If you do that with 32-bit wchar_t, > they'll support Unicode fully. If you do it with 16-bit wchar_t, then > the ones that are using the locale system for character handling will > have to be refitted with extra layers to support more than the BMP, > and those patches probably (hopefully) won't be accepted upstream. > > The only applications that would benefit from having 16-bit wchar_t > are existing Windows applications that are not going to have much use > for a POSIX layer anyway, and they can be fixed very easily with > search-and-replace (no new code layers). That's not so easy as you said to search-and-replace, Windows and POSIX there is a lot of incompatible and that won't be changed,= or We just implement a virtual machine that running on Win32, that's would compatible all the POSIX things on win32, but that's useless The intention to provide a POSIX layer is to reduce the burden for those Developers have intension to create cross-platform(include Windows), but not for those Developers that only intent to developing apps for Linux/POSIX. So such a layer should preserve the usable part of POSIX and dropping those part that just creating inconvenience. wchar_t to be 32bit is obviously suite for Win32. My intention is not developing a virtual machine like layer such as cygwin, but a native Win32 layer that provide most POSIX functions and with utf8 support, that would solve most portable issue and works on win32 just like a win32 app but not a Unix/Linux app. > > Rich --=20 =E6=AD=A4=E8=87=B4 =E7=A4=BC =E7=BD=97=E5=8B=87=E5=88=9A Yours sincerely, Yonggang Luo