From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3840 Path: news.gmane.org!not-for-mail From: Roy Newsgroups: gmane.linux.lib.musl.general Subject: Re: Re: Re: iconv Korean and Traditional Chinese research so far Date: Tue, 06 Aug 2013 23:11:23 +0800 Message-ID: References: <20130804165152.GA32076@brightrain.aerifal.cx> <20130805191246.GM221@brightrain.aerifal.cx> <20130806133205.GS221@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1375801907 32050 80.91.229.3 (6 Aug 2013 15:11:47 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 6 Aug 2013 15:11:47 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3844-gllmg-musl=m.gmane.org@lists.openwall.com Tue Aug 06 17:11:50 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V6ivM-00016L-1P for gllmg-musl@plane.gmane.org; Tue, 06 Aug 2013 17:11:48 +0200 Original-Received: (qmail 13673 invoked by uid 550); 6 Aug 2013 15:11:47 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 13665 invoked from network); 6 Aug 2013 15:11:47 -0000 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 139 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 203186096008.static.ctinets.com User-Agent: Opera Mail/11.64 (Win32) Xref: news.gmane.org gmane.linux.lib.musl.general:3840 Archived-At: On Tue, 06 Aug 2013 21:32:05 +0800, Rich Felker wrote: > On Tue, Aug 06, 2013 at 02:14:33PM +0800, Roy wrote: >> >My impression (please correct me if I'm wrong) is that you can't use >> >Big5-UAO as the system encoding on modern versions of Windows (just >> >ancient ones where you install unmaintained third-party software that >> >hacks the system charset tables) >> >> It doesn't "hack" the nls file but replaces with UAO-available CP950 >> nls file. >> The executable(setup program) is generated with NSIS(Nullsoft >> Scriptable Install System). >> Since the nls file format doesn't change since NT 3.1 in 1993 till >> now NT 6.2(i.e. Win 8.1 "Blue"), the UAO-available CP950 nls will >> continue to work in newer versions of windows unless MS throw away >> nls file format with something different. > > OK, thanks for clarifying. I'd still consider it a ways into the > "hack" domain if the OS vendor still is not supporting it directly, > but it does make a difference that it still works "cleanly". I was > under the impression that these sorts of things changes between > Windows versions in ways that would preclude using old, unmaintained > patches like this. I agree that just the fact that certain OS vendors > do not support an encoding is not in itself a reason not to support > it. > >> >and that it's not supported in GNU >> >libiconv. If this is the case, and especially if Big5-UAO's main use >> >is on a telnet-based BBS where everybody is using special telnet >> >clients that have their own Big5-UAO converters, >> >> GNU libiconv even not supports IBM EBCDIC(both SBCS and stateful >> SBCS+DBCS)! >> >> So does it matter if GNU libiconv is not support whatever encodings? >> (Yes glibc iconv(or say, gconv modules) does support both IBM EBCDIC >> SBCS and stateful SBCS+DBCS encodings) > > I was under the impression that GNU libiconv was in sync with glibc's > iconv, but I have not checked this. I actually was more interested in > glibc's, which is in widespread use. glibc's inclusion or exclusion of > a feature is not in itself a reason to include or exclude it, but > supporting something that glibc supports does have the added > motivation that it will increase compatibility with what programs are > expecting. > >> >I'd find it really >> >hard to justify trying to support this. But I'm open to hearing >> >arguments on why we should, if you believe it's important. >> >> I think it will be nice to have build/link time option for those >> "unpopular" encodings. >> >> >>For static linking, can we have conditional linking like QT does? >> > >> >My feeling is that it's a tradeoff, and probably has more pros than >> >cons. Unlike QT, musl's iconv is extremely small. >> >> I would add "right now" here. When we adds more encoding later, >> iconv module will be bigger than now, and people will need to find a >> way to conditionally compiling the encoding they need (for both >> dynamically or statically) > > It's never been my intent to add more encodings later (aside from pure > non-table-based variants of existing ones, like the ISO-2022 versions) > once coverage is complete, at least not as built-in features. This can > be discussed if you think there are reasons it needs to change, but up > until now, the plan has been to support: > > - ISO-8859 based 8-bit encodings > - Other 8-bit encodings with actual legacy usage (mainly Cyrillic) > - JIS 0208 based encodings > - KS X 1001 based encodings > - GB 2312 and supersets > - Big5 and supersets > > All of those except Big5 and supersets are now supported, so short of > any change, my position is that right now we're discussing the "last" > significant addition to musl's iconv. > > Some things that are definitely outside the scope of musl's iconv: > > - Anything whose characters are not present in Unicode > - Anything PUA-based (really, same as above) > - Newly invented encodings with no historical encoded data > > What's more borderline is where UAO falls: encodings that have neither > governmental or language-body-authority support nor any vendor support > from other software vendors, but for which there is at least one major > corpus of historical data and/or current usage for the encoding by > users of the language(s) whose characters are encoded. > > However, based on the file at > > http://moztw.org/docs/big5/table/uao250-b2u.txt > > a number of the mappings UAO defines are into the private use area. > This would generally preclude support (as this is a font-specific > encoding, not a Unicode encoding) unless the affected characters have > since been added to Unicode and could be remapped to the correct > codepoints. Do you know the status on this? Those are Big5-2003 compatibility code range. Big5-2003 is in CNS11643 appendix section, but it is rarely used since no OS/Application supports it. So skipping the PUA mappings are fine. > > I'm also still unclear on whether this is a superset of HKSCS (it's > definitely not directly, but maybe it is if the PUA mappings are > corrected; I did not do any detaield checks but just noted the lack of > mappings to the non-BMP codepoints HKSCS uses). No it isn't. There is some code conflict between HKSCS(2001/2004) and UAO. > >> >Even with all the >> >above, the size of iconv.o will be under 130k, maybe closer to 110k. >> >If you actually use iconv in your program, this is a small price to >> >pay for having it fully functional. On the other hand, if linking it >> >is conditional, you have to consider who makes the decision, and when. >> >If it's at link time for each application, that's probably too much of >> >a musl-specific version. >> >> Since statically linking libc-iconv is new area now (other libc >> doesn't touch this topic much), I think we can create standard for >> statically linking specified encoding table in link time. >> (This is also a reason of "why libc should provide an unique >> identifier with preprocessor define") > > I don't see how "creating a standard" for doing this would make the > situation any better. Most software authors these days are at best > tolerant of the existing of static linking, and more often hostile to > it. They're not going to add specific build behavior for static > linking, and even if they do, they're likely to get it wrong, in which > case the user ends up stuck with binaries that can't process input in > their language. > > Rich