Status of Big5 and extensions

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Status of Big5 and extensions
@ 2013-08-07 16:50 Rich Felker
  2013-08-08  0:18 ` Roy
  0 siblings, 1 reply; 9+ messages in thread
From: Rich Felker @ 2013-08-07 16:50 UTC (permalink / raw)
  To: musl

OK, so after a lot of research and discussion, I'm about to commit the
first part of Big5 support in iconv. The plain "Big5" charset name is
going to be the maximal set everybody agrees on, which, as far as I
can tell, is just CP950. (Actually even IBM's Big5 variant in ICU
differs from CP950 in a few places, but it's just wrong. The one
ideograph where it differs conflicts with Unihan.txt, which is
authoritative on which Unicode characters encode which character
identities from historical CJK charsets.)

As for extensions, my understanding of HKSCS is to a point now where I
feel we can add it (charset name "Big5-HKSCS"), based on the 2008
government publication which has a few new characters beyond the old
versions in most software. (Thank you nsz for helping dig up all the
files and researching how they differ!) However there are a few
technical difficulties to implementation: the Unicode codepoints span
a 17-bit range rather than just a 16-bit range, so we need an
efficient way of doing the mappings. What's worse, several HKSCS
codepoints map to Latin characters with multiple combining marks which
have no precomposed representation. Supporting these will require
extending iconv to be able to output two or more characters of output
for each unit of input, which is mildly error-prone with the current
design, so I may hold off on HKSCS support until I overhaul some of
the core logic of the iconv function.

Now, the hard part: Taiwan extensions. I appreciate all the help from
Roy, but I'm still not to the point of having anything nearly ready to
be added. The UAO extension set has at least 290 mappings to
codepoints in the Unicode Private Use Areas (PUA), which makes it
unsuitable for inclusion as-is. This may be a resolvable issue if
these 290 characters all exist in Unicode and could be remapped, but I
do not feel any of us are qualified to determine this. So UAO is
probably still a long way away from being able to be adopted into
iconv, unless we have authoritative data on the identities of these
characters in the form of something from an official standards body or
at the very least multiple major vendors (enough to ensure that any
future standardization would be consistent and non-controversial).

If there are other supersets of CP950 (possibly subsets of UAO) which
would be useful for supporting Taiwanese users, I would very much like
to understand the situation on them and whether it's feasible to
include them in iconv at this point. Perhaps the most important thing
to know at this point is what practical difficulties exist for
Taiwanese users limited to CP950, and which extension charsets could
solve this. I feel like issues like "I cannot write my own name" are
on a completely different level than "I can't mix Japanese text in
with my legacy-encoded data [when I really should be using Unicode for
this anyway]". The former sort of issue is something that demands some
sort of support, even if imperfect or mildly hackish, whereas the
latter is not a justification for hacks and bypassing proper standards
processes.

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Status of Big5 and extensions
  2013-08-07 16:50 Status of Big5 and extensions Rich Felker
@ 2013-08-08  0:18 ` Roy
  2013-08-08  2:11   ` Rich Felker
  0 siblings, 1 reply; 9+ messages in thread
From: Roy @ 2013-08-08  0:18 UTC (permalink / raw)
  To: musl

在 Thu, 08 Aug 2013 00:50:44 +0800, Rich Felker <dalias@aerifal.cx> 寫道:

> OK, so after a lot of research and discussion, I'm about to commit the
> first part of Big5 support in iconv. The plain "Big5" charset name is
> going to be the maximal set everybody agrees on, which, as far as I
> can tell, is just CP950. (Actually even IBM's Big5 variant in ICU
> differs from CP950 in a few places, but it's just wrong. The one
> ideograph where it differs conflicts with Unihan.txt, which is
> authoritative on which Unicode characters encode which character
> identities from historical CJK charsets.)
>
> As for extensions, my understanding of HKSCS is to a point now where I
> feel we can add it (charset name "Big5-HKSCS"), based on the 2008
> government publication which has a few new characters beyond the old
> versions in most software. (Thank you nsz for helping dig up all the
> files and researching how they differ!) However there are a few
> technical difficulties to implementation: the Unicode codepoints span
> a 17-bit range rather than just a 16-bit range, so we need an
> efficient way of doing the mappings. What's worse, several HKSCS
> codepoints map to Latin characters with multiple combining marks which
> have no precomposed representation. Supporting these will require
> extending iconv to be able to output two or more characters of output
> for each unit of input, which is mildly error-prone with the current
> design, so I may hold off on HKSCS support until I overhaul some of
> the core logic of the iconv function.
>
> Now, the hard part: Taiwan extensions. I appreciate all the help from
> Roy, but I'm still not to the point of having anything nearly ready to
> be added. The UAO extension set has at least 290 mappings to
> codepoints in the Unicode Private Use Areas (PUA), which makes it
> unsuitable for inclusion as-is. This may be a resolvable issue if
> these 290 characters all exist in Unicode and could be remapped, but I
> do not feel any of us are qualified to determine this. So UAO is
> probably still a long way away from being able to be adopted into
> iconv, unless we have authoritative data on the identities of these
> characters in the form of something from an official standards body or
> at the very least multiple major vendors (enough to ensure that any
> future standardization would be consistent and non-controversial).
>

I did ask one of creator of UAO in  
http://forum.moztw.org/viewtopic.php?f=10&t=40174
And there is a reply of those PUA characters in Big5 view:
0xFA40 - 0xFA63: Reserved for user-defined characters
0xC8A5 - 0xC8B0: for Big5-2003 compliant
Others: "Art characters" of ChinaSea character set(csw 1.0, i.e.  
cswsmin.tte), which are not mapped to Unicode and the codepoint that has  
not occupied by HKSCS codepoints.

> If there are other supersets of CP950 (possibly subsets of UAO) which
> would be useful for supporting Taiwanese users, I would very much like
> to understand the situation on them and whether it's feasible to
> include them in iconv at this point. Perhaps the most important thing
> to know at this point is what practical difficulties exist for
> Taiwanese users limited to CP950, and which extension charsets could
> solve this. I feel like issues like "I cannot write my own name" are
> on a completely different level than "I can't mix Japanese text in
> with my legacy-encoded data [when I really should be using Unicode for
> this anyway]". The former sort of issue is something that demands some
> sort of support, even if imperfect or mildly hackish, whereas the
> latter is not a justification for hacks and bypassing proper standards
> processes.
>
> Rich



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: Status of Big5 and extensions
  2013-08-08  0:18 ` Roy
@ 2013-08-08  2:11   ` Rich Felker
  2013-08-08  3:48     ` Roy
  2013-08-08  3:53     ` Szabolcs Nagy
  0 siblings, 2 replies; 9+ messages in thread
From: Rich Felker @ 2013-08-08  2:11 UTC (permalink / raw)
  To: musl

On Thu, Aug 08, 2013 at 08:18:45AM +0800, Roy wrote:
> >Now, the hard part: Taiwan extensions. I appreciate all the help from
> >Roy, but I'm still not to the point of having anything nearly ready to
> >be added. The UAO extension set has at least 290 mappings to
> >codepoints in the Unicode Private Use Areas (PUA), which makes it
> >unsuitable for inclusion as-is. This may be a resolvable issue if
> >these 290 characters all exist in Unicode and could be remapped, but I
> >do not feel any of us are qualified to determine this. So UAO is
> >probably still a long way away from being able to be adopted into
> >iconv, unless we have authoritative data on the identities of these
> >characters in the form of something from an official standards body or
> >at the very least multiple major vendors (enough to ensure that any
> >future standardization would be consistent and non-controversial).
> >
> 
> I did ask one of creator of UAO in
> http://forum.moztw.org/viewtopic.php?f=10&t=40174
> And there is a reply of those PUA characters in Big5 view:
> 0xFA40 - 0xFA63: Reserved for user-defined characters

So these can simply be dropped from the mapping.

> 0xC8A5 - 0xC8B0: for Big5-2003 compliant

I do not see any C8xx mappings in Big5-2003, so this explanation does
not seem plausible.

Since you mentioned Big5-2003, I've been looking into it, and it seems
like it should be part of our base Big5 mapping. Diffing moztw's
version of it against CP950.TXT (after cleaning up both), I get:

-0xA156 0x2013
+0xA156 0x2015
...
-0xA1C2 0x00AF
+0xA1C2 0x203E
...
-0xA2A4 0x2550
-0xA2A5 0x255E
-0xA2A6 0x256A
-0xA2A7 0x2561
+0xA2A4 0x2501
+0xA2A5 0x251D
+0xA2A6 0x253F
+0xA2A7 0x2525

The above all looks like pure nonstandard Mozilla behavior. What's
next is more interesting:

-0xA2CC 0x5341
-0xA2CD 0x5344
-0xA2CE 0x5345
+0xA2CC 0x3038
+0xA2CD 0x3039
+0xA2CE 0x303A

This looks to me like an actual bug in CP905.TXT from Unicode.
Unihan.txt says U+5341 is Big5's A451, so it can't also be A2CC. Same
for the others. Indeed, CP905.TXT maps these in a non-one-to-one way,
which is in itself almost certainly a bug.

+0xA3C0 0x2400
+0xA3C1 0x2401
+...
+0xA3DF 0x241F
+0xA3E0 0x2421

These are all part of ETEN omitted from CP950, and should definitely
be in Big5 base.

+0xC6A1 0x2460
+0xC6A2 0x2461
+0xC6A3 0x2462
+0xC6A4 0x2463
+...
+0xC7F1 0x30F5
+0xC7F2 0x30F6

These are also from ETEN. Notably, the Cyrillic block that immediately
follows these is still omitted in Big5-2003, for reasons that appear
political. Since ETEN, UAO, and HKSCS all have it, I see no reason not
to add the Cyrillic block back in here.

Finally:

-0xF9FA 0x256D
-0xF9FB 0x256E
-0xF9FC 0x2570
-0xF9FD 0x256F
+0xF9FA 0x2554
+0xF9FB 0x2557
+0xF9FC 0x255A
+0xF9FD 0x255D

This looks like pure Mozilla cruft. Is there any justification for
these sorts of changes?

Does the above analysis look correct? If so I will go ahead and merge
the above changes to Big5 support into musl.

BTW, the only non-PUA part of UAO within the standard Big5 range
(89x157 grid) that won't be mapped with these changes is the stuff
right after the Cyrillic block. This part does not conflict with
current HKSCS, so if I had good sources from both the Taiwan and HK
sides supporting the position that these mappings will not conflict
with other extensions in current use or with future expansion of
HKSCS, we could consider including that part of UAO in the base Big5
mapping. At this point this is only an idea for consideration, but we
can keep it in mind.

> Others: "Art characters" of ChinaSea character set(csw 1.0, i.e.
> cswsmin.tte), which are not mapped to Unicode and the codepoint that
> has not occupied by HKSCS codepoints.

So basically dingbats? 

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: Status of Big5 and extensions
  2013-08-08  2:11   ` Rich Felker
@ 2013-08-08  3:48     ` Roy
  2013-08-08  3:53     ` Szabolcs Nagy
  1 sibling, 0 replies; 9+ messages in thread
From: Roy @ 2013-08-08  3:48 UTC (permalink / raw)
  To: musl

On Thu, 08 Aug 2013 10:11:19 +0800, Rich Felker <dalias@aerifal.cx> wrote:

> On Thu, Aug 08, 2013 at 08:18:45AM +0800, Roy wrote:
>> >Now, the hard part: Taiwan extensions. I appreciate all the help from
>> >Roy, but I'm still not to the point of having anything nearly ready to
>> >be added. The UAO extension set has at least 290 mappings to
>> >codepoints in the Unicode Private Use Areas (PUA), which makes it
>> >unsuitable for inclusion as-is. This may be a resolvable issue if
>> >these 290 characters all exist in Unicode and could be remapped, but I
>> >do not feel any of us are qualified to determine this. So UAO is
>> >probably still a long way away from being able to be adopted into
>> >iconv, unless we have authoritative data on the identities of these
>> >characters in the form of something from an official standards body or
>> >at the very least multiple major vendors (enough to ensure that any
>> >future standardization would be consistent and non-controversial).
>> >
>>
>> I did ask one of creator of UAO in
>> http://forum.moztw.org/viewtopic.php?f=10&t=40174
>> And there is a reply of those PUA characters in Big5 view:
>> 0xFA40 - 0xFA63: Reserved for user-defined characters
>
> So these can simply be dropped from the mapping.
>
>> 0xC8A5 - 0xC8B0: for Big5-2003 compliant
>
> I do not see any C8xx mappings in Big5-2003, so this explanation does
> not seem plausible.

So do I, but this area exists in draft marked as reserved.
http://web.archive.org/web/20041210015709/http://pingyeh.net/big5/big5-2003-v3/summary.txt

>
> Since you mentioned Big5-2003, I've been looking into it, and it seems
> like it should be part of our base Big5 mapping. Diffing moztw's
> version of it against CP950.TXT (after cleaning up both), I get:
>
> -0xA156 0x2013
> +0xA156 0x2015
> ...
> -0xA1C2 0x00AF
> +0xA1C2 0x203E
> ...
> -0xA2A4 0x2550
> -0xA2A5 0x255E
> -0xA2A6 0x256A
> -0xA2A7 0x2561
> +0xA2A4 0x2501
> +0xA2A5 0x251D
> +0xA2A6 0x253F
> +0xA2A7 0x2525
>
> The above all looks like pure nonstandard Mozilla behavior. What's
> next is more interesting:
>
> -0xA2CC 0x5341
> -0xA2CD 0x5344
> -0xA2CE 0x5345
> +0xA2CC 0x3038
> +0xA2CD 0x3039
> +0xA2CE 0x303A
>
> This looks to me like an actual bug in CP905.TXT from Unicode.
> Unihan.txt says U+5341 is Big5's A451, so it can't also be A2CC. Same
> for the others. Indeed, CP905.TXT maps these in a non-one-to-one way,
> which is in itself almost certainly a bug.

The move from Unihan to Symbol of 0xA2CC-0xA2CE is approved in Big5-2003  
meeting, minutes vote #1:
http://web.archive.org/web/20050307160806/http://pingyeh.net/big5/big5-2003-v3/1106-2003-note.txt

>
> +0xA3C0 0x2400
> +0xA3C1 0x2401
> +...
> +0xA3DF 0x241F
> +0xA3E0 0x2421
>
> These are all part of ETEN omitted from CP950, and should definitely
> be in Big5 base.

Those are control symbols. Listed in summary.txt

>
> +0xC6A1 0x2460
> +0xC6A2 0x2461
> +0xC6A3 0x2462
> +0xC6A4 0x2463
> +...
> +0xC7F1 0x30F5
> +0xC7F2 0x30F6
>
> These are also from ETEN. Notably, the Cyrillic block that immediately
> follows these is still omitted in Big5-2003, for reasons that appear
> political. Since ETEN, UAO, and HKSCS all have it, I see no reason not
> to add the Cyrillic block back in here.

Government said Cyrillic block is not in CNS 11643, so removed.
But YES we should have it.

>
> Finally:
>
> -0xF9FA 0x256D
> -0xF9FB 0x256E
> -0xF9FC 0x2570
> -0xF9FD 0x256F
> +0xF9FA 0x2554
> +0xF9FB 0x2557
> +0xF9FC 0x255A
> +0xF9FD 0x255D
>
> This looks like pure Mozilla cruft. Is there any justification for
> these sorts of changes?

Another vote in meeting, minutes vote #2.

Big5-2003 draft v3 (should be identical to released version) for reference:
http://web.archive.org/web/20051017043312/http://pingyeh.net/big5/big5-2003-v3/big5-2003-draft-v3.csv

BTW MS votes in the meeting but she didn't use Big5-2003 in her newer  
OSes, WTF?

>
> Does the above analysis look correct? If so I will go ahead and merge
> the above changes to Big5 support into musl.

Mostly correct IMHO.

>
> BTW, the only non-PUA part of UAO within the standard Big5 range
> (89x157 grid) that won't be mapped with these changes is the stuff
> right after the Cyrillic block. This part does not conflict with
> current HKSCS, so if I had good sources from both the Taiwan and HK
> sides supporting the position that these mappings will not conflict
> with other extensions in current use or with future expansion of
> HKSCS, we could consider including that part of UAO in the base Big5
> mapping. At this point this is only an idea for consideration, but we
> can keep it in mind.
>
>> Others: "Art characters" of ChinaSea character set(csw 1.0, i.e.
>> cswsmin.tte), which are not mapped to Unicode and the codepoint that
>> has not occupied by HKSCS codepoints.
>
> So basically dingbats?

No only dingbats AFAIK.

>
> Rich



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: Status of Big5 and extensions
  2013-08-08  2:11   ` Rich Felker
  2013-08-08  3:48     ` Roy
@ 2013-08-08  3:53     ` Szabolcs Nagy
  2013-08-08  4:30       ` Rich Felker
  1 sibling, 1 reply; 9+ messages in thread
From: Szabolcs Nagy @ 2013-08-08  3:53 UTC (permalink / raw)
  To: musl

* Rich Felker <dalias@aerifal.cx> [2013-08-07 22:11:19 -0400]:
> Since you mentioned Big5-2003, I've been looking into it, and it seems
> like it should be part of our base Big5 mapping. Diffing moztw's
> version of it against CP950.TXT (after cleaning up both), I get:

i checked an other source for big5-2003 and it is bug compatible
with the moztw one (so it might not be mozilla's fault)
http://www.csie.ntu.edu.tw/~r92030/project/big5/

this source maps C255 to 5F5E instead of 5F5D
(also observed in the icu version of cp950)

> -0xA156 0x2013
> +0xA156 0x2015
> ...
> -0xA1C2 0x00AF
> +0xA1C2 0x203E
> ...
> -0xA2A4 0x2550
> -0xA2A5 0x255E
> -0xA2A6 0x256A
> -0xA2A7 0x2561
> +0xA2A4 0x2501
> +0xA2A5 0x251D
> +0xA2A6 0x253F
> +0xA2A7 0x2525
> 
> The above all looks like pure nonstandard Mozilla behavior. What's
> next is more interesting:
> 
> -0xA2CC 0x5341
> -0xA2CD 0x5344
> -0xA2CE 0x5345
> +0xA2CC 0x3038
> +0xA2CD 0x3039
> +0xA2CE 0x303A
> 
> This looks to me like an actual bug in CP905.TXT from Unicode.
> Unihan.txt says U+5341 is Big5's A451, so it can't also be A2CC. Same
> for the others. Indeed, CP905.TXT maps these in a non-one-to-one way,
> which is in itself almost certainly a bug.
> 
> +0xA3C0 0x2400
> +0xA3C1 0x2401
> +...
> +0xA3DF 0x241F
> +0xA3E0 0x2421
> 
> These are all part of ETEN omitted from CP950, and should definitely
> be in Big5 base.
> 
> +0xC6A1 0x2460
> +0xC6A2 0x2461
> +0xC6A3 0x2462
> +0xC6A4 0x2463
> +...
> +0xC7F1 0x30F5
> +0xC7F2 0x30F6
> 
> These are also from ETEN. Notably, the Cyrillic block that immediately
> follows these is still omitted in Big5-2003, for reasons that appear
> political. Since ETEN, UAO, and HKSCS all have it, I see no reason not
> to add the Cyrillic block back in here.
> 

the C6BF-C6D9 part is incompatible in hkscs and big5-2003
hkscs == uao != big5-2003 for these codes
icu agrees with the old hkscs pua codes so this might be
just a bug in the big5-2003 source

> Finally:
> 
> -0xF9FA 0x256D
> -0xF9FB 0x256E
> -0xF9FC 0x2570
> -0xF9FD 0x256F
> +0xF9FA 0x2554
> +0xF9FB 0x2557
> +0xF9FC 0x255A
> +0xF9FD 0x255D
> 
> This looks like pure Mozilla cruft. Is there any justification for
> these sorts of changes?
> 

these are box drawing chars (like A2A4-A2A7 above),
the diff is double vs light lines

cp950 == hkscs == uao != big-2003 (and missing from icu)

hkscs maps F9FE to FFED instead of 2593 (cp950,uao,icu)

> Does the above analysis look correct? If so I will go ahead and merge
> the above changes to Big5 support into musl.
> 
> BTW, the only non-PUA part of UAO within the standard Big5 range
> (89x157 grid) that won't be mapped with these changes is the stuff
> right after the Cyrillic block. This part does not conflict with
> current HKSCS, so if I had good sources from both the Taiwan and HK
> sides supporting the position that these mappings will not conflict
> with other extensions in current use or with future expansion of
> HKSCS, we could consider including that part of UAO in the base Big5
> mapping. At this point this is only an idea for consideration, but we
> can keep it in mind.

note that
C87A, C87C, C8A4 are mapped to 2xxxx in hkscs
(old hkscs pua codes agree with uao)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: Status of Big5 and extensions
  2013-08-08  3:53     ` Szabolcs Nagy
@ 2013-08-08  4:30       ` Rich Felker
  2013-08-08  4:50         ` Szabolcs Nagy
  0 siblings, 1 reply; 9+ messages in thread
From: Rich Felker @ 2013-08-08  4:30 UTC (permalink / raw)
  To: musl

On Thu, Aug 08, 2013 at 05:53:21AM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@aerifal.cx> [2013-08-07 22:11:19 -0400]:
> > Since you mentioned Big5-2003, I've been looking into it, and it seems
> > like it should be part of our base Big5 mapping. Diffing moztw's
> > version of it against CP950.TXT (after cleaning up both), I get:
> 
> i checked an other source for big5-2003 and it is bug compatible
> with the moztw one (so it might not be mozilla's fault)
> http://www.csie.ntu.edu.tw/~r92030/project/big5/
> 
> this source maps C255 to 5F5E instead of 5F5D
> (also observed in the icu version of cp950)

Unfortunately this mismatches the normative Unihan.txt which says
U+5F5D corresponds to the historical Big5 character C255, so we need
at least some justification for the change if Unihan.txt is buggy.

> > These are all part of ETEN omitted from CP950, and should definitely
> > be in Big5 base.
> > 
> > +0xC6A1 0x2460
> > +0xC6A2 0x2461
> > +0xC6A3 0x2462
> > +0xC6A4 0x2463
> > +...
> > +0xC7F1 0x30F5
> > +0xC7F2 0x30F6
> > 
> > These are also from ETEN. Notably, the Cyrillic block that immediately
> > follows these is still omitted in Big5-2003, for reasons that appear
> > political. Since ETEN, UAO, and HKSCS all have it, I see no reason not
> > to add the Cyrillic block back in here.
> > 
> 
> the C6BF-C6D9 part is incompatible in hkscs and big5-2003
> hkscs == uao != big5-2003 for these codes
> icu agrees with the old hkscs pua codes so this might be
> just a bug in the big5-2003 source

I believe I've dug up the story on this here:

http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-April/035389.html

In short, the Big5-2003 mappings from moztw.org are wrong. The "KANGXI
RADICAL" characters in Unicode are compatibility characters. This
means they have compatibility-equivalents which should be used in
Unicode documents in place of them, much like the Greek letter μ
should be used in place of Latin-1 MICRO SIGN (µ), and only exist for
round-trip compatibility with legacy character sets which encode the
character twice. Since Big5 (unlike CNS 11643) does not encode the
Kangxi radicals twice, using them in a mapping to Unicode is wrong use
of Unicode, regardless of what the mapping table from the standards
body says. Thus, I have no problem with going with the UAO/HKSCS way.

According to the above link, however, HKSCS has introduced a problem.
They've double mapped U+5E7A and are thus mapping the one in the C6CD
slot to U+2F33 instead, since the FBF4 slot is mapping to U+5E7A. I'm
not sure what the right solution to this is; since we're not
interested in round-trip, it might make the most sense to just ignore
it and map them both to the (same) proper character.

> > -0xF9FA 0x256D
> > -0xF9FB 0x256E
> > -0xF9FC 0x2570
> > -0xF9FD 0x256F
> > +0xF9FA 0x2554
> > +0xF9FB 0x2557
> > +0xF9FC 0x255A
> > +0xF9FD 0x255D
> > 
> > This looks like pure Mozilla cruft. Is there any justification for
> > these sorts of changes?
> > 
> 
> these are box drawing chars (like A2A4-A2A7 above),
> the diff is double vs light lines
> 
> cp950 == hkscs == uao != big-2003 (and missing from icu)
> 
> hkscs maps F9FE to FFED instead of 2593 (cp950,uao,icu)

I don't really care so much about these anyway since they do not
affect linguistic content, just warez .nfo files. ;-)

> > Does the above analysis look correct? If so I will go ahead and merge
> > the above changes to Big5 support into musl.
> > 
> > BTW, the only non-PUA part of UAO within the standard Big5 range
> > (89x157 grid) that won't be mapped with these changes is the stuff
> > right after the Cyrillic block. This part does not conflict with
> > current HKSCS, so if I had good sources from both the Taiwan and HK
> > sides supporting the position that these mappings will not conflict
> > with other extensions in current use or with future expansion of
> > HKSCS, we could consider including that part of UAO in the base Big5
> > mapping. At this point this is only an idea for consideration, but we
> > can keep it in mind.
> 
> note that
> C87A, C87C, C8A4 are mapped to 2xxxx in hkscs
> (old hkscs pua codes agree with uao)

OK, so is this non-conflicting?

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: Status of Big5 and extensions
  2013-08-08  4:30       ` Rich Felker
@ 2013-08-08  4:50         ` Szabolcs Nagy
  2013-08-08  5:31           ` Rich Felker
  0 siblings, 1 reply; 9+ messages in thread
From: Szabolcs Nagy @ 2013-08-08  4:50 UTC (permalink / raw)
  To: musl

* Rich Felker <dalias@aerifal.cx> [2013-08-08 00:30:35 -0400]:
> On Thu, Aug 08, 2013 at 05:53:21AM +0200, Szabolcs Nagy wrote:
> > note that
> > C87A, C87C, C8A4 are mapped to 2xxxx in hkscs
> > (old hkscs pua codes agree with uao)
> 
> OK, so is this non-conflicting?
> 

i just wanted to note that these are the only codes
in the 89x157 table that map to non-16bit unicode
codepoints

there is no conflict


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: Status of Big5 and extensions
  2013-08-08  4:50         ` Szabolcs Nagy
@ 2013-08-08  5:31           ` Rich Felker
  2013-08-08  7:19             ` Rich Felker
  0 siblings, 1 reply; 9+ messages in thread
From: Rich Felker @ 2013-08-08  5:31 UTC (permalink / raw)
  To: musl

On Thu, Aug 08, 2013 at 06:50:57AM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@aerifal.cx> [2013-08-08 00:30:35 -0400]:
> > On Thu, Aug 08, 2013 at 05:53:21AM +0200, Szabolcs Nagy wrote:
> > > note that
> > > C87A, C87C, C8A4 are mapped to 2xxxx in hkscs
> > > (old hkscs pua codes agree with uao)
> > 
> > OK, so is this non-conflicting?
> > 
> 
> i just wanted to note that these are the only codes
> in the 89x157 table that map to non-16bit unicode
> codepoints
> 
> there is no conflict

OK, great. I think rather than using any fancy tables for the 89x157
grid and HKSCS, we can just special-case these two in the code.

It also looks to me like HKSCS and UAO are essentially non-conflicting
in the 89x157 grid once you remove PUA junk; the only real conflict is
the half-width kana UAO mapped over part of HKSCS. So it may make
sense to just add all the extended mappings in this range except the
kana to the base Big5 table. This would definitely simplify HKSCS
support. If we later want UAO support, this range could just be
special-cased algorithmically since it seems to be direct range
mappings to Unicode.

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: Status of Big5 and extensions
  2013-08-08  5:31           ` Rich Felker
@ 2013-08-08  7:19             ` Rich Felker
  0 siblings, 0 replies; 9+ messages in thread
From: Rich Felker @ 2013-08-08  7:19 UTC (permalink / raw)
  To: musl

On Thu, Aug 08, 2013 at 01:31:07AM -0400, Rich Felker wrote:
> On Thu, Aug 08, 2013 at 06:50:57AM +0200, Szabolcs Nagy wrote:
> > * Rich Felker <dalias@aerifal.cx> [2013-08-08 00:30:35 -0400]:
> > > On Thu, Aug 08, 2013 at 05:53:21AM +0200, Szabolcs Nagy wrote:
> > > > note that
> > > > C87A, C87C, C8A4 are mapped to 2xxxx in hkscs
> > > > (old hkscs pua codes agree with uao)
> > > 
> > > OK, so is this non-conflicting?
> > > 
> > 
> > i just wanted to note that these are the only codes
> > in the 89x157 table that map to non-16bit unicode
> > codepoints
> > 
> > there is no conflict
> 
> OK, great. I think rather than using any fancy tables for the 89x157
> grid and HKSCS, we can just special-case these two in the code.
> 
> It also looks to me like HKSCS and UAO are essentially non-conflicting
> in the 89x157 grid once you remove PUA junk; the only real conflict is
> the half-width kana UAO mapped over part of HKSCS. So it may make
> sense to just add all the extended mappings in this range except the
> kana to the base Big5 table. This would definitely simplify HKSCS
> support. If we later want UAO support, this range could just be
> special-cased algorithmically since it seems to be direct range
> mappings to Unicode.

It seems this sort of unification has been attempted already, and it
looks like it largely succeeded. See:

http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-April/035330.html

Rich


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-08-08  7:19 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-07 16:50 Status of Big5 and extensions Rich Felker
2013-08-08  0:18 ` Roy
2013-08-08  2:11   ` Rich Felker
2013-08-08  3:48     ` Roy
2013-08-08  3:53     ` Szabolcs Nagy
2013-08-08  4:30       ` Rich Felker
2013-08-08  4:50         ` Szabolcs Nagy
2013-08-08  5:31           ` Rich Felker
2013-08-08  7:19             ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).