From mboxrd@z Thu Jan 1 00:00:00 1970 From: bqt@update.uu.se (Johnny Billquist) Date: Mon, 28 Mar 2016 01:56:31 +0200 Subject: [TUHS] Character sets In-Reply-To: <20160327233049.GA11617@mercury.ccil.org> References: <56F7B14A.7040201@update.uu.se> <20160327112920.GH12921@mercury.ccil.org> <56F7C85F.6060503@update.uu.se> <20160327214943.GS3766@eureka.lemis.com> <56F8565C.3080704@update.uu.se> <20160327233049.GA11617@mercury.ccil.org> Message-ID: <56F8732F.4010004@update.uu.se> On 2016-03-28 01:30, John Cowan wrote: > Johnny Billquist scripsit: > >>>> Haha. Yes... Except that you now have multiple representations of each >>>> character within one character set. So what has improved??? > > Mojibake, though not unknown, is now much less common, and the number > of documents on the web that are in UTF-8 (including its ASCII subset) > is at 85% and rising. > >>> In the Good Old Days, characters were all the same size, and you could >>> do nice, simple things like >>> >>> while (*c && *c++ != " "); > > That particular piece of code still works if the encoding is UTF-8. > Fundamentally, Unicode is complicated because human writing systems > are complicated. While true, I do not agree that Unicode is complicated because of writing systems. Unicode have surpassed the writing systems... >> Another one I noted a while ago was that functions and command in >> Unix, such as lpq, which try to print things in nice columns now >> fail, because the code don't actually know how many characters have >> been output. > > Well, if the font isn't fixed-width, you're screwed anyway. But if > it is, there is information in the Unicode tables that tells you which > characters have widths of 0, 1, or 2. Print programs can be modified > to use that information. (...or 3) Yeah, you just need to suck in a few gigabytes of Unicode libraries in your 4K program. I'm not sure I agree that this is an acceptable solution. >> And let's not even talk about such wonderful concepts as colors in >> the character set definition... Unicode seems to have it all... > > Colors are optional. Really. So how should Green Book (U+1F4D7) be rendered differently than Blue Book (U+1F4D8), or Orange Book (U+1F4D9) ? Curious minds want to know... >> I wonder how many code points exist for 'A'. It's definitely more than >> one... > > Other than Greek and Cyrillic A letters, there are the math letters, which > are used *in plain text* to designate semantic differences: plain A, > italic A, and bold A mean different things mathematically. Using the > math italics for emphasis or book titles is a Bad Thing. And what are your thoughts on FULLWIDTH LATIN CAPITAL LETTER A (U+FF21). What is the semantic difference in having more whitespace around the letter? (It should semantically be decomposed to LATIN CAPITAL LETTER A (U+41), so for all unicode string comparisons, it is equal to A, but it's still a different code point.) Johnny (Yes, I do not like Unicode...) -- Johnny Billquist || "I'm on a bus || on a psychedelic trip email: bqt at softjar.se || Reading murder books pdp is alive! || tryin' to stay hip" - B. Idol