[TUHS] Character sets (was: Command-line options)

The Unix Heritage Society mailing list
 help / color / mirror / Atom feed

* [TUHS] Character sets (was: Command-line options)
       [not found] <mailman.169.1459059516.15972.tuhs@minnie.tuhs.org>
@ 2016-03-27 10:09 ` Johnny Billquist
  2016-03-27 11:29   ` John Cowan
  0 siblings, 1 reply; 19+ messages in thread
From: Johnny Billquist @ 2016-03-27 10:09 UTC (permalink / raw)


On 2016-03-27 08:18, Greg 'groggy' Lehey<grog at lemis.com> wrote:
> Isn't it wonderful that we no longer have issues with character
> representation?

I hope that comment was meant as a joke, ironic, cynical, or whatever...

	Johnny

-- 
Johnny Billquist                  || "I'm on a bus
                                   ||  on a psychedelic trip
email: bqt at softjar.se             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets (was: Command-line options)
  2016-03-27 10:09 ` [TUHS] Character sets (was: Command-line options) Johnny Billquist
@ 2016-03-27 11:29   ` John Cowan
  2016-03-27 11:47     ` [TUHS] Character sets Johnny Billquist
  0 siblings, 1 reply; 19+ messages in thread
From: John Cowan @ 2016-03-27 11:29 UTC (permalink / raw)


Johnny Billquist scripsit:

> On 2016-03-27 08:18, Greg 'groggy' Lehey<grog at lemis.com> wrote:
> >Isn't it wonderful that we no longer have issues with character
> >representation?
> 
> I hope that comment was meant as a joke, ironic, cynical, or whatever...

Undoubtedly.  But things *are* much better than they used to be:
we can now do everything within a single character set, and convert
only at the boundaries (and increasingly, only in one direction).

-- 
John Cowan          http://www.ccil.org/~cowan        cowan at ccil.org
Deshil Holles eamus.  Deshil Holles eamus.  Deshil Holles eamus.
Send us, bright one, light one, Horhorn, quickening, and wombfruit. (3x)
Hoopsa, boyaboy, hoopsa!  Hoopsa, boyaboy, hoopsa!  Hoopsa, boyaboy, hoopsa!
  --Joyce, Ulysses, "Oxen of the Sun"


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-27 11:29   ` John Cowan
@ 2016-03-27 11:47     ` Johnny Billquist
  2016-03-27 21:49       ` Greg 'groggy' Lehey
  0 siblings, 1 reply; 19+ messages in thread
From: Johnny Billquist @ 2016-03-27 11:47 UTC (permalink / raw)


On 2016-03-27 13:29, John Cowan wrote:
> Johnny Billquist scripsit:
>
>> On 2016-03-27 08:18, Greg 'groggy' Lehey<grog at lemis.com> wrote:
>>> Isn't it wonderful that we no longer have issues with character
>>> representation?
>>
>> I hope that comment was meant as a joke, ironic, cynical, or whatever...
>
> Undoubtedly.  But things *are* much better than they used to be:
> we can now do everything within a single character set, and convert
> only at the boundaries (and increasingly, only in one direction).

Haha. Yes... Except that you now have multiple representations of each 
character within one character set. So what has improved???

	Johnny

-- 
Johnny Billquist                  || "I'm on a bus
                                   ||  on a psychedelic trip
email: bqt at softjar.se             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-27 11:47     ` [TUHS] Character sets Johnny Billquist
@ 2016-03-27 21:49       ` Greg 'groggy' Lehey
  2016-03-27 21:53         ` Johnny Billquist
  0 siblings, 1 reply; 19+ messages in thread
From: Greg 'groggy' Lehey @ 2016-03-27 21:49 UTC (permalink / raw)


On Sunday, 27 March 2016 at 13:47:43 +0200, Johnny Billquist wrote:
> On 2016-03-27 13:29, John Cowan wrote:
>> Johnny Billquist scripsit:
>>
>>> On 2016-03-27 08:18, Greg 'groggy' Lehey<grog at lemis.com> wrote:
>>>> Isn't it wonderful that we no longer have issues with character
>>>> representation?
>>>
>>> I hope that comment was meant as a joke, ironic, cynical, or whatever...
>>
>> Undoubtedly.  But things *are* much better than they used to be:
>> we can now do everything within a single character set, and convert
>> only at the boundaries (and increasingly, only in one direction).
>
> Haha. Yes... Except that you now have multiple representations of each
> character within one character set. So what has improved???

In the Good Old Days, characters were all the same size, and you could
do nice, simple things like

  while (*c && *c++ != " ");

Now you need a whole library to do the same thing.

Greg
--
Sent from my desktop computer.
Finger grog at FreeBSD.org for PGP public key.
See complete headers for address and phone numbers.
This message is digitally signed.  If your Microsoft MUA reports
problems, please read http://tinyurl.com/broken-mua
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://minnie.tuhs.org/pipermail/tuhs/attachments/20160328/e9301686/attachment-0001.sig>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-27 21:49       ` Greg 'groggy' Lehey
@ 2016-03-27 21:53         ` Johnny Billquist
  2016-03-27 21:59           ` Greg 'groggy' Lehey
  2016-03-27 23:30           ` John Cowan
  0 siblings, 2 replies; 19+ messages in thread
From: Johnny Billquist @ 2016-03-27 21:53 UTC (permalink / raw)


On 2016-03-27 23:49, Greg 'groggy' Lehey wrote:
> On Sunday, 27 March 2016 at 13:47:43 +0200, Johnny Billquist wrote:
>> On 2016-03-27 13:29, John Cowan wrote:
>>> Johnny Billquist scripsit:
>>>
>>>> On 2016-03-27 08:18, Greg 'groggy' Lehey<grog at lemis.com> wrote:
>>>>> Isn't it wonderful that we no longer have issues with character
>>>>> representation?
>>>>
>>>> I hope that comment was meant as a joke, ironic, cynical, or whatever...
>>>
>>> Undoubtedly.  But things *are* much better than they used to be:
>>> we can now do everything within a single character set, and convert
>>> only at the boundaries (and increasingly, only in one direction).
>>
>> Haha. Yes... Except that you now have multiple representations of each
>> character within one character set. So what has improved???
>
> In the Good Old Days, characters were all the same size, and you could
> do nice, simple things like
>
>    while (*c && *c++ != " ");
>
> Now you need a whole library to do the same thing.

Another one I noted a while ago was that functions and command in Unix, 
such as lpq, which try to print things in nice columns now fail, because 
the code don't actually know how many characters have been output.

And let's not even talk about such wonderful concepts as colors in the 
character set definition... Unicode seems to have it all... I wonder how 
many code points exist for 'A'. It's definitely more than one...

	Johnny

-- 
Johnny Billquist                  || "I'm on a bus
                                   ||  on a psychedelic trip
email: bqt at softjar.se             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-27 21:53         ` Johnny Billquist
@ 2016-03-27 21:59           ` Greg 'groggy' Lehey
  2016-03-27 22:19             ` Johnny Billquist
  2016-03-27 22:21             ` Charles Anthony
  2016-03-27 23:30           ` John Cowan
  1 sibling, 2 replies; 19+ messages in thread
From: Greg 'groggy' Lehey @ 2016-03-27 21:59 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2141 bytes --]

On Sunday, 27 March 2016 at 23:53:32 +0200, Johnny Billquist wrote:
> On 2016-03-27 23:49, Greg 'groggy' Lehey wrote:
>> On Sunday, 27 March 2016 at 13:47:43 +0200, Johnny Billquist wrote:
>>> On 2016-03-27 13:29, John Cowan wrote:
>>>> Johnny Billquist scripsit:
>>>>
>>>>> On 2016-03-27 08:18, Greg 'groggy' Lehey<grog at lemis.com> wrote:
>>>>>> Isn't it wonderful that we no longer have issues with character
>>>>>> representation?
>>>>>
>>>>> I hope that comment was meant as a joke, ironic, cynical, or whatever...
>>>>
>>>> Undoubtedly.  But things *are* much better than they used to be:
>>>> we can now do everything within a single character set, and convert
>>>> only at the boundaries (and increasingly, only in one direction).
>>>
>>> Haha. Yes... Except that you now have multiple representations of each
>>> character within one character set. So what has improved???
>>
>> In the Good Old Days, characters were all the same size, and you could
>> do nice, simple things like
>>
>>    while (*c && *c++ != " ");
>>
>> Now you need a whole library to do the same thing.
>
> Another one I noted a while ago was that functions and command in Unix,
> such as lpq, which try to print things in nice columns now fail, because
> the code don't actually know how many characters have been output.
>
> And let's not even talk about such wonderful concepts as colors in the
> character set definition... Unicode seems to have it all... I wonder how
> many code points exist for 'A'. It's definitely more than one...

For some definition of A, of course.  In addition there's clearly at
least Î‘ (0x391) and Ð (0x410).

Greg
--
Sent from my desktop computer.
Finger grog at FreeBSD.org for PGP public key.
See complete headers for address and phone numbers.
This message is digitally signed.  If your Microsoft MUA reports
problems, please read http://tinyurl.com/broken-mua
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://minnie.tuhs.org/pipermail/tuhs/attachments/20160328/5942b263/attachment.sig>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-27 21:59           ` Greg 'groggy' Lehey
@ 2016-03-27 22:19             ` Johnny Billquist
  2016-03-27 22:21             ` Charles Anthony
  1 sibling, 0 replies; 19+ messages in thread
From: Johnny Billquist @ 2016-03-27 22:19 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2456 bytes --]

On 2016-03-27 23:59, Greg 'groggy' Lehey wrote:
> On Sunday, 27 March 2016 at 23:53:32 +0200, Johnny Billquist wrote:
>> On 2016-03-27 23:49, Greg 'groggy' Lehey wrote:
>>> On Sunday, 27 March 2016 at 13:47:43 +0200, Johnny Billquist wrote:
>>>> On 2016-03-27 13:29, John Cowan wrote:
>>>>> Johnny Billquist scripsit:
>>>>>
>>>>>> On 2016-03-27 08:18, Greg 'groggy' Lehey<grog at lemis.com> wrote:
>>>>>>> Isn't it wonderful that we no longer have issues with character
>>>>>>> representation?
>>>>>>
>>>>>> I hope that comment was meant as a joke, ironic, cynical, or whatever...
>>>>>
>>>>> Undoubtedly.  But things *are* much better than they used to be:
>>>>> we can now do everything within a single character set, and convert
>>>>> only at the boundaries (and increasingly, only in one direction).
>>>>
>>>> Haha. Yes... Except that you now have multiple representations of each
>>>> character within one character set. So what has improved???
>>>
>>> In the Good Old Days, characters were all the same size, and you could
>>> do nice, simple things like
>>>
>>>     while (*c && *c++ != " ");
>>>
>>> Now you need a whole library to do the same thing.
>>
>> Another one I noted a while ago was that functions and command in Unix,
>> such as lpq, which try to print things in nice columns now fail, because
>> the code don't actually know how many characters have been output.
>>
>> And let's not even talk about such wonderful concepts as colors in the
>> character set definition... Unicode seems to have it all... I wonder how
>> many code points exist for 'A'. It's definitely more than one...
>
> For some definition of A, of course.  In addition there's clearly at
> least Α (0x391) and А (0x410).

Oh, definitely. I'm trying to limit myself to Latin-A at the moment. 
Otherwise the list will just be ridiculously long.

You have, of course, U+41, but you also have U+FF21. But if you want to 
go slightly silly, you also have U+1F110, U+1F130, U+1F150, U+1F170, 
U+1F1E6, U+E0041... And god know if I've missed some other ones.
Of course whitespace and other typographic details matters. That's why 
we have different code points for the letter, depending on things like 
whitespace.

	Johnny

-- 
Johnny Billquist                  || "I'm on a bus
                                   ||  on a psychedelic trip
email: bqt at softjar.se             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-27 21:59           ` Greg 'groggy' Lehey
  2016-03-27 22:19             ` Johnny Billquist
@ 2016-03-27 22:21             ` Charles Anthony
  2016-03-27 23:23               ` Dave Horsfall
  2016-03-28  0:18               ` Johnny Billquist
  1 sibling, 2 replies; 19+ messages in thread
From: Charles Anthony @ 2016-03-27 22:21 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1089 bytes --]

On Sun, Mar 27, 2016 at 2:59 PM, Greg 'groggy' Lehey <grog at lemis.com> wrote:

> On Sunday, 27 March 2016 at 23:53:32 +0200, Johnny Billquist wrote:
> > On 2016-03-27 23:49, Greg 'groggy' Lehey wrote:
> >> On Sunday, 27 March 2016 at 13:47:43 +0200, Johnny Billquist wrote:
> >>> On 2016-03-27 13:29, John Cowan wrote:
> >>>> Johnny Billquist scripsit:
> >>>>
> >>>>> On 2016-03-27 08:18, Greg 'groggy' Lehey<grog at lemis.com> wrote:
> >>>>>> Isn't it wonderful that we no longer have issues with character
> >>>>>> representation?
> >
> > And let's not even talk about such wonderful concepts as colors in the
> > character set definition... Unicode seems to have it all... I wonder how
> > many code points exist for 'A'. It's definitely more than one...
>
> For some definition of A, of course.  In addition there's clearly at
> least Î‘ (0x391) and Ð  (0x410).
>
> ,∀, sᴉɥʇ ʇǝƃɹoɟ ʇ,uop

-- Charles
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://minnie.tuhs.org/pipermail/tuhs/attachments/20160327/fa7c3ff4/attachment.html>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-27 22:21             ` Charles Anthony
@ 2016-03-27 23:23               ` Dave Horsfall
  2016-03-28  0:20                 ` John Cowan
  2016-03-28  0:18               ` Johnny Billquist
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Horsfall @ 2016-03-27 23:23 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 220 bytes --]

On Sun, 27 Mar 2016, Charles Anthony wrote:

> ,∀, sᴉɥʇ ʇǝƃɹoɟ ʇ,uop

That, sir, is sheer genius!  I dips me lid to you...

-- 
Dave Horsfall DTM (VK2KFU)  "Those who don't understand security will suffer."


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-27 21:53         ` Johnny Billquist
  2016-03-27 21:59           ` Greg 'groggy' Lehey
@ 2016-03-27 23:30           ` John Cowan
  2016-03-27 23:56             ` Johnny Billquist
  2016-03-28  1:20             ` Random832
  1 sibling, 2 replies; 19+ messages in thread
From: John Cowan @ 2016-03-27 23:30 UTC (permalink / raw)


Johnny Billquist scripsit:

> >>Haha. Yes... Except that you now have multiple representations of each
> >>character within one character set. So what has improved???

Mojibake, though not unknown, is now much less common, and the number
of documents on the web that are in UTF-8 (including its ASCII subset)
is at 85% and rising.

> >In the Good Old Days, characters were all the same size, and you could
> >do nice, simple things like
> >
> >   while (*c && *c++ != " ");

That particular piece of code still works if the encoding is UTF-8.
Fundamentally, Unicode is complicated because human writing systems
are complicated.

> Another one I noted a while ago was that functions and command in
> Unix, such as lpq, which try to print things in nice columns now
> fail, because the code don't actually know how many characters have
> been output.

Well, if the font isn't fixed-width, you're screwed anyway.  But if
it is, there is information in the Unicode tables that tells you which
characters have widths of 0, 1, or 2.  Print programs can be modified
to use that information.

> And let's not even talk about such wonderful concepts as colors in
> the character set definition... Unicode seems to have it all... 

Colors are optional.

> I wonder how many code points exist for 'A'. It's definitely more than
> one...

Other than Greek and Cyrillic A letters, there are the math letters, which
are used *in plain text* to designate semantic differences: plain A,
italic A, and bold A mean different things mathematically.  Using the
math italics for emphasis or book titles is a Bad Thing.

-- 
John Cowan          http://www.ccil.org/~cowan        cowan at ccil.org
   There was an old man                Said with a laugh, "I
     From Peru, whose lim'ricks all      Cut them in half, the pay is
       Look'd like haiku.  He              Much better for two."
                                             --Emmet O'Brien


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-27 23:30           ` John Cowan
@ 2016-03-27 23:56             ` Johnny Billquist
  2016-03-28  1:54               ` John Cowan
  2016-03-28  3:27               ` Steve Nickolas
  2016-03-28  1:20             ` Random832
  1 sibling, 2 replies; 19+ messages in thread
From: Johnny Billquist @ 2016-03-27 23:56 UTC (permalink / raw)


On 2016-03-28 01:30, John Cowan wrote:
> Johnny Billquist scripsit:
>
>>>> Haha. Yes... Except that you now have multiple representations of each
>>>> character within one character set. So what has improved???
>
> Mojibake, though not unknown, is now much less common, and the number
> of documents on the web that are in UTF-8 (including its ASCII subset)
> is at 85% and rising.
>
>>> In the Good Old Days, characters were all the same size, and you could
>>> do nice, simple things like
>>>
>>>    while (*c && *c++ != " ");
>
> That particular piece of code still works if the encoding is UTF-8.
> Fundamentally, Unicode is complicated because human writing systems
> are complicated.

While true, I do not agree that Unicode is complicated because of 
writing systems. Unicode have surpassed the writing systems...

>> Another one I noted a while ago was that functions and command in
>> Unix, such as lpq, which try to print things in nice columns now
>> fail, because the code don't actually know how many characters have
>> been output.
>
> Well, if the font isn't fixed-width, you're screwed anyway.  But if
> it is, there is information in the Unicode tables that tells you which
> characters have widths of 0, 1, or 2.  Print programs can be modified
> to use that information.

(...or 3)
Yeah, you just need to suck in a few gigabytes of Unicode libraries in 
your 4K program. I'm not sure I agree that this is an acceptable solution.

>> And let's not even talk about such wonderful concepts as colors in
>> the character set definition... Unicode seems to have it all...
>
> Colors are optional.

Really. So how should Green Book (U+1F4D7) be rendered differently than 
Blue Book (U+1F4D8), or Orange Book (U+1F4D9) ?

Curious minds want to know...

>> I wonder how many code points exist for 'A'. It's definitely more than
>> one...
>
> Other than Greek and Cyrillic A letters, there are the math letters, which
> are used *in plain text* to designate semantic differences: plain A,
> italic A, and bold A mean different things mathematically.  Using the
> math italics for emphasis or book titles is a Bad Thing.

And what are your thoughts on FULLWIDTH LATIN CAPITAL LETTER A (U+FF21). 
What is the semantic difference in having more whitespace around the 
letter? (It should semantically be decomposed to LATIN CAPITAL LETTER A 
(U+41), so for all unicode string comparisons, it is equal to A, but 
it's still a different code point.)

	Johnny (Yes, I do not like Unicode...)

-- 
Johnny Billquist                  || "I'm on a bus
                                   ||  on a psychedelic trip
email: bqt at softjar.se             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-27 22:21             ` Charles Anthony
  2016-03-27 23:23               ` Dave Horsfall
@ 2016-03-28  0:18               ` Johnny Billquist
  1 sibling, 0 replies; 19+ messages in thread
From: Johnny Billquist @ 2016-03-28  0:18 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1400 bytes --]

On 2016-03-28 00:21, Charles Anthony wrote:
>
>
> On Sun, Mar 27, 2016 at 2:59 PM, Greg 'groggy' Lehey <grog at lemis.com
> <mailto:grog at lemis.com>> wrote:
>
>     On Sunday, 27 March 2016 at 23:53:32 +0200, Johnny Billquist wrote:
>      > On 2016-03-27 23:49, Greg 'groggy' Lehey wrote:
>      >> On Sunday, 27 March 2016 at 13:47:43 +0200, Johnny Billquist wrote:
>      >>> On 2016-03-27 13:29, John Cowan wrote:
>      >>>> Johnny Billquist scripsit:
>      >>>>
>      >>>>> On 2016-03-27 08:18, Greg 'groggy' Lehey<grog at lemis.com
>     <mailto:grog at lemis.com>> wrote:
>      >>>>>> Isn't it wonderful that we no longer have issues with character
>      >>>>>> representation?
>      >
>      > And let's not even talk about such wonderful concepts as colors
>     in the
>      > character set definition... Unicode seems to have it all... I
>     wonder how
>      > many code points exist for 'A'. It's definitely more than one...
>
>     For some definition of A, of course.  In addition there's clearly at
>     least Î‘ (0x391) and Ð  (0x410).
>
> ,∀, sᴉɥʇ ʇǝƃɹoɟ ʇ,uop

Damn! I did forget that one... :-)

	Johnny

-- 
Johnny Billquist                  || "I'm on a bus
                                   ||  on a psychedelic trip
email: bqt at softjar.se             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-27 23:23               ` Dave Horsfall
@ 2016-03-28  0:20                 ` John Cowan
  2016-03-28  1:02                   ` Dave Horsfall
  0 siblings, 1 reply; 19+ messages in thread
From: John Cowan @ 2016-03-28  0:20 UTC (permalink / raw)


Dave Horsfall scripsit:

> That, sir, is sheer genius!  I dips me lid to you...

http://www.fileformat.info/convert/text/upside-down.htm

-- 
"Well, I'm back."  --Sam        John Cowan <cowan at ccil.org>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-28  0:20                 ` John Cowan
@ 2016-03-28  1:02                   ` Dave Horsfall
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Horsfall @ 2016-03-28  1:02 UTC (permalink / raw)


On Sun, 27 Mar 2016, John Cowan wrote:

> http://www.fileformat.info/convert/text/upside-down.htm

Wow - thanks!

-- 
Dave Horsfall DTM (VK2KFU)  "Those who don't understand security will suffer."


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-27 23:30           ` John Cowan
  2016-03-27 23:56             ` Johnny Billquist
@ 2016-03-28  1:20             ` Random832
  2016-03-28  1:58               ` John Cowan
  1 sibling, 1 reply; 19+ messages in thread
From: Random832 @ 2016-03-28  1:20 UTC (permalink / raw)

On Sun, Mar 27, 2016, at 19:30, John Cowan wrote:
> > >   while (*c && *c++ != " ");
> 
> That particular piece of code still works if the encoding is UTF-8.

Sure it does, but replace that != " " with !isblank(*c), and it doesn't
work anymore since it ignores multibyte characters. Often you don't
care, but you've got to remember to set LC_ALL=C when running grep etc
on large data sets or it will be much slower, since \w and \s care about
multibyte characters (as does case-insensitive matching, etc).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-27 23:56             ` Johnny Billquist
@ 2016-03-28  1:54               ` John Cowan
  2016-03-28  3:27               ` Steve Nickolas
  1 sibling, 0 replies; 19+ messages in thread
From: John Cowan @ 2016-03-28  1:54 UTC (permalink / raw)


Johnny Billquist scripsit:

> While true, I do not agree that Unicode is complicated because of
> writing systems. Unicode have surpassed the writing systems...

Yes, there is also incidental complexity required by the need for
various pre-existing factors.

> Yeah, you just need to suck in a few gigabytes of Unicode libraries
> in your 4K program. I'm not sure I agree that this is an acceptable
> solution.

I doubt if the program is really just 4K any more, and there are such
things as shared libraries.  The Asian width table is not very big
by itself, especially if you use runs of characters rather than individual
characters and do a binary search.

> Really. So how should Green Book (U+1F4D7) be rendered differently
> than Blue Book (U+1F4D8), or Orange Book (U+1F4D9) ?

See <http://unicode.org/emoji/charts/full-emoji-list.html> (slow to load)
and examine the fourth column ("Chart") for rows 1063-65.  Basically,
GREEN BOOK has vertical stripes on the cover, BLUE BOOK has horizontal
stripes, and ORANGE BOOK is black with white dots.

> And what are your thoughts on FULLWIDTH LATIN CAPITAL LETTER A
> (U+FF21). What is the semantic difference in having more whitespace
> around the letter? 

1-1 convertibility with various Japanese character sets.  Unicode is
not Cleanicode: it was designed not to do the best possible job, but
the best job possible under the circumstances.

-- 
John Cowan          http://www.ccil.org/~cowan        cowan at ccil.org
    "Any legal document draws most of its meaning from context.  A telegram
    that says 'SELL HUNDRED THOUSAND SHARES IBM SHORT' (only 190 bits in
    5-bit Baudot code plus appropriate headers) is as good a legal document
    as any, even sans digital signature." --me


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-28  1:20             ` Random832
@ 2016-03-28  1:58               ` John Cowan
  2016-03-28  5:12                 ` Random832
  0 siblings, 1 reply; 19+ messages in thread
From: John Cowan @ 2016-03-28  1:58 UTC (permalink / raw)


Random832 scripsit:

> Sure it does, but replace that != " " with !isblank(*c), and it doesn't
> work anymore since it ignores multibyte characters. 

In which locales does isblank() actually return true on characters other
than space and tab?  (This is a straight question.)

-- 
John Cowan          http://www.ccil.org/~cowan        cowan at ccil.org
You cannot enter here.  Go back to the abyss prepared for you!  Go back!
Fall into the nothingness that awaits you and your Master.  Go! --Gandalf


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-27 23:56             ` Johnny Billquist
  2016-03-28  1:54               ` John Cowan
@ 2016-03-28  3:27               ` Steve Nickolas
  1 sibling, 0 replies; 19+ messages in thread
From: Steve Nickolas @ 2016-03-28  3:27 UTC (permalink / raw)


On Mon, 28 Mar 2016, Johnny Billquist wrote:

> And what are your thoughts on FULLWIDTH LATIN CAPITAL LETTER A (U+FF21). What 
> is the semantic difference in having more whitespace around the letter? (It 
> should semantically be decomposed to LATIN CAPITAL LETTER A (U+41), so for 
> all unicode string comparisons, it is equal to A, but it's still a different 
> code point.)

Japanese text uses full-width ASCII a lot.

I just today ran across some closed captions that had two lines in 
English, and both lines were written full-width.

-uso.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Character sets
  2016-03-28  1:58               ` John Cowan
@ 2016-03-28  5:12                 ` Random832
  0 siblings, 0 replies; 19+ messages in thread
From: Random832 @ 2016-03-28  5:12 UTC (permalink / raw)

On Sun, Mar 27, 2016, at 21:58, John Cowan wrote:
> Random832 scripsit:
> 
> > Sure it does, but replace that != " " with !isblank(*c), and it doesn't
> > work anymore since it ignores multibyte characters. 
> 
> In which locales does isblank() actually return true on characters other
> than space and tab?  (This is a straight question.)

See, no, that's a trick question. None of the other blank class
characters are single-byte, so of course isblank doesn't. The following
characters return true on is*w*blank for me: U+00a0 U+1680 U+2000 U+2001
U+2002 U+2003 U+2004 U+2005 U+2006 U+2007 U+2008 U+2009 U+200a U+200b
U+202f U+205f U+3000 (Oddly enough, isblank(0xA0) is true even in the
UTF-8 locale, though of course U+00a0 is actually a multibyte character
"\xc2\xa0".) So, if what you _want_ is to find the next blank character,
doing this loop with isblank won't work. If what you want is to find
space or tab, sure. But that's why grep for patterns containing \s are
so slow.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-03-28  5:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <mailman.169.1459059516.15972.tuhs@minnie.tuhs.org>
2016-03-27 10:09 ` [TUHS] Character sets (was: Command-line options) Johnny Billquist
2016-03-27 11:29   ` John Cowan
2016-03-27 11:47     ` [TUHS] Character sets Johnny Billquist
2016-03-27 21:49       ` Greg 'groggy' Lehey
2016-03-27 21:53         ` Johnny Billquist
2016-03-27 21:59           ` Greg 'groggy' Lehey
2016-03-27 22:19             ` Johnny Billquist
2016-03-27 22:21             ` Charles Anthony
2016-03-27 23:23               ` Dave Horsfall
2016-03-28  0:20                 ` John Cowan
2016-03-28  1:02                   ` Dave Horsfall
2016-03-28  0:18               ` Johnny Billquist
2016-03-27 23:30           ` John Cowan
2016-03-27 23:56             ` Johnny Billquist
2016-03-28  1:54               ` John Cowan
2016-03-28  3:27               ` Steve Nickolas
2016-03-28  1:20             ` Random832
2016-03-28  1:58               ` John Cowan
2016-03-28  5:12                 ` Random832

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).