Re: UTF-8

zsh-users
 help / color / mirror / code / Atom feed

* Re: UTF-8
       [not found] <BC9BC140-F1A5-11D5-BA73-000393164560@mas.ecp.fr>
@ 2001-12-18 16:51 ` Oliver Kiddle
  0 siblings, 0 replies; 32+ messages in thread
From: Oliver Kiddle @ 2001-12-18 16:51 UTC (permalink / raw)
  To: Olivier Verdier; +Cc: zsh-users

Olivier Verdier wrote:
> 
> I'm using Darwin and Mac OS X 10.1 together with zsh (zsh --version =
> zsh 4.0.4 (powerpc-apple-darwin1.4)), and I can't figure out how to make
> it work properly with UTF-8 encoding. All file names are indeed encoded
> in UTF-8 on macintosh hard-disk (HFS+ format). I use a terminal which is
> UTF-8 aware (apple Terminal.app). It works perfectly with
> UTF-8-configured 'less' and 'vim' commands.
> 
> Some examples of misbehaviors:
> 1) a 'ls' command for "Téléchargement" gives "Te??e??hargement"

The output of the ls command doesn't pass through zsh at all but goes
straight to the terminal so in this case, it is either ls or the
apple terminal which is failing to handle UTF-8.

>         *but* 'ls | less' gives "Téléchargement" if less is configured for
> UTF-8
>         so the output of 'ls' is correct, but is misinterpreted by the shell

That seems a little strange. I would suspect that the terminal is expecting
something like ISO-8859-1 and less is converting to that from UTF-8. Try
using a more weird character and see what happens then.

> 2) completion doesn't work; if 'Télé' is on the directory, Té[tab] gives
> nothing, but 'cd Télé' works...
>         *moreover* 'cd Té' writes 'cd T@' on screen, but 'cd Té[tab]' turns
> itself into 'cd Té'
> 
> 3) 'cd Télé' together with the option 'printeightbit' prints correctly
> the pwd; mkdir Télé works as expected.

I'm not quite sure why the completion there doesn't work. I don't have a
UTF-8 aware terminal to experiment with this which doesn't help.

Unfortunately, zsh was never built to handle UTF-8 correctly. For many
things it would be transparent because of the way UTF-8 is designed.
Commands like echo and cd I would expect to work. In some areas though,
it won't work. For example, if you assign a UTF-8 string to a variable
and use $#var to get its length, it will report the length wrongly
because it will count two for two-byte characters.

Fixing this would be quite a big job because it would affect virtually
all the code and need initial thought to work out where to use wide
characters, where to use UTF-8 and where to do conversions for input
and output.

For future reference, send any zsh questions to zsh-users@sunsite.dk or
zsh-workers@sunsite.dk. The address you used just goes to the people who
maintain the web pages.

Oliver Kiddle

_____________________________________________________________________
This message has been checked for all known viruses by the 
MessageLabs Virus Scanning Service. For further information visit
http://www.messagelabs.com/stats.asp

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-19 17:04                               ` utf-8 Ray Andrews
@ 2014-12-19 22:06                                 ` ZyX
  0 siblings, 0 replies; 32+ messages in thread
From: ZyX @ 2014-12-19 22:06 UTC (permalink / raw)
  To: Ray Andrews, zsh-users

19.12.2014, 20:07, "Ray Andrews" <rayandrews@eastlink.ca>:
>  On 12/18/2014 11:02 PM, Bart Schaefer wrote:
>
>  So often the answer is dead simple once the question is understood:
>>   Just hold down the "shift to English" key and type.  In the case of
>>   Did you read my messages? ***RUSSIANS ENTER THIS KIND OF THINGS USING ENGLISH KEYBOARD LAYOUT***.
>>   You'll see they have keys that have latin top left and cyrillic bottom
>>   right. Quite how they switch between the two modes, I can't tell you but
>>   they can.
>  ... there.  Non Latin keyboards normally have the ability to switch to a
>  universal ASCII/Latin/English mode, which is required for the input of special
>  characters, like '\n', and would in fact be used for all shell programming since
>  zsh keywords and such are not translated into other languages or alphabets
>  in any case. Simple: everyone codes in English.
>
>  Thank you gentlemen.

There is some misunderstanding. *Keyboards* do not have a way to switch layouts: those that once were factured with РУС/ЛАТ button (Russian/Latin switch, not sure whether the actual implementation of this button was hardware or software) are long since dead and buried. Except for some very rare cases local keyboards are regular keyboards you may find in the nearest computer store with the only addition of the local characters scribed on keys (note: keyboard even does not know and cannot tell the OS what is scribed there). What does the switching is some software (usually considered to be a part of the operating system), and it may do much more then just switching: e.g. search for “X11 Japanese input method” (there are much more characters that are needed to write in japanese then there is present on the keyboard).

The only keyboards left that handle switching themselves are various virtual keyboards (usually those you may find on your smartphone).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-19  7:02                             ` utf-8 Bart Schaefer
@ 2014-12-19 17:04                               ` Ray Andrews
  2014-12-19 22:06                                 ` utf-8 ZyX
  0 siblings, 1 reply; 32+ messages in thread
From: Ray Andrews @ 2014-12-19 17:04 UTC (permalink / raw)
  To: zsh-users

On 12/18/2014 11:02 PM, Bart Schaefer wrote:


So often the answer is dead simple once the question is understood:

> Just hold down the "shift to English" key and type.  In the case of

> Did you read my messages? ***RUSSIANS ENTER THIS KIND OF THINGS USING ENGLISH KEYBOARD LAYOUT***.
>

> You'll see they have keys that have latin top left and cyrillic bottom
> right. Quite how they switch between the two modes, I can't tell you but
> they can.
>
... there.  Non Latin keyboards normally have the ability to switch to a
universal ASCII/Latin/English mode, which is required for the input of special
characters, like '\n', and would in fact be used for all shell programming since
zsh keywords and such are not translated into other languages or alphabets
in any case. Simple: everyone codes in English.

Thank you gentlemen.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-19  6:34                           ` utf-8 Ray Andrews
  2014-12-19  7:02                             ` utf-8 Bart Schaefer
@ 2014-12-19  7:29                             ` Павлов Николай Александрович
  1 sibling, 0 replies; 32+ messages in thread
From: Павлов Николай Александрович @ 2014-12-19  7:29 UTC (permalink / raw)
  To: Ray Andrews, zsh-users

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On December 19, 2014 9:34:45 AM EAT, Ray Andrews <rayandrews@eastlink.ca> wrote:
>On 12/18/2014 06:45 PM, Bart Schaefer wrote:
>> Did you read the entirety of my last message or did you stop after
>the
>> first two sentences?  There's more below the part of your message
>that
>> I quoted.
>Yes. But where I loose the scent is in thinking about, specifically,
>Cyrillic.
>Cyrillic does not have a 'real' 'n' character.  They have two
>'backwards'
>capital ens that are vowels, and the sound 'en' is represented by 'our'
>
>'H'.
>So, since their keyboard would be devoted to their own alphabet (better
>than
>ours, BTW), where would the 'n' character be so as to type '\n' ...
>there *is* no
>'n' character!  Would they use '\H' ('H' being the Cyrillic letter
>representing the sound 'en'?
>> Re-read the last paragraph of my previous message, please.  In UTF-8
>the
>> code points are based on the visual representation of the character;
>in
>> multi-byte Unicode there are multiple similar characters based on the
>> source language and semantics, but the ASCII subset is always the
>same
>> code points as is ASCII, including "control characters" like newline.
>What would constitute the 'visual representation of the char'?  And
>ASCII may
>maintain a consistency, but if my KB is Cyrillic, where does that leave
>me?
>Would their 'H' produce ASCII #78 (which is is as close as you can get
>to Latin
>'N'?  Cyrillic has more letters, so how would  one begin to ASCII
>Cyrillic? Nope,
>that can't be the way.  But I'm sure the Russians can print newlines
>somehow
>and I can't believe they have an 'n' sitting there for just that
>purpose.  Do they
>enter the needed value in hex? Or does zsh not worry about it, and the
>Russians
>have to bind a key? If so, then yes, it's off topic.

Did you read my messages? ***RUSSIANS ENTER THIS KIND OF THINGS USING ENGLISH KEYBOARD LAYOUT***.

If you are in an non-English environment you are concerned about a way to switch keyboard layouts, input methods or something like this and have one of such things be able to produce English text.

>
>Sorry for being stupid, but I really don't get it.

-----BEGIN PGP SIGNATURE-----
Version: APG v1.1.1

iQJNBAEBCgA3BQJUk9PaMBwfMDI7PjIgHTg6PjswOSAQOzU6QTA9NEA+MjhHIDxr
cC1wYXZAeWFuZGV4LnJ1PgAKCRBu+P2/AXZZIrzrD/9896N5Xe//AZGGObXaTNUi
1g6M2JMmKLxgNeNiwaLkg5hktK2feSRTUzm1qqWjPtC51LAAh0xh5MZRE0KSWIyC
kBj/VrqghP7gtN33fwRCXedxRbHEW2flbZ6YqiEnnYvRezn6Pqrs645lxafr2+Or
14r5S7EO1OE25ozNMSwaa713Q4RYMWb/kLs4R8mumK85xnCSgkqiIy4Zj/owLRZz
lJpTXJzZ7TJWmZVW0CPattz0eZ/EGmN7tovQiByA8j6kiMf9SFV1A2iZk7XtXpjx
8J5NPOHQhFOPGRn7OKjCSQRxqYagc3KbFXtf7Td3mmlb57/Q6YiztPl74S2lROiR
xW+RjViiZx4duIX5CPU5V47BItmF5xcJ5yeigB84AifnbQ6MX2WTaiUgparebCXB
le0CAAnJtY4E4BjkNfMkyOvyr24QAGQ3lIhvDFMLisiMROKc1ljMOtrywKObD2BC
LuvhCQRuvmcnQOQlegvcCGVk8WntI8K+51ucNs9Vd0aeVyFnBDwNJf3SDqRbda4I
WQ11TveMRNNAgJpn54VuaXnDLik/rfp01A3jatZpJ4SZ6xjN6VxdnrqCfpKahkyC
BON7qw+pwbBhaF1P205Z1EbvRBdOb07BGolfu90KLvR5EvHG3WVLEXHVHESqSyaz
1VAje917TTgDapHcb5x4Qw==
=LQ5+
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-19  6:34                           ` utf-8 Ray Andrews
@ 2014-12-19  7:02                             ` Bart Schaefer
  2014-12-19 17:04                               ` utf-8 Ray Andrews
  2014-12-19  7:29                             ` utf-8 Павлов Николай Александрович
  1 sibling, 1 reply; 32+ messages in thread
From: Bart Schaefer @ 2014-12-19  7:02 UTC (permalink / raw)
  To: zsh-users

On Dec 18, 10:34pm, Ray Andrews wrote:
}
} Yes. But where I loose the scent is in thinking about, specifically,
} Cyrillic. Cyrillic does not have a 'real' 'n' character. They have
} two 'backwards' capital ens that are vowels, and the sound 'en' is
} represented by 'our' 'H'. So, since their keyboard would be devoted
} to their own alphabet (better than ours, BTW), where would the 'n'
} character be so as to type '\n' ... there *is* no 'n' character! Would
} they use '\H' ('H' being the Cyrillic letter representing the sound
} 'en'?

OK, now we really are way off topic.

What appears on the keyboard has almost nothing to do with the shell and
barely anything to do with the character set used to represent letters,
numbers, etc. internally.

} But I'm sure the Russians can print newlines somehow and I can't
} believe they have an 'n' sitting there for just that purpose. Do they
} enter the needed value in hex? Or does zsh not worry about it, and the
} Russians have to bind a key? If so, then yes, it's off topic.

Indeed zsh does not worry about this, and the Russians have to figure
out for themselves how to enter characters that don't appear on their
keyboard.

I haven't used a Russian keyboard but I have used a Japanese one and in
fact it DID have an 'n' sitting there for just that purpose.  The same
way your keyboard has a % sign sitting there above the 5, and so on.
Just hold down the "shift to English" key and type.  In the case of
keyboards that have no such shift key, there is language front-end
software that interprets multiple keystrokes and converts that to one
of the the "missing" characters before anything gets sent through
the terminal to the shell.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-19  2:45                         ` utf-8 Bart Schaefer
@ 2014-12-19  6:34                           ` Ray Andrews
  2014-12-19  7:02                             ` utf-8 Bart Schaefer
  2014-12-19  7:29                             ` utf-8 Павлов Николай Александрович
  0 siblings, 2 replies; 32+ messages in thread
From: Ray Andrews @ 2014-12-19  6:34 UTC (permalink / raw)
  To: zsh-users

On 12/18/2014 06:45 PM, Bart Schaefer wrote:
> Did you read the entirety of my last message or did you stop after the
> first two sentences?  There's more below the part of your message that
> I quoted.
Yes. But where I loose the scent is in thinking about, specifically, 
Cyrillic.
Cyrillic does not have a 'real' 'n' character.  They have two 'backwards'
capital ens that are vowels, and the sound 'en' is represented by 'our' 
'H'.
So, since their keyboard would be devoted to their own alphabet (better than
ours, BTW), where would the 'n' character be so as to type '\n' ... 
there *is* no
'n' character!  Would they use '\H' ('H' being the Cyrillic letter 
representing the sound 'en'?
> Re-read the last paragraph of my previous message, please.  In UTF-8 the
> code points are based on the visual representation of the character; in
> multi-byte Unicode there are multiple similar characters based on the
> source language and semantics, but the ASCII subset is always the same
> code points as is ASCII, including "control characters" like newline.
What would constitute the 'visual representation of the char'?  And 
ASCII may
maintain a consistency, but if my KB is Cyrillic, where does that leave me?
Would their 'H' produce ASCII #78 (which is is as close as you can get 
to Latin
'N'?  Cyrillic has more letters, so how would  one begin to ASCII 
Cyrillic? Nope,
that can't be the way.  But I'm sure the Russians can print newlines somehow
and I can't believe they have an 'n' sitting there for just that 
purpose.  Do they
enter the needed value in hex? Or does zsh not worry about it, and the 
Russians
have to bind a key? If so, then yes, it's off topic.

Sorry for being stupid, but I really don't get it.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-19  2:27                       ` utf-8 Ray Andrews
                                           ` (2 preceding siblings ...)
  2014-12-19  3:50                         ` utf-8 Lawrence Velázquez
@ 2014-12-19  5:24                         ` Павлов Николай Александрович
  3 siblings, 0 replies; 32+ messages in thread
From: Павлов Николай Александрович @ 2014-12-19  5:24 UTC (permalink / raw)
  To: Ray Andrews, zsh-users

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On December 19, 2014 5:27:15 AM EAT, Ray Andrews <rayandrews@eastlink.ca> wrote:
>On 12/18/2014 06:04 PM, Bart Schaefer wrote:
>> This has gone way off topic for the zsh-users list.  I don't recall
>if
>> the ietf-charsets list is still active, but that might be a better
>place
>> to go looking if it is.
>>
>Bart,
>
>Of topic? I'm wondering how one enters the newline character in zsh
>when
>one is using a different locale/alphabet.  I've only ever used English,
>
>and I'd expect that in, say, Cyrillic there would be some char that's a
>
>dead ringer for 'n' (as in '\n'), but in *principal* a Cyrillic 'n'
>might not be the same utf-8 code as 'our' 'n', so I'm wondering what
>zsh
>does about that. Spanish would have at least two 'candidates' for 'n'.
>
>What does zsh do once we are outside of good old ASCII?

I know exactly no encodings which are not able to represent ASCII characters and am pretty sure you cannot have one as a system locale even if such encoding exists. The way user enters characters is out of the scope of zsh responsibilities. It has a number of ways to input characters not present in particular keyboard layout, but usually users just have more then one layout and one of the layouts they use is some English variant.
-----BEGIN PGP SIGNATURE-----
Version: APG v1.1.1

iQJNBAEBCgA3BQJUk7aYMBwfMDI7PjIgHTg6PjswOSAQOzU6QTA9NEA+MjhHIDxr
cC1wYXZAeWFuZGV4LnJ1PgAKCRBu+P2/AXZZIgdUEADDFMCJE7GqsJgjYN6lF307
YYN73HGCcZgIN1hS89SwI8q1xj/CsclstIwV1xC5E1lsiFV8p56t3Mn1g1cMcozp
8BHCtNDFkDfzWu+PZUOGdPArCpbh5M0fHzR0zWRPjO7E02k09omfmUihV+Y4RNcF
2p6SAIImhNVhVsogSezckyD5u8K4j/oQX+9WM0pZdprPMQIuLxvpPPytmtCw7RHE
uAoXW2ksbX6a4XUgDxyMde9QQSGYdAgEFCIbCHiLx/yNO17uLPbx+8vk9PjsIOCq
Ur9AqCtbKYO+GZXC/PfR9mOHxNS3hMO5U612gflpLL7SLtasmQBq0FEEWF+U85zS
wqkbH9Fsbg4U8MzLI8/27njpndtHPUMMd3Z9BLq3+R3FJXgeusTJ1fD9xEPhijeC
lM575Vxwa6FsLOCYXCVKAjmHk5IJ3OCrjgXLyomIaocXVq8lauZ1J3NVXwyHZ0D/
j35ez2yJPkf2hxu4u0tz6hglLYyZok7aoTTW65f/3JjHNcT+qL9Ae+i4vicgNkjX
ctQbBOJx5h/rdhz0PTT9//dya2ye+QPyZFi5em1lUyLaRZhv80P/HVJY4FyZ4wXb
zcipriZAj0RTW7pE32slM/jjeYonoTFqUBmtQ0inshgAO8ozU+XgtKv+j9s5Kkcf
fmreQXkv0o0ayMb/dp4jQQ==
=xpO1
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18 23:55                   ` utf-8 Ray Andrews
  2014-12-19  2:04                     ` utf-8 Bart Schaefer
@ 2014-12-19  5:18                     ` Павлов Николай Александрович
  1 sibling, 0 replies; 32+ messages in thread
From: Павлов Николай Александрович @ 2014-12-19  5:18 UTC (permalink / raw)
  To: Ray Andrews, zsh-users

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On December 19, 2014 2:55:07 AM EAT, Ray Andrews <rayandrews@eastlink.ca> wrote:
>On 12/18/2014 01:38 PM, ZyX wrote:
>> `\n`. Escapes are defined by zsh parser, not by anything else. Same
>> for any other language. There is not much reasoning behind
>translating
>> characters after `\` and I have never seen them actually translated
>in
>> any language, no matter whether it allows unicode identifiers or not.
>
>
>Sorry, I don't understand.  Of course this is defined by zsh, but what
>char in Cyrillic will be used for '\n' in Latin?  See what I mean?  Or

No. There is no such a thing as "'\n' in Latin". There is "'\n' in $encoding" and UTF-8 as $encoding is able to represent all characters from all alphabets which were defined in Unicode standard.

>some even more different alphabet that has nothing like 'n' at all?  Or
>
>do  you have 'n' available to you exactly as in Latin?

Usually people using non-English locales have their input method set up so they can switch between languages. If not then e.g. with Russian keyboard you have much bigger problems then inputting '\n': what about "e", "c", "h", "o" and "'"? Standard Russian keybord layout has no "'" and commands are not translated.
-----BEGIN PGP SIGNATURE-----
Version: APG v1.1.1

iQJNBAEBCgA3BQJUk7UvMBwfMDI7PjIgHTg6PjswOSAQOzU6QTA9NEA+MjhHIDxr
cC1wYXZAeWFuZGV4LnJ1PgAKCRBu+P2/AXZZItEDD/9olnBIDHsRMGdosWhzATmf
Ujv7QzFDbnk7VT9lRGgoMT6ewsNla3ZeUnrxrh+yFSNfeucmugNbWQ7Otaetrnzo
PRBZQx8tDxW4RqAs6kpOBxbjY9O7iG4M3mo97WnVwlKKTensop9IAQffXcqag3dr
Nu7nLKdbt93EuriKIG62kgES8qpBTLxFcTioZRVhACON6XnY0v2P5tKMsTBPGcyG
1f8b/358WVNhApcWVdPMGlnLnknQGFARt5LmCQxhBYNhAVZOcLT3SeHkR1c2D8Kb
YxkrN9n9hHSyNZj0uHjmk+dtbUfgj9oUp+avHYPm0/U5ESH+EN1tLotTiHPW7X0D
9c/FUoeZZMrCvwUh5Gb0bGDSmJTSZ7rR2Ic4WDUbD332JeXCgX+M0A8XO7OkU+s0
zsHGwBIR/WjY3E0c6fugbkJVepVuEbREUQqUaVnjmKGCbzbeOFBtKUq2/DosnS5w
sE0tLE8xo/EgZY0xENV0Ocb7r6zookGNskofWyL2I3j9+JbRZM1DOeNtj2D4o4py
lgOU9EO23r5isoanHgy/tigD0OKvNEtSN6Z1dKnGzieR+8a7LKjgZH+VNEKsyzsS
hFdV8gsMtSrEHtNv+cjoG0RAq1Zsuo6Ss/5ex7nxCVS9Elr62BaBTzwRyw54W6W9
BZOoW6DU/tb8phHsCgPCxQ==
=gxQ8
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-19  2:27                       ` utf-8 Ray Andrews
  2014-12-19  2:32                         ` utf-8 Mikael Magnusson
  2014-12-19  2:45                         ` utf-8 Bart Schaefer
@ 2014-12-19  3:50                         ` Lawrence Velázquez
  2014-12-19  5:24                         ` utf-8 Павлов Николай Александрович
  3 siblings, 0 replies; 32+ messages in thread
From: Lawrence Velázquez @ 2014-12-19  3:50 UTC (permalink / raw)
  To: Ray Andrews; +Cc: zsh-users

On Dec 18, 2014, at 9:27 PM, Ray Andrews <rayandrews@eastlink.ca> wrote:

> Spanish would have at least two 'candidates' for 'n'.

No. The Spanish alphabet has one 'n'. The letters 'n' and 'ñ' are distinct, both visually and semantically.

vq

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-19  2:27                       ` utf-8 Ray Andrews
  2014-12-19  2:32                         ` utf-8 Mikael Magnusson
@ 2014-12-19  2:45                         ` Bart Schaefer
  2014-12-19  6:34                           ` utf-8 Ray Andrews
  2014-12-19  3:50                         ` utf-8 Lawrence Velázquez
  2014-12-19  5:24                         ` utf-8 Павлов Николай Александрович
  3 siblings, 1 reply; 32+ messages in thread
From: Bart Schaefer @ 2014-12-19  2:45 UTC (permalink / raw)
  To: zsh-users

On Dec 18,  6:27pm, Ray Andrews wrote:
} Subject: Re: utf-8
}
} On 12/18/2014 06:04 PM, Bart Schaefer wrote:
} > This has gone way off topic for the zsh-users list.  I don't recall if
} > the ietf-charsets list is still active, but that might be a better place
} > to go looking if it is.
} >
} Bart,
} 
} Of topic? I'm wondering how one enters the newline character in zsh when 
} one is using a different locale/alphabet.

Did you read the entirety of my last message or did you stop after the
first two sentences?  There's more below the part of your message that
I quoted.

Anyway yes, it's off topic because how you enter a character is part of
the language processing layer, which generally happens before zsh receives
the input from the terminal.  There are some hacks to let zsh compensate
for a system that doesn't have a functioning language layer, but details
of language-to-character-set mapping are not a topic for this list unless
you are asking about the scripting for such a hack.

} and I'd expect that in, say, Cyrillic there would be some char that's a 
} dead ringer for 'n' (as in '\n'), but in *principal* a Cyrillic 'n' 
} might not be the same utf-8 code as 'our' 'n'

Re-read the last paragraph of my previous message, please.  In UTF-8 the
code points are based on the visual representation of the character; in
multi-byte Unicode there are multiple similar characters based on the
source language and semantics, but the ASCII subset is always the same
code points as is ASCII, including "control characters" like newline.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-19  2:27                       ` utf-8 Ray Andrews
@ 2014-12-19  2:32                         ` Mikael Magnusson
  2014-12-19  2:45                         ` utf-8 Bart Schaefer
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Mikael Magnusson @ 2014-12-19  2:32 UTC (permalink / raw)
  To: Ray Andrews; +Cc: Zsh Users

On Fri, Dec 19, 2014 at 3:27 AM, Ray Andrews <rayandrews@eastlink.ca> wrote:
> On 12/18/2014 06:04 PM, Bart Schaefer wrote:
>>
>> This has gone way off topic for the zsh-users list.  I don't recall if
>> the ietf-charsets list is still active, but that might be a better place
>> to go looking if it is.
>>
> Bart,
>
> Of topic? I'm wondering how one enters the newline character in zsh when one
> is using a different locale/alphabet.  I've only ever used English, and I'd
> expect that in, say, Cyrillic there would be some char that's a dead ringer
> for 'n' (as in '\n'), but in *principal* a Cyrillic 'n' might not be the
> same utf-8 code as 'our' 'n', so I'm wondering what zsh does about that.
> Spanish would have at least two 'candidates' for 'n'.  What does zsh do once
> we are outside of good old ASCII?

Escape sequences like \n and \t are always exactly those characters.

-- 
Mikael Magnusson


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-19  2:04                     ` utf-8 Bart Schaefer
@ 2014-12-19  2:27                       ` Ray Andrews
  2014-12-19  2:32                         ` utf-8 Mikael Magnusson
                                           ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Ray Andrews @ 2014-12-19  2:27 UTC (permalink / raw)
  To: zsh-users

On 12/18/2014 06:04 PM, Bart Schaefer wrote:
> This has gone way off topic for the zsh-users list.  I don't recall if
> the ietf-charsets list is still active, but that might be a better place
> to go looking if it is.
>
Bart,

Of topic? I'm wondering how one enters the newline character in zsh when 
one is using a different locale/alphabet.  I've only ever used English, 
and I'd expect that in, say, Cyrillic there would be some char that's a 
dead ringer for 'n' (as in '\n'), but in *principal* a Cyrillic 'n' 
might not be the same utf-8 code as 'our' 'n', so I'm wondering what zsh 
does about that. Spanish would have at least two 'candidates' for 'n'.  
What does zsh do once we are outside of good old ASCII?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18 23:55                   ` utf-8 Ray Andrews
@ 2014-12-19  2:04                     ` Bart Schaefer
  2014-12-19  2:27                       ` utf-8 Ray Andrews
  2014-12-19  5:18                     ` utf-8 Павлов Николай Александрович
  1 sibling, 1 reply; 32+ messages in thread
From: Bart Schaefer @ 2014-12-19  2:04 UTC (permalink / raw)
  To: zsh-users

This has gone way off topic for the zsh-users list.  I don't recall if
the ietf-charsets list is still active, but that might be a better place
to go looking if it is.

On Dec 18,  3:55pm, Ray Andrews wrote:
} Subject: Re: utf-8
}
} On 12/18/2014 01:38 PM, ZyX wrote:
} > `\n`. Escapes are defined by zsh parser, not by anything else. Same 
} > for any other language. There is not much reasoning behind translating 
} > characters after `\` and I have never seen them actually translated in 
} > any language, no matter whether it allows unicode identifiers or not. 
} 
} Sorry, I don't understand.  Of course this is defined by zsh, but what 
} char in Cyrillic will be used for '\n' in Latin?  See what I mean?

Do you mean for '\n' to be interpreted as newline, or to be interpreted
as "a literal 'n'"?

} Or some even more different alphabet that has nothing like 'n' at all?
} Or do you have 'n' available to you exactly as in Latin?

Unicode is intended to be a "universal" character encoding, meaning that
all characters in all character sets are included.  So a literal 'n' is
always a literal 'n', and a newline is always a newline.

Unlike say the ISO set of character encodings, which have all of ASCII
in common but may use the same code point for different characters in
different languages, Unicode has only one code point for each possible
character regardless of language.  UTF-8 is the classic compromise that
made almost no one happy, because to fit everything into that code point
range a number of visually similar but semantically distinct ideograms
in various languages had to be combined on the same code points.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18 21:38                 ` utf-8 ZyX
@ 2014-12-18 23:55                   ` Ray Andrews
  2014-12-19  2:04                     ` utf-8 Bart Schaefer
  2014-12-19  5:18                     ` utf-8 Павлов Николай Александрович
  0 siblings, 2 replies; 32+ messages in thread
From: Ray Andrews @ 2014-12-18 23:55 UTC (permalink / raw)
  To: ZyX, zsh-users

On 12/18/2014 01:38 PM, ZyX wrote:
> `\n`. Escapes are defined by zsh parser, not by anything else. Same 
> for any other language. There is not much reasoning behind translating 
> characters after `\` and I have never seen them actually translated in 
> any language, no matter whether it allows unicode identifiers or not. 

Sorry, I don't understand.  Of course this is defined by zsh, but what 
char in Cyrillic will be used for '\n' in Latin?  See what I mean?  Or 
some even more different alphabet that has nothing like 'n' at all?  Or 
do  you have 'n' available to you exactly as in Latin?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18 21:15               ` utf-8 Ray Andrews
@ 2014-12-18 21:38                 ` ZyX
  2014-12-18 23:55                   ` utf-8 Ray Andrews
  0 siblings, 1 reply; 32+ messages in thread
From: ZyX @ 2014-12-18 21:38 UTC (permalink / raw)
  To: Ray Andrews, zsh-users

19.12.2014, 00:17, "Ray Andrews" <rayandrews@eastlink.ca>:
> On 12/18/2014 12:52 PM, ZyX wrote:
>>  http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, third
>>  column. Read
>>  http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
>>  for the explanation of the values, you need L* and N* (note: testing
>>  shows that not all N* are relevant: No is not (test: CIRCLED DIGIT
>>  ONE), N is not as well (test: VULGAR FRACTION ONE QUARTER), Nd (DIGIT
>>  ONE, FULLWIDTH DIGIT ONE) and No (RUNIC ARLAUG SYMBOL) are). I highly
>>  suggest seeking answer in libc sources if you need better precision.
>
> It  is very generous.  I can think of only one more question.  What
> happens in a language 'above' normal ASCII with things like escapes?
> Like if you were writing in Russian:
>
> echo "\nRussian is a very expressive language.\n"
>
> .... if that was in Cyrillic characters, how does one indicate '\n' ?

`\n`. Escapes are defined by zsh parser, not by anything else. Same for any other language. There is not much reasoning behind translating characters after `\` and I have never seen them actually translated in any language, no matter whether it allows unicode identifiers or not.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18 20:52             ` utf-8 ZyX
@ 2014-12-18 21:15               ` Ray Andrews
  2014-12-18 21:38                 ` utf-8 ZyX
  0 siblings, 1 reply; 32+ messages in thread
From: Ray Andrews @ 2014-12-18 21:15 UTC (permalink / raw)
  To: ZyX, zsh-users

On 12/18/2014 12:52 PM, ZyX wrote:
> http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, third 
> column. Read 
> http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values 
> for the explanation of the values, you need L* and N* (note: testing 
> shows that not all N* are relevant: No is not (test: CIRCLED DIGIT 
> ONE), N is not as well (test: VULGAR FRACTION ONE QUARTER), Nd (DIGIT 
> ONE, FULLWIDTH DIGIT ONE) and No (RUNIC ARLAUG SYMBOL) are). I highly 
> suggest seeking answer in libc sources if you need better precision. 

It  is very generous.  I can think of only one more question.  What 
happens in a language 'above' normal ASCII with things like escapes? 
Like if you were writing in Russian:

echo "\nRussian is a very expressive language.\n"

.... if that was in Cyrillic characters, how does one indicate '\n' ?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18 20:04           ` utf-8 Ray Andrews
  2014-12-18 20:12             ` utf-8 Peter Stephenson
@ 2014-12-18 20:52             ` ZyX
  2014-12-18 21:15               ` utf-8 Ray Andrews
  1 sibling, 1 reply; 32+ messages in thread
From: ZyX @ 2014-12-18 20:52 UTC (permalink / raw)
  To: Ray Andrews, zsh-users

18.12.2014, 23:04, "Ray Andrews" <rayandrews@eastlink.ca>:
> On 12/18/2014 10:52 AM, ZyX wrote:
> You are missing the main point. Identifiers consist of the characters for which `iswalnum` is true
>
> ...
> “☠” is U+2620 SKULL AND CROSSBONES which does *not* have unicode
> category “Letter” or “Number” and thus cannot be used in an identifier.
>
> Ok, I see what you are saying.  So 'anything' can be data, but an
> identifier must be a 'letter' or 'number'.  Where can I see a table of
> what iswalnum() accepts out of unicode?

http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, third column. Read http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values for the explanation of the values, you need L* and N* (note: testing shows that not all N* are relevant: No is not (test: CIRCLED DIGIT ONE), N is not as well (test: VULGAR FRACTION ONE QUARTER), Nd (DIGIT ONE, FULLWIDTH DIGIT ONE) and No (RUNIC ARLAUG SYMBOL) are). I highly suggest seeking answer in libc sources if you need better precision.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18 20:04           ` utf-8 Ray Andrews
@ 2014-12-18 20:12             ` Peter Stephenson
  2014-12-18 20:52             ` utf-8 ZyX
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Stephenson @ 2014-12-18 20:12 UTC (permalink / raw)
  To: zsh-users

On Thu, 18 Dec 2014 12:04:01 -0800
Ray Andrews <rayandrews@eastlink.ca> wrote:

> On 12/18/2014 10:52 AM, ZyX wrote:
> >
> You are missing the main point. Identifiers consist of the characters for which `iswalnum` is true
> 
> ...
> “☠” is U+2620 SKULL AND CROSSBONES which does *not* have unicode 
> category “Letter” or “Number” and thus cannot be used in an identifier.
> 
> Ok, I see what you are saying.  So 'anything' can be data, but an 
> identifier must be a 'letter' or 'number'.  Where can I see a table of 
> what iswalnum() accepts out of unicode?

http://www.unicode.org/charts/

describes the type for each unicode character.

You'll need to direct questions about that elsewhere.

pws


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18 18:52         ` utf-8 ZyX
@ 2014-12-18 20:04           ` Ray Andrews
  2014-12-18 20:12             ` utf-8 Peter Stephenson
  2014-12-18 20:52             ` utf-8 ZyX
  0 siblings, 2 replies; 32+ messages in thread
From: Ray Andrews @ 2014-12-18 20:04 UTC (permalink / raw)
  To: ZyX, zsh-users

On 12/18/2014 10:52 AM, ZyX wrote:
>
You are missing the main point. Identifiers consist of the characters for which `iswalnum` is true

...
“☠” is U+2620 SKULL AND CROSSBONES which does *not* have unicode 
category “Letter” or “Number” and thus cannot be used in an identifier.

Ok, I see what you are saying.  So 'anything' can be data, but an 
identifier must be a 'letter' or 'number'.  Where can I see a table of 
what iswalnum() accepts out of unicode?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18 18:41       ` utf-8 Ray Andrews
@ 2014-12-18 18:52         ` ZyX
  2014-12-18 20:04           ` utf-8 Ray Andrews
  0 siblings, 1 reply; 32+ messages in thread
From: ZyX @ 2014-12-18 18:52 UTC (permalink / raw)
  To: Ray Andrews, zsh-users

18.12.2014, 21:41, "Ray Andrews" <rayandrews@eastlink.ca>:
> On 12/18/2014 10:05 AM, ZyX wrote:
>> It is permitted at least in variable and function names: though I cannot find anything relevant in
> ...
>
> Seems I can use unicode 'one way' but not the other:

You are missing the main point. Identifiers consist of the characters for which `iswalnum` is true (there is an implementation detail that for ASCII characters internal zsh equivalent is used, so that glibc has no chances to say that U+0041 LATIN CAPITAL LETTER A is not an alphanumeric character (manual page actually says it must not do this though in any locale) or that U+003D EQUALS SIGN is). “☠” is U+2620 SKULL AND CROSSBONES which does *not* have unicode category “Letter” or “Number” and thus cannot be used in an identifier. To use it in an identifier you must create a custom libc locale (or even a custom libc) which will return true for `iswalnum(0x2620)`.

This is usual behaviour for many languages that have unicode identifiers: use unicode character classes for deciding which codepoints may and which may not form an identifier.

>
>> $ howdy=☠
>>
>> $ echo $howdy
>> ☠
>>
>> $ ☠=howdy
>> zsh: command not found: ☠=howdy
>>
>> $ var☠=howdy
>> zsh: command not found: var☠=howdy
>
> multibyte is on, all 'posix*' options are off.

Try testing with something like `ПЕРЕМЕННАЯ` (Russian translation of “VARIABLE”) or `αβγ` (first three Greek letters). They do work, at least on my system.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18 18:05     ` utf-8 ZyX
@ 2014-12-18 18:41       ` Ray Andrews
  2014-12-18 18:52         ` utf-8 ZyX
  0 siblings, 1 reply; 32+ messages in thread
From: Ray Andrews @ 2014-12-18 18:41 UTC (permalink / raw)
  To: ZyX, zsh-users

[-- Attachment #1: Type: text/plain, Size: 408 bytes --]

On 12/18/2014 10:05 AM, ZyX wrote:
> It is permitted at least in variable and function names: though I 
> cannot find anything relevant in 
...

Seems I can use unicode 'one way' but not the other:

    $ howdy=☠

    $ echo $howdy
    ☠

    $ ☠=howdy
    zsh: command not found: ☠=howdy

    $ var☠=howdy
    zsh: command not found: var☠=howdy


multibyte is on, all 'posix*' options are off.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18 18:14       ` utf-8 Ray Andrews
@ 2014-12-18 18:22         ` ZyX
  0 siblings, 0 replies; 32+ messages in thread
From: ZyX @ 2014-12-18 18:22 UTC (permalink / raw)
  To: Ray Andrews, zsh-users

18.12.2014, 21:16, "Ray Andrews" <rayandrews@eastlink.ca>:
> On 12/18/2014 09:48 AM, Peter Stephenson wrote:
>>  Yes, correct. Most syntax is pinned down --- either something is a
>>  keyword or something like a decimal number from a fixed set, or it's
>>  any old string. Identifiers are an exception. There's an option for
>>  this. POSIX_IDENTIFIERS <K> <S> When this option is set, only the
>>  ASCII characters a to z, A to Z, 0 to 9 and _ may be used in
>>  identifiers (names of shell parameters and modules). When the option
>>  is unset and multibyte character support is enabled (i.e. it is
>>  compiled in and the option MULTIBYTE is set), then additionally any
>>  alphanumeric characters in the local character set may be used in
>>  identifiers. Note that scripts and functions written with this feature
>>  are not portable, and also that both options must be set before the
>>  script or function is parsed; setting them during execution is not
>>  sufficient as the syntax variable=value has already been parsed as a
>>  command rather than an assignment. If multibyte character support is
>>  not compiled into the shell this option is ignored; all octets with
>>  the top bit set may be used in identifiers. This is non-standard but
>>  is the tradi‐ tional zsh behaviour. pws
>
> Ok thanks.  Now if I can just figger out how to enter one of these
> unicodes in xfce terminal. You'd think their doc might say something
> about it.

Zsh has `insert-unicode-char` if you know the codepoint and `insert-composed-char` for a more human-friendly input of a limited set of characters.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18 17:48     ` utf-8 Peter Stephenson
@ 2014-12-18 18:14       ` Ray Andrews
  2014-12-18 18:22         ` utf-8 ZyX
  0 siblings, 1 reply; 32+ messages in thread
From: Ray Andrews @ 2014-12-18 18:14 UTC (permalink / raw)
  To: zsh-users

On 12/18/2014 09:48 AM, Peter Stephenson wrote:
> Yes, correct. Most syntax is pinned down --- either something is a 
> keyword or something like a decimal number from a fixed set, or it's 
> any old string. Identifiers are an exception. There's an option for 
> this. POSIX_IDENTIFIERS <K> <S> When this option is set, only the 
> ASCII characters a to z, A to Z, 0 to 9 and _ may be used in 
> identifiers (names of shell parameters and modules). When the option 
> is unset and multibyte character support is enabled (i.e. it is 
> compiled in and the option MULTIBYTE is set), then additionally any 
> alphanumeric characters in the local character set may be used in 
> identifiers. Note that scripts and functions written with this feature 
> are not portable, and also that both options must be set before the 
> script or function is parsed; setting them during execution is not 
> sufficient as the syntax variable=value has already been parsed as a 
> command rather than an assignment. If multibyte character support is 
> not compiled into the shell this option is ignored; all octets with 
> the top bit set may be used in identifiers. This is non-standard but 
> is the tradi‐ tional zsh behaviour. pws 
Ok thanks.  Now if I can just figger out how to enter one of these 
unicodes in xfce terminal. You'd think their doc might say something 
about it.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18 17:36   ` utf-8 Ray Andrews
  2014-12-18 17:48     ` utf-8 Peter Stephenson
@ 2014-12-18 18:05     ` ZyX
  2014-12-18 18:41       ` utf-8 Ray Andrews
  1 sibling, 1 reply; 32+ messages in thread
From: ZyX @ 2014-12-18 18:05 UTC (permalink / raw)
  To: Ray Andrews, zsh-users

18.12.2014, 20:38, "Ray Andrews" <rayandrews@eastlink.ca>:
> On 12/18/2014 01:25 AM, Peter Stephenson wrote:
>
> Mikael, Peter:
>>  Chapter 5 of the FAQ is the best place to start. You can see this
>>  online at http://zsh.sourceforge.net/FAQ/zshfaq05.html#l52. The
>>  version in Etc of the source is newer but I don't think there are
>>  significant differences. pws
>
> Very nicely written. That's exactly what I wanted to learn.  And tho I
> knew it
> previously, I had semi forgotten the difference between unicode and utf-8,
> which lead to the fuzzy question. To ask it again more accurately, where are
> extended unicode characters permitted? Or perhaps that's better reversed,
> where are they *not* permitted? Can a variable have a name beyond ASCII?
> I see that zsh is transparent to utf-8 everywhere, but that does not presume
> that one has use of the entire unicode charset in all situations.

It is permitted at least in variable and function names: though I cannot find anything relevant in manual regarding them, but code that implements `isident` function that is used to check for variable names (not function names, I do not know this part) indirectly uses library function `iswalnum` which in turn knows about unicode character classes (depends on LC_CTYPE).

AFAIK function name can be anything that is not parsed as anything else: the following definition works:

    '()' () {
        echo Test
    }

    \(\)
    # Outputs Test.

More:

    $PATH () {
        echo Test
    }

    /home/zyx/.gem/ruby/1.9.1/bin:<skip>:/opt/ekopath/bin
    # Outputs Test as well.

. It looks like zsh code was intentionally modified to use `iswalnum` for `itype_end` called from `isident`. It also appears that UTF-8 characters in IFS are also recognized: `itype_end` handles them as well and I do not think such handling was added without a reason. Everything is locale-bound in any case because libc functions are used and not something like icu.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18 17:36   ` utf-8 Ray Andrews
@ 2014-12-18 17:48     ` Peter Stephenson
  2014-12-18 18:14       ` utf-8 Ray Andrews
  2014-12-18 18:05     ` utf-8 ZyX
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Stephenson @ 2014-12-18 17:48 UTC (permalink / raw)
  To: zsh-users

On Thu, 18 Dec 2014 09:36:33 -0800
Ray Andrews <rayandrews@eastlink.ca> wrote:

> On 12/18/2014 01:25 AM, Peter Stephenson wrote:
> 
> Mikael, Peter:
> 
> > Chapter 5 of the FAQ is the best place to start. You can see this 
> > online at http://zsh.sourceforge.net/FAQ/zshfaq05.html#l52. The 
> > version in Etc of the source is newer but I don't think there are 
> > significant differences. pws 
> 
> Very nicely written. That's exactly what I wanted to learn.  And tho I 
> knew it
> previously, I had semi forgotten the difference between unicode and utf-8,
> which lead to the fuzzy question. To ask it again more accurately, where are
> extended unicode characters permitted? Or perhaps that's better reversed,
> where are they *not* permitted? Can a variable have a name beyond ASCII?
> I see that zsh is transparent to utf-8 everywhere, but that does not presume
> that one has use of the entire unicode charset in all situations.

Yes, correct.  Most syntax is pinned down --- either something is
a keyword or something like a decimal number from a fixed set, or it's
any old string.  Identifiers are an exception.  There's an option for this.

POSIX_IDENTIFIERS <K> <S>
       When  this option is set, only the ASCII characters a to z, A to
       Z, 0 to 9 and _ may be  used  in  identifiers  (names  of  shell
       parameters and modules).

       When  the  option  is  unset  and multibyte character support is
       enabled (i.e. it is compiled in  and  the  option  MULTIBYTE  is
       set), then additionally any alphanumeric characters in the local
       character set may be used in identifiers.  Note that scripts and
       functions  written  with this feature are not portable, and also
       that both options must be set before the script or  function  is
       parsed;  setting  them during execution is not sufficient as the
       syntax variable=value has  already  been  parsed  as  a  command
       rather than an assignment.

       If  multibyte  character  support is not compiled into the shell
       this option is ignored; all octets with the top bit set  may  be
       used  in  identifiers.   This  is non-standard but is the tradi‐
       tional zsh behaviour.

pws


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18  9:25 ` utf-8 Peter Stephenson
@ 2014-12-18 17:36   ` Ray Andrews
  2014-12-18 17:48     ` utf-8 Peter Stephenson
  2014-12-18 18:05     ` utf-8 ZyX
  0 siblings, 2 replies; 32+ messages in thread
From: Ray Andrews @ 2014-12-18 17:36 UTC (permalink / raw)
  To: zsh-users

On 12/18/2014 01:25 AM, Peter Stephenson wrote:

Mikael, Peter:

> Chapter 5 of the FAQ is the best place to start. You can see this 
> online at http://zsh.sourceforge.net/FAQ/zshfaq05.html#l52. The 
> version in Etc of the source is newer but I don't think there are 
> significant differences. pws 

Very nicely written. That's exactly what I wanted to learn.  And tho I 
knew it
previously, I had semi forgotten the difference between unicode and utf-8,
which lead to the fuzzy question. To ask it again more accurately, where are
extended unicode characters permitted? Or perhaps that's better reversed,
where are they *not* permitted? Can a variable have a name beyond ASCII?
I see that zsh is transparent to utf-8 everywhere, but that does not presume
that one has use of the entire unicode charset in all situations.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-17 18:05 utf-8 Ray Andrews
  2014-12-17 20:31 ` utf-8 ZyX
@ 2014-12-18  9:25 ` Peter Stephenson
  2014-12-18 17:36   ` utf-8 Ray Andrews
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Stephenson @ 2014-12-18  9:25 UTC (permalink / raw)
  To: Zsh Users

On Wed, 17 Dec 2014 10:05:27 -0800
Ray Andrews <rayandrews@eastlink.ca> wrote:
> When we talk about utf-8 and zsh, what is the relevance of that?

Chapter 5 of the FAQ is the best place to start.  You can see this
online at http://zsh.sourceforge.net/FAQ/zshfaq05.html#l52.  The version
in Etc of the source is newer but I don't think there are significant
differences.

pws

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18  6:48     ` utf-8 Павлов Николай Александрович
@ 2014-12-18  9:25       ` Mikael Magnusson
  0 siblings, 0 replies; 32+ messages in thread
From: Mikael Magnusson @ 2014-12-18  9:25 UTC (permalink / raw)
  To: Павлов
	Николай
	Александрович
  Cc: Ray Andrews, Zsh Users

On Thu, Dec 18, 2014 at 7:48 AM, Павлов Николай Александрович
<kp-pav@yandex.ru> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> On December 18, 2014 3:39:56 AM EAT, Ray Andrews <rayandrews@eastlink.ca> wrote:
>>On 12/17/2014 12:31 PM, ZyX wrote:
>>
>>
>>ZyX,
>>> It looks like it is the following: - Explicit support in RE patterns.
>>
>>> - COMBINING_CHARS option that tells zsh that terminal is able to
>>display
>>... I did some reading, but it's too 'zoomed in' for me, it presumes
>>one
>>already more  or less knows what's going on.  I don't.
>
> Your question is too broad to give more detailed answer and the intent is not clear. You are also posting to zsh users and developers mainly live in zsh workers, reading users with lower priority. I know some internals of zsh (not the part you are requesting though) and know some "dark corners" of unicode processing in general, but I cannot give more detailed explanation without knowing what you are after.

All mails to zsh-users are automatically sent to subscribers of
zsh-workers as well. The main issue with non-singlebyte encodings is
that almost all the code used to assume that one byte equals one
character equals one on-screen character cell. This took a couple of
years to fix, but is more or less done now. There is nothing specific
to UTF-8 in the code as far as I know, except in getkeystring, but
that looks more like an optimization to avoid calling iconv(). Eg, zsh
works fine if you run under EUC-JP too, but then you can of course
only type japanese characters (and the ascii set).

Most of what Pavlov(if my cyrillic isn't too rusty) said applies to
unicode, not utf-8, which is a character set, not a character
encoding. All the unicode things should work fine in any
encoding/character set, assuming the character you want exists in it.

-- 
Mikael Magnusson

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-18  0:39   ` utf-8 Ray Andrews
@ 2014-12-18  6:48     ` Павлов Николай Александрович
  2014-12-18  9:25       ` utf-8 Mikael Magnusson
  0 siblings, 1 reply; 32+ messages in thread
From: Павлов Николай Александрович @ 2014-12-18  6:48 UTC (permalink / raw)
  To: Ray Andrews, Zsh Users

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On December 18, 2014 3:39:56 AM EAT, Ray Andrews <rayandrews@eastlink.ca> wrote:
>On 12/17/2014 12:31 PM, ZyX wrote:
>
>
>ZyX,
>> It looks like it is the following: - Explicit support in RE patterns.
>
>> - COMBINING_CHARS option that tells zsh that terminal is able to
>display
>... I did some reading, but it's too 'zoomed in' for me, it presumes
>one
>already more  or less knows what's going on.  I don't.

Your question is too broad to give more detailed answer and the intent is not clear. You are also posting to zsh users and developers mainly live in zsh workers, reading users with lower priority. I know some internals of zsh (not the part you are requesting though) and know some "dark corners" of unicode processing in general, but I cannot give more detailed explanation without knowing what you are after.
-----BEGIN PGP SIGNATURE-----
Version: APG v1.1.1

iQJNBAEBCgA3BQJUkni1MBwfMDI7PjIgHTg6PjswOSAQOzU6QTA9NEA+MjhHIDxr
cC1wYXZAeWFuZGV4LnJ1PgAKCRBu+P2/AXZZImtuEACkB8Jl9eFkpQuscxrvJ6Ep
IuqgVKFNT75YlNVod6LBEqIdXg4uM8lctgwvcA+2s/19hPgOvpe+t7TGQM4n7OVw
TpVArqG4S9WTiAL1ml4uXbA2bkpxw9IrAqtq9J8KigkwhKKcdFhguDph7Boe/CIU
NCysyUHLPE5e2rwHJ1+6HiLkw2rRijj5Ki2ku+Re8uQVm1DuFYa+EWihEzNKMesm
JDb8pllNXd2iK4D4865Uy2YKdvtIPBXSfFhmL/OcSUNowLwhMh7Tg15B2XAY9dDh
vTKmnewq/xUWJvAdOs8hT5FHUvN+cC+KzXdrTAPM/D8VDIWz90oZfHBgqzWvWNO7
HvCr4FaOFG/FCZ4WryZ4IKd+WSrCZ8JRTMG6L5+oHz1duBaLno7ArE0/D4epj+kc
zUREFtQFrHnVEGNkFpKouehQ9/CVBBqmUxzvrnnI7SEL9W/SLIz/a4mTvyuz+Ade
/L/yaD+5ZM1Cq0ySOp93ZZmnq8Z9+u8a6mZbRCBG7WpEB/Aaj+I6YXAah9ueJ47N
C/y1mPVYJpluGMKDRWUIApIZuSA2UeodcUq8IEjZsA7PiEaQ0Krowihpp+JUJtj3
ZXJL+Y4ufAk/GPAMjmRgZroVGy5EL9Q1+ybgtEODK6YiAu7OqUVdOPnYcuiIREos
PSDwbPC4GbJyQ0Y/k+xxIw==
=/842
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-17 20:31 ` utf-8 ZyX
@ 2014-12-18  0:39   ` Ray Andrews
  2014-12-18  6:48     ` utf-8 Павлов Николай Александрович
  0 siblings, 1 reply; 32+ messages in thread
From: Ray Andrews @ 2014-12-18  0:39 UTC (permalink / raw)
  To: ZyX, Zsh Users

On 12/17/2014 12:31 PM, ZyX wrote:


ZyX,
> It looks like it is the following: - Explicit support in RE patterns. 
> - COMBINING_CHARS option that tells zsh that terminal is able to display 
... I did some reading, but it's too 'zoomed in' for me, it presumes one 
already more  or less knows what's going on.  I don't.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: utf-8
  2014-12-17 18:05 utf-8 Ray Andrews
@ 2014-12-17 20:31 ` ZyX
  2014-12-18  0:39   ` utf-8 Ray Andrews
  2014-12-18  9:25 ` utf-8 Peter Stephenson
  1 sibling, 1 reply; 32+ messages in thread
From: ZyX @ 2014-12-17 20:31 UTC (permalink / raw)
  To: Ray Andrews, Zsh Users

17.12.2014, 21:37, "Ray Andrews" <rayandrews@eastlink.ca>:
> When we talk about utf-8 and zsh, what is the relevance of that?  I mean
> what/when/where is zsh concerned with character encoding?  Filenames I
> guess, and inside strings too, perhaps? Not in zsh syntax itself I
> presume.  I guess that any data stream would/could be utf-8erized.
> Anywhere else?  Or is this something where I'm not even asking the right
> question?

You can check out explicit `utf-8` support by searching for `(?i)utf-?8|unicode` in `man zshall`.

It looks  like it is the following:

- Explicit support in RE patterns.
- COMBINING_CHARS option that tells zsh that terminal is able to display combining characters correctly (i.e. when calculating width zsh should assume that combining characters are joined with non-combining ones and thus are effectively zero cells wide).
- MULTIBYTE option that affects string indexing and string length calculations, also `${(#)SOME_INTEGER_THAT_IS_GREATER_THEN_127}` parameter expansion flag.
- `$'\uXXXX'` and `$'\UXXXXXXXX'`.
- Width calculations for unicode characters with East Asian width property equal to F and W (i.e. fullwidth or double-width characters).
- `insert-unicode-char` widget.

Otherwise zsh supports encoding from the system locale (which may be UTF-8 or not) and not UTF-8.

// Note: I did not actually check the code, I only checked the documentation.

Also note that it would be very, very strange if zsh assumed filenames are in any encoding. File systems usually hold filenames as pure byte strings that just cannot contain some characters (for POSIX filesystem they only cannot contain `/` (because it is directory separator) and `\0` (because it is almost impossible to implement since there was some legacy: C strings are considered zero-terminated)). Any sane language knows that filename is a zero-terminated `/`-separated (with some additional assumptions if it intends to be run on Windows) byte string and that filename is *just* zero-terminated `/`-separated string *and nothing beyond that*. Not even that `abc/./../def` can be transformed to `def`: it is generally not true, so such normalization is always done only explicitly. (Note: Python-3 is *not* sane.)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* utf-8
@ 2014-12-17 18:05 Ray Andrews
  2014-12-17 20:31 ` utf-8 ZyX
  2014-12-18  9:25 ` utf-8 Peter Stephenson
  0 siblings, 2 replies; 32+ messages in thread
From: Ray Andrews @ 2014-12-17 18:05 UTC (permalink / raw)
  To: Zsh Users

When we talk about utf-8 and zsh, what is the relevance of that?  I mean 
what/when/where is zsh concerned with character encoding?  Filenames I 
guess, and inside strings too, perhaps? Not in zsh syntax itself I 
presume.  I guess that any data stream would/could be utf-8erized. 
Anywhere else?  Or is this something where I'm not even asking the right 
question?

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2014-12-19 22:17 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <BC9BC140-F1A5-11D5-BA73-000393164560@mas.ecp.fr>
2001-12-18 16:51 ` UTF-8 Oliver Kiddle
2014-12-17 18:05 utf-8 Ray Andrews
2014-12-17 20:31 ` utf-8 ZyX
2014-12-18  0:39   ` utf-8 Ray Andrews
2014-12-18  6:48     ` utf-8 Павлов Николай Александрович
2014-12-18  9:25       ` utf-8 Mikael Magnusson
2014-12-18  9:25 ` utf-8 Peter Stephenson
2014-12-18 17:36   ` utf-8 Ray Andrews
2014-12-18 17:48     ` utf-8 Peter Stephenson
2014-12-18 18:14       ` utf-8 Ray Andrews
2014-12-18 18:22         ` utf-8 ZyX
2014-12-18 18:05     ` utf-8 ZyX
2014-12-18 18:41       ` utf-8 Ray Andrews
2014-12-18 18:52         ` utf-8 ZyX
2014-12-18 20:04           ` utf-8 Ray Andrews
2014-12-18 20:12             ` utf-8 Peter Stephenson
2014-12-18 20:52             ` utf-8 ZyX
2014-12-18 21:15               ` utf-8 Ray Andrews
2014-12-18 21:38                 ` utf-8 ZyX
2014-12-18 23:55                   ` utf-8 Ray Andrews
2014-12-19  2:04                     ` utf-8 Bart Schaefer
2014-12-19  2:27                       ` utf-8 Ray Andrews
2014-12-19  2:32                         ` utf-8 Mikael Magnusson
2014-12-19  2:45                         ` utf-8 Bart Schaefer
2014-12-19  6:34                           ` utf-8 Ray Andrews
2014-12-19  7:02                             ` utf-8 Bart Schaefer
2014-12-19 17:04                               ` utf-8 Ray Andrews
2014-12-19 22:06                                 ` utf-8 ZyX
2014-12-19  7:29                             ` utf-8 Павлов Николай Александрович
2014-12-19  3:50                         ` utf-8 Lawrence Velázquez
2014-12-19  5:24                         ` utf-8 Павлов Николай Александрович
2014-12-19  5:18                     ` utf-8 Павлов Николай Александрович

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).