zsh-workers
 help / color / mirror / code / Atom feed
* Problems with non-ascii filenames
@ 2009-02-28  9:37 İsmail Dönmez
  2009-02-28 18:22 ` Bart Schaefer
  2009-02-28 21:28 ` Andrey Borzenkov
  0 siblings, 2 replies; 12+ messages in thread
From: İsmail Dönmez @ 2009-02-28  9:37 UTC (permalink / raw)
  To: zsh workers

Hi,

Using latest zsh CVS on OSX 10.5.6, observe :

[~]> touch xöööx

[~]> echo xo<0308>o<0308>o<0308>x
xöööx

Somehow "ö" character is replaced by <0308> while tab completing. Any
help is appreciated.

Regards.

-- 
İsmail DÖNMEZ


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with non-ascii filenames
  2009-02-28  9:37 Problems with non-ascii filenames İsmail Dönmez
@ 2009-02-28 18:22 ` Bart Schaefer
  2009-02-28 19:35   ` Andrey Borzenkov
  2009-02-28 21:28 ` Andrey Borzenkov
  1 sibling, 1 reply; 12+ messages in thread
From: Bart Schaefer @ 2009-02-28 18:22 UTC (permalink / raw)
  To: ismail, zsh workers

On Feb 28, 11:37am, Ismail wrote:
} 
} Using latest zsh CVS on OSX 10.5.6, observe :
} 
} [~]> touch xÃÃÃx
} 
} [~]> echo xo<0308>o<0308>o<0308>x
} xÃÃÃx
} 
} Somehow "Ã" character is replaced by <0308> while tab completing. Any
} help is appreciated.

The multibyte character handling on OSX appears to be particularly
sensitive to the LANG setting (see my previous mail to Wolfgang).
At the same time, OSX doesn't appear to export a LANG value (or at
least it doesn't on my iMac at work).

I can't precisely reproduce the above; I get things like

schaefer<263> touch x<00c3><00c3><00c3>x

or

schaefer<263> touch xinsert-composed-char:180: character not in range

before I ever get as far as creating the file.  Maybe there's some
additional character munging happening in transit of the email so
I'm not using the correct input.

However, I suggest checking your $LANG value and adjusting it if
necessary.  Tab-completion after LANG= works quite nicely.

Wolfgang, if you're reading this, something that I forgot to mention in
my reply to you is that sometime during 4.3.x zsh began to pay closer
attention to characters that are absent from the declared LANG character
set and to either refuse to process them at all, or to render them as
digits surrounded by angle brackets.  It no longer blindly passes those
characters around unprocessed, so things that "worked" before because
xterm dealt with the processing will now appear to "fail" because the
shell is trying harder to do the right thing internally.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with non-ascii filenames
  2009-02-28 18:22 ` Bart Schaefer
@ 2009-02-28 19:35   ` Andrey Borzenkov
  2009-02-28 23:06     ` Bart Schaefer
  0 siblings, 1 reply; 12+ messages in thread
From: Andrey Borzenkov @ 2009-02-28 19:35 UTC (permalink / raw)
  To: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 1479 bytes --]

On 28 февраля 2009 21:22:50 Bart Schaefer wrote:
> On Feb 28, 11:37am, Ismail wrote:
> } [~]> echo xo<0308>o<0308>o<0308>x
> } xÃÃÃx
> }
> } Somehow "Ã" character is replaced by <0308> while tab completing.
[...]
> Wolfgang, if you're reading this, something that I forgot to mention
> in my reply to you is that sometime during 4.3.x zsh began to pay
> closer attention to characters that are absent from the declared LANG
> character set and to either refuse to process them at all, or to
> render them as digits surrounded by angle brackets. 

Unfortunately that does not play nicely with PRINTEIGHTBIT. Currently 
manual states:

PRINT_EIGHT_BIT
     Print eight bit characters literally in completion lists, etc.
     This option is not necessary if your system correctly returns the
     printability of eight bit characters (see man page ctype(3)).


But PRINTEIGHTBIT affect only one function - (wcs_)niceputchar. Still 
conversion to <XXXX> happens directly in zrefresh() if character deemed 
to be unprintable *and* MULTIBYTE_SUPPORT is set.

So either documentation must be fixed (by clearly mentioning that this 
option has no effect if compiled with multibyte). This is bad IMHO as 
average user is not supposed to know build options.

Or code should be fixed. I am not exactly sure - is it correct that raw 
character is stuffed into output buffer? Or are they all supposed to go 
via niceputchar in the first place?



[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with non-ascii filenames
  2009-02-28  9:37 Problems with non-ascii filenames İsmail Dönmez
  2009-02-28 18:22 ` Bart Schaefer
@ 2009-02-28 21:28 ` Andrey Borzenkov
  2009-02-28 22:28   ` İsmail Dönmez
  1 sibling, 1 reply; 12+ messages in thread
From: Andrey Borzenkov @ 2009-02-28 21:28 UTC (permalink / raw)
  To: zsh-workers; +Cc: İsmail Dönmez

[-- Attachment #1: Type: text/plain, Size: 641 bytes --]

On 28 февраля 2009 12:37:00 İsmail Dönmez wrote:
> Hi,
>
> Using latest zsh CVS on OSX 10.5.6, observe :
>
> [~]> touch xöööx
>
> [~]> echo xo<0308>o<0308>o<0308>x
> xöööx
>
> Somehow "ö" character is replaced by <0308> while tab completing. Any
> help is appreciated.
>

No, it is not replaced. 0308 is combining diactrical mark for umlaut. 
Unfortunately, UNICODE allows for several different representations of 
the same character.

I believe this specific issue was already mentioned in the past w.r.t 
MacOS - it seems to prefer combining characters.

Try setting combiningchars - this could help.

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with non-ascii filenames
  2009-02-28 21:28 ` Andrey Borzenkov
@ 2009-02-28 22:28   ` İsmail Dönmez
  2009-03-01  2:15     ` Vincent Lefevre
  0 siblings, 1 reply; 12+ messages in thread
From: İsmail Dönmez @ 2009-02-28 22:28 UTC (permalink / raw)
  To: Andrey Borzenkov; +Cc: zsh-workers, Bart Schaefer

Hi,

On Sat, Feb 28, 2009 at 11:28 PM, Andrey Borzenkov <arvidjaar@gmail.com> wrote:
> On 28 февраля 2009 12:37:00 İsmail Dönmez wrote:
>> Hi,
>>
>> Using latest zsh CVS on OSX 10.5.6, observe :
>>
>> [~]> touch xöööx
>>
>> [~]> echo xo<0308>o<0308>o<0308>x
>> xöööx
>>
>> Somehow "ö" character is replaced by <0308> while tab completing. Any
>> help is appreciated.
>>
>
> No, it is not replaced. 0308 is combining diactrical mark for umlaut.
> Unfortunately, UNICODE allows for several different representations of
> the same character.
>
> I believe this specific issue was already mentioned in the past w.r.t
> MacOS - it seems to prefer combining characters.
>
> Try setting combiningchars - this could help.
>

Setting LANG to anything didn't help, setting combining chars helps, kinda:

touch xööx
echo x<tab>

completes to xööx

echo xö<tab>

does nothing.

Seems weird.

Regards.
-- 
İsmail DÖNMEZ


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with non-ascii filenames
  2009-02-28 19:35   ` Andrey Borzenkov
@ 2009-02-28 23:06     ` Bart Schaefer
  0 siblings, 0 replies; 12+ messages in thread
From: Bart Schaefer @ 2009-02-28 23:06 UTC (permalink / raw)
  To: zsh-workers

On Feb 28, 10:35pm, Andrey Borzenkov wrote:
}
} > Wolfgang, if you're reading this, something that I forgot to mention
} > in my reply to you is that sometime during 4.3.x zsh began to pay
} > closer attention to characters that are absent from the declared LANG
} > character set and to either refuse to process them at all, or to
} > render them as digits surrounded by angle brackets. 
} 
} Unfortunately that does not play nicely with PRINTEIGHTBIT.

Well, yes and no.  One could argue that PRINTEIGHTBIT doesn't mean
"print multibyte as raw bytes" but rather "if the character set is
not multibyte then print characters even if the high bit is set."

However, I agree that --enable-multibyte should not have the effect
of changing the behavior on single-byte I/O where the high bit happens
to be set.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with non-ascii filenames
  2009-02-28 22:28   ` İsmail Dönmez
@ 2009-03-01  2:15     ` Vincent Lefevre
  2009-03-01  7:19       ` Mikael Magnusson
                         ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Vincent Lefevre @ 2009-03-01  2:15 UTC (permalink / raw)
  To: İsmail Dönmez; +Cc: Andrey Borzenkov, zsh-workers, Bart Schaefer

On 2009-03-01 00:28:39 +0200, İsmail Dönmez wrote:
> touch xööx
> echo x<tab>
> 
> completes to xööx
> 
> echo xö<tab>
> 
> does nothing.

Same problem under Linux:

  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=374913

> Seems weird.

Not weird. Normalization insensitivity is not implemented in zsh.

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with non-ascii filenames
  2009-03-01  2:15     ` Vincent Lefevre
@ 2009-03-01  7:19       ` Mikael Magnusson
  2009-03-01  7:45       ` Andrey Borzenkov
  2009-03-01  8:05       ` Andrey Borzenkov
  2 siblings, 0 replies; 12+ messages in thread
From: Mikael Magnusson @ 2009-03-01  7:19 UTC (permalink / raw)
  To: zsh-workers

2009/3/1 Vincent Lefevre <vincent@vinc17.org>:
> On 2009-03-01 00:28:39 +0200, İsmail Dönmez wrote:
>> touch xööx
>> echo x<tab>
>>
>> completes to xööx
>>
>> echo xö<tab>
>>
>> does nothing.
>
> Same problem under Linux:
>
>  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=374913
>
>> Seems weird.
>
> Not weird. Normalization insensitivity is not implemented in zsh.

Assuming a filename usually only has one or two accented chars,
couldn't this be handled by using the _approximate completer?

% touch xo<0308>bap
% ls xöbap<tab>

for me this changes the commandline back to ls xo<0308>bap (without
the combining option set).

Unfortunately it seems you can only match the whole filename, not
prefixes, so maybe it's not as useful as I first thought...

-- 
Mikael Magnusson


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with non-ascii filenames
  2009-03-01  2:15     ` Vincent Lefevre
  2009-03-01  7:19       ` Mikael Magnusson
@ 2009-03-01  7:45       ` Andrey Borzenkov
  2009-03-02  3:11         ` Vincent Lefevre
  2009-03-01  8:05       ` Andrey Borzenkov
  2 siblings, 1 reply; 12+ messages in thread
From: Andrey Borzenkov @ 2009-03-01  7:45 UTC (permalink / raw)
  To: İsmail Dönmez, zsh-workers

[-- Attachment #1: Type: text/plain, Size: 736 bytes --]

On 1 марта 2009 05:15:16 Vincent Lefevre wrote:
> On 2009-03-01 00:28:39 +0200, İsmail Dönmez wrote:
> > touch xööx
> > echo x<tab>
> >
> > completes to xööx
> >
> > echo xö<tab>
> >
> > does nothing.
>
> Same problem under Linux:
>
>   http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=374913
>

Well ... assuming zsh implements equivalence based on character 
normalization. And you have two files with names: <00E4> ("standard" 
single character ä) and <0061><0308> (a with combining diaeresis). Now 
completion listing will present to you

ä ä

Which one are you going to select?

It is different from case insensitive file systems where you never can 
get both forms at once (FOO and foo).

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with non-ascii filenames
  2009-03-01  2:15     ` Vincent Lefevre
  2009-03-01  7:19       ` Mikael Magnusson
  2009-03-01  7:45       ` Andrey Borzenkov
@ 2009-03-01  8:05       ` Andrey Borzenkov
  2009-03-02  3:16         ` Vincent Lefevre
  2 siblings, 1 reply; 12+ messages in thread
From: Andrey Borzenkov @ 2009-03-01  8:05 UTC (permalink / raw)
  To: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 391 bytes --]

On 1 марта 2009 05:15:16 Vincent Lefevre wrote:
>
> Normalization insensitivity is not implemented in zsh.

This is not applicable to zsh because zsh does not use UNICODE in the 
first place. Zsh is using whatever character set and encoding underlying 
operating system happens to use.

I do not say it is not doable, just that this is not how it is 
implemented right now.


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with non-ascii filenames
  2009-03-01  7:45       ` Andrey Borzenkov
@ 2009-03-02  3:11         ` Vincent Lefevre
  0 siblings, 0 replies; 12+ messages in thread
From: Vincent Lefevre @ 2009-03-02  3:11 UTC (permalink / raw)
  To: zsh-workers

On 2009-03-01 10:45:01 +0300, Andrey Borzenkov wrote:
> Well ... assuming zsh implements equivalence based on character 
> normalization. And you have two files with names: <00E4> ("standard" 
> single character ä) and <0061><0308> (a with combining diaeresis). Now 
> completion listing will present to you
> 
> ä ä
> 
> Which one are you going to select?

zsh should propose the choice (even though such a FS is badly designed).

> It is different from case insensitive file systems where you never can 
> get both forms at once (FOO and foo).

But you missed the point that on a case-sensitive file system you can
have both forms and you can still choose case-insensitive completion
(very useful...).

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problems with non-ascii filenames
  2009-03-01  8:05       ` Andrey Borzenkov
@ 2009-03-02  3:16         ` Vincent Lefevre
  0 siblings, 0 replies; 12+ messages in thread
From: Vincent Lefevre @ 2009-03-02  3:16 UTC (permalink / raw)
  To: zsh-workers

On 2009-03-01 11:05:47 +0300, Andrey Borzenkov wrote:
> On 1 марта 2009 05:15:16 Vincent Lefevre wrote:
> > Normalization insensitivity is not implemented in zsh.
> 
> This is not applicable to zsh because zsh does not use UNICODE in the 
> first place. Zsh is using whatever character set and encoding underlying 
> operating system happens to use.

But zsh has multibyte support, which is synonym to Unicode in zsh
(as said in the manual).

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-03-02  3:16 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-28  9:37 Problems with non-ascii filenames İsmail Dönmez
2009-02-28 18:22 ` Bart Schaefer
2009-02-28 19:35   ` Andrey Borzenkov
2009-02-28 23:06     ` Bart Schaefer
2009-02-28 21:28 ` Andrey Borzenkov
2009-02-28 22:28   ` İsmail Dönmez
2009-03-01  2:15     ` Vincent Lefevre
2009-03-01  7:19       ` Mikael Magnusson
2009-03-01  7:45       ` Andrey Borzenkov
2009-03-02  3:11         ` Vincent Lefevre
2009-03-01  8:05       ` Andrey Borzenkov
2009-03-02  3:16         ` Vincent Lefevre

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).