UTF-8 support

zsh-workers
 help / color / mirror / code / Atom feed

* UTF-8 support
@ 2004-09-30  8:29 David Gómez
  2004-09-30  9:24 ` Peter Stephenson
  2004-10-01 18:41 ` David Gómez
  0 siblings, 2 replies; 10+ messages in thread
From: David Gómez @ 2004-09-30  8:29 UTC (permalink / raw)
  To: Zsh-workers

Hi all ;),

I've been searching in the list archives and in zsh documentation, for
the subject question, with little success. From what i understand, there
is some kind of partial utf-8 support in zsh... My doubts are: Are there
plans to add full utf-8 support to zsh/zle? Is currently somebody
working on it?

Thanks,

-- 
David Gómez

"The question of whether computers can think is just like the question of
whether submarines can swim." -- Edsger W. Dijkstra

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 support
  2004-09-30  8:29 UTF-8 support David Gómez
@ 2004-09-30  9:24 ` Peter Stephenson
  2004-10-01 18:41 ` David Gómez
  1 sibling, 0 replies; 10+ messages in thread
From: Peter Stephenson @ 2004-09-30  9:24 UTC (permalink / raw)
  To: Zsh-workers

David =?iso-8859-15?Q?G=F3mez?= wrote:
> I've been searching in the list archives and in zsh documentation, for
> the subject question, with little success. From what i understand, there
> is some kind of partial utf-8 support in zsh... My doubts are: Are there
> plans to add full utf-8 support to zsh/zle? Is currently somebody
> working on it?

Unfortunately, we get lots and lots of questions like this, but nobody
has offered to take the (probably considerable) time to do it.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR Ltd., Science Park, Milton Road,
Cambridge, CB4 0WH, UK                          Tel: +44 (0)1223 692070

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 support
  2004-09-30  8:29 UTF-8 support David Gómez
  2004-09-30  9:24 ` Peter Stephenson
@ 2004-10-01 18:41 ` David Gómez
  2004-10-01 19:46   ` Oliver Kiddle
  1 sibling, 1 reply; 10+ messages in thread
From: David Gómez @ 2004-10-01 18:41 UTC (permalink / raw)
  To: Zsh-workers

Hi Peter ;),

> > I've been searching in the list archives and in zsh documentation, for
> > the subject question, with little success. From what i understand, there
> > is some kind of partial utf-8 support in zsh... My doubts are: Are there
> > plans to add full utf-8 support to zsh/zle? Is currently somebody
> > working on it?
>
> Unfortunately, we get lots and lots of questions like this, but nobody
> has offered to take the (probably considerable) time to do it.

So i conclude from your response that nobody is working on it ;). 
I understand the time problem, everybody is short on time, including
myself. The thing is that i have zero knowledge about zsh source,
and this seems a good time to start some source reading ;). I don't
want to go back to the readline hell ;) to have full utf-8 support.

But i need help to know where to start. What parts of zsh would need 
to be worked on, only zle? Is there already, some kind of, although
minimal, support for utf-8? Also, if you know from some documentation
about zsh internals, besides from source ;), please point me to it.

Thanks,

-- 
David Gómez

"The question of whether computers can think is just like the question of
whether submarines can swim." -- Edsger W. Dijkstra

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 support
  2004-10-01 18:41 ` David Gómez
@ 2004-10-01 19:46   ` Oliver Kiddle
  2004-10-04 16:08     ` David Gómez
  2004-10-04 16:20     ` Peter Stephenson
  0 siblings, 2 replies; 10+ messages in thread
From: Oliver Kiddle @ 2004-10-01 19:46 UTC (permalink / raw)
  To: David Gómez; +Cc: Zsh-workers

--------
David =?iso-8859-15?Q?G=F3mez?= wrote:
> So i conclude from your response that nobody is working on it ;).
> I understand the time problem, everybody is short on time, including

Nothing has been done. A few people may have done some work that was
never posted. I got as far reading up, thinking about what the right
approach would be and adding support for stuff like the following to
print characters given their unicode code point:
  echo '\u20ac'
It seemed a good point to start because it'll be useful for testing.
Unfortunately, I'm very short on time for the rest of this year.

> But i need help to know where to start. What parts of zsh would need 
> to be worked on, only zle? Is there already, some kind of, although

Most parts of the source will need work but it is possible to add
support in individual areas. So don't start with completion, find
something simple like the print builtin (in particular -c and -C
options). Builtins in general are simple because they are relatively
self-contained. If you try to attack zle first, you'll just get fed up
with it being too hard. Once you've got something simple like print
done, another idea for something simple would be to add a Test/U01 test
and add code to make it search for a UTF-8 locale ($langinfo[CODESET] in
the langinfo module will help) and use it for LC_CTYPE.

> minimal, support for utf-8? Also, if you know from some documentation
> about zsh internals, besides from source ;), please point me to it.

The source and comments are the only documentation I know of but you can
always ask on the list. Do you know much about unicode/UTF-8? For the
minimum, read http://www.joelonsoftware.com/articles/Unicode.html
and then read http://www.cl.cam.ac.uk/~mgk25/unicode.html

In my opinion it would be sensible to support multibyte encodings in
general and not just UTF-8. Doing this isn't much effort beyond handling
UTF-8 if we assume basic ASCII compatibility and don't worry about
stateful encodings. There are a few characters which are defined to
display as double width even in proportional fonts so keep that in mind.
You can detect whether UTF-8 is enabled with the C library's locale
functions but we shouldn't need to: functions such as mbrlen do all the
work for us.

Once we've got a few basic areas working, we might want to think about
whether there are any common constructs we should create general
functions for in utils.c.

Oliver

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 support
  2004-10-01 19:46   ` Oliver Kiddle
@ 2004-10-04 16:08     ` David Gómez
  2004-10-04 16:15       ` Clint Adams
  2004-10-05 11:13       ` Oliver Kiddle
  2004-10-04 16:20     ` Peter Stephenson
  1 sibling, 2 replies; 10+ messages in thread
From: David Gómez @ 2004-10-04 16:08 UTC (permalink / raw)
  To: Oliver Kiddle; +Cc: David Gómez, Zsh-workers

Hi Oliver ;),

> > So i conclude from your response that nobody is working on it ;).
> > I understand the time problem, everybody is short on time, including
> 
> Nothing has been done. A few people may have done some work that was
> never posted.

Is yet possible to find that work ;)?

> I got as far reading up, thinking about what the right
> approach would be and adding support for stuff like the following to
> print characters given their unicode code point:
>   echo '\u20ac'
> It seemed a good point to start because it'll be useful for testing.

Yes, it's useful for testing to be able use unicode points as input
to echo. I've used to do some testing myself ;)

> Most parts of the source will need work but it is possible to add
> support in individual areas. So don't start with completion, find
> something simple like the print builtin (in particular -c and -C
> options).

I see, splitting the parameters in columns needs the print builtin
have knowledge of the real width if you're using UTF-8 input.

> Builtins in general are simple because they are relatively
> self-contained. If you try to attack zle first, you'll just get fed up
> with it being too hard.

I think you're totally right. zle is to hard for a start, given i
have no experience in zsh source. I'll give a look to the print builtin
and will play a bit with zsh code to learn more.

> done, another idea for something simple would be to add a Test/U01 test
> and add code to make it search for a UTF-8 locale ($langinfo[CODESET] in
> the langinfo module will help) 

Good, i didn't know about that module ;)

> The source and comments are the only documentation I know of but you can
> always ask on the list.

Thanks, i'll do ;)

> Do you know much about unicode/UTF-8? For the
> minimum, read http://www.joelonsoftware.com/articles/Unicode.html
> and then read http://www.cl.cam.ac.uk/~mgk25/unicode.html

I knew a bit. But i've been reading your links these days and have
refreshed my rusted utf-8 concepts ;).

> In my opinion it would be sensible to support multibyte encodings in
> general and not just UTF-8.

I think the reason behind using UTF-8 is not having to use any other
encondings at all, so adding support for other multibytes encoding
wouldn't be needed in my opinion. But, on the other hand, using mbs*
from libc would made easy support any multibyte the current locale
has selected.

> stateful encodings. There are a few characters which are defined to
> display as double width even in proportional fonts so keep that in mind.

In what scripts happens these characters?

> You can detect whether UTF-8 is enabled with the C library's locale
> functions but we shouldn't need to: functions such as mbrlen do all the
> work for us.

Shouldn't mbrlen and company only be used when an UTF-8 locale is selected?

Thanks,

-- 
David Gómez

"The question of whether computers can think is just like the question of
whether submarines can swim." -- Edsger W. Dijkstra

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 support
  2004-10-04 16:08     ` David Gómez
@ 2004-10-04 16:15       ` Clint Adams
  2004-10-05 11:13       ` Oliver Kiddle
  1 sibling, 0 replies; 10+ messages in thread
From: Clint Adams @ 2004-10-04 16:15 UTC (permalink / raw)
  To: Oliver Kiddle, David Gómez, Zsh-workers

> Is yet possible to find that work ;)?

I don't know, but you can find some more information by
going to http://www.zsh.org/cgi-bin/mla/wilma/workers and searching
for things such as "multibyte".


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 support
  2004-10-01 19:46   ` Oliver Kiddle
  2004-10-04 16:08     ` David Gómez
@ 2004-10-04 16:20     ` Peter Stephenson
  2004-10-05 11:01       ` Oliver Kiddle
  1 sibling, 1 reply; 10+ messages in thread
From: Peter Stephenson @ 2004-10-04 16:20 UTC (permalink / raw)
  To: Zsh-workers

Oliver Kiddle wrote:
> In my opinion it would be sensible to support multibyte encodings in
> general and not just UTF-8. Doing this isn't much effort beyond handling
> UTF-8 if we assume basic ASCII compatibility and don't worry about
> stateful encodings.

I came to the conclusion that was going to be very time consuming --- it
means unmetafying potentially a long string (we don't know where the
characters end) and calling a function every time we want to compare multibyte
characters.  Doing it only for UTF-8 can be optimised to work with
extensions to the current tests; it's simple to test for the length of a
UTF-8 character (although some error checking is also necessary).

Given that the whole point of Unicode is to replace all other schemes,
I'm not so keen about supporting other schemes if it's that much less
efficient.

-- 
Peter Stephenson <pws@csr.com>                  Software Engineer
CSR Ltd., Science Park, Milton Road,
Cambridge, CB4 0WH, UK                          Tel: +44 (0)1223 692070

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 support
  2004-10-04 16:20     ` Peter Stephenson
@ 2004-10-05 11:01       ` Oliver Kiddle
  2004-10-05 11:32         ` Peter Stephenson
  0 siblings, 1 reply; 10+ messages in thread
From: Oliver Kiddle @ 2004-10-05 11:01 UTC (permalink / raw)
  To: Zsh-workers

Peter wrote:
> I came to the conclusion that was going to be very time consuming --- it
> means unmetafying potentially a long string (we don't know where the
> characters end) and calling a function every time we want to compare multibyte
> characters.  Doing it only for UTF-8 can be optimised to work with
> extensions to the current tests; it's simple to test for the length of a
> UTF-8 character (although some error checking is also necessary).

If you want to find a short string in a long string you can surely
metafy the short string instead of unmetafying the long string.

The approach I was suggesting has the big advantage that we can add
support in isolated areas without first breaking the entire shell.

I think it would be bad a mistake to rewrite our own, UTF-8 specific
versions of all the routines that libc already provides. Even if we can
make one or two slightly more efficient by handling the meta process at
the same time. And if we're going to restrict the code to UTF-8, we
could ditch the meta stuff and use overcoding. This amounts to storing
the null character as the overlong two byte sequence c080. The code for
that would be a lot simpler but you can't expect to pass overlong
sequences elsewhere without getting errors. At least UTF-8 allows you to
strchr for 7-bit ASCII characters in a UTF-8 string (other multi-byte
encodings allow this only for /). Can we perhaps change the Meta
character to 0xc0. We can then use overcoding for UTF-8 but make the
UTF-8 specific code much more minimal in the metafy process.

The most efficient way would be to maintain string lengths, Pascal style
(length in bytes not characters). Possibly even using wchar_t instead
of multi-byte encodings. We could perhaps do that for limited sections
of code such as parameters. That would cope better when someone decides
to change the current locale. If we extend that elsewhere, we need to
be careful if we want to maintain portability of word code files, however.

> Given that the whole point of Unicode is to replace all other schemes,
> I'm not so keen about supporting other schemes if it's that much less
> efficient.

I'm not suggesting supporting alternatives to Unicode but alternatives
to UTF-8. I'd bet that single-byte 8-bit encodings will stick around on
small or embedded systems for longer than you might expect. My main
objection is to any suggestion of not using library calls to handle the
work. mblen may be easy to reimplement but wcwidth is not so we'd end up
with a mixture.

I don't mind so much whether we support other multibyte encodings with
more limited ASCII compatibility than UTF-8. It'd be better to have
limited support than an error message followed by setting LC_CTYPE to
C, though.

Oliver

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 support
  2004-10-04 16:08     ` David Gómez
  2004-10-04 16:15       ` Clint Adams
@ 2004-10-05 11:13       ` Oliver Kiddle
  1 sibling, 0 replies; 10+ messages in thread
From: Oliver Kiddle @ 2004-10-05 11:13 UTC (permalink / raw)
  To: Zsh-workers

David =?iso-8859-15?Q?G=F3mez?= wrote:

Seems my mail program needs some work too. :-)

> > stateful encodings. There are a few characters which are defined to
> > display as double width even in proportional fonts so keep that in mind.
> 
> In what scripts happens these characters?

Chinese and Korean I believe. If you want some examples, try looking at
some of the spam that gets sent to you. I've got some examples at the
moment if you want. We might have to be careful about putting any of
that in our test scripts if we don't want to offer our Chinese users
drugs, inkjets and larger members. :)

> Shouldn't mbrlen and company only be used when an UTF-8 locale is selected?

No. They'll do the right thing if other locale's are selected: mbrlen
will return 1 in a single byte locale.

Oliver

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: UTF-8 support
  2004-10-05 11:01       ` Oliver Kiddle
@ 2004-10-05 11:32         ` Peter Stephenson
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Stephenson @ 2004-10-05 11:32 UTC (permalink / raw)
  To: Zsh-workers

Oliver Kiddle wrote:
> If you want to find a short string in a long string you can surely
> metafy the short string instead of unmetafying the long string.

Both strings are likely to be metafied anyway, internally, but that
doesn't help if you're using the library routines for comparisons, since
they don't know about meta characters; and because you don't know where
a character ends, you also don't know at what byte two characters differ
without using library functions.  Unless you guess where it ends you
need the entire string from the first multibyte character in the
representation used by the library.

Indeed, unless we start with some assumption about the encoding we have
to compare every single character with library functions on an
unmetafied string.  This is very messy if we have to support systems
where the library functions aren't available (and we break quite a lot
unless we do that).  So, while I can't say for sure, I strongly suspect
we're going to end up with having to make some of the assumptions which
are already encoded into the library.  Thus some kind of hybrid is
forced on us for practical reasons.  Given this, I suspect that assuming
UTF-8 and avoiding the library functions where we don't need them is
actually going to be the neatest.  However, this remains to be seen.

I can't see an advantage in assuming UTF-8 and then relying on the
library for comparisons etc.  This seems to give the worst of both
worlds.

> The approach I was suggesting has the big advantage that we can add
> support in isolated areas without first breaking the entire shell.

That can be done however we decide, at least if we keep the current Meta
scheme.  Indeed, that's probably the way to go; we can experiment with
different methods locally before altering the rest of the shell.  The
pattern code is probably the most time-critical for comparing multibyte
characters.  Maybe this is a good time to look at removing the
requirement for NULL-terminated strings after all.

> mblen may be easy to reimplement but wcwidth is not so we'd end up
> with a mixture.

Yes, we certainly need library calls in zle.  However, formatting
strings for interactive output doesn't need to go particularly fast.
As I said, I think that in practice we're stuck with a mixture anyway.

pws

**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**********************************************************************

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-10-05 11:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-30  8:29 UTF-8 support David Gómez
2004-09-30  9:24 ` Peter Stephenson
2004-10-01 18:41 ` David Gómez
2004-10-01 19:46   ` Oliver Kiddle
2004-10-04 16:08     ` David Gómez
2004-10-04 16:15       ` Clint Adams
2004-10-05 11:13       ` Oliver Kiddle
2004-10-04 16:20     ` Peter Stephenson
2004-10-05 11:01       ` Oliver Kiddle
2004-10-05 11:32         ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).