From: "François Pinard" <pinard@iro.umontreal.ca>
Cc: ding@gnus.org
Subject: Re: "Coding system"? Eh?
Date: 09 Sep 1998 22:50:30 +-400 [thread overview]
Message-ID: <oq1zpkisuh.fsf@icule.progiciels-bpi.ca> (raw)
In-Reply-To: davidk@lysator.liu.se's message of "07 Sep 1998 17:12:40 +0200"
davidk@lysator.liu.se (David Kågedal) écrit:
> Lars Magne Ingebrigtsen <larsi@gnus.org> writes:
> > Michael Welsh Duggan <md5i@cs.cmu.edu> writes:
> > > No, not really. A character set is merely a set of characters.
> > > [...] A coding-system is just that: a coding-system.
I'm no specialist, but my impression is that MULE does not makes such a
clear separation. Internally, each Mule "character" (I'm not sure of the
terminology) holds information about both the code and its encoding.
> Unicode defines a character set where LATIT-LETTER-A-WITH-UMLAUT has a
> specific number (228 i believe), but Unicode also defines several
> character encodings. There is UCS-2 where all characters occupy two
> bytes. Then there is UTF-8 where most characters can be encoded using
> one byte, while 'ä' needs at least two. Actually, all characters can
> be encoded with, say, three bytes in UTF-8.
You mean, all Unicode characters. ISO 10646 might need more then three,
as UTF-8 is also available for ISO 10646.
> Unicode also defines UTF-7 which is so ugly that I won't say anything
> further about it.
Does Unicode now defines UTF-7? It originated from the IETF, and UTF-7
is specifically for MIME contexts, which Unicode does not address.
> Then ISO-10646, which is in principle a superset of Unicode (but does
> not contain any more defined characters) [...]
Some convergence happened, indeed, but the details are a bit more complex.
> also defines UCS-4, where all characters are encoded using four bytes,
> and UTF-16, where all characters are encoding using two bytes.
I do not remember that ISO 10646 introduced UTF-16, I thought it was a
Unicode invention, but once again, I'm no specialist and may easily be
wrong. ISO 10646 redefined the BMP so there is room for UTF-16 coding,
so ISO 10646 is aware and compatible with Unicode on this. By the way,
UTF-16 encodes characters using either two or four bytes.
> > When one talks about character sets (in, say, MIME) one talks about
> > encoded character sets.
One should be aware that MIME and ISO 10646/Unicode use different meaning
for the same terms. I often saw people debating hotly such things,
without realising they were using definitions from different sources.
> > Abstract character sets aren't all that interesting when fiddling
> > with data. iso-8859-1, which MULE calls a coding system, is something
> > everyone else calls a character set. The same with old-jis and
> > iso-2022-jp.
> ISO 8859-1 is both a character set, and an encoding (one-to-one from
> charater to byte), I believe. But I'm not sure how it is defined.
And to make things more confusing, when an encoding is used for only one
character set, there is a trend to not make the distinction, and consider the
encoding itself as a character set. I'm a moderate purist on those things,
yet people finally convinced me that practical considerations should prevail.
--
François Pinard mailto:pinard@iro.umontreal.ca
Join the free Translation Project! http://www.iro.umontreal.ca/~pinard
next prev parent reply other threads:[~1998-09-09 18:50 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
1998-09-05 16:01 Lars Magne Ingebrigtsen
1998-09-05 16:31 ` Michael Welsh Duggan
1998-09-05 20:07 ` Lars Magne Ingebrigtsen
1998-09-05 20:45 ` Hrvoje Niksic
1998-09-05 21:12 ` Lars Magne Ingebrigtsen
1998-09-05 21:47 ` Hrvoje Niksic
1998-09-07 15:12 ` David Kågedal
1998-09-09 18:50 ` François Pinard [this message]
1998-09-10 12:45 ` David Kågedal
1998-09-10 20:21 ` Gisle Aas
1998-09-11 6:27 ` François Pinard
1998-09-11 6:16 ` François Pinard
1998-09-11 16:14 ` Hallvard B Furuseth
2002-10-20 23:13 ` Lars Magne Ingebrigtsen
1998-09-09 18:59 ` François Pinard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=oq1zpkisuh.fsf@icule.progiciels-bpi.ca \
--to=pinard@iro.umontreal.ca \
--cc=ding@gnus.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).