Re: "Coding system"? Eh? - François Pinard

Gnus development mailing list
 help / color / mirror / Atom feed

From: "François Pinard" <pinard@iro.umontreal.ca>
Cc: ding@gnus.org
Subject: Re: "Coding system"?  Eh?
Date: 09 Sep 1998 22:50:30 +-400	[thread overview]
Message-ID: <oq1zpkisuh.fsf@icule.progiciels-bpi.ca> (raw)
In-Reply-To: davidk@lysator.liu.se's message of "07 Sep 1998 17:12:40 +0200"

davidk@lysator.liu.se (David Kågedal) écrit:

> Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> > Michael Welsh Duggan <md5i@cs.cmu.edu> writes:

> > > No, not really.  A character set is merely a set of characters.
> > > [...]  A coding-system is just that: a coding-system.

I'm no specialist, but my impression is that MULE does not makes such a
clear separation.  Internally, each Mule "character" (I'm not sure of the
terminology) holds information about both the code and its encoding.

> Unicode defines a character set where LATIT-LETTER-A-WITH-UMLAUT has a
> specific number (228 i believe), but Unicode also defines several
> character encodings.  There is UCS-2 where all characters occupy two
> bytes.  Then there is UTF-8 where most characters can be encoded using
> one byte, while 'ä' needs at least two.  Actually, all characters can
> be encoded with, say, three bytes in UTF-8.

You mean, all Unicode characters.  ISO 10646 might need more then three,
as UTF-8 is also available for ISO 10646.

> Unicode also defines UTF-7 which is so ugly that I won't say anything
> further about it.

Does Unicode now defines UTF-7?  It originated from the IETF, and UTF-7
is specifically for MIME contexts, which Unicode does not address.

> Then ISO-10646, which is in principle a superset of Unicode (but does
> not contain any more defined characters) [...]

Some convergence happened, indeed, but the details are a bit more complex.

> also defines UCS-4, where all characters are encoded using four bytes,
> and UTF-16, where all characters are encoding using two bytes.

I do not remember that ISO 10646 introduced UTF-16, I thought it was a
Unicode invention, but once again, I'm no specialist and may easily be
wrong.  ISO 10646 redefined the BMP so there is room for UTF-16 coding,
so ISO 10646 is aware and compatible with Unicode on this.  By the way,
UTF-16 encodes characters using either two or four bytes.

> > When one talks about character sets (in, say, MIME) one talks about
> > encoded character sets.

One should be aware that MIME and ISO 10646/Unicode use different meaning
for the same terms.  I often saw people debating hotly such things,
without realising they were using definitions from different sources.

> > Abstract character sets aren't all that interesting when fiddling
> > with data.  iso-8859-1, which MULE calls a coding system, is something
> > everyone else calls a character set.  The same with old-jis and
> > iso-2022-jp.

> ISO 8859-1 is both a character set, and an encoding (one-to-one from
> charater to byte), I believe.  But I'm not sure how it is defined.

And to make things more confusing, when an encoding is used for only one
character set, there is a trend to not make the distinction, and consider the
encoding itself as a character set.  I'm a moderate purist on those things,
yet people finally convinced me that practical considerations should prevail.

-- 
François Pinard                            mailto:pinard@iro.umontreal.ca
Join the free Translation Project!    http://www.iro.umontreal.ca/~pinard

next prev parent reply	other threads:[~1998-09-09 18:50 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
1998-09-05 16:01 Lars Magne Ingebrigtsen
1998-09-05 16:31 ` Michael Welsh Duggan
1998-09-05 20:07   ` Lars Magne Ingebrigtsen
1998-09-05 20:45     ` Hrvoje Niksic
1998-09-05 21:12       ` Lars Magne Ingebrigtsen
1998-09-05 21:47         ` Hrvoje Niksic
1998-09-07 15:12     ` David Kågedal
1998-09-09 18:50       ` François Pinard [this message]
1998-09-10 12:45         ` David Kågedal
1998-09-10 20:21           ` Gisle Aas
1998-09-11  6:27             ` François Pinard
1998-09-11  6:16           ` François Pinard
1998-09-11 16:14         ` Hallvard B Furuseth
2002-10-20 23:13       ` Lars Magne Ingebrigtsen
1998-09-09 18:59         ` François Pinard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=oq1zpkisuh.fsf@icule.progiciels-bpi.ca \
    --to=pinard@iro.umontreal.ca \
    --cc=ding@gnus.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).