From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/16769
Path: main.gmane.org!not-for-mail
From: davidk@lysator.liu.se (David =?ISO-8859-1?Q?K=E5gedal?=)
Newsgroups: gmane.emacs.gnus.general
Subject: Re: "Coding system"?  Eh?
Date: 07 Sep 1998 17:12:40 +0200
Sender: owner-ding@hpc.uh.edu
Message-ID: <jpd8982byv.fsf@tinget.lysator.liu.se>
References: <m31zpqtups.fsf@sparky.gnus.org> <v1tbtouwmf6.fsf@peoria.mt.cs.cmu.edu> <m3ww7ie31s.fsf@sparky.gnus.org>
NNTP-Posting-Host: coloc-standby.netfonds.no
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: main.gmane.org 1035155587 29776 80.91.224.250 (20 Oct 2002 23:13:07 GMT)
X-Complaints-To: usenet@main.gmane.org
NNTP-Posting-Date: Sun, 20 Oct 2002 23:13:07 +0000 (UTC)
Return-Path: <owner-ding@hpc.uh.edu>
Original-Received: from gizmo.hpc.uh.edu (gizmo.hpc.uh.edu [129.7.102.31])
	by sclp3.sclp.com (8.8.5/8.8.5) with ESMTP id LAA11727
	for <jason@mailhost.sclp.com>; Mon, 7 Sep 1998 11:15:20 -0400 (EDT)
Original-Received: from sina.hpc.uh.edu (sina.hpc.uh.edu [129.7.3.5]) by gizmo.hpc.uh.edu (8.7.6/8.7.3) with ESMTP id JAF23277; Mon, 7 Sep 1998 09:46:26 -0500
Original-Received: by sina.hpc.uh.edu (TLB v0.09a (1.20 tibbs 1996/10/09 22:03:07)); Mon, 07 Sep 1998 10:15:15 -0500 (CDT)
Original-Received: from sclp3.sclp.com (root@sclp3.sclp.com [209.195.19.139]) by sina.hpc.uh.edu (8.7.3/8.7.3) with ESMTP id KAA08547 for <ding@hpc.uh.edu>; Mon, 7 Sep 1998 10:15:04 -0500 (CDT)
Original-Received: from samantha.lysator.liu.se (samantha.lysator.liu.se [130.236.254.202])
	by sclp3.sclp.com (8.8.5/8.8.5) with ESMTP id LAA11714
	for <ding@gnus.org>; Mon, 7 Sep 1998 11:14:46 -0400 (EDT)
Original-Received: from tinget.lysator.liu.se (davidk@tinget.lysator.liu.se [130.236.254.66])
	by samantha.lysator.liu.se (8.8.7/8.8.7) with ESMTP id RAA19673;
	Mon, 7 Sep 1998 17:12:48 +0200 (MET DST)
Original-Received: (from davidk@localhost)
	by tinget.lysator.liu.se (8.8.8/8.8.7) id RAA03233;
	Mon, 7 Sep 1998 17:12:42 +0200 (MET DST)
Original-To: ding@gnus.org
In-Reply-To: Lars Magne Ingebrigtsen's message of "05 Sep 1998 22:07:43 +0200"
Original-Lines: 49
X-Mailer: Gnus v5.6.24/Emacs 19.34
Precedence: list
X-Majordomo: 1.94.jlt7
Xref: main.gmane.org gmane.emacs.gnus.general:16769
X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:16769

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> Michael Welsh Duggan <md5i@cs.cmu.edu> writes:
> 
> > No, not really.  A character set is merely a set of characters.
> > latin-1, etc, are often called character sets because they use the
> > same number of characters as extended ASCII, etc.  A coding-system is
> > just that: a coding-system.  The characters could be encoded any which
> > way (including encrypted!).  For example, old-jis uses escapes around
> > sequences of 7-bit characters.  This is an encoding, which you can
> > display using a character set, but not a character set in and of
> > itself.
> 
> All texts consists of characters (from some character set) encoded
> (using some coding system).  iso-8859-1, for instance, represents the
> character LATIN-LETTER-A-WITH-UMLAUT ("ä") with one byte that contains
> the number 0xe4.  The same letter encoded in a different charset (say,
> Unicode) would occupy two bytes.  Other character sets use multiple
> bytes to represent characters, like iso-2022-jp.

Now you are mixing things.  The phrase "encoded in a different charset
(say, Unicode)" is a semantic error.

Unicode defines a character set where LATIT-LETTER-A-WITH-UMLAUT has a
specific number (228 i believe), but Unicode also defines several
character encodings.  There is UCS-2 where all characters occupy two
bytes.  Then there is UTF-8 where most characters can be encoded using
one byte, while 'ä' needs at least two.  Actually, all characters can
be encoded with, say, three bytes in UTF-8.  Unicode also defines
UTF-7 which is so ugly that I won't say anything further about it.
Then ISO-10646, which is in principle a superset of Unicode (but does
not contain any more defined characters) also defines UCS-4, where all
characters are encoded using four bytes, and UTF-16, where all
characters are encoding using two bytes.

Byt the character set is always the same, with numbers ranging from 0
to 65536.

> When one talks about character sets (in, say, MIME) one talks about
> encoded character sets.  Abstract character sets aren't all that
> interesting when fiddling with data.  iso-8859-1, which MULE calls a
> coding system, is something everyone else calls a character set.  The
> same with old-jis and iso-2022-jp.

ISO 8859-1 is both a character set, and an encoding (one-to-one from
charater to byte), I believe.  But I'm not sure how it is defined.

-- 
David Kågedal        <davidk@lysator.liu.se> http://www.lysator.liu.se/~davidk/