From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.emacs.gnus.general/16769 Path: main.gmane.org!not-for-mail From: davidk@lysator.liu.se (David =?ISO-8859-1?Q?K=E5gedal?=) Newsgroups: gmane.emacs.gnus.general Subject: Re: "Coding system"? Eh? Date: 07 Sep 1998 17:12:40 +0200 Sender: owner-ding@hpc.uh.edu Message-ID: References: NNTP-Posting-Host: coloc-standby.netfonds.no Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Trace: main.gmane.org 1035155587 29776 80.91.224.250 (20 Oct 2002 23:13:07 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Sun, 20 Oct 2002 23:13:07 +0000 (UTC) Return-Path: Original-Received: from gizmo.hpc.uh.edu (gizmo.hpc.uh.edu [129.7.102.31]) by sclp3.sclp.com (8.8.5/8.8.5) with ESMTP id LAA11727 for ; Mon, 7 Sep 1998 11:15:20 -0400 (EDT) Original-Received: from sina.hpc.uh.edu (sina.hpc.uh.edu [129.7.3.5]) by gizmo.hpc.uh.edu (8.7.6/8.7.3) with ESMTP id JAF23277; Mon, 7 Sep 1998 09:46:26 -0500 Original-Received: by sina.hpc.uh.edu (TLB v0.09a (1.20 tibbs 1996/10/09 22:03:07)); Mon, 07 Sep 1998 10:15:15 -0500 (CDT) Original-Received: from sclp3.sclp.com (root@sclp3.sclp.com [209.195.19.139]) by sina.hpc.uh.edu (8.7.3/8.7.3) with ESMTP id KAA08547 for ; Mon, 7 Sep 1998 10:15:04 -0500 (CDT) Original-Received: from samantha.lysator.liu.se (samantha.lysator.liu.se [130.236.254.202]) by sclp3.sclp.com (8.8.5/8.8.5) with ESMTP id LAA11714 for ; Mon, 7 Sep 1998 11:14:46 -0400 (EDT) Original-Received: from tinget.lysator.liu.se (davidk@tinget.lysator.liu.se [130.236.254.66]) by samantha.lysator.liu.se (8.8.7/8.8.7) with ESMTP id RAA19673; Mon, 7 Sep 1998 17:12:48 +0200 (MET DST) Original-Received: (from davidk@localhost) by tinget.lysator.liu.se (8.8.8/8.8.7) id RAA03233; Mon, 7 Sep 1998 17:12:42 +0200 (MET DST) Original-To: ding@gnus.org In-Reply-To: Lars Magne Ingebrigtsen's message of "05 Sep 1998 22:07:43 +0200" Original-Lines: 49 X-Mailer: Gnus v5.6.24/Emacs 19.34 Precedence: list X-Majordomo: 1.94.jlt7 Xref: main.gmane.org gmane.emacs.gnus.general:16769 X-Report-Spam: http://spam.gmane.org/gmane.emacs.gnus.general:16769 Lars Magne Ingebrigtsen writes: > Michael Welsh Duggan writes: > > > No, not really. A character set is merely a set of characters. > > latin-1, etc, are often called character sets because they use the > > same number of characters as extended ASCII, etc. A coding-system is > > just that: a coding-system. The characters could be encoded any which > > way (including encrypted!). For example, old-jis uses escapes around > > sequences of 7-bit characters. This is an encoding, which you can > > display using a character set, but not a character set in and of > > itself. > > All texts consists of characters (from some character set) encoded > (using some coding system). iso-8859-1, for instance, represents the > character LATIN-LETTER-A-WITH-UMLAUT ("ä") with one byte that contains > the number 0xe4. The same letter encoded in a different charset (say, > Unicode) would occupy two bytes. Other character sets use multiple > bytes to represent characters, like iso-2022-jp. Now you are mixing things. The phrase "encoded in a different charset (say, Unicode)" is a semantic error. Unicode defines a character set where LATIT-LETTER-A-WITH-UMLAUT has a specific number (228 i believe), but Unicode also defines several character encodings. There is UCS-2 where all characters occupy two bytes. Then there is UTF-8 where most characters can be encoded using one byte, while 'ä' needs at least two. Actually, all characters can be encoded with, say, three bytes in UTF-8. Unicode also defines UTF-7 which is so ugly that I won't say anything further about it. Then ISO-10646, which is in principle a superset of Unicode (but does not contain any more defined characters) also defines UCS-4, where all characters are encoded using four bytes, and UTF-16, where all characters are encoding using two bytes. Byt the character set is always the same, with numbers ranging from 0 to 65536. > When one talks about character sets (in, say, MIME) one talks about > encoded character sets. Abstract character sets aren't all that > interesting when fiddling with data. iso-8859-1, which MULE calls a > coding system, is something everyone else calls a character set. The > same with old-jis and iso-2022-jp. ISO 8859-1 is both a character set, and an encoding (one-to-one from charater to byte), I believe. But I'm not sure how it is defined. -- David Kågedal http://www.lysator.liu.se/~davidk/