"Coding system"? Eh?

Gnus development mailing list
 help / color / mirror / Atom feed

* "Coding system"?  Eh?
@ 1998-09-05 16:01 Lars Magne Ingebrigtsen
  1998-09-05 16:31 ` Michael Welsh Duggan
  0 siblings, 1 reply; 15+ messages in thread
From: Lars Magne Ingebrigtsen @ 1998-09-05 16:01 UTC (permalink / raw)


Isn't what MULE calls a "coding system" what the entire rest of the
world call a "character set"?  So `decode-coding-system' should
really have been called `decode-charset'?

MULE is confusing.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@ifi.uio.no * Lars Magne Ingebrigtsen


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "Coding system"?  Eh?
  1998-09-05 16:01 "Coding system"? Eh? Lars Magne Ingebrigtsen
@ 1998-09-05 16:31 ` Michael Welsh Duggan
  1998-09-05 20:07   ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 15+ messages in thread
From: Michael Welsh Duggan @ 1998-09-05 16:31 UTC (permalink / raw)

Lars Magne Ingebrigtsen <larsi@ifi.uio.no> writes:

> Isn't what MULE calls a "coding system" what the entire rest of the
> world call a "character set"?  So `decode-coding-system' should
> really have been called `decode-charset'?

No, not really.  A character set is merely a set of characters.
latin-1, etc, are often called character sets because they use the
same number of characters as extended ASCII, etc.  A coding-system is
just that: a coding-system.  The characters could be encoded any which
way (including encrypted!).  For example, old-jis uses escapes around
sequences of 7-bit characters.  This is an encoding, which you can
display using a character set, but not a character set in and of
itself.

More information on {decode,encode}-coding-system: The way the
function is handled internally is that it deletes the region and
replaces it with the {de,en}coded text.  This means markers in the
region are screwed.  Regions are still buggy though; they shouldn't
work the way they do currently.  I am looking into how hard it would
be to fix things such that markers at least can be preserved.

-- 
Michael Duggan
(md5i@cs.cmu.edu)
.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "Coding system"?  Eh?
  1998-09-05 16:31 ` Michael Welsh Duggan
@ 1998-09-05 20:07   ` Lars Magne Ingebrigtsen
  1998-09-05 20:45     ` Hrvoje Niksic
  1998-09-07 15:12     ` David Kågedal
  0 siblings, 2 replies; 15+ messages in thread
From: Lars Magne Ingebrigtsen @ 1998-09-05 20:07 UTC (permalink / raw)

Michael Welsh Duggan <md5i@cs.cmu.edu> writes:

> No, not really.  A character set is merely a set of characters.
> latin-1, etc, are often called character sets because they use the
> same number of characters as extended ASCII, etc.  A coding-system is
> just that: a coding-system.  The characters could be encoded any which
> way (including encrypted!).  For example, old-jis uses escapes around
> sequences of 7-bit characters.  This is an encoding, which you can
> display using a character set, but not a character set in and of
> itself.

All texts consists of characters (from some character set) encoded
(using some coding system).  iso-8859-1, for instance, represents the
character LATIN-LETTER-A-WITH-UMLAUT ("ä") with one byte that contains
the number 0xe4.  The same letter encoded in a different charset (say,
Unicode) would occupy two bytes.  Other character sets use multiple
bytes to represent characters, like iso-2022-jp.

When one talks about character sets (in, say, MIME) one talks about
encoded character sets.  Abstract character sets aren't all that
interesting when fiddling with data.  iso-8859-1, which MULE calls a
coding system, is something everyone else calls a character set.  The
same with old-jis and iso-2022-jp.

Or something.  

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "Coding system"?  Eh?
  1998-09-05 20:07   ` Lars Magne Ingebrigtsen
@ 1998-09-05 20:45     ` Hrvoje Niksic
  1998-09-05 21:12       ` Lars Magne Ingebrigtsen
  1998-09-07 15:12     ` David Kågedal
  1 sibling, 1 reply; 15+ messages in thread
From: Hrvoje Niksic @ 1998-09-05 20:45 UTC (permalink / raw)

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> iso-8859-1, which MULE calls a coding system, is something everyone
> else calls a character set.  The same with old-jis and iso-2022-jp.

I believe Michael's point was that, under Mule, you can create coding
systems that have nothing to do with character sets, such as a `gzip'
coding-system.  Coding systems are Emacs-specific hybrids between
character sets and their external representation.  This probably makes
them different enough from "character sets" to warrant for a separate
name.

-- 
Hrvoje Niksic <hniksic@srce.hr> | Student at FER Zagreb, Croatia
--------------------------------+--------------------------------
Try to use "ad nauseam" at least once per flame. It doesn't mean
anything; but it gives that polished feel to your postings.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "Coding system"?  Eh?
  1998-09-05 20:45     ` Hrvoje Niksic
@ 1998-09-05 21:12       ` Lars Magne Ingebrigtsen
  1998-09-05 21:47         ` Hrvoje Niksic
  0 siblings, 1 reply; 15+ messages in thread
From: Lars Magne Ingebrigtsen @ 1998-09-05 21:12 UTC (permalink / raw)

Hrvoje Niksic <hniksic@srce.hr> writes:

> I believe Michael's point was that, under Mule, you can create coding
> systems that have nothing to do with character sets, such as a `gzip'
> coding-system.

They do that?  And if the unzipped file results in something that's
iso-2022-jp, do they run it though the decoding twice, or do they have
a gzip-iso-2022-jp coding system, as Morioka almost suggested for
base64?  (base64-iso-2022-jp, etc.)  The latter would be a nightmare,
and the former would be just yucky.

> Coding systems are Emacs-specific hybrids between
> character sets and their external representation.  This probably makes
> them different enough from "character sets" to warrant for a separate
> name.

Hm.  I did a `M-x list-coding-systems', and it listed nothing but
character sets.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "Coding system"?  Eh?
  1998-09-05 21:12       ` Lars Magne Ingebrigtsen
@ 1998-09-05 21:47         ` Hrvoje Niksic
  0 siblings, 0 replies; 15+ messages in thread
From: Hrvoje Niksic @ 1998-09-05 21:47 UTC (permalink / raw)


Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> Hrvoje Niksic <hniksic@srce.hr> writes:
> 
> > I believe Michael's point was that, under Mule, you can create
> > coding systems that have nothing to do with character sets, such
> > as a `gzip' coding-system.
> 
> They do that?  And if the unzipped file results in something that's
> iso-2022-jp, do they run it though the decoding twice, or do they
> have a gzip-iso-2022-jp coding system, as Morioka almost suggested
> for base64?  (base64-iso-2022-jp, etc.)  The latter would be a
> nightmare, and the former would be just yucky.

I believe XEmacs/Mule allows you to create coding-system chains (at
least I seem to recall seing internal code to that effect), the yucky
solution, whereas in FSF Emacs you get the nightmare one.

> > Coding systems are Emacs-specific hybrids between character sets
> > and their external representation.  This probably makes them
> > different enough from "character sets" to warrant for a separate
> > name.
> 
> Hm.  I did a `M-x list-coding-systems', and it listed nothing but
> character sets.

Well, I never said all of this was implemented.  :-)  I was trying to
explain the concept, the way I see it.

-- 
Hrvoje Niksic <hniksic@srce.hr> | Student at FER Zagreb, Croatia
--------------------------------+--------------------------------
Ask not for whom the <CONTROL-G> tolls.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "Coding system"?  Eh?
  1998-09-05 20:07   ` Lars Magne Ingebrigtsen
  1998-09-05 20:45     ` Hrvoje Niksic
@ 1998-09-07 15:12     ` David Kågedal
  1998-09-09 18:50       ` François Pinard
  2002-10-20 23:13       ` Lars Magne Ingebrigtsen
  1 sibling, 2 replies; 15+ messages in thread
From: David Kågedal @ 1998-09-07 15:12 UTC (permalink / raw)

Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> Michael Welsh Duggan <md5i@cs.cmu.edu> writes:
> 
> > No, not really.  A character set is merely a set of characters.
> > latin-1, etc, are often called character sets because they use the
> > same number of characters as extended ASCII, etc.  A coding-system is
> > just that: a coding-system.  The characters could be encoded any which
> > way (including encrypted!).  For example, old-jis uses escapes around
> > sequences of 7-bit characters.  This is an encoding, which you can
> > display using a character set, but not a character set in and of
> > itself.
> 
> All texts consists of characters (from some character set) encoded
> (using some coding system).  iso-8859-1, for instance, represents the
> character LATIN-LETTER-A-WITH-UMLAUT ("ä") with one byte that contains
> the number 0xe4.  The same letter encoded in a different charset (say,
> Unicode) would occupy two bytes.  Other character sets use multiple
> bytes to represent characters, like iso-2022-jp.

Now you are mixing things.  The phrase "encoded in a different charset
(say, Unicode)" is a semantic error.

Unicode defines a character set where LATIT-LETTER-A-WITH-UMLAUT has a
specific number (228 i believe), but Unicode also defines several
character encodings.  There is UCS-2 where all characters occupy two
bytes.  Then there is UTF-8 where most characters can be encoded using
one byte, while 'ä' needs at least two.  Actually, all characters can
be encoded with, say, three bytes in UTF-8.  Unicode also defines
UTF-7 which is so ugly that I won't say anything further about it.
Then ISO-10646, which is in principle a superset of Unicode (but does
not contain any more defined characters) also defines UCS-4, where all
characters are encoded using four bytes, and UTF-16, where all
characters are encoding using two bytes.

Byt the character set is always the same, with numbers ranging from 0
to 65536.

> When one talks about character sets (in, say, MIME) one talks about
> encoded character sets.  Abstract character sets aren't all that
> interesting when fiddling with data.  iso-8859-1, which MULE calls a
> coding system, is something everyone else calls a character set.  The
> same with old-jis and iso-2022-jp.

ISO 8859-1 is both a character set, and an encoding (one-to-one from
charater to byte), I believe.  But I'm not sure how it is defined.

-- 
David Kågedal        <davidk@lysator.liu.se> http://www.lysator.liu.se/~davidk/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "Coding system"?  Eh?
  1998-09-07 15:12     ` David Kågedal
@ 1998-09-09 18:50       ` François Pinard
  1998-09-10 12:45         ` David Kågedal
  1998-09-11 16:14         ` Hallvard B Furuseth
  2002-10-20 23:13       ` Lars Magne Ingebrigtsen
  1 sibling, 2 replies; 15+ messages in thread
From: François Pinard @ 1998-09-09 18:50 UTC (permalink / raw)
  Cc: ding

davidk@lysator.liu.se (David Kågedal) écrit:

> Lars Magne Ingebrigtsen <larsi@gnus.org> writes:

> > Michael Welsh Duggan <md5i@cs.cmu.edu> writes:

> > > No, not really.  A character set is merely a set of characters.
> > > [...]  A coding-system is just that: a coding-system.

I'm no specialist, but my impression is that MULE does not makes such a
clear separation.  Internally, each Mule "character" (I'm not sure of the
terminology) holds information about both the code and its encoding.

> Unicode defines a character set where LATIT-LETTER-A-WITH-UMLAUT has a
> specific number (228 i believe), but Unicode also defines several
> character encodings.  There is UCS-2 where all characters occupy two
> bytes.  Then there is UTF-8 where most characters can be encoded using
> one byte, while 'ä' needs at least two.  Actually, all characters can
> be encoded with, say, three bytes in UTF-8.

You mean, all Unicode characters.  ISO 10646 might need more then three,
as UTF-8 is also available for ISO 10646.

> Unicode also defines UTF-7 which is so ugly that I won't say anything
> further about it.

Does Unicode now defines UTF-7?  It originated from the IETF, and UTF-7
is specifically for MIME contexts, which Unicode does not address.

> Then ISO-10646, which is in principle a superset of Unicode (but does
> not contain any more defined characters) [...]

Some convergence happened, indeed, but the details are a bit more complex.

> also defines UCS-4, where all characters are encoded using four bytes,
> and UTF-16, where all characters are encoding using two bytes.

I do not remember that ISO 10646 introduced UTF-16, I thought it was a
Unicode invention, but once again, I'm no specialist and may easily be
wrong.  ISO 10646 redefined the BMP so there is room for UTF-16 coding,
so ISO 10646 is aware and compatible with Unicode on this.  By the way,
UTF-16 encodes characters using either two or four bytes.

> > When one talks about character sets (in, say, MIME) one talks about
> > encoded character sets.

One should be aware that MIME and ISO 10646/Unicode use different meaning
for the same terms.  I often saw people debating hotly such things,
without realising they were using definitions from different sources.

> > Abstract character sets aren't all that interesting when fiddling
> > with data.  iso-8859-1, which MULE calls a coding system, is something
> > everyone else calls a character set.  The same with old-jis and
> > iso-2022-jp.

> ISO 8859-1 is both a character set, and an encoding (one-to-one from
> charater to byte), I believe.  But I'm not sure how it is defined.

And to make things more confusing, when an encoding is used for only one
character set, there is a trend to not make the distinction, and consider the
encoding itself as a character set.  I'm a moderate purist on those things,
yet people finally convinced me that practical considerations should prevail.

-- 
François Pinard                            mailto:pinard@iro.umontreal.ca
Join the free Translation Project!    http://www.iro.umontreal.ca/~pinard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "Coding system"?  Eh?
  1998-09-09 18:50       ` François Pinard
@ 1998-09-10 12:45         ` David Kågedal
  1998-09-10 20:21           ` Gisle Aas
  1998-09-11  6:16           ` François Pinard
  1998-09-11 16:14         ` Hallvard B Furuseth
  1 sibling, 2 replies; 15+ messages in thread
From: David Kågedal @ 1998-09-10 12:45 UTC (permalink / raw)


François Pinard <pinard@iro.umontreal.ca> writes:

> davidk@lysator.liu.se (David Kågedal) écrit:
> 
> > Unicode defines a character set where LATIT-LETTER-A-WITH-UMLAUT has a
> > specific number (228 i believe), but Unicode also defines several
> > character encodings.  There is UCS-2 where all characters occupy two
> > bytes.  Then there is UTF-8 where most characters can be encoded using
> > one byte, while 'ä' needs at least two.  Actually, all characters can
> > be encoded with, say, three bytes in UTF-8.
> 
> You mean, all Unicode characters.  ISO 10646 might need more then three,
> as UTF-8 is also available for ISO 10646.

True.  I was talking about Unicode.

> > Unicode also defines UTF-7 which is so ugly that I won't say anything
> > further about it.
> 
> Does Unicode now defines UTF-7?  It originated from the IETF, and UTF-7
> is specifically for MIME contexts, which Unicode does not address.

I might be wrong about the origin of UTF-7.  But it's still ugly.

> > Then ISO-10646, which is in principle a superset of Unicode (but does
> > not contain any more defined characters) [...]
> 
> Some convergence happened, indeed, but the details are a bit more complex.
> 
> > also defines UCS-4, where all characters are encoded using four bytes,
> > and UTF-16, where all characters are encoding using two bytes.
> 
> I do not remember that ISO 10646 introduced UTF-16, I thought it was a
> Unicode invention, but once again, I'm no specialist and may easily be
> wrong.  ISO 10646 redefined the BMP so there is room for UTF-16 coding,
> so ISO 10646 is aware and compatible with Unicode on this.  By the way,
> UTF-16 encodes characters using either two or four bytes.

The difference between UTF-16 and UCS-2 is that it can encode some of
the charaters outside the Unicode range (BMP).  So I guess Unicode has
no need for UTF-16.

-- 
David Kågedal        <davidk@lysator.liu.se> http://www.lysator.liu.se/~davidk/


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "Coding system"?  Eh?
  1998-09-10 12:45         ` David Kågedal
@ 1998-09-10 20:21           ` Gisle Aas
  1998-09-11  6:27             ` François Pinard
  1998-09-11  6:16           ` François Pinard
  1 sibling, 1 reply; 15+ messages in thread
From: Gisle Aas @ 1998-09-10 20:21 UTC (permalink / raw)
  Cc: ding

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 566 bytes --]

davidk@lysator.liu.se (David Kågedal) writes:

> > You mean, all Unicode characters.  ISO 10646 might need more then three,
> > as UTF-8 is also available for ISO 10646.
> 
> True.  I was talking about Unicode.

Unicode is in sync with ISO 10646.  Also Unicode allocates characters
above U+FFFF.  http://www.unicode.org/unicode/alloc/Pipeline.html.

> The difference between UTF-16 and UCS-2 is that it can encode some of
> the charaters outside the Unicode range (BMP).  So I guess Unicode has
> no need for UTF-16.

Unicode 2.x is in a way UTF-16.

-- 
Gisle Aas


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "Coding system"?  Eh?
  1998-09-10 20:21           ` Gisle Aas
@ 1998-09-11  6:27             ` François Pinard
  0 siblings, 0 replies; 15+ messages in thread
From: François Pinard @ 1998-09-11  6:27 UTC (permalink / raw)
  Cc: David Kågedal, ding

Gisle Aas <aas@sn.no> écrit:

> Unicode is in sync with ISO 10646.  Also Unicode allocates characters
> above U+FFFF.  http://www.unicode.org/unicode/alloc/Pipeline.html.

Thanks for the reference, I'll later take a look when I'll be on the net.

> > The difference between UTF-16 and UCS-2 is that it can encode some of
> > the charaters outside the Unicode range (BMP).  So I guess Unicode has
> > no need for UTF-16.

> Unicode 2.x is in a way UTF-16.

Yes, I got that feeling, even if I did not buy the books (it becomes
expensive after a while, when you use your own money for it :-).

There is some sadness in all this.  The original idea was to have the
capability of a set of fixed width characters covering all spoken languages.
Look were we are now.  UTF-16 is a variable width code, we have a lot of
combining characters, and various marks for byte order, for directionality,
and so forth.  Many characters are missing (to the point this creates me
problems within `recode'), and a non-negligible fraction of Japanese users
are highly irritated by Han unification, and other things.

Many people still think Unicode / ISO 10646 is another step of mankind
towards God, but if look a bit inside, you'll see that reality has run
and caught back progress, pretty fast, sadly enough.

-- 
François Pinard                            mailto:pinard@iro.umontreal.ca
Join the free Translation Project!    http://www.iro.umontreal.ca/~pinard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "Coding system"?  Eh?
  1998-09-10 12:45         ` David Kågedal
  1998-09-10 20:21           ` Gisle Aas
@ 1998-09-11  6:16           ` François Pinard
  1 sibling, 0 replies; 15+ messages in thread
From: François Pinard @ 1998-09-11  6:16 UTC (permalink / raw)
  Cc: ding

davidk@lysator.liu.se (David Kågedal) écrit:

> > Does Unicode now defines UTF-7?  It originated from the IETF, and UTF-7
> > is specifically for MIME contexts, which Unicode does not address.

> I might be wrong about the origin of UTF-7.  But it's still ugly.

We are all helping each other, here, it is not that important that we
are wrong or right, as long as we improve.  If UTF-7 has been adopted
by Unicode, I would surely have liked to know, because the `recode'
documentation would then need to be adjusted.

About the ugliness of UTF-7, I agree to a certain extent.  For ding readers
who do not know, UTF-7 is a kind of quoted-printable for characters using
more than 8 bits, and is suited for transmission over 7 bit channels.
Very roughly said, instead of `=', it uses `+', and instead of hexadecimal
values, it uses in-lined Base64.

I found it a bit painful to write an UTF-7 encoder and decoder, but now that
it's done, the algorithmic ugliness (which is another kind of ugliness)
is all hidden in black boxes, and we might consider that it is not in the
way anymore.  UTF-8 has its elegances, but still it is slightly painful
to write _efficient_ encoders/decoders.

For transmission of Unicode or ISO 10646 message bodies, it looks to me that
we have the choice between UTF-8 and UTF-7.  UCS-2 and UCS-4 are internal
formats not well suited for transmission, UTF-1 is obsolete, and UTF-16
is not much better than UCS-2 for transmission before all machines replace
8-bit bytes with 16-bit bytes, and this will not happen in this century :-).

In fact, we have to look at things with a cold eye, here.  If you do
not have an integrated decoder in Gnus or in your other mail readers,
I would not be sure which of UTF-8 or UTF-7 looks uglier.  UTF-8 would
look like a mix of ASCII and binary dump, UTF-7 would look like ASCII
with fragments of Base64 in it.  I might prefer UTF-7, after all, maybe.
And if you have a decoder well integrated, you do not see the ugliness.
The algorithmic ugliness is hidden once and for all in black boxes anyway,
and then, it does not really matter.

> The difference between UTF-16 and UCS-2 is that it can encode some of
> the charaters outside the Unicode range (BMP).  So I guess Unicode has
> no need for UTF-16.

Unicode needs it, because people are beginning to see that 65.000
characters are not as _enough_ as it was once thought (hmph! I suspect
this is strange English :-).  I mean that a few years ago, it was believed
that 65.000 characters were to satisfied all our needs for a lot of years,
but relatively soon, people began to see that it is not enough, and that
we need a way to get more characters.  ISO 10646 had much higher goals to
start with, so it did not have that problem.  UTF-16 extends the Unicode
set to around 1.000.000 characters, still much less than ISO 10646, but
yet, much more comfortable than 65.000 -- and ISO 10646 later made room
in its BMP so the UTF-16 technique be more simply implementable.  I do
not think ISO 10646 ever needed UTF-16, but it wanted Unicode compatibility.

-- 
François Pinard                            mailto:pinard@iro.umontreal.ca
Join the free Translation Project!    http://www.iro.umontreal.ca/~pinard

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "Coding system"? Eh?
  1998-09-09 18:50       ` François Pinard
  1998-09-10 12:45         ` David Kågedal
@ 1998-09-11 16:14         ` Hallvard B Furuseth
  1 sibling, 0 replies; 15+ messages in thread
From: Hallvard B Furuseth @ 1998-09-11 16:14 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 827 bytes --]

[Michael Welsh Duggan]

> No, not really.  A character set is merely a set of characters.
> [...]  A coding-system is just that: a coding-system.

[François Pinard]

> I'm no specialist, but my impression is that MULE does not makes such a
> clear separation.  Internally, each Mule "character" (I'm not sure of the
> terminology) holds information about both the code and its encoding.

That's sort of true for *latin-N* characters sets in MULE: They have a
"natural" encoding which is equivalent to the character set.  However,
two other Cyrillic coding systems map map to (subsets of) the MULE
character set latin-iso8859-9 (that's latin-5).  And it's not that way
for asian MULE character sets, I think even a single MULE character can
have several encodings in the same coding system (iso2022 or whatever).

-- 
Hallvard


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "Coding system"?  Eh?
  1998-09-07 15:12     ` David Kågedal
  1998-09-09 18:50       ` François Pinard
@ 2002-10-20 23:13       ` Lars Magne Ingebrigtsen
  1998-09-09 18:59         ` François Pinard
  1 sibling, 1 reply; 15+ messages in thread
From: Lars Magne Ingebrigtsen @ 2002-10-20 23:13 UTC (permalink / raw)

davidk@lysator.liu.se (David Kågedal) writes:

> ISO 8859-1 is both a character set, and an encoding (one-to-one from
> charater to byte), I believe.  But I'm not sure how it is defined.

A character set is an encoding in normal usage.  Quoth RFC2045:

2.2.  Character Set

   The term "character set" is used in MIME to refer to a method of
   converting a sequence of octets into a sequence of characters.  Note
   that unconditional and unambiguous conversion in the other direction
   is not required, in that not all characters may be representable by a
   given character set and a character set may provide more than one
   sequence of octets to represent a particular sequence of characters.

   This definition is intended to allow various kinds of character
   encodings, from simple single-table mappings such as US-ASCII to
   complex table switching methods such as those that use ISO 2022's
   techniques, to be used as character sets.  However, the definition
   associated with a MIME character set name must fully specify the
   mapping to be performed.  In particular, use of external profiling
   information to determine the exact mapping is not permitted.

   NOTE: The term "character set" was originally to describe such
   straightforward schemes as US-ASCII and ISO-8859-1 which have a
   simple one-to-one mapping from single octets to single characters.
   Multi-octet coded character sets and switching techniques make the
   situation more complex. For example, some communities use the term
   "character encoding" for what MIME calls a "character set", while
   using the phrase "coded character set" to denote an abstract mapping
   from integers (not octets) to characters.

-- 
(domestic pets only, the antidote for overdose, milk.)
  larsi@gnus.org * Lars Magne Ingebrigtsen

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "Coding system"?  Eh?
  2002-10-20 23:13       ` Lars Magne Ingebrigtsen
@ 1998-09-09 18:59         ` François Pinard
  0 siblings, 0 replies; 15+ messages in thread
From: François Pinard @ 1998-09-09 18:59 UTC (permalink / raw)


Lars Magne Ingebrigtsen <larsi@gnus.org> écrit:

> A character set is an encoding in normal usage.  Quoth RFC2045:

>    Multi-octet coded character sets and switching techniques make the
>    situation more complex. For example, some communities use the term
>    "character encoding" for what MIME calls a "character set", while
>    using the phrase "coded character set" to denote an abstract mapping
>    from integers (not octets) to characters.

There is another distinction between MIME and ISO terminology.  Roughly
said, MIME considers a character set as mapping possible code values to
an encoding of those (often trivial), while ISO consider a character set
as a mere set of characters, not necessarily covering all code positions.
It is sometimes needed to make the set union of many ISO character sets
to get the equivalent of a MIME character set.

P.S. - Take everything I say with a grain of salt.  People which are deep
in these matters are quite susceptible to the detailed wording, and for such
strict readers, I've almost no chance of expressing myself correctly! :-)

-- 
François Pinard                            mailto:pinard@iro.umontreal.ca
Join the free Translation Project!    http://www.iro.umontreal.ca/~pinard


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2002-10-20 23:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-09-05 16:01 "Coding system"? Eh? Lars Magne Ingebrigtsen
1998-09-05 16:31 ` Michael Welsh Duggan
1998-09-05 20:07   ` Lars Magne Ingebrigtsen
1998-09-05 20:45     ` Hrvoje Niksic
1998-09-05 21:12       ` Lars Magne Ingebrigtsen
1998-09-05 21:47         ` Hrvoje Niksic
1998-09-07 15:12     ` David Kågedal
1998-09-09 18:50       ` François Pinard
1998-09-10 12:45         ` David Kågedal
1998-09-10 20:21           ` Gisle Aas
1998-09-11  6:27             ` François Pinard
1998-09-11  6:16           ` François Pinard
1998-09-11 16:14         ` Hallvard B Furuseth
2002-10-20 23:13       ` Lars Magne Ingebrigtsen
1998-09-09 18:59         ` François Pinard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).