* "Coding system"? Eh? @ 1998-09-05 16:01 Lars Magne Ingebrigtsen 1998-09-05 16:31 ` Michael Welsh Duggan 0 siblings, 1 reply; 15+ messages in thread From: Lars Magne Ingebrigtsen @ 1998-09-05 16:01 UTC (permalink / raw) Isn't what MULE calls a "coding system" what the entire rest of the world call a "character set"? So `decode-coding-system' should really have been called `decode-charset'? MULE is confusing. -- (domestic pets only, the antidote for overdose, milk.) larsi@ifi.uio.no * Lars Magne Ingebrigtsen ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "Coding system"? Eh? 1998-09-05 16:01 "Coding system"? Eh? Lars Magne Ingebrigtsen @ 1998-09-05 16:31 ` Michael Welsh Duggan 1998-09-05 20:07 ` Lars Magne Ingebrigtsen 0 siblings, 1 reply; 15+ messages in thread From: Michael Welsh Duggan @ 1998-09-05 16:31 UTC (permalink / raw) Lars Magne Ingebrigtsen <larsi@ifi.uio.no> writes: > Isn't what MULE calls a "coding system" what the entire rest of the > world call a "character set"? So `decode-coding-system' should > really have been called `decode-charset'? No, not really. A character set is merely a set of characters. latin-1, etc, are often called character sets because they use the same number of characters as extended ASCII, etc. A coding-system is just that: a coding-system. The characters could be encoded any which way (including encrypted!). For example, old-jis uses escapes around sequences of 7-bit characters. This is an encoding, which you can display using a character set, but not a character set in and of itself. More information on {decode,encode}-coding-system: The way the function is handled internally is that it deletes the region and replaces it with the {de,en}coded text. This means markers in the region are screwed. Regions are still buggy though; they shouldn't work the way they do currently. I am looking into how hard it would be to fix things such that markers at least can be preserved. -- Michael Duggan (md5i@cs.cmu.edu) . ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "Coding system"? Eh? 1998-09-05 16:31 ` Michael Welsh Duggan @ 1998-09-05 20:07 ` Lars Magne Ingebrigtsen 1998-09-05 20:45 ` Hrvoje Niksic 1998-09-07 15:12 ` David Kågedal 0 siblings, 2 replies; 15+ messages in thread From: Lars Magne Ingebrigtsen @ 1998-09-05 20:07 UTC (permalink / raw) Michael Welsh Duggan <md5i@cs.cmu.edu> writes: > No, not really. A character set is merely a set of characters. > latin-1, etc, are often called character sets because they use the > same number of characters as extended ASCII, etc. A coding-system is > just that: a coding-system. The characters could be encoded any which > way (including encrypted!). For example, old-jis uses escapes around > sequences of 7-bit characters. This is an encoding, which you can > display using a character set, but not a character set in and of > itself. All texts consists of characters (from some character set) encoded (using some coding system). iso-8859-1, for instance, represents the character LATIN-LETTER-A-WITH-UMLAUT ("ä") with one byte that contains the number 0xe4. The same letter encoded in a different charset (say, Unicode) would occupy two bytes. Other character sets use multiple bytes to represent characters, like iso-2022-jp. When one talks about character sets (in, say, MIME) one talks about encoded character sets. Abstract character sets aren't all that interesting when fiddling with data. iso-8859-1, which MULE calls a coding system, is something everyone else calls a character set. The same with old-jis and iso-2022-jp. Or something. -- (domestic pets only, the antidote for overdose, milk.) larsi@gnus.org * Lars Magne Ingebrigtsen ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "Coding system"? Eh? 1998-09-05 20:07 ` Lars Magne Ingebrigtsen @ 1998-09-05 20:45 ` Hrvoje Niksic 1998-09-05 21:12 ` Lars Magne Ingebrigtsen 1998-09-07 15:12 ` David Kågedal 1 sibling, 1 reply; 15+ messages in thread From: Hrvoje Niksic @ 1998-09-05 20:45 UTC (permalink / raw) Lars Magne Ingebrigtsen <larsi@gnus.org> writes: > iso-8859-1, which MULE calls a coding system, is something everyone > else calls a character set. The same with old-jis and iso-2022-jp. I believe Michael's point was that, under Mule, you can create coding systems that have nothing to do with character sets, such as a `gzip' coding-system. Coding systems are Emacs-specific hybrids between character sets and their external representation. This probably makes them different enough from "character sets" to warrant for a separate name. -- Hrvoje Niksic <hniksic@srce.hr> | Student at FER Zagreb, Croatia --------------------------------+-------------------------------- Try to use "ad nauseam" at least once per flame. It doesn't mean anything; but it gives that polished feel to your postings. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "Coding system"? Eh? 1998-09-05 20:45 ` Hrvoje Niksic @ 1998-09-05 21:12 ` Lars Magne Ingebrigtsen 1998-09-05 21:47 ` Hrvoje Niksic 0 siblings, 1 reply; 15+ messages in thread From: Lars Magne Ingebrigtsen @ 1998-09-05 21:12 UTC (permalink / raw) Hrvoje Niksic <hniksic@srce.hr> writes: > I believe Michael's point was that, under Mule, you can create coding > systems that have nothing to do with character sets, such as a `gzip' > coding-system. They do that? And if the unzipped file results in something that's iso-2022-jp, do they run it though the decoding twice, or do they have a gzip-iso-2022-jp coding system, as Morioka almost suggested for base64? (base64-iso-2022-jp, etc.) The latter would be a nightmare, and the former would be just yucky. > Coding systems are Emacs-specific hybrids between > character sets and their external representation. This probably makes > them different enough from "character sets" to warrant for a separate > name. Hm. I did a `M-x list-coding-systems', and it listed nothing but character sets. -- (domestic pets only, the antidote for overdose, milk.) larsi@gnus.org * Lars Magne Ingebrigtsen ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "Coding system"? Eh? 1998-09-05 21:12 ` Lars Magne Ingebrigtsen @ 1998-09-05 21:47 ` Hrvoje Niksic 0 siblings, 0 replies; 15+ messages in thread From: Hrvoje Niksic @ 1998-09-05 21:47 UTC (permalink / raw) Lars Magne Ingebrigtsen <larsi@gnus.org> writes: > Hrvoje Niksic <hniksic@srce.hr> writes: > > > I believe Michael's point was that, under Mule, you can create > > coding systems that have nothing to do with character sets, such > > as a `gzip' coding-system. > > They do that? And if the unzipped file results in something that's > iso-2022-jp, do they run it though the decoding twice, or do they > have a gzip-iso-2022-jp coding system, as Morioka almost suggested > for base64? (base64-iso-2022-jp, etc.) The latter would be a > nightmare, and the former would be just yucky. I believe XEmacs/Mule allows you to create coding-system chains (at least I seem to recall seing internal code to that effect), the yucky solution, whereas in FSF Emacs you get the nightmare one. > > Coding systems are Emacs-specific hybrids between character sets > > and their external representation. This probably makes them > > different enough from "character sets" to warrant for a separate > > name. > > Hm. I did a `M-x list-coding-systems', and it listed nothing but > character sets. Well, I never said all of this was implemented. :-) I was trying to explain the concept, the way I see it. -- Hrvoje Niksic <hniksic@srce.hr> | Student at FER Zagreb, Croatia --------------------------------+-------------------------------- Ask not for whom the <CONTROL-G> tolls. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "Coding system"? Eh? 1998-09-05 20:07 ` Lars Magne Ingebrigtsen 1998-09-05 20:45 ` Hrvoje Niksic @ 1998-09-07 15:12 ` David Kågedal 1998-09-09 18:50 ` François Pinard 2002-10-20 23:13 ` Lars Magne Ingebrigtsen 1 sibling, 2 replies; 15+ messages in thread From: David Kågedal @ 1998-09-07 15:12 UTC (permalink / raw) Lars Magne Ingebrigtsen <larsi@gnus.org> writes: > Michael Welsh Duggan <md5i@cs.cmu.edu> writes: > > > No, not really. A character set is merely a set of characters. > > latin-1, etc, are often called character sets because they use the > > same number of characters as extended ASCII, etc. A coding-system is > > just that: a coding-system. The characters could be encoded any which > > way (including encrypted!). For example, old-jis uses escapes around > > sequences of 7-bit characters. This is an encoding, which you can > > display using a character set, but not a character set in and of > > itself. > > All texts consists of characters (from some character set) encoded > (using some coding system). iso-8859-1, for instance, represents the > character LATIN-LETTER-A-WITH-UMLAUT ("ä") with one byte that contains > the number 0xe4. The same letter encoded in a different charset (say, > Unicode) would occupy two bytes. Other character sets use multiple > bytes to represent characters, like iso-2022-jp. Now you are mixing things. The phrase "encoded in a different charset (say, Unicode)" is a semantic error. Unicode defines a character set where LATIT-LETTER-A-WITH-UMLAUT has a specific number (228 i believe), but Unicode also defines several character encodings. There is UCS-2 where all characters occupy two bytes. Then there is UTF-8 where most characters can be encoded using one byte, while 'ä' needs at least two. Actually, all characters can be encoded with, say, three bytes in UTF-8. Unicode also defines UTF-7 which is so ugly that I won't say anything further about it. Then ISO-10646, which is in principle a superset of Unicode (but does not contain any more defined characters) also defines UCS-4, where all characters are encoded using four bytes, and UTF-16, where all characters are encoding using two bytes. Byt the character set is always the same, with numbers ranging from 0 to 65536. > When one talks about character sets (in, say, MIME) one talks about > encoded character sets. Abstract character sets aren't all that > interesting when fiddling with data. iso-8859-1, which MULE calls a > coding system, is something everyone else calls a character set. The > same with old-jis and iso-2022-jp. ISO 8859-1 is both a character set, and an encoding (one-to-one from charater to byte), I believe. But I'm not sure how it is defined. -- David Kågedal <davidk@lysator.liu.se> http://www.lysator.liu.se/~davidk/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "Coding system"? Eh? 1998-09-07 15:12 ` David Kågedal @ 1998-09-09 18:50 ` François Pinard 1998-09-10 12:45 ` David Kågedal 1998-09-11 16:14 ` Hallvard B Furuseth 2002-10-20 23:13 ` Lars Magne Ingebrigtsen 1 sibling, 2 replies; 15+ messages in thread From: François Pinard @ 1998-09-09 18:50 UTC (permalink / raw) Cc: ding davidk@lysator.liu.se (David Kågedal) écrit: > Lars Magne Ingebrigtsen <larsi@gnus.org> writes: > > Michael Welsh Duggan <md5i@cs.cmu.edu> writes: > > > No, not really. A character set is merely a set of characters. > > > [...] A coding-system is just that: a coding-system. I'm no specialist, but my impression is that MULE does not makes such a clear separation. Internally, each Mule "character" (I'm not sure of the terminology) holds information about both the code and its encoding. > Unicode defines a character set where LATIT-LETTER-A-WITH-UMLAUT has a > specific number (228 i believe), but Unicode also defines several > character encodings. There is UCS-2 where all characters occupy two > bytes. Then there is UTF-8 where most characters can be encoded using > one byte, while 'ä' needs at least two. Actually, all characters can > be encoded with, say, three bytes in UTF-8. You mean, all Unicode characters. ISO 10646 might need more then three, as UTF-8 is also available for ISO 10646. > Unicode also defines UTF-7 which is so ugly that I won't say anything > further about it. Does Unicode now defines UTF-7? It originated from the IETF, and UTF-7 is specifically for MIME contexts, which Unicode does not address. > Then ISO-10646, which is in principle a superset of Unicode (but does > not contain any more defined characters) [...] Some convergence happened, indeed, but the details are a bit more complex. > also defines UCS-4, where all characters are encoded using four bytes, > and UTF-16, where all characters are encoding using two bytes. I do not remember that ISO 10646 introduced UTF-16, I thought it was a Unicode invention, but once again, I'm no specialist and may easily be wrong. ISO 10646 redefined the BMP so there is room for UTF-16 coding, so ISO 10646 is aware and compatible with Unicode on this. By the way, UTF-16 encodes characters using either two or four bytes. > > When one talks about character sets (in, say, MIME) one talks about > > encoded character sets. One should be aware that MIME and ISO 10646/Unicode use different meaning for the same terms. I often saw people debating hotly such things, without realising they were using definitions from different sources. > > Abstract character sets aren't all that interesting when fiddling > > with data. iso-8859-1, which MULE calls a coding system, is something > > everyone else calls a character set. The same with old-jis and > > iso-2022-jp. > ISO 8859-1 is both a character set, and an encoding (one-to-one from > charater to byte), I believe. But I'm not sure how it is defined. And to make things more confusing, when an encoding is used for only one character set, there is a trend to not make the distinction, and consider the encoding itself as a character set. I'm a moderate purist on those things, yet people finally convinced me that practical considerations should prevail. -- François Pinard mailto:pinard@iro.umontreal.ca Join the free Translation Project! http://www.iro.umontreal.ca/~pinard ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "Coding system"? Eh? 1998-09-09 18:50 ` François Pinard @ 1998-09-10 12:45 ` David Kågedal 1998-09-10 20:21 ` Gisle Aas 1998-09-11 6:16 ` François Pinard 1998-09-11 16:14 ` Hallvard B Furuseth 1 sibling, 2 replies; 15+ messages in thread From: David Kågedal @ 1998-09-10 12:45 UTC (permalink / raw) François Pinard <pinard@iro.umontreal.ca> writes: > davidk@lysator.liu.se (David Kågedal) écrit: > > > Unicode defines a character set where LATIT-LETTER-A-WITH-UMLAUT has a > > specific number (228 i believe), but Unicode also defines several > > character encodings. There is UCS-2 where all characters occupy two > > bytes. Then there is UTF-8 where most characters can be encoded using > > one byte, while 'ä' needs at least two. Actually, all characters can > > be encoded with, say, three bytes in UTF-8. > > You mean, all Unicode characters. ISO 10646 might need more then three, > as UTF-8 is also available for ISO 10646. True. I was talking about Unicode. > > Unicode also defines UTF-7 which is so ugly that I won't say anything > > further about it. > > Does Unicode now defines UTF-7? It originated from the IETF, and UTF-7 > is specifically for MIME contexts, which Unicode does not address. I might be wrong about the origin of UTF-7. But it's still ugly. > > Then ISO-10646, which is in principle a superset of Unicode (but does > > not contain any more defined characters) [...] > > Some convergence happened, indeed, but the details are a bit more complex. > > > also defines UCS-4, where all characters are encoded using four bytes, > > and UTF-16, where all characters are encoding using two bytes. > > I do not remember that ISO 10646 introduced UTF-16, I thought it was a > Unicode invention, but once again, I'm no specialist and may easily be > wrong. ISO 10646 redefined the BMP so there is room for UTF-16 coding, > so ISO 10646 is aware and compatible with Unicode on this. By the way, > UTF-16 encodes characters using either two or four bytes. The difference between UTF-16 and UCS-2 is that it can encode some of the charaters outside the Unicode range (BMP). So I guess Unicode has no need for UTF-16. -- David Kågedal <davidk@lysator.liu.se> http://www.lysator.liu.se/~davidk/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "Coding system"? Eh? 1998-09-10 12:45 ` David Kågedal @ 1998-09-10 20:21 ` Gisle Aas 1998-09-11 6:27 ` François Pinard 1998-09-11 6:16 ` François Pinard 1 sibling, 1 reply; 15+ messages in thread From: Gisle Aas @ 1998-09-10 20:21 UTC (permalink / raw) Cc: ding [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 566 bytes --] davidk@lysator.liu.se (David Kågedal) writes: > > You mean, all Unicode characters. ISO 10646 might need more then three, > > as UTF-8 is also available for ISO 10646. > > True. I was talking about Unicode. Unicode is in sync with ISO 10646. Also Unicode allocates characters above U+FFFF. http://www.unicode.org/unicode/alloc/Pipeline.html. > The difference between UTF-16 and UCS-2 is that it can encode some of > the charaters outside the Unicode range (BMP). So I guess Unicode has > no need for UTF-16. Unicode 2.x is in a way UTF-16. -- Gisle Aas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "Coding system"? Eh? 1998-09-10 20:21 ` Gisle Aas @ 1998-09-11 6:27 ` François Pinard 0 siblings, 0 replies; 15+ messages in thread From: François Pinard @ 1998-09-11 6:27 UTC (permalink / raw) Cc: David Kågedal, ding Gisle Aas <aas@sn.no> écrit: > Unicode is in sync with ISO 10646. Also Unicode allocates characters > above U+FFFF. http://www.unicode.org/unicode/alloc/Pipeline.html. Thanks for the reference, I'll later take a look when I'll be on the net. > > The difference between UTF-16 and UCS-2 is that it can encode some of > > the charaters outside the Unicode range (BMP). So I guess Unicode has > > no need for UTF-16. > Unicode 2.x is in a way UTF-16. Yes, I got that feeling, even if I did not buy the books (it becomes expensive after a while, when you use your own money for it :-). There is some sadness in all this. The original idea was to have the capability of a set of fixed width characters covering all spoken languages. Look were we are now. UTF-16 is a variable width code, we have a lot of combining characters, and various marks for byte order, for directionality, and so forth. Many characters are missing (to the point this creates me problems within `recode'), and a non-negligible fraction of Japanese users are highly irritated by Han unification, and other things. Many people still think Unicode / ISO 10646 is another step of mankind towards God, but if look a bit inside, you'll see that reality has run and caught back progress, pretty fast, sadly enough. -- François Pinard mailto:pinard@iro.umontreal.ca Join the free Translation Project! http://www.iro.umontreal.ca/~pinard ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "Coding system"? Eh? 1998-09-10 12:45 ` David Kågedal 1998-09-10 20:21 ` Gisle Aas @ 1998-09-11 6:16 ` François Pinard 1 sibling, 0 replies; 15+ messages in thread From: François Pinard @ 1998-09-11 6:16 UTC (permalink / raw) Cc: ding davidk@lysator.liu.se (David Kågedal) écrit: > > Does Unicode now defines UTF-7? It originated from the IETF, and UTF-7 > > is specifically for MIME contexts, which Unicode does not address. > I might be wrong about the origin of UTF-7. But it's still ugly. We are all helping each other, here, it is not that important that we are wrong or right, as long as we improve. If UTF-7 has been adopted by Unicode, I would surely have liked to know, because the `recode' documentation would then need to be adjusted. About the ugliness of UTF-7, I agree to a certain extent. For ding readers who do not know, UTF-7 is a kind of quoted-printable for characters using more than 8 bits, and is suited for transmission over 7 bit channels. Very roughly said, instead of `=', it uses `+', and instead of hexadecimal values, it uses in-lined Base64. I found it a bit painful to write an UTF-7 encoder and decoder, but now that it's done, the algorithmic ugliness (which is another kind of ugliness) is all hidden in black boxes, and we might consider that it is not in the way anymore. UTF-8 has its elegances, but still it is slightly painful to write _efficient_ encoders/decoders. For transmission of Unicode or ISO 10646 message bodies, it looks to me that we have the choice between UTF-8 and UTF-7. UCS-2 and UCS-4 are internal formats not well suited for transmission, UTF-1 is obsolete, and UTF-16 is not much better than UCS-2 for transmission before all machines replace 8-bit bytes with 16-bit bytes, and this will not happen in this century :-). In fact, we have to look at things with a cold eye, here. If you do not have an integrated decoder in Gnus or in your other mail readers, I would not be sure which of UTF-8 or UTF-7 looks uglier. UTF-8 would look like a mix of ASCII and binary dump, UTF-7 would look like ASCII with fragments of Base64 in it. I might prefer UTF-7, after all, maybe. And if you have a decoder well integrated, you do not see the ugliness. The algorithmic ugliness is hidden once and for all in black boxes anyway, and then, it does not really matter. > The difference between UTF-16 and UCS-2 is that it can encode some of > the charaters outside the Unicode range (BMP). So I guess Unicode has > no need for UTF-16. Unicode needs it, because people are beginning to see that 65.000 characters are not as _enough_ as it was once thought (hmph! I suspect this is strange English :-). I mean that a few years ago, it was believed that 65.000 characters were to satisfied all our needs for a lot of years, but relatively soon, people began to see that it is not enough, and that we need a way to get more characters. ISO 10646 had much higher goals to start with, so it did not have that problem. UTF-16 extends the Unicode set to around 1.000.000 characters, still much less than ISO 10646, but yet, much more comfortable than 65.000 -- and ISO 10646 later made room in its BMP so the UTF-16 technique be more simply implementable. I do not think ISO 10646 ever needed UTF-16, but it wanted Unicode compatibility. -- François Pinard mailto:pinard@iro.umontreal.ca Join the free Translation Project! http://www.iro.umontreal.ca/~pinard ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "Coding system"? Eh? 1998-09-09 18:50 ` François Pinard 1998-09-10 12:45 ` David Kågedal @ 1998-09-11 16:14 ` Hallvard B Furuseth 1 sibling, 0 replies; 15+ messages in thread From: Hallvard B Furuseth @ 1998-09-11 16:14 UTC (permalink / raw) [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 827 bytes --] [Michael Welsh Duggan] > No, not really. A character set is merely a set of characters. > [...] A coding-system is just that: a coding-system. [François Pinard] > I'm no specialist, but my impression is that MULE does not makes such a > clear separation. Internally, each Mule "character" (I'm not sure of the > terminology) holds information about both the code and its encoding. That's sort of true for *latin-N* characters sets in MULE: They have a "natural" encoding which is equivalent to the character set. However, two other Cyrillic coding systems map map to (subsets of) the MULE character set latin-iso8859-9 (that's latin-5). And it's not that way for asian MULE character sets, I think even a single MULE character can have several encodings in the same coding system (iso2022 or whatever). -- Hallvard ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "Coding system"? Eh? 1998-09-07 15:12 ` David Kågedal 1998-09-09 18:50 ` François Pinard @ 2002-10-20 23:13 ` Lars Magne Ingebrigtsen 1998-09-09 18:59 ` François Pinard 1 sibling, 1 reply; 15+ messages in thread From: Lars Magne Ingebrigtsen @ 2002-10-20 23:13 UTC (permalink / raw) davidk@lysator.liu.se (David Kågedal) writes: > ISO 8859-1 is both a character set, and an encoding (one-to-one from > charater to byte), I believe. But I'm not sure how it is defined. A character set is an encoding in normal usage. Quoth RFC2045: 2.2. Character Set The term "character set" is used in MIME to refer to a method of converting a sequence of octets into a sequence of characters. Note that unconditional and unambiguous conversion in the other direction is not required, in that not all characters may be representable by a given character set and a character set may provide more than one sequence of octets to represent a particular sequence of characters. This definition is intended to allow various kinds of character encodings, from simple single-table mappings such as US-ASCII to complex table switching methods such as those that use ISO 2022's techniques, to be used as character sets. However, the definition associated with a MIME character set name must fully specify the mapping to be performed. In particular, use of external profiling information to determine the exact mapping is not permitted. NOTE: The term "character set" was originally to describe such straightforward schemes as US-ASCII and ISO-8859-1 which have a simple one-to-one mapping from single octets to single characters. Multi-octet coded character sets and switching techniques make the situation more complex. For example, some communities use the term "character encoding" for what MIME calls a "character set", while using the phrase "coded character set" to denote an abstract mapping from integers (not octets) to characters. -- (domestic pets only, the antidote for overdose, milk.) larsi@gnus.org * Lars Magne Ingebrigtsen ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "Coding system"? Eh? 2002-10-20 23:13 ` Lars Magne Ingebrigtsen @ 1998-09-09 18:59 ` François Pinard 0 siblings, 0 replies; 15+ messages in thread From: François Pinard @ 1998-09-09 18:59 UTC (permalink / raw) Lars Magne Ingebrigtsen <larsi@gnus.org> écrit: > A character set is an encoding in normal usage. Quoth RFC2045: > Multi-octet coded character sets and switching techniques make the > situation more complex. For example, some communities use the term > "character encoding" for what MIME calls a "character set", while > using the phrase "coded character set" to denote an abstract mapping > from integers (not octets) to characters. There is another distinction between MIME and ISO terminology. Roughly said, MIME considers a character set as mapping possible code values to an encoding of those (often trivial), while ISO consider a character set as a mere set of characters, not necessarily covering all code positions. It is sometimes needed to make the set union of many ISO character sets to get the equivalent of a MIME character set. P.S. - Take everything I say with a grain of salt. People which are deep in these matters are quite susceptible to the detailed wording, and for such strict readers, I've almost no chance of expressing myself correctly! :-) -- François Pinard mailto:pinard@iro.umontreal.ca Join the free Translation Project! http://www.iro.umontreal.ca/~pinard ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2002-10-20 23:13 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 1998-09-05 16:01 "Coding system"? Eh? Lars Magne Ingebrigtsen 1998-09-05 16:31 ` Michael Welsh Duggan 1998-09-05 20:07 ` Lars Magne Ingebrigtsen 1998-09-05 20:45 ` Hrvoje Niksic 1998-09-05 21:12 ` Lars Magne Ingebrigtsen 1998-09-05 21:47 ` Hrvoje Niksic 1998-09-07 15:12 ` David Kågedal 1998-09-09 18:50 ` François Pinard 1998-09-10 12:45 ` David Kågedal 1998-09-10 20:21 ` Gisle Aas 1998-09-11 6:27 ` François Pinard 1998-09-11 6:16 ` François Pinard 1998-09-11 16:14 ` Hallvard B Furuseth 2002-10-20 23:13 ` Lars Magne Ingebrigtsen 1998-09-09 18:59 ` François Pinard
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).