Re: Unicode (was RE: JIT-compilation for OCaml?)

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Pierpaolo BERNARDI <bernardp@cli.di.unipi.it>
To: John Max Skaller <skaller@ozemail.com.au>
Cc: OCAML <caml-list@inria.fr>
Subject: Re: Unicode (was RE: JIT-compilation for OCaml?)
Date: Mon, 22 Jan 2001 22:44:26 +0100 (MET)	[thread overview]
Message-ID: <Pine.GSO.4.00.10101222155260.697-100000@carlotta.cli.di.unipi.it> (raw)
In-Reply-To: <3A6C97C3.C109DC15@ozemail.com.au>

On Tue, 23 Jan 2001, John Max Skaller wrote:

> > Unicode can be encoded in several ways, for example, UTF-8, UTF-16,
> > UTF-32, UCS2, etc..  This has nothing to do with the number of characters
> > that can be encoded.
> 
> 	This is not quite right. Unicode is 16 bit, it supports
> only 2^16 code points: again, unless this has
> changed recently. 

Sorry, no. The idea that "Unicode is 16 bit" is a relic of the
prehistory of Unicode, when it was thought that 64K characters would
suffice.

Unicode associates abstract characters with a numeric index (scalar
value). These scalar values, to be stored in a computer must be
serialized. Some serialization methods are, for example: UTF-8 (based
on 8-bit chunks), UTF-16 (based on 16-bit chunks), UTF-32 (based on
32-bit chunks). With UTF-32 there's a simple correspondence between
Unicode scalar values and chunks: the numeric value of the 32-bit
chunk is the scalar value; with the other two formats, Unicode
characters use a variable number of chunks: 1 to 4 for UTF-8, 1 or 2
for UTF-16. There are other serialization methods, like, for example
the archaic UCS-2 (which uses surrogates, and which I think is used by
Microsoft), UTF-7, UTF-EBCDIC, UTF-7,5, etc.

Unicode scalar values range from 0x0000 to 0x100000. The forthcoming
Unicode 3.1 uses 94,140 of these values.

And the same is true for ISO. There are some discrepancies which are due
only to the different times of publication of the respective standards.

> However, some of the code points are reserved
> for UCS-16 encoding of a larger space of 2^20 code points (another

You mean UCS-2.  UCS-2 is just one encoding among many others, it is not
identical to Unicode. Yes, there are code points whose only purpose is to
be used with this particular encoding.

> 	Note that this is only loosely connected with
> the encoding of _characters_, since some code points are
> not characters (such as 'newline'), and some sequences
> of code points represent a single (accented) character :-)

Yes, although the word 'character' has many meanings:

  Character. (1) The smallest component of written language that has
  semantic value; refers to the abstract meaning and/or shape, rather than
  a specific shape (see also glyph), though in code tables some form of
  visual representation is essential for the reader's understanding. (2)
  Synonym for abstract character. (See Definition D3 in Section 3.3,
  Characters and Coded Representations .) (3) The basic unit of encoding
  for the Unicode character encoding. (4) The English name for the
  ideographic written elements of Chinese origin. (See ideograph (2).)
     (from the Unicode Glossary)

You mean meaning 1. Usually, in computer related discussions meaning 3 is
intended.

If you are interested in sorting out the details, you can read UTR-17:

 http://www.unicode.org/unicode/reports/tr17/

Hope this helps.

  Pierpaolo

next prev parent reply	other threads:[~2001-01-22 22:05 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2001-01-11 12:58 Dave Berry
2001-01-11 18:49 ` Xavier Leroy
2001-01-12  9:24   ` John Max Skaller
2001-01-12 12:05   ` Pierpaolo BERNARDI
     [not found]   ` <3A5F7685.FF2593BB@snob.spb.ru>
2001-01-12 21:33     ` Nickolay Semyonov
2001-01-17 19:47       ` John Max Skaller
2001-01-12  0:19 ` Pierpaolo BERNARDI
2001-01-17 19:37   ` John Max Skaller
2001-01-18 17:49     ` Pierpaolo BERNARDI
2001-01-22 20:27       ` John Max Skaller
2001-01-22 21:44         ` Pierpaolo BERNARDI [this message]
2001-01-24 13:41           ` John Max Skaller
2001-01-12  8:33 ` John Max Skaller
     [not found]   ` <3A5F77B7.52D8F933@snob.spb.ru>
2001-01-12 21:33     ` Nickolay Semyonov
2001-01-12 21:25 ` Nickolay Semyonov
     [not found] <Pine.GSO.4.00.10101222155260.697-100000@carlotta.cli.di.unipi .it>
2001-01-22 21:57 ` Pierpaolo BERNARDI

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.GSO.4.00.10101222155260.697-100000@carlotta.cli.di.unipi.it \
    --to=bernardp@cli.di.unipi.it \
    --cc=caml-list@inria.fr \
    --cc=skaller@ozemail.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).