From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: Date: Fri, 19 May 2006 19:13:33 -0500 From: quanstro@quanstro.net To: 9fans@cse.psu.edu Subject: Re: [9fans] combining characters In-Reply-To: <20060520001201.GF14448@submarine> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Topicbox-Message-UUID: 5386a100-ead1-11e9-9d60-3106f5b1d025 On Fri May 19 19:13:39 CDT 2006, rvs@sun.com wrote: > > no. the unicode sequences (e.g. U+0069 U+0361) are correct. > > i checked this and several other examples with the actual books. >=20 > How did you check it ? Visual inspection ?=20 since these were actual books, i know of no other way. ;-) > Since I'm no expert > in UNICODE I'm quite curious to know how one is supposed to > tell between a real character and a combination of a diacritic > and some other character when they are visually indistinguishable ? say i have a random accented letter. suppose that U+x is the cp for the letter. suppose U+y is the cp for the accent. suppose that we're lu= cky and there exists U+w =E2=89=A1 U+xU+y. then U+w should be the same glyph as U+xU+y. cannonical composition would yield compose(U+xU+y) U+w compose(U+w) U+w while cannonical decompostion would yield decompose(U+xU+y) U+xU+y decompose(U+w) U+xU+y > I would expect unicode to always favor single glyphs from a particula= r=20 > page over anything else. it's always a single glyph. don't confuse letters, codepoints, and glyph= s. >=20 > btw, could you send me a .png with the actual title ? i'll send you a png of the character. i don't have the books. what language rule are you trying to get at? - erik >=20 > > i think you misunderstand how unicode works. =20 >=20 > That could very well be the case ;-) But I know how Russian language > works regardless of what committee members think. >=20 > > a base cp like U+0069 followed by a combining cp like U+0361=20 > > make a single character. this identification is called "composition"= . > > unicode contains some precomposed cps, but not U+0069 U+0361. >=20 > That's ok. My only point is -- I would expect anybody who enters=20 > titles into a database adhere to the rules of the language the > title is written in. Maybe its too much to expect, though. >=20 > Thanks, > Roman. >=20