From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: Date: Fri, 19 May 2006 17:16:22 -0500 From: quanstro@quanstro.net To: 9fans@cse.psu.edu Subject: Re: [9fans] combining characters In-Reply-To: <8ccc8ba40605191504s415ebf15q3fb23d63947ef64d@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Topicbox-Message-UUID: 524f1f7e-ead1-11e9-9d60-3106f5b1d025 i was really suprised by this, too. i did a bit of work for a company that does searchable literature a coupl= e of months ago. they were having trouble with "bad unicode". the problem was stuff like this: CA: Corporate Author Nizhegorodskai=CD=A1a=CD=A1 gosudarstvennai=CD=A1a=CD=A1 sel=CA=B9skokhozi=CD=A1a=CD=A1i=CC=86stvennai=CD=A1a=CD=A1 akademii=CD= =A1a=CD=A1 the character that probablly doesn't look right is a combining double bre= ve. it's actually good data. i tracked down the cover of this book and it's = really spelled like that. the problem is that the unicode folk didn't have the foresight to include stuff like this. - erik On Fri May 19 17:05:24 CDT 2006, nemo@lsub.org wrote: > isn=C2=B4t there enough space to keep all them there? >=20 > On 5/19/06, quanstro@quanstro.net wrote: > > =C3=A1 is a single codepoint. sure. but there are useful letters th= at don't > > exist in unicode unless they are composed. e.g. romanized russian, > > accented cyrillic, etc. > > > > - erik > > > > On Fri May 19 17:00:38 CDT 2006, nemo@lsub.org wrote: > > > I think that =C3=A1 is just a single rune, not two different ones c= omposed. If > > > to type them, you have to type several keys, it=C2=B4s just a keybo= ard issue, > > > isn=C2=B4t it? I don=C2=B4t understand why this could go to a upper= layer. Is there > > > any other problem? (besides having to use utf8 for i/o, I mean).