From mboxrd@z Thu Jan 1 00:00:00 1970 References: <5e85d81160785cdf717f01a8c0649731@quintile.net> <2926311e74db0ddc2310008906226507@lilly.quanstro.net> From: Quintile Content-Type: text/plain; charset=utf-8 In-Reply-To: <2926311e74db0ddc2310008906226507@lilly.quanstro.net> Message-Id: Date: Mon, 5 Jan 2015 22:27:33 +0000 To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (1.0) Subject: Re: [9fans] I don't understand utf8 (it seems) Topicbox-Message-UUID: 395af3c8-ead9-11e9-9d60-3106f5b1d025 ok, I understand, what I thought was UTF8 is in fact latin1. thanks. that makes sense. -Steve > On 5 Jan 2015, at 22:05, erik quanstrom wrote: >=20 >> On Mon Jan 5 13:48:47 PST 2015, steve@quintile.net wrote: >> I am trying to parse a stream from a tcp connection. >>=20 >> I think the data is utf8, here is a sample >>=20 >> 20 2d 20 c8 65 73 6b fd 20 72 6f 7a 68 6c 61 73 >>=20 >> which when I print it I get: >>=20 >> - e s k r o z h l a s =20 >> ^ ^ >> missing missing >>=20 >> there are two missing characters. Ok, bad UTF8 perhaps? >> but when I try unicode(1) I see: >>=20 >> unicode c8 fd >> =C3=88 >> =C3=BD >>=20 >> Is this 8 bit runes? (!) >> Is there a name for such a thing? >> Is this common? >> Is it just MS code pages but the >0x7f values happen (designed to) to map= onto the same letters as utf8? >=20 > latin1 has this property that if you embed the byte in a rune-sized chunk,= then it's > a valid Rune. but latin1 is invalid utf-8. >=20 > the reason that unicode(1) failed to meet expectations, is that the desir= e was=20 > to convert the supposed utf-8 0xc8 to a codepoint, but what unicode did wa= s > convert the codepoint 0xc8 into utf-8. >=20 > - erik