From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Mon, 5 Jan 2015 14:05:17 -0800 To: 9fans@9fans.net Message-ID: <2926311e74db0ddc2310008906226507@lilly.quanstro.net> In-Reply-To: <5e85d81160785cdf717f01a8c0649731@quintile.net> References: <5e85d81160785cdf717f01a8c0649731@quintile.net> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [9fans] I don't understand utf8 (it seems) Topicbox-Message-UUID: 39531eaa-ead9-11e9-9d60-3106f5b1d025 On Mon Jan 5 13:48:47 PST 2015, steve@quintile.net wrote: > I am trying to parse a stream from a tcp connection. >=20 > I think the data is utf8, here is a sample >=20 > 20 2d 20 c8 65 73 6b fd 20 72 6f 7a 68 6c 61 73 >=20 > which when I print it I get: >=20 > - e s k r o z h l a s =20 > ^ ^ > missing missing >=20 > there are two missing characters. Ok, bad UTF8 perhaps? > but when I try unicode(1) I see: >=20 > unicode c8 fd > =C3=88 > =C3=BD >=20 > Is this 8 bit runes? (!) > Is there a name for such a thing? > Is this common? > Is it just MS code pages but the >0x7f values happen (designed to) to m= ap onto the same letters as utf8? latin1 has this property that if you embed the byte in a rune-sized chunk= , then it's a valid Rune. but latin1 is invalid utf-8. the reason that unicode(1) failed to meet expectations, is that the des= ire was=20 to convert the supposed utf-8 0xc8 to a codepoint, but what unicode did w= as convert the codepoint 0xc8 into utf-8. - erik