From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Tue, 6 Jan 2015 00:15:33 +0200 From: Antons Suspans To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Message-ID: <20150105221533.GA7830@ax.s16> References: <5e85d81160785cdf717f01a8c0649731@quintile.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <5e85d81160785cdf717f01a8c0649731@quintile.net> User-Agent: Mutt/1.5.21 (2010-09-15) Content-Transfer-Encoding: quoted-printable Subject: Re: [9fans] I don't understand utf8 (it seems) Topicbox-Message-UUID: 3956d306-ead9-11e9-9d60-3106f5b1d025 On Mon, Jan 05, 2015 at 09:52:12PM +0000, Steve Simon wrote: > I am trying to parse a stream from a tcp connection. >=20 > I think the data is utf8, here is a sample >=20 > 20 2d 20 c8 65 73 6b fd 20 72 6f 7a 68 6c 61 73 >=20 > which when I print it I get: >=20 > - e s k r o z h l a s =20 > ^ ^ > missing missing >=20 > there are two missing characters. Ok, bad UTF8 perhaps? > but when I try unicode(1) I see: >=20 > unicode c8 fd > =C3=88 > =C3=BD >=20 > Is this 8 bit runes? (!) > Is there a name for such a thing? > Is this common? > Is it just MS code pages but the >0x7f values happen (designed to) to m= ap onto the same letters as utf8? >=20 > thanks in advance of useful suggestions =E2=98=BA >=20 > -Steve >=20 >=20 Those might be ISO-8859-1 octets. The first 256 codepoints of Unicode are those of ISO-8859-1. % unicode -t 20 2d 20 c8 65 73 6b fd 20 72 6f 7a 68 6c 61 73 | xd - right away produces 18 octets representing 16 Unicode codepoints (liste= d as args). % ascii -t 20 2d 20 c8 65 73 6b fd 20 72 6f 7a 68 6c 61 73 | tcs -f 8859-= 1 | xd - yields 16 octets which are then treated as ones encoding 16 codepoints = in ISO-8859-1 and transformed to 18 octets in UTF-8 (representing those same 16 codepoi= nts). Hope this helps, --=20 Antons