From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from resqmta-ch2-02v.sys.comcast.net (resqmta-ch2-02v.sys.comcast.net [IPv6:2001:558:fe21:29:69:252:207:34]) by hurricane.the-brannons.com (Postfix) with ESMTPS id C2F2077C8D for ; Mon, 25 Jan 2016 04:38:36 -0800 (PST) Received: from resomta-ch2-18v.sys.comcast.net ([69.252.207.114]) by resqmta-ch2-02v.sys.comcast.net with comcast id ACg01s0012Udklx01Cg0l9; Mon, 25 Jan 2016 12:40:00 +0000 Received: from eklhad ([IPv6:2601:405:4001:e487:21e:4fff:fec2:a0f1]) by resomta-ch2-18v.sys.comcast.net with comcast id ACfz1s0052MDcd701Cfzz5; Mon, 25 Jan 2016 12:39:59 +0000 To: edbrowse-dev@lists.the-brannons.com From: Karl Dahlke Reply-to: Karl Dahlke References: <20160122130158.GA2555@Kraftkrust> <20160122171305.GE2555@Kraftkrust> User-Agent: edbrowse/3.6.1+ Date: Mon, 25 Jan 2016 07:39:59 -0500 Message-ID: <20160025073959.eklhad@comcast.net> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20140121; t=1453725600; bh=9+8sIZpeq6wbcNBpYRA1S1vyAmCKEnGCM2P7VzskiOQ=; h=Received:Received:To:From:Reply-to:Subject:Date:Message-ID: Mime-Version:Content-Type; b=iB7xM69x7ivrQq+z0+Tt37n5iEPRGaqXYxLdAQaUFskH+Y5o2WMzQ0tTLo98EbrC+ WimTTIAy/kuhudusraC2/PKedeWs/to2SPn0sTgpPxedYH2OhgDLVMHRaJS8Ugjo6L 1Qi25HWgpImF/wD4tY4plkdWTUFT7Y38ev44+ZKUutG8NbEQIzJG0TdMzYaTUk5i3M QaZjEzTbfvtVokAahasXDeAz1jtXBNiVMsPh/N+ZQ+IvAcYBGriW4h6wqpY+CTXAzg ObbYrY9OlqOrEuK0avE3+hi6QNuNBQwDh7SRxQaQQ7J+VCsgF53i5avewQnFuoWtHJ crWK+SFisMRPA== Subject: [Edbrowse-dev] Edbrowse recognizing site as binary data X-BeenThere: edbrowse-dev@lists.the-brannons.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Edbrowse Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Jan 2016 12:38:37 -0000 As of this commit, edbrowse recognizes utf16 or utf32, according to the byte order mark, and converts to utf8, the internal edbrowse format, and the only format understood by pcre. Text is converted back if the same file is written. If text is sent anywhere else it remains in utf8. This is consistent with our iso utf8 conversions. Big and little endian are recognized. I ran a few tests but it is not thoroughly tested, there are lots of corner cases. This has been muched discusssed, and didn't seem worth doing, but Geoff pointed out that such files are more common on Windows, in fact I think he first discovered the problem, and much of the Asian world uses utf16 in files and websites because it is the most efficient way to represent such text, more efficient than utf8. So this web page, coming down as utf16, now works. https://portal.slm.tu-dresden.de Geoff if you have some 16 or 32 files, you may wish to test, edbrowse whatever-file-utf32.txt and see if it looks right, and beyond this, make some edits and write the file and see if the edits stick and if the file remains in its original format. Ok, I already found a windows bug just by thinking about it. Text files are open text mode but when mapping back to utf 16 or 32 they need to be binary mode. I may even have to stick in \r\0\0\0 manually. Arrgghh. I'll look into it. Karl Dahlke