* [Edbrowse-dev] Edbrowse recognizing site as binary data @ 2016-01-22 13:01 Sebastian Humenda 2016-01-22 14:30 ` Karl Dahlke 0 siblings, 1 reply; 4+ messages in thread From: Sebastian Humenda @ 2016-01-22 13:01 UTC (permalink / raw) To: edbrowse-dev [-- Attachment #1: Type: text/plain, Size: 539 bytes --] Hi, with Edbrowse 3.6.0.1, I get the following output when executing $ edbrowse https://portal.slm.tu-dresden.de no errors binary data 4002 cannot browse a binary file I presume that's a bug? Is there anything I can do to help resolving this issue? Thanks Sebastian -- Web: http://www.crustulus.de (English|Deutsch) | Blog: http://www.crustulus.de/blog FreeDict: Free multilingual dictionaries - http://www.freedict.org Freies Latein-Deutsch-Wörterbuch: http://www.crustulus.de/freedict.de.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* [Edbrowse-dev] Edbrowse recognizing site as binary data 2016-01-22 13:01 [Edbrowse-dev] Edbrowse recognizing site as binary data Sebastian Humenda @ 2016-01-22 14:30 ` Karl Dahlke 2016-01-22 17:13 ` Sebastian Humenda 0 siblings, 1 reply; 4+ messages in thread From: Karl Dahlke @ 2016-01-22 14:30 UTC (permalink / raw) To: edbrowse-dev > edbrowse https://portal.slm.tu-dresden.de > cannot browse a binary file > I presume that's a bug? Is there anything I can do to help resolving this issue? Not a bug technically, it's a feature not yet implemented. Save the data to a file and run /bin/file on it and get this HTML document, Little-endian UTF-16 Unicode text The problem is utf-16, which edbrowse does not recognize. This has been discussed on the group. It is very rare on the internet and becoming rarer still, so not sure if it is worth the bother to implement. similar comments for utf-32. Though these are more common on Windows, I'm told, so maybe we should implement. Anyways this is the first time a user has reported such a page to us. Use the bd command to get around it, like this. But any unicode chars above 256 will be garbled. bd e edbrowse https://portal.slm.tu-dresden.de 1s/^..// ,s/\0//g b Karl Dahlke ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Edbrowse-dev] Edbrowse recognizing site as binary data 2016-01-22 14:30 ` Karl Dahlke @ 2016-01-22 17:13 ` Sebastian Humenda 2016-01-25 12:39 ` Karl Dahlke 0 siblings, 1 reply; 4+ messages in thread From: Sebastian Humenda @ 2016-01-22 17:13 UTC (permalink / raw) To: edbrowse-dev [-- Attachment #1: Type: text/plain, Size: 1288 bytes --] Hi, Karl Dahlke schrieb am 22.01.2016, 9:30 -0500: >> edbrowse https://portal.slm.tu-dresden.de >> cannot browse a binary file >> I presume that's a bug? Is there anything I can do to help resolving this issue? > >Not a bug technically, it's a feature not yet implemented. >Save the data to a file and run /bin/file on it and get this > >HTML document, Little-endian UTF-16 Unicode text Ah right. >The problem is utf-16, which edbrowse does not recognize. >This has been discussed on the group. >It is very rare on the internet and becoming rarer still, This is partially true. Especially for Asian languages, UTF-16 is a better choice than UTF8. Not true for the above site though. Anyway, there's http://llvm.org/svn/llvm-project/llvm/trunk/include/llvm/Support/ConvertUTF.h. That should provide a conversion function for both UTF-32->UTF-8 and UTF-16->UTF-8 (together with the appropriate C file, of course). Wouldn't it be easy to just detect UTF-16 and convert it to UTF-8 before doing anything else? Cheers Sebastian -- Web: http://www.crustulus.de (English|Deutsch) | Blog: http://www.crustulus.de/blog FreeDict: Free multilingual dictionaries - http://www.freedict.org Freies Latein-Deutsch-Wörterbuch: http://www.crustulus.de/freedict.de.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* [Edbrowse-dev] Edbrowse recognizing site as binary data 2016-01-22 17:13 ` Sebastian Humenda @ 2016-01-25 12:39 ` Karl Dahlke 0 siblings, 0 replies; 4+ messages in thread From: Karl Dahlke @ 2016-01-25 12:39 UTC (permalink / raw) To: edbrowse-dev As of this commit, edbrowse recognizes utf16 or utf32, according to the byte order mark, and converts to utf8, the internal edbrowse format, and the only format understood by pcre. Text is converted back if the same file is written. If text is sent anywhere else it remains in utf8. This is consistent with our iso utf8 conversions. Big and little endian are recognized. I ran a few tests but it is not thoroughly tested, there are lots of corner cases. This has been muched discusssed, and didn't seem worth doing, but Geoff pointed out that such files are more common on Windows, in fact I think he first discovered the problem, and much of the Asian world uses utf16 in files and websites because it is the most efficient way to represent such text, more efficient than utf8. So this web page, coming down as utf16, now works. https://portal.slm.tu-dresden.de Geoff if you have some 16 or 32 files, you may wish to test, edbrowse whatever-file-utf32.txt and see if it looks right, and beyond this, make some edits and write the file and see if the edits stick and if the file remains in its original format. Ok, I already found a windows bug just by thinking about it. Text files are open text mode but when mapping back to utf 16 or 32 they need to be binary mode. I may even have to stick in \r\0\0\0 manually. Arrgghh. I'll look into it. Karl Dahlke ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2016-01-25 12:38 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-01-22 13:01 [Edbrowse-dev] Edbrowse recognizing site as binary data Sebastian Humenda 2016-01-22 14:30 ` Karl Dahlke 2016-01-22 17:13 ` Sebastian Humenda 2016-01-25 12:39 ` Karl Dahlke
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).