edbrowse-dev - development list for edbrowse
 help / color / mirror / Atom feed
* [Edbrowse-dev] Edbrowse recognizing site as binary data
@ 2016-01-22 13:01 Sebastian Humenda
  2016-01-22 14:30 ` Karl Dahlke
  0 siblings, 1 reply; 4+ messages in thread
From: Sebastian Humenda @ 2016-01-22 13:01 UTC (permalink / raw)
  To: edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 539 bytes --]

Hi,

with Edbrowse 3.6.0.1, I get the following output when executing

    $ edbrowse https://portal.slm.tu-dresden.de
    no errors
    binary data
    4002
    cannot browse a binary file

I presume that's a bug? Is there anything I can do to help resolving this issue?

Thanks
Sebastian
-- 
Web: http://www.crustulus.de (English|Deutsch)  | Blog: http://www.crustulus.de/blog
FreeDict: Free multilingual dictionaries - http://www.freedict.org
Freies Latein-Deutsch-Wörterbuch: http://www.crustulus.de/freedict.de.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Edbrowse-dev]  Edbrowse recognizing site as binary data
  2016-01-22 13:01 [Edbrowse-dev] Edbrowse recognizing site as binary data Sebastian Humenda
@ 2016-01-22 14:30 ` Karl Dahlke
  2016-01-22 17:13   ` Sebastian Humenda
  0 siblings, 1 reply; 4+ messages in thread
From: Karl Dahlke @ 2016-01-22 14:30 UTC (permalink / raw)
  To: edbrowse-dev

> edbrowse https://portal.slm.tu-dresden.de
> cannot browse a binary file
> I presume that's a bug? Is there anything I can do to help resolving this issue?

Not a bug technically, it's a feature not yet implemented.
Save the data to a file and run /bin/file on it and get this

HTML document, Little-endian UTF-16 Unicode text

The problem is utf-16, which edbrowse does not recognize.
This has been discussed on the group.
It is very rare on the internet and becoming rarer still,
so not sure if it is worth the bother to implement.
similar comments for utf-32.
Though these are more common on Windows, I'm told,
so maybe we should implement.
Anyways this is the first time a user has reported such a page to us.

Use the bd command to get around it, like this.
But any unicode chars above 256 will be garbled.

bd
e edbrowse https://portal.slm.tu-dresden.de
1s/^..//
,s/\0//g
b


Karl Dahlke

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Edbrowse-dev] Edbrowse recognizing site as binary data
  2016-01-22 14:30 ` Karl Dahlke
@ 2016-01-22 17:13   ` Sebastian Humenda
  2016-01-25 12:39     ` Karl Dahlke
  0 siblings, 1 reply; 4+ messages in thread
From: Sebastian Humenda @ 2016-01-22 17:13 UTC (permalink / raw)
  To: edbrowse-dev

[-- Attachment #1: Type: text/plain, Size: 1288 bytes --]

Hi,

Karl Dahlke schrieb am 22.01.2016,  9:30 -0500:
>> edbrowse https://portal.slm.tu-dresden.de
>> cannot browse a binary file
>> I presume that's a bug? Is there anything I can do to help resolving this issue?
>
>Not a bug technically, it's a feature not yet implemented.
>Save the data to a file and run /bin/file on it and get this
>
>HTML document, Little-endian UTF-16 Unicode text
Ah right.

>The problem is utf-16, which edbrowse does not recognize.
>This has been discussed on the group.
>It is very rare on the internet and becoming rarer still,
This is partially true. Especially for Asian languages, UTF-16 is a better
choice than UTF8. Not true for the above site though.
Anyway, there's
http://llvm.org/svn/llvm-project/llvm/trunk/include/llvm/Support/ConvertUTF.h.
That should provide a conversion function for both UTF-32->UTF-8 and
UTF-16->UTF-8 (together with the appropriate C file, of course). Wouldn't it be
easy to just detect UTF-16 and convert it to UTF-8 before doing anything else?

Cheers
Sebastian
-- 
Web: http://www.crustulus.de (English|Deutsch)  | Blog: http://www.crustulus.de/blog
FreeDict: Free multilingual dictionaries - http://www.freedict.org
Freies Latein-Deutsch-Wörterbuch: http://www.crustulus.de/freedict.de.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Edbrowse-dev]  Edbrowse recognizing site as binary data
  2016-01-22 17:13   ` Sebastian Humenda
@ 2016-01-25 12:39     ` Karl Dahlke
  0 siblings, 0 replies; 4+ messages in thread
From: Karl Dahlke @ 2016-01-25 12:39 UTC (permalink / raw)
  To: edbrowse-dev

As of this commit, edbrowse recognizes utf16 or utf32, according to the
byte order mark, and converts to utf8, the internal edbrowse format,
and the only format understood by pcre.
Text is converted back if the same file is written.
If text is sent anywhere else it remains in utf8.
This is consistent with our iso utf8 conversions.
Big and little endian are recognized.

I ran a few tests but it is not thoroughly tested,
there are lots of corner cases.

This has been muched discusssed, and didn't seem worth doing,
but Geoff pointed out that such files are more common on Windows,
in fact I think he first discovered the problem,
and much of the Asian world uses utf16 in files and websites
because it is the most efficient way to represent such text,
more efficient than utf8.

So this web page, coming down as utf16, now works.
https://portal.slm.tu-dresden.de

Geoff if you have some 16 or 32 files, you may wish to test,
	edbrowse whatever-file-utf32.txt
and see if it looks right,
and beyond this, make some edits and write the file
and see if the edits stick and if the file remains in its original format.

Ok, I already found a windows bug just by thinking about it.
Text files are open text mode but when mapping back to utf 16 or 32
they need to be binary mode.
I may even have to stick in \r\0\0\0 manually. Arrgghh.
I'll look into it.

Karl Dahlke

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-01-25 12:38 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-22 13:01 [Edbrowse-dev] Edbrowse recognizing site as binary data Sebastian Humenda
2016-01-22 14:30 ` Karl Dahlke
2016-01-22 17:13   ` Sebastian Humenda
2016-01-25 12:39     ` Karl Dahlke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).