9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* Re: [9fans] utf-8 text files from httpd
       [not found] <<df49a7370910190649k3179f0b1r4c877d5ca72af232@mail.gmail.com>
@ 2009-10-19 13:55 ` erik quanstrom
  2009-10-19 14:32   ` roger peppe
  0 siblings, 1 reply; 15+ messages in thread
From: erik quanstrom @ 2009-10-19 13:55 UTC (permalink / raw)
  To: 9fans

On Mon Oct 19 09:51:33 EDT 2009, rogpeppe@gmail.com wrote:
> there's another problem with file -m that
> i've been bitten by before: it ignores any
> stuff after the first 6000 bytes.
>
> so if you've got a mostly-ascii file with some
> utf-8 characters 8K in, then it won't be picked up.
>
> i think file -m should read the whole file, but that's just IMHO.

a relic trying to avoid ken's read ahead
and firing up the worm drives.

why try that hard?  just call it utf-8.  i can't think of
any browsers that would have a problem with that today.

- erik



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [9fans] utf-8 text files from httpd
  2009-10-19 13:55 ` [9fans] utf-8 text files from httpd erik quanstrom
@ 2009-10-19 14:32   ` roger peppe
  2009-10-19 17:36     ` lucio
  0 siblings, 1 reply; 15+ messages in thread
From: roger peppe @ 2009-10-19 14:32 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

2009/10/19 erik quanstrom <quanstro@quanstro.net>:
> why try that hard?  just call it utf-8.  i can't think of
> any browsers that would have a problem with that today.

the instance of the problem that i had was when
adding an attachment to a upas mail.
file -m is useful when the attachment might be
binary.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [9fans] utf-8 text files from httpd
  2009-10-19 14:32   ` roger peppe
@ 2009-10-19 17:36     ` lucio
  0 siblings, 0 replies; 15+ messages in thread
From: lucio @ 2009-10-19 17:36 UTC (permalink / raw)
  To: 9fans

> 2009/10/19 erik quanstrom <quanstro@quanstro.net>:
>> why try that hard?  just call it utf-8.  i can't think of
>> any browsers that would have a problem with that today.
> 
> the instance of the problem that i had was when
> adding an attachment to a upas mail.
> file -m is useful when the attachment might be
> binary.

Why not enhance "file -m" so that it is instructed to read the entire
file, then?  Knowing the context, adding, say, a "b" option (for
"big") would not do any damage, right?

++L




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [9fans] utf-8 text files from httpd
       [not found] <<df49a7370910190732i526a15b6o6d2822cd2d14bff0@mail.gmail.com>
@ 2009-10-19 14:50 ` erik quanstrom
  0 siblings, 0 replies; 15+ messages in thread
From: erik quanstrom @ 2009-10-19 14:50 UTC (permalink / raw)
  To: 9fans

On Mon Oct 19 10:36:51 EDT 2009, rogpeppe@gmail.com wrote:
> 2009/10/19 erik quanstrom <quanstro@quanstro.net>:
> > why try that hard?  just call it utf-8.  i can't think of
> > any browsers that would have a problem with that today.
>
> the instance of the problem that i had was when
> adding an attachment to a upas mail.
> file -m is useful when the attachment might be
> binary.

/sys/src/cmd/upas/marshal/marshal.c:/^body

already scans the whole file.  it could never
call something that's not ascii ascii.  unfortunately it
could be fooled by a bucky bit that's not
utf-8, since it doesn't check for valid utf-8.

it would be better to at least have a flag to file
that tells it to read the whole file and to have
file always return the character set to avoid
distributing various and sundry hacks
about the system.

- erik



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [9fans] utf-8 text files from httpd
  2009-10-19 13:14 ` erik quanstrom
@ 2009-10-19 13:49   ` roger peppe
  0 siblings, 0 replies; 15+ messages in thread
From: roger peppe @ 2009-10-19 13:49 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

there's another problem with file -m that
i've been bitten by before: it ignores any
stuff after the first 6000 bytes.

so if you've got a mostly-ascii file with some
utf-8 characters 8K in, then it won't be picked up.

i think file -m should read the whole file, but that's just IMHO.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [9fans] utf-8 text files from httpd
       [not found] <<fe41879c0910190300l51480646pf9630e90c6f30207@mail.gmail.com>
@ 2009-10-19 13:14 ` erik quanstrom
  2009-10-19 13:49   ` roger peppe
  0 siblings, 1 reply; 15+ messages in thread
From: erik quanstrom @ 2009-10-19 13:14 UTC (permalink / raw)
  To: 9fans

> Is the output of file(1) appropriate for this purpose?
> Shouldn't your sample file also be sent as UTF-8?

it should be.  for example since
	; echo ☺ | file
	stdin: short UTF text	# sic
one would expect that echo ☺ | file -m
would yield text/plain; charset=utf-8.

> file(1) speaks only mine type but not charset.

file does sometimes return a character set.

minooka;  grep -n charset /sys/src/cmd/file.c | sed 1q
594: 	0xfeff0000,	0xffffffff,	"utf-32be\n",
	"text/plain charset=utf-32be",

it doesn't make sense to me for file to be
inconsistent.  if file emits character sets, it
should always emit character sets.

i'm not sure why the ';' is dropped.  this would force
a client to parse the output.

> it is difficult or impossible to determine charset from a few japanese
> letters.

plan 9 is a utf-8 system.  if we have files in another
character set that's not a proper subset, most plan 9
tools will not work properly on them.

also, since it is hard to guess the charset of particular
japanese-encoded files, it would probablly be good to
force their encoding with html decoration.

- erik



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [9fans] utf-8 text files from httpd
  2009-10-19 10:00   ` Akshat Kumar
@ 2009-10-19 12:45     ` Kenji Arisawa
  0 siblings, 0 replies; 15+ messages in thread
From: Kenji Arisawa @ 2009-10-19 12:45 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I think it is difficult to make web server to work correctly in case
we have variety of charset text files on the server.
Although we can manually select charset in browser menu, the selection
is useless in case the page is written in Javascript that fills some
portion of a page reading a text file.
(note that the text file will be interpreted as ascii without
"charset" in http header.)
I believe the only solution every thing work correctly is to write all
text files in utf-8 and put "charset=utf-8" in http header as Erik is
trying.

P.S.
file(1) speaks only mine type but not charset.
it is difficult or impossible to determine charset from a few japanese
letters.

Kenji Arisawa


On 2009/10/19, at 19:00, Akshat Kumar wrote:

> new/sendfd.c:243 c old/sendfd.c:243
> <
> ---
>> /*
> new/sendfd.c:246 c old/sendfd.c:246
> <
> ---
>> */
>
> (context: text/plain -> text/plain; charset=utf-8)
>
> Now my text files can be read in the proper encoding
> by default, and are not interpreted by browsers (as
> well as certain applications) to be whack ASCII.
>
> Is the output of file(1) appropriate for this purpose?
> Shouldn't your sample file also be sent as UTF-8?
>
> Thank you for the input, Mr. Arisawa. I agree with
> Erik in this case, as you wouldn't be doing much with
> files of other encodings on Plan 9 (well, prior to a
> tcs(1)), you really only need to worry about getting
> across UTF-8.
>
> The point about file handling being up to browsers is
> appropriate. However, I'd like to push as much standard
> behaviour from the server as I can. If there's an explicit
> account of the encoding and type of a file, then there
> ought to be no ambiguity.
>
>
> Thanks,
> ak
>




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [9fans] utf-8 text files from httpd
  2009-10-19  1:37 ` erik quanstrom
@ 2009-10-19 10:00   ` Akshat Kumar
  2009-10-19 12:45     ` Kenji Arisawa
  0 siblings, 1 reply; 15+ messages in thread
From: Akshat Kumar @ 2009-10-19 10:00 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

new/sendfd.c:243 c old/sendfd.c:243
<
---
> /*
new/sendfd.c:246 c old/sendfd.c:246
<
---
> */

(context: text/plain -> text/plain; charset=utf-8)

Now my text files can be read in the proper encoding
by default, and are not interpreted by browsers (as
well as certain applications) to be whack ASCII.

Is the output of file(1) appropriate for this purpose?
Shouldn't your sample file also be sent as UTF-8?

Thank you for the input, Mr. Arisawa. I agree with
Erik in this case, as you wouldn't be doing much with
files of other encodings on Plan 9 (well, prior to a
tcs(1)), you really only need to worry about getting
across UTF-8.

The point about file handling being up to browsers is
appropriate. However, I'd like to push as much standard
behaviour from the server as I can. If there's an explicit
account of the encoding and type of a file, then there
ought to be no ambiguity.


Thanks,
ak



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [9fans] utf-8 text files from httpd
@ 2009-10-19  9:05 Eris Discordia
  0 siblings, 0 replies; 15+ messages in thread
From: Eris Discordia @ 2009-10-19  9:05 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

The decision whether to open in place or save to disk based on MIME type is
up to the browser. For example, I set my browsers to ask to save to disk
application/pdf documents (rather than opening them with Adobe Acrobat's
problem plugin). A MIME type of text/plain (without any specification of
encoding) is correct (and expected by any mainstream browser) for text
files. Opera opens those by default but can be set to do any one of a
variety of tasks when encountering text/plain. All mainstream browsers also
include encoding autodetection routines which may or may not fail depending
on your file's contents. All mainstream browsers also allow you to select
an encoding to decode and view your document in.

Assuming the right bytes arrive at your client it is always possible to
read the file in the right encoding. The encoding specified in response
header has no say in the bytes that are transmitted.

If your "any browser" includes Opera try Preferences > Advanced > Downloads
> (Uncheck "Hide file types opened with Opera") > Quick Search text/plain >
Edit > Action: Open with Opera (if the setting has been altered). Then
retry visiting your remote file. Even if response header contains the wrong
encoding (ISO-8859-1, EUC-KR, whatever) or no encoding specification at all
Opera should retrieve the document and display it. If the display is wrong,
try View > Encoding > Unicode > UTF-8.

The behavior you describe of "having to download the file" and "characters
being garbled" is not "any browser" sort of behavior. Neither Opera, nor
Firefox, nor Chrome display such behavior for the example I have supplied
below.

If all else fails... why not wget -S [URI] and check (and probably post)
the response header?

This resource, for example:

<http://www.phrack.org/issues.html?issue=66&id=3&mode=txt>

results in this response header:

>   HTTP/1.1 200 OK
>   Date: Sun, 18 Oct 2009 10:45:56 GMT
>   Server: Apache
>   X-Powered-By: PHP/5.2.8-pl2-gentoo
>   Cache-Control: no-store, no-cache
>   Connection: close
>   Content-Type: text/plain

And there's no problem whatsoever with its display in either Opera, Chrome,
or Firefox. Opera Info Panel says, by the way:

> Encoding (used by Opera):
> - not supplied - (windows-1252)




--On Sunday, October 18, 2009 20:34 -0400 Akshat Kumar
<akumar@mail.nanosouffle.net> wrote:

> I'm trying to put up a plain text file containing UTF-8
> characters from httpd, but when viewing it from any
> browser, it comes off as an ASCII file that needs to
> be downloaded (so, those characters are garbled).
> Is this due to some behaviour of httpd?
>
> ak
>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [9fans] utf-8 text files from httpd
       [not found] <<A6127A93-8E78-4E11-9284-56A16D2A2093@ar.aichi-u.ac.jp>
@ 2009-10-19  4:46 ` erik quanstrom
  0 siblings, 0 replies; 15+ messages in thread
From: erik quanstrom @ 2009-10-19  4:46 UTC (permalink / raw)
  To: 9fans

> Thus, hard coding "charset=utf-8" in http header will bring other
> problem
> because that coding disables a line in html header such as:
> 	<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">

that should not be a problem on a plan 9 system;
plan 9's character set is utf-8.

- erik



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [9fans] utf-8 text files from httpd
  2009-10-19  2:16 ` Kenji Arisawa
@ 2009-10-19  3:35   ` Kenji Arisawa
  0 siblings, 0 replies; 15+ messages in thread
From: Kenji Arisawa @ 2009-10-19  3:35 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

we should note also http://www.w3.org/TR/html4/charset.html#h-5.2.2.
the document says:

	To sum up, conforming user agents must observe the following
priorities when determining
	a  document's character encoding (from highest priority to lowest):
	1. An HTTP "charset" parameter in a "Content-Type" field.
	2. A META declaration with "http-equiv" set to "Content-Type" and a
value set for "charset".
	3. The charset attribute set on an element that designates an
external resource.

Thus, hard coding "charset=utf-8" in http header will bring other
problem
because that coding disables a line in html header such as:
	<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">

Kenji Arisawa

On 2009/10/19, at 11:16, Kenji Arisawa wrote:

> according to rfc2616, default charset in sending text file is ascii:
>
>   The "charset" parameter is used with some media types to define the
>   character set (section 3.4) of the data. When no explicit charset
>   parameter is provided by the sender, media subtypes of the "text"
>   type are defined to have a default charset value of "ISO-8859-1"
> when
>   received via HTTP. Data in character sets other than "ISO-8859-1" or
>   its subsets MUST be labeled with an appropriate charset value. See
>   section 3.4.1 for compatibility problems.
>
> httpd need explicitly charset=utf-8 in http header in sending utf-8
> text.
>
> Kenji Arisawa
>
> On 2009/10/19, at 9:34, Akshat Kumar wrote:
>
>> I'm trying to put up a plain text file containing UTF-8
>> characters from httpd, but when viewing it from any
>> browser, it comes off as an ASCII file that needs to
>> be downloaded (so, those characters are garbled).
>> Is this due to some behaviour of httpd?
>>
>> ak
>>
>
>




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [9fans] utf-8 text files from httpd
  2009-10-19  0:34 Akshat Kumar
  2009-10-19  1:39 ` andrey mirtchovski
@ 2009-10-19  2:16 ` Kenji Arisawa
  2009-10-19  3:35   ` Kenji Arisawa
  1 sibling, 1 reply; 15+ messages in thread
From: Kenji Arisawa @ 2009-10-19  2:16 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

according to rfc2616, default charset in sending text file is ascii:

    The "charset" parameter is used with some media types to define the
    character set (section 3.4) of the data. When no explicit charset
    parameter is provided by the sender, media subtypes of the "text"
    type are defined to have a default charset value of "ISO-8859-1"
when
    received via HTTP. Data in character sets other than "ISO-8859-1" or
    its subsets MUST be labeled with an appropriate charset value. See
    section 3.4.1 for compatibility problems.

httpd need explicitly charset=utf-8 in http header in sending utf-8
text.

Kenji Arisawa

On 2009/10/19, at 9:34, Akshat Kumar wrote:

> I'm trying to put up a plain text file containing UTF-8
> characters from httpd, but when viewing it from any
> browser, it comes off as an ASCII file that needs to
> be downloaded (so, those characters are garbled).
> Is this due to some behaviour of httpd?
>
> ak
>




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [9fans] utf-8 text files from httpd
  2009-10-19  0:34 Akshat Kumar
@ 2009-10-19  1:39 ` andrey mirtchovski
  2009-10-19  2:16 ` Kenji Arisawa
  1 sibling, 0 replies; 15+ messages in thread
From: andrey mirtchovski @ 2009-10-19  1:39 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

your mimetypes are probably maim-typed (heh). see /sys/lib/mimetype
for a fix, or put this in your page's <head> section:

	<META HTTP-EQUIV="Content-Type" content="text/html; charset=utf-8">



On Sun, Oct 18, 2009 at 6:34 PM, Akshat Kumar
<akumar@mail.nanosouffle.net> wrote:
> I'm trying to put up a plain text file containing UTF-8
> characters from httpd, but when viewing it from any
> browser, it comes off as an ASCII file that needs to
> be downloaded (so, those characters are garbled).
> Is this due to some behaviour of httpd?
>
> ak
>
>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [9fans] utf-8 text files from httpd
       [not found] <<fe41879c0910181734l6363baebsa896bda992d690@mail.gmail.com>
@ 2009-10-19  1:37 ` erik quanstrom
  2009-10-19 10:00   ` Akshat Kumar
  0 siblings, 1 reply; 15+ messages in thread
From: erik quanstrom @ 2009-10-19  1:37 UTC (permalink / raw)
  To: 9fans

On Sun Oct 18 20:37:23 EDT 2009, akumar@mail.nanosouffle.net wrote:
> I'm trying to put up a plain text file containing UTF-8
> characters from httpd, but when viewing it from any
> browser, it comes off as an ASCII file that needs to
> be downloaded (so, those characters are garbled).
> Is this due to some behaviour of httpd?

httpd(8) is dropping the ball on this one, and i don't
see an easy way to fix it without a hack, since the
specification of /sys/lib/mimetype lacks a way to add
a charset.

(as /sys/lib/mimetype(6) is missing, it's somewhat of
a guess what the format really is.)

httpd is (again) being one step too cute, relying on the
suffix of the file, rather than the output of file(1).
but if it did, we would be dealing with a bug in file.
it returns "short Ascii" for a file with the contents
"fu☺".

there is already a hack in /sys/src/cmd/ip/httpd/sendfd.c
but it's commented out.  i am not sure why.  that hack
should never cause problems today and can only solve
them.

i'd recommend submitting a patch without the comment.

- erik

* if you haven't chased a nonexistant httpd.rewrite bug
because httpd was caching /sys/lib/httpd.rewrite and you'd
forgotten to issue the magic 50 spurious requests, you likely
haven't used it.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [9fans] utf-8 text files from httpd
@ 2009-10-19  0:34 Akshat Kumar
  2009-10-19  1:39 ` andrey mirtchovski
  2009-10-19  2:16 ` Kenji Arisawa
  0 siblings, 2 replies; 15+ messages in thread
From: Akshat Kumar @ 2009-10-19  0:34 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I'm trying to put up a plain text file containing UTF-8
characters from httpd, but when viewing it from any
browser, it comes off as an ASCII file that needs to
be downloaded (so, those characters are garbled).
Is this due to some behaviour of httpd?

ak



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2009-10-19 17:36 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <<df49a7370910190649k3179f0b1r4c877d5ca72af232@mail.gmail.com>
2009-10-19 13:55 ` [9fans] utf-8 text files from httpd erik quanstrom
2009-10-19 14:32   ` roger peppe
2009-10-19 17:36     ` lucio
     [not found] <<df49a7370910190732i526a15b6o6d2822cd2d14bff0@mail.gmail.com>
2009-10-19 14:50 ` erik quanstrom
     [not found] <<fe41879c0910190300l51480646pf9630e90c6f30207@mail.gmail.com>
2009-10-19 13:14 ` erik quanstrom
2009-10-19 13:49   ` roger peppe
2009-10-19  9:05 Eris Discordia
     [not found] <<A6127A93-8E78-4E11-9284-56A16D2A2093@ar.aichi-u.ac.jp>
2009-10-19  4:46 ` erik quanstrom
     [not found] <<fe41879c0910181734l6363baebsa896bda992d690@mail.gmail.com>
2009-10-19  1:37 ` erik quanstrom
2009-10-19 10:00   ` Akshat Kumar
2009-10-19 12:45     ` Kenji Arisawa
  -- strict thread matches above, loose matches on Subject: below --
2009-10-19  0:34 Akshat Kumar
2009-10-19  1:39 ` andrey mirtchovski
2009-10-19  2:16 ` Kenji Arisawa
2009-10-19  3:35   ` Kenji Arisawa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).