From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 Date: Sun, 18 Oct 2009 20:34:56 -0400 Message-ID: From: Akshat Kumar To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: text/plain; charset=ISO-8859-1 Subject: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 89f955e4-ead5-11e9-9d60-3106f5b1d025 I'm trying to put up a plain text file containing UTF-8 characters from httpd, but when viewing it from any browser, it comes off as an ASCII file that needs to be downloaded (so, those characters are garbled). Is this due to some behaviour of httpd? ak From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Sun, 18 Oct 2009 21:37:04 -0400 To: 9fans@9fans.net Message-ID: In-Reply-To: <> References: <> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Subject: Re: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 8a0d871c-ead5-11e9-9d60-3106f5b1d025 On Sun Oct 18 20:37:23 EDT 2009, akumar@mail.nanosouffle.net wrote: > I'm trying to put up a plain text file containing UTF-8 > characters from httpd, but when viewing it from any > browser, it comes off as an ASCII file that needs to > be downloaded (so, those characters are garbled). > Is this due to some behaviour of httpd? httpd(8) is dropping the ball on this one, and i don't see an easy way to fix it without a hack, since the specification of /sys/lib/mimetype lacks a way to add a charset. (as /sys/lib/mimetype(6) is missing, it's somewhat of a guess what the format really is.) httpd is (again) being one step too cute, relying on the suffix of the file, rather than the output of file(1). but if it did, we would be dealing with a bug in file. it returns "short Ascii" for a file with the contents "fu☺". there is already a hack in /sys/src/cmd/ip/httpd/sendfd.c but it's commented out. i am not sure why. that hack should never cause problems today and can only solve them. i'd recommend submitting a patch without the comment. - erik * if you haven't chased a nonexistant httpd.rewrite bug because httpd was caching /sys/lib/httpd.rewrite and you'd forgotten to issue the magic 50 spurious requests, you likely haven't used it. From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: References: Date: Sun, 18 Oct 2009 19:39:20 -0600 Message-ID: <14ec7b180910181839t3103644dsa6a9a24079f14652@mail.gmail.com> From: andrey mirtchovski To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: text/plain; charset=UTF-8 Subject: Re: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 8a1a40e2-ead5-11e9-9d60-3106f5b1d025 your mimetypes are probably maim-typed (heh). see /sys/lib/mimetype for a fix, or put this in your page's section: On Sun, Oct 18, 2009 at 6:34 PM, Akshat Kumar wrote: > I'm trying to put up a plain text file containing UTF-8 > characters from httpd, but when viewing it from any > browser, it comes off as an ASCII file that needs to > be downloaded (so, those characters are garbled). > Is this due to some behaviour of httpd? > > ak > > From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes Mime-Version: 1.0 (Apple Message framework v1076) From: Kenji Arisawa In-Reply-To: Date: Mon, 19 Oct 2009 11:16:28 +0900 Content-Transfer-Encoding: 7bit Message-Id: References: To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Subject: Re: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 8a20ed16-ead5-11e9-9d60-3106f5b1d025 according to rfc2616, default charset in sending text file is ascii: The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems. httpd need explicitly charset=utf-8 in http header in sending utf-8 text. Kenji Arisawa On 2009/10/19, at 9:34, Akshat Kumar wrote: > I'm trying to put up a plain text file containing UTF-8 > characters from httpd, but when viewing it from any > browser, it comes off as an ASCII file that needs to > be downloaded (so, those characters are garbled). > Is this due to some behaviour of httpd? > > ak > From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes Mime-Version: 1.0 (Apple Message framework v1076) From: Kenji Arisawa In-Reply-To: Date: Mon, 19 Oct 2009 12:35:50 +0900 Content-Transfer-Encoding: 7bit Message-Id: References: To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Subject: Re: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 8a29fe06-ead5-11e9-9d60-3106f5b1d025 we should note also http://www.w3.org/TR/html4/charset.html#h-5.2.2. the document says: To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest): 1. An HTTP "charset" parameter in a "Content-Type" field. 2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset". 3. The charset attribute set on an element that designates an external resource. Thus, hard coding "charset=utf-8" in http header will bring other problem because that coding disables a line in html header such as: Kenji Arisawa On 2009/10/19, at 11:16, Kenji Arisawa wrote: > according to rfc2616, default charset in sending text file is ascii: > > The "charset" parameter is used with some media types to define the > character set (section 3.4) of the data. When no explicit charset > parameter is provided by the sender, media subtypes of the "text" > type are defined to have a default charset value of "ISO-8859-1" > when > received via HTTP. Data in character sets other than "ISO-8859-1" or > its subsets MUST be labeled with an appropriate charset value. See > section 3.4.1 for compatibility problems. > > httpd need explicitly charset=utf-8 in http header in sending utf-8 > text. > > Kenji Arisawa > > On 2009/10/19, at 9:34, Akshat Kumar wrote: > >> I'm trying to put up a plain text file containing UTF-8 >> characters from httpd, but when viewing it from any >> browser, it comes off as an ASCII file that needs to >> be downloaded (so, those characters are garbled). >> Is this due to some behaviour of httpd? >> >> ak >> > > From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Mon, 19 Oct 2009 00:46:48 -0400 To: 9fans@9fans.net Message-ID: In-Reply-To: <> References: <> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 8a47e358-ead5-11e9-9d60-3106f5b1d025 > Thus, hard coding "charset=utf-8" in http header will bring other > problem > because that coding disables a line in html header such as: > that should not be a problem on a plan 9 system; plan 9's character set is utf-8. - erik From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 19 Oct 2009 10:05:39 +0100 From: Eris Discordia To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Message-ID: <1EE8D30F45EA3DBC0F877B25@[192.168.1.2]> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Subject: Re: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 8a545976-ead5-11e9-9d60-3106f5b1d025 The decision whether to open in place or save to disk based on MIME type is up to the browser. For example, I set my browsers to ask to save to disk application/pdf documents (rather than opening them with Adobe Acrobat's problem plugin). A MIME type of text/plain (without any specification of encoding) is correct (and expected by any mainstream browser) for text files. Opera opens those by default but can be set to do any one of a variety of tasks when encountering text/plain. All mainstream browsers also include encoding autodetection routines which may or may not fail depending on your file's contents. All mainstream browsers also allow you to select an encoding to decode and view your document in. Assuming the right bytes arrive at your client it is always possible to read the file in the right encoding. The encoding specified in response header has no say in the bytes that are transmitted. If your "any browser" includes Opera try Preferences > Advanced > Downloads > (Uncheck "Hide file types opened with Opera") > Quick Search text/plain > Edit > Action: Open with Opera (if the setting has been altered). Then retry visiting your remote file. Even if response header contains the wrong encoding (ISO-8859-1, EUC-KR, whatever) or no encoding specification at all Opera should retrieve the document and display it. If the display is wrong, try View > Encoding > Unicode > UTF-8. The behavior you describe of "having to download the file" and "characters being garbled" is not "any browser" sort of behavior. Neither Opera, nor Firefox, nor Chrome display such behavior for the example I have supplied below. If all else fails... why not wget -S [URI] and check (and probably post) the response header? This resource, for example: results in this response header: > HTTP/1.1 200 OK > Date: Sun, 18 Oct 2009 10:45:56 GMT > Server: Apache > X-Powered-By: PHP/5.2.8-pl2-gentoo > Cache-Control: no-store, no-cache > Connection: close > Content-Type: text/plain And there's no problem whatsoever with its display in either Opera, Chrome, or Firefox. Opera Info Panel says, by the way: > Encoding (used by Opera): > - not supplied - (windows-1252) --On Sunday, October 18, 2009 20:34 -0400 Akshat Kumar wrote: > I'm trying to put up a plain text file containing UTF-8 > characters from httpd, but when viewing it from any > browser, it comes off as an ASCII file that needs to > be downloaded (so, those characters are garbled). > Is this due to some behaviour of httpd? > > ak > From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: References: Date: Mon, 19 Oct 2009 06:00:26 -0400 Message-ID: From: Akshat Kumar To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: text/plain; charset=ISO-8859-1 Subject: Re: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 8a590098-ead5-11e9-9d60-3106f5b1d025 new/sendfd.c:243 c old/sendfd.c:243 < --- > /* new/sendfd.c:246 c old/sendfd.c:246 < --- > */ (context: text/plain -> text/plain; charset=utf-8) Now my text files can be read in the proper encoding by default, and are not interpreted by browsers (as well as certain applications) to be whack ASCII. Is the output of file(1) appropriate for this purpose? Shouldn't your sample file also be sent as UTF-8? Thank you for the input, Mr. Arisawa. I agree with Erik in this case, as you wouldn't be doing much with files of other encodings on Plan 9 (well, prior to a tcs(1)), you really only need to worry about getting across UTF-8. The point about file handling being up to browsers is appropriate. However, I'd like to push as much standard behaviour from the server as I can. If there's an explicit account of the encoding and type of a file, then there ought to be no ambiguity. Thanks, ak From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes Mime-Version: 1.0 (Apple Message framework v1076) From: Kenji Arisawa In-Reply-To: Date: Mon, 19 Oct 2009 21:45:57 +0900 Content-Transfer-Encoding: 7bit Message-Id: <602B15C1-A53C-467F-AB4C-5C820228D998@ar.aichi-u.ac.jp> References: To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Subject: Re: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 8a68ca8c-ead5-11e9-9d60-3106f5b1d025 I think it is difficult to make web server to work correctly in case we have variety of charset text files on the server. Although we can manually select charset in browser menu, the selection is useless in case the page is written in Javascript that fills some portion of a page reading a text file. (note that the text file will be interpreted as ascii without "charset" in http header.) I believe the only solution every thing work correctly is to write all text files in utf-8 and put "charset=utf-8" in http header as Erik is trying. P.S. file(1) speaks only mine type but not charset. it is difficult or impossible to determine charset from a few japanese letters. Kenji Arisawa On 2009/10/19, at 19:00, Akshat Kumar wrote: > new/sendfd.c:243 c old/sendfd.c:243 > < > --- >> /* > new/sendfd.c:246 c old/sendfd.c:246 > < > --- >> */ > > (context: text/plain -> text/plain; charset=utf-8) > > Now my text files can be read in the proper encoding > by default, and are not interpreted by browsers (as > well as certain applications) to be whack ASCII. > > Is the output of file(1) appropriate for this purpose? > Shouldn't your sample file also be sent as UTF-8? > > Thank you for the input, Mr. Arisawa. I agree with > Erik in this case, as you wouldn't be doing much with > files of other encodings on Plan 9 (well, prior to a > tcs(1)), you really only need to worry about getting > across UTF-8. > > The point about file handling being up to browsers is > appropriate. However, I'd like to push as much standard > behaviour from the server as I can. If there's an explicit > account of the encoding and type of a file, then there > ought to be no ambiguity. > > > Thanks, > ak > From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Mon, 19 Oct 2009 09:14:41 -0400 To: 9fans@9fans.net Message-ID: <25526834e4974523e25c09565df13029@brasstown.quanstro.net> In-Reply-To: <> References: <> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Subject: Re: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 8a6cf4d6-ead5-11e9-9d60-3106f5b1d025 > Is the output of file(1) appropriate for this purpose? > Shouldn't your sample file also be sent as UTF-8? it should be. for example since ; echo ☺ | file stdin: short UTF text # sic one would expect that echo ☺ | file -m would yield text/plain; charset=utf-8. > file(1) speaks only mine type but not charset. file does sometimes return a character set. minooka; grep -n charset /sys/src/cmd/file.c | sed 1q 594: 0xfeff0000, 0xffffffff, "utf-32be\n", "text/plain charset=utf-32be", it doesn't make sense to me for file to be inconsistent. if file emits character sets, it should always emit character sets. i'm not sure why the ';' is dropped. this would force a client to parse the output. > it is difficult or impossible to determine charset from a few japanese > letters. plan 9 is a utf-8 system. if we have files in another character set that's not a proper subset, most plan 9 tools will not work properly on them. also, since it is hard to guess the charset of particular japanese-encoded files, it would probablly be good to force their encoding with html decoration. - erik From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: <25526834e4974523e25c09565df13029@brasstown.quanstro.net> References: <25526834e4974523e25c09565df13029@brasstown.quanstro.net> Date: Mon, 19 Oct 2009 14:49:07 +0100 Message-ID: From: roger peppe To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: text/plain; charset=UTF-8 Subject: Re: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 8a8ab7f0-ead5-11e9-9d60-3106f5b1d025 there's another problem with file -m that i've been bitten by before: it ignores any stuff after the first 6000 bytes. so if you've got a mostly-ascii file with some utf-8 characters 8K in, then it won't be picked up. i think file -m should read the whole file, but that's just IMHO. From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Mon, 19 Oct 2009 09:55:48 -0400 To: 9fans@9fans.net Message-ID: In-Reply-To: <> References: <> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 8a909170-ead5-11e9-9d60-3106f5b1d025 On Mon Oct 19 09:51:33 EDT 2009, rogpeppe@gmail.com wrote: > there's another problem with file -m that > i've been bitten by before: it ignores any > stuff after the first 6000 bytes. > > so if you've got a mostly-ascii file with some > utf-8 characters 8K in, then it won't be picked up. > > i think file -m should read the whole file, but that's just IMHO. a relic trying to avoid ken's read ahead and firing up the worm drives. why try that hard? just call it utf-8. i can't think of any browsers that would have a problem with that today. - erik From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: References: Date: Mon, 19 Oct 2009 15:32:54 +0100 Message-ID: From: roger peppe To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 8aa23916-ead5-11e9-9d60-3106f5b1d025 2009/10/19 erik quanstrom : > why try that hard? =C2=A0just call it utf-8. =C2=A0i can't think of > any browsers that would have a problem with that today. the instance of the problem that i had was when adding an attachment to a upas mail. file -m is useful when the attachment might be binary. From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Mon, 19 Oct 2009 10:50:56 -0400 To: 9fans@9fans.net Message-ID: <7bfbca621ccbe0a6befa6fb02bdb8609@ladd.quanstro.net> In-Reply-To: <> References: <> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Subject: Re: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 8aaa6a6e-ead5-11e9-9d60-3106f5b1d025 On Mon Oct 19 10:36:51 EDT 2009, rogpeppe@gmail.com wrote: > 2009/10/19 erik quanstrom : > > why try that hard?  just call it utf-8.  i can't think of > > any browsers that would have a problem with that today. > > the instance of the problem that i had was when > adding an attachment to a upas mail. > file -m is useful when the attachment might be > binary. /sys/src/cmd/upas/marshal/marshal.c:/^body already scans the whole file. it could never call something that's not ascii ascii. unfortunately it could be fooled by a bucky bit that's not utf-8, since it doesn't check for valid utf-8. it would be better to at least have a flag to file that tells it to read the whole file and to have file always return the character set to avoid distributing various and sundry hacks about the system. - erik From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <34182b25dd5a7d619dde2a7c15fae485@proxima.alt.za> To: 9fans@9fans.net Date: Mon, 19 Oct 2009 19:36:28 +0200 From: lucio@proxima.alt.za In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [9fans] utf-8 text files from httpd Topicbox-Message-UUID: 8ba33ca2-ead5-11e9-9d60-3106f5b1d025 > 2009/10/19 erik quanstrom : >> why try that hard? =C2=A0just call it utf-8. =C2=A0i can't think of >> any browsers that would have a problem with that today. >=20 > the instance of the problem that i had was when > adding an attachment to a upas mail. > file -m is useful when the attachment might be > binary. Why not enhance "file -m" so that it is instructed to read the entire file, then? Knowing the context, adding, say, a "b" option (for "big") would not do any damage, right? ++L