From mboxrd@z Thu Jan  1 00:00:00 1970
From: erik quanstrom <quanstro@quanstro.net>
Date: Mon, 19 Oct 2009 09:14:41 -0400
To: 9fans@9fans.net
Message-ID: <25526834e4974523e25c09565df13029@brasstown.quanstro.net>
In-Reply-To: <<fe41879c0910190300l51480646pf9630e90c6f30207@mail.gmail.com>>
References: <<fe41879c0910190300l51480646pf9630e90c6f30207@mail.gmail.com>>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Subject: Re: [9fans] utf-8 text files from httpd
Topicbox-Message-UUID: 8a6cf4d6-ead5-11e9-9d60-3106f5b1d025

> Is the output of file(1) appropriate for this purpose?
> Shouldn't your sample file also be sent as UTF-8?

it should be.  for example since
	; echo ☺ | file
	stdin: short UTF text	# sic
one would expect that echo ☺ | file -m
would yield text/plain; charset=utf-8.

> file(1) speaks only mine type but not charset.

file does sometimes return a character set.

minooka;  grep -n charset /sys/src/cmd/file.c | sed 1q
594: 	0xfeff0000,	0xffffffff,	"utf-32be\n",
	"text/plain charset=utf-32be",

it doesn't make sense to me for file to be
inconsistent.  if file emits character sets, it
should always emit character sets.

i'm not sure why the ';' is dropped.  this would force
a client to parse the output.

> it is difficult or impossible to determine charset from a few japanese
> letters.

plan 9 is a utf-8 system.  if we have files in another
character set that's not a proper subset, most plan 9
tools will not work properly on them.

also, since it is hard to guess the charset of particular
japanese-encoded files, it would probablly be good to
force their encoding with html decoration.

- erik