9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] conversion of charsets in upas/fs
@ 2005-01-29 13:57 Heiko Dudzus
  2005-01-31  1:21 ` Kenji Okamoto
  0 siblings, 1 reply; 18+ messages in thread
From: Heiko Dudzus @ 2005-01-29 13:57 UTC (permalink / raw)
  To: 9fans

upas/fs seems to convert mails with iso-8859-1 (and some other
charsets) to UTF-8 automatically.  (I looked into the source but
didn't understand how it actually works, not finding every definiton
of every function)

This conversion works fine with mails that have are encoded
quoted-printable.  It doesn't work here with mails that are 8bit
encoded.  (at least with 8859-1)

Isn't that also possible with 8bit encoded mails?

Regards, Heiko



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-01-29 13:57 [9fans] conversion of charsets in upas/fs Heiko Dudzus
@ 2005-01-31  1:21 ` Kenji Okamoto
  2005-01-31 14:05   ` Heiko Dudzus
  0 siblings, 1 reply; 18+ messages in thread
From: Kenji Okamoto @ 2005-01-31  1:21 UTC (permalink / raw)
  To: 9fans

> Isn't that also possible with 8bit encoded mails?

What about its header line?
If the header doesn't have approapriate and supported by
the Upas line, it may fail.

Kenji



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-01-31  1:21 ` Kenji Okamoto
@ 2005-01-31 14:05   ` Heiko Dudzus
  2005-01-31 18:40     ` Russ Cox
  2005-02-01  1:42     ` Kenji Okamoto
  0 siblings, 2 replies; 18+ messages in thread
From: Heiko Dudzus @ 2005-01-31 14:05 UTC (permalink / raw)
  To: 9fans

Hello Kenji,

thanks for your answer,

>> Isn't that also possible with 8bit encoded mails?
> 
> What about its header line?
> If the header doesn't have approapriate and supported by
> the Upas line, it may fail.

I will quote an example:

| Content-Type: text/plain; charset=ISO-8859-15; format=flowed
| Content-Transfer-Encoding: 8bit

in the header doesn't lead to properly displayed 'umlauts', but

| Content-Type: text/plain; charset=ISO-8859-15; format=flowed
| Content-Transfer-Encoding: quoted-printable

does lead to properly displayed 'umlauts'.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-01-31 14:05   ` Heiko Dudzus
@ 2005-01-31 18:40     ` Russ Cox
  2005-01-31 19:19       ` boyd, rounin
  2005-02-01  8:24       ` Heiko Dudzus
  2005-02-01  1:42     ` Kenji Okamoto
  1 sibling, 2 replies; 18+ messages in thread
From: Russ Cox @ 2005-01-31 18:40 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> | Content-Type: text/plain; charset=ISO-8859-15; format=flowed
> | Content-Transfer-Encoding: 8bit
> 
> in the header doesn't lead to properly displayed 'umlauts', but
> 
> | Content-Type: text/plain; charset=ISO-8859-15; format=flowed
> | Content-Transfer-Encoding: quoted-printable
> 
> does lead to properly displayed 'umlauts'.

I don't believe Plan 9 is causing the problem.  I just 
manually sent myself an 8-bit 8859-15 message on
Plan 9 with the script below and it came through just fine,
at least using "mail" to read (didn't try acme Mail but they
should be the same).

Probably some mail server before Plan 9 is screwing
with the 8-bit characters.

Russ


#!/usr/local/plan9/bin/rc

{
	echo 'HELO rsc'
	sleep 2
	echo 'MAIL FROM: <rsc@swtch.com>'
	sleep 2
	echo 'RCPT TO: <glenda@plan9.bell-labs.com>'
	sleep 2
	echo 'DATA'
	sleep 2
	echo 'Content-Type: text/plain; charset=ISO-8859-15; format=flowed'
	echo 'Content-Transfer-Encoding: 8bit'
	echo
	echo Hello world.
	echo Here is an umlaut: ü | tcs -t 8859-15
	echo
	echo .
	sleep 2
	echo QUIT
	sleep 2
} | dial -e tcp!204.178.31.2!25


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-01-31 18:40     ` Russ Cox
@ 2005-01-31 19:19       ` boyd, rounin
  2005-02-01  8:24       ` Heiko Dudzus
  1 sibling, 0 replies; 18+ messages in thread
From: boyd, rounin @ 2005-01-31 19:19 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

    Probably some mail server before Plan 9 is screwing
    with the 8-bit characters.

yeah i've seen this before.  some EMSTP implementations screw up
and some blendmail mailers need the undocumented mailer flag '9'.
--
MGRS 31U DQ 52572 12604




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-01-31 14:05   ` Heiko Dudzus
  2005-01-31 18:40     ` Russ Cox
@ 2005-02-01  1:42     ` Kenji Okamoto
  2005-02-01  3:42       ` Russ Cox
  1 sibling, 1 reply; 18+ messages in thread
From: Kenji Okamoto @ 2005-02-01  1:42 UTC (permalink / raw)
  To: 9fans

> | Content-Type: text/plain; charset=ISO-8859-15; format=flowed
> | Content-Transfer-Encoding: 8bit

I think this is not supported in Upas.
You have to add its support by yourself.

Kenji



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-02-01  1:42     ` Kenji Okamoto
@ 2005-02-01  3:42       ` Russ Cox
  0 siblings, 0 replies; 18+ messages in thread
From: Russ Cox @ 2005-02-01  3:42 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> > | Content-Type: text/plain; charset=ISO-8859-15; format=flowed
> > | Content-Transfer-Encoding: 8bit
> 
> I think this is not supported in Upas.
> You have to add its support by yourself.

No.  It is supported, and it works well.
Russ


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-01-31 18:40     ` Russ Cox
  2005-01-31 19:19       ` boyd, rounin
@ 2005-02-01  8:24       ` Heiko Dudzus
  2005-02-01 10:29         ` Heiko Dudzus
  1 sibling, 1 reply; 18+ messages in thread
From: Heiko Dudzus @ 2005-02-01  8:24 UTC (permalink / raw)
  To: russcox, 9fans

> I don't believe Plan 9 is causing the problem.  I just
> manually sent myself an 8-bit 8859-15 message on
> Plan 9 with the script below and it came through just fine,
> at least using "mail" to read (didn't try acme Mail but they
> should be the same).
>
> Probably some mail server before Plan 9 is screwing
> with the 8-bit characters.

The problem remained when I sent this test mail to the local smtp
server  But I found out, that all is fine when I move away my pipeto
file.

It seems as if /mail/lib/pipeto.lib introduces the problem somewhere.
I hope to find it.

Heiko



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-02-01  8:24       ` Heiko Dudzus
@ 2005-02-01 10:29         ` Heiko Dudzus
  2005-02-01 17:11           ` Heiko Dudzus
  2005-02-01 18:26           ` Russ Cox
  0 siblings, 2 replies; 18+ messages in thread
From: Heiko Dudzus @ 2005-02-01 10:29 UTC (permalink / raw)
  To: 9fans

> The problem remained when I sent this test mail to the local smtp
> server  But I found out, that all is fine when I move away my pipeto
> file.
> 
> It seems as if /mail/lib/pipeto.lib introduces the problem somewhere.
> I hope to find it.

Ok, i took a mail, made by Russ' smtp dialogue script, and did
manually what pipeto.lib does with every mail.

% cd /mail/fs/mbox/124
% cat rawunix | sed '/^$/,$ s/^From / From /' > /tmp/msg

This file is already screwed.  I compared to the original rawunix file
with xd:

term% diff <{xd -c rawunix} <{xd -c /tmp/msg}
21,23c21,23
< 0000140      a  n     u  m  l  a  u  t  :    fc \n 04 04
< 0000150  \n
< 0000151 
---
> 0000140      a  n     u  m  l  a  u  t  :    c2 80 \n 04
> 0000150  04 \n
> 0000152 
term% 

0xfc represents 'ü' in iso-8859-15 but sed replaces it by 0xc2 and
0xc8.  Why?  It should only hide bogus 'From ' lines in the mail body.
Is sed allowed to replace 0xfc by something different here?

Heiko



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-02-01 10:29         ` Heiko Dudzus
@ 2005-02-01 17:11           ` Heiko Dudzus
  2005-02-01 20:41             ` Sape Mullender
  2005-02-01 18:26           ` Russ Cox
  1 sibling, 1 reply; 18+ messages in thread
From: Heiko Dudzus @ 2005-02-01 17:11 UTC (permalink / raw)
  To: 9fans

> 0xfc represents 'ü' in iso-8859-15 but sed replaces it by 0xc2 and
> 0xc8.  Why?  It should only hide bogus 'From ' lines in the mail body.
> Is sed allowed to replace 0xfc by something different here?

I don't feel good answering to my own mails so often.  But someone
mailed to me off-list and made clear that sed doesn't have to be able
to read non-utf input. I understand that point now.

Is it then desireable for pipeto.lib to deal with the situation?  I am
trying to modify it to fit my needs.  If pipeto.lib should deal with
incoming 8bit ISO-* mails I will put it as patch on sources when I
finished.

Heiko



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-02-01 10:29         ` Heiko Dudzus
  2005-02-01 17:11           ` Heiko Dudzus
@ 2005-02-01 18:26           ` Russ Cox
  2005-02-01 18:37             ` rog
  1 sibling, 1 reply; 18+ messages in thread
From: Russ Cox @ 2005-02-01 18:26 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Sed operates on UTF, so if you give it non-UTF (aka garbage)
it replaces bad UTF sequences with error runes, which is what
you are seeing.

Probably the best thing to do is write a program that applies
the transformations you want but works a byte at a time and
is character-set-ignorant.

Russ



On Tue, 1 Feb 2005 11:29:32 +0100, Heiko Dudzus <heiko.dudzus@gmx.de> wrote:
> > The problem remained when I sent this test mail to the local smtp
> > server  But I found out, that all is fine when I move away my pipeto
> > file.
> >
> > It seems as if /mail/lib/pipeto.lib introduces the problem somewhere.
> > I hope to find it.
> 
> Ok, i took a mail, made by Russ' smtp dialogue script, and did
> manually what pipeto.lib does with every mail.
> 
> % cd /mail/fs/mbox/124
> % cat rawunix | sed '/^$/,$ s/^From / From /' > /tmp/msg
> 
> This file is already screwed.  I compared to the original rawunix file
> with xd:
> 
> term% diff <{xd -c rawunix} <{xd -c /tmp/msg}
> 21,23c21,23
> < 0000140      a  n     u  m  l  a  u  t  :    fc \n 04 04
> < 0000150  \n
> < 0000151
> ---
> > 0000140      a  n     u  m  l  a  u  t  :    c2 80 \n 04
> > 0000150  04 \n
> > 0000152
> term%
> 
> 0xfc represents 'ü' in iso-8859-15 but sed replaces it by 0xc2 and
> 0xc8.  Why?  It should only hide bogus 'From ' lines in the mail body.
> Is sed allowed to replace 0xfc by something different here?
> 
> Heiko
> 
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-02-01 18:26           ` Russ Cox
@ 2005-02-01 18:37             ` rog
  2005-02-01 19:50               ` boyd, rounin
  2005-02-02 10:59               ` Heiko Dudzus
  0 siblings, 2 replies; 18+ messages in thread
From: rog @ 2005-02-01 18:37 UTC (permalink / raw)
  To: 9fans

> Probably the best thing to do is write a program that applies
> the transformations you want but works a byte at a time and
> is character-set-ignorant.

alternatively, a very quick solution might be to get pipeto.lib
to take an extra verbatim copy of the message before running
upas/fs on it.

e.g.
	# save and parse the mail file
	cat > $TMP.msgraw
	sed '/^$/,$ s/^From / From /' < $TMP.rawmsg >$TMP.msg
	upas/fs -p -f $TMP.msg || exit $status

and in fn spool:

	$BIN/deliver $RECIP $D/from $_mbox < $TMP.rawmsg || exit $status

this does mean that more space is used, and that the
spam classification isn't strictly accurate for non-utf charsets,
but it does avoid having to write any code...

N.B. i haven't tried this out!



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-02-01 18:37             ` rog
@ 2005-02-01 19:50               ` boyd, rounin
  2005-02-02 10:59               ` Heiko Dudzus
  1 sibling, 0 replies; 18+ messages in thread
From: boyd, rounin @ 2005-02-01 19:50 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

this might help to debug rune errors (well that's what it was written for):

     http://www.insultant.net/code/plan9/vr.c
--
MGRS 31U DQ 52572 12604




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-02-01 17:11           ` Heiko Dudzus
@ 2005-02-01 20:41             ` Sape Mullender
  2005-02-01 20:45               ` boyd, rounin
  0 siblings, 1 reply; 18+ messages in thread
From: Sape Mullender @ 2005-02-01 20:41 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 67 bytes --]

This might help to change you umlauts into real ones :-)

	Sape

[-- Attachment #2: utf2latin.c --]
[-- Type: text/plain, Size: 1858 bytes --]

#include <stdio.h>

FILE *f;

main(int argc, char ** argv)
{	int c, c1, c2;
	unsigned int r;
	char buf[8];

	if (argc > 2) {
		fprintf(stderr, "Usage: %s [ file ]\n", argv[0]);
		return;
	}
	if (argc == 2) {
		f = fopen(argv[1],"r");
		if (f == NULL) {
			fprintf(stderr, "Can't open %s\n", argv[1]);
			return;
		}
	} else {
		f = stdin;
	}
	while ((c = getc(f)) != EOF) {
		if (c < 0x80) putchar(c);
		else if ((c & 0xe0) == 0xc0) {
			/* two-char rune */
			c1 = c;
			r = (c & 0x1f) << 6;
			if ((c = getc(f)) == EOF) {
				fprintf(stderr, "EOF in rune\n");
				putchar(c1);
				break;
			}
			if ((c & 0xc0) != 0x80) {
				fprintf(stderr, "Bad rune %x, %x\n", r, c);
				putchar(c1);
				putchar(c);
				continue;
			}
			r = r | (c & 0x3f);
			if (r < 0x100) putchar(r);
			else {
				fprintf(stderr, "Rune too big %x\n", r);
				putchar(c1);
				putchar(c);
			}
		} else if ((c & 0xf0) == 0xe0) {
			/* three-char rune */
			r = (c & 0xf) << 12;
			c1 = c;
			if ((c = getc(f)) == EOF) {
				fprintf(stderr, "EOF in rune\n");
				putchar(c1);
				break;
			}
			if ((c & 0xc0) != 0x80) {
				fprintf(stderr, "Bad rune %x, %x\n", r, c);
				putchar(c1);
				putchar(c);
				continue;
			}
			c2 = c;
			r = r | ((c & 0x3f) << 6);
			if ((c = getc(f)) == EOF) {
				fprintf(stderr, "EOF in rune\n");
				putchar(c1);
				putchar(c2);
				break;
			}
			if ((c & 0xc0) != 0x80) {
				fprintf(stderr, "Bad rune %x, %x\n", r, c);
				putchar(c1);
				putchar(c2);
				putchar(c);
				continue;
			}
			r = r | (c & 0x3f);
			if (r < 0x100) putchar(r);
			else {
				fprintf(stderr, "Rune too big %x\n", r);
				putchar(c1);
				putchar(c2);
				putchar(c);
			}
		} else {
			fprintf(stderr, "Bad rune %x, %x\n", r, c);
			putchar(c);
		}
	}
	fflush(f);
	return;
}

[-- Attachment #3: latin2utf.c --]
[-- Type: text/plain, Size: 555 bytes --]

#include <stdio.h>

FILE *f;

main(int argc, char ** argv)
{	int c;
	unsigned int r;
	char buf[8];

	if (argc > 2) {
		fprintf(stderr, "Usage: %s [ file ]\n", argv[0]);
		return;
	}
	if (argc == 2) {
		f = fopen(argv[1],"r");
		if (f == NULL) {
			fprintf(stderr, "Can't open %s\n", argv[1]);
			return;
		}
	} else {
		f = stdin;
	}
	while ((c = getc(f)) != EOF) {
		if (c < 0x80) putchar(c);
		else {
			/* print a two-char rune */
			putchar(0xc0 | (c >> 6));
			putchar(0x80 | (c & 0x3f));
		}
	}
	fflush(f);
	return;
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-02-01 20:41             ` Sape Mullender
@ 2005-02-01 20:45               ` boyd, rounin
  0 siblings, 0 replies; 18+ messages in thread
From: boyd, rounin @ 2005-02-01 20:45 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> This might help to change you umlauts into real ones :-)

if yer into troff this might help too:

     http://www.insultant.net/repo/plan9/ralph.c
--
MGRS 31U DQ 52572 12604




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-02-01 18:37             ` rog
  2005-02-01 19:50               ` boyd, rounin
@ 2005-02-02 10:59               ` Heiko Dudzus
  2005-02-02 13:48                 ` Russ Cox
  1 sibling, 1 reply; 18+ messages in thread
From: Heiko Dudzus @ 2005-02-02 10:59 UTC (permalink / raw)
  To: 9fans

> alternatively, a very quick solution might be to get pipeto.lib
> to take an extra verbatim copy of the message before running
> upas/fs on it.

Ah, good idea. I will try it.

I have a question, though, about the sed pipe here: In the usual case,
where one doesn't use pipeto.lib, upas/fs has to deal with mails that
have 'From '-lines in the body.

There is no sed '/^$/,$ s/^From / From /' applied to mails before they
get read by upas/fs.

So, does pipeto.lib really has to do it better?  How about dropping
the sed pipe in the pipeto.lib at all?

BTW: Thanks also to the other who posted code

Heiko



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-02-02 10:59               ` Heiko Dudzus
@ 2005-02-02 13:48                 ` Russ Cox
  2005-02-02 23:41                   ` Heiko Dudzus
  0 siblings, 1 reply; 18+ messages in thread
From: Russ Cox @ 2005-02-02 13:48 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> I have a question, though, about the sed pipe here: In the usual case,
> where one doesn't use pipeto.lib, upas/fs has to deal with mails that
> have 'From '-lines in the body.
>
> There is no sed '/^$/,$ s/^From / From /' applied to mails before they
> get read by upas/fs.

That's not true.  Upas/deliver and the other programs that write to
the mail boxes do this tranformation.  If you have lines beginning with
"From ", then upas/fs will think there are multiple messages.

Russ


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] conversion of charsets in upas/fs
  2005-02-02 13:48                 ` Russ Cox
@ 2005-02-02 23:41                   ` Heiko Dudzus
  0 siblings, 0 replies; 18+ messages in thread
From: Heiko Dudzus @ 2005-02-02 23:41 UTC (permalink / raw)
  To: russcox, 9fans

> That's not true.  Upas/deliver and the other programs that write to
> the mail boxes do this tranformation.

Ok.

I finally made up my mind to use Steve Simons' upas/padfrom.

Thanks to all who helped or corrected me.
Heiko



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2005-02-02 23:41 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-29 13:57 [9fans] conversion of charsets in upas/fs Heiko Dudzus
2005-01-31  1:21 ` Kenji Okamoto
2005-01-31 14:05   ` Heiko Dudzus
2005-01-31 18:40     ` Russ Cox
2005-01-31 19:19       ` boyd, rounin
2005-02-01  8:24       ` Heiko Dudzus
2005-02-01 10:29         ` Heiko Dudzus
2005-02-01 17:11           ` Heiko Dudzus
2005-02-01 20:41             ` Sape Mullender
2005-02-01 20:45               ` boyd, rounin
2005-02-01 18:26           ` Russ Cox
2005-02-01 18:37             ` rog
2005-02-01 19:50               ` boyd, rounin
2005-02-02 10:59               ` Heiko Dudzus
2005-02-02 13:48                 ` Russ Cox
2005-02-02 23:41                   ` Heiko Dudzus
2005-02-01  1:42     ` Kenji Okamoto
2005-02-01  3:42       ` Russ Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).