* [9fans] conversion of charsets in upas/fs @ 2005-01-29 13:57 Heiko Dudzus 2005-01-31 1:21 ` Kenji Okamoto 0 siblings, 1 reply; 18+ messages in thread From: Heiko Dudzus @ 2005-01-29 13:57 UTC (permalink / raw) To: 9fans upas/fs seems to convert mails with iso-8859-1 (and some other charsets) to UTF-8 automatically. (I looked into the source but didn't understand how it actually works, not finding every definiton of every function) This conversion works fine with mails that have are encoded quoted-printable. It doesn't work here with mails that are 8bit encoded. (at least with 8859-1) Isn't that also possible with 8bit encoded mails? Regards, Heiko ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-01-29 13:57 [9fans] conversion of charsets in upas/fs Heiko Dudzus @ 2005-01-31 1:21 ` Kenji Okamoto 2005-01-31 14:05 ` Heiko Dudzus 0 siblings, 1 reply; 18+ messages in thread From: Kenji Okamoto @ 2005-01-31 1:21 UTC (permalink / raw) To: 9fans > Isn't that also possible with 8bit encoded mails? What about its header line? If the header doesn't have approapriate and supported by the Upas line, it may fail. Kenji ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-01-31 1:21 ` Kenji Okamoto @ 2005-01-31 14:05 ` Heiko Dudzus 2005-01-31 18:40 ` Russ Cox 2005-02-01 1:42 ` Kenji Okamoto 0 siblings, 2 replies; 18+ messages in thread From: Heiko Dudzus @ 2005-01-31 14:05 UTC (permalink / raw) To: 9fans Hello Kenji, thanks for your answer, >> Isn't that also possible with 8bit encoded mails? > > What about its header line? > If the header doesn't have approapriate and supported by > the Upas line, it may fail. I will quote an example: | Content-Type: text/plain; charset=ISO-8859-15; format=flowed | Content-Transfer-Encoding: 8bit in the header doesn't lead to properly displayed 'umlauts', but | Content-Type: text/plain; charset=ISO-8859-15; format=flowed | Content-Transfer-Encoding: quoted-printable does lead to properly displayed 'umlauts'. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-01-31 14:05 ` Heiko Dudzus @ 2005-01-31 18:40 ` Russ Cox 2005-01-31 19:19 ` boyd, rounin 2005-02-01 8:24 ` Heiko Dudzus 2005-02-01 1:42 ` Kenji Okamoto 1 sibling, 2 replies; 18+ messages in thread From: Russ Cox @ 2005-01-31 18:40 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs > | Content-Type: text/plain; charset=ISO-8859-15; format=flowed > | Content-Transfer-Encoding: 8bit > > in the header doesn't lead to properly displayed 'umlauts', but > > | Content-Type: text/plain; charset=ISO-8859-15; format=flowed > | Content-Transfer-Encoding: quoted-printable > > does lead to properly displayed 'umlauts'. I don't believe Plan 9 is causing the problem. I just manually sent myself an 8-bit 8859-15 message on Plan 9 with the script below and it came through just fine, at least using "mail" to read (didn't try acme Mail but they should be the same). Probably some mail server before Plan 9 is screwing with the 8-bit characters. Russ #!/usr/local/plan9/bin/rc { echo 'HELO rsc' sleep 2 echo 'MAIL FROM: <rsc@swtch.com>' sleep 2 echo 'RCPT TO: <glenda@plan9.bell-labs.com>' sleep 2 echo 'DATA' sleep 2 echo 'Content-Type: text/plain; charset=ISO-8859-15; format=flowed' echo 'Content-Transfer-Encoding: 8bit' echo echo Hello world. echo Here is an umlaut: ü | tcs -t 8859-15 echo echo . sleep 2 echo QUIT sleep 2 } | dial -e tcp!204.178.31.2!25 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-01-31 18:40 ` Russ Cox @ 2005-01-31 19:19 ` boyd, rounin 2005-02-01 8:24 ` Heiko Dudzus 1 sibling, 0 replies; 18+ messages in thread From: boyd, rounin @ 2005-01-31 19:19 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs Probably some mail server before Plan 9 is screwing with the 8-bit characters. yeah i've seen this before. some EMSTP implementations screw up and some blendmail mailers need the undocumented mailer flag '9'. -- MGRS 31U DQ 52572 12604 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-01-31 18:40 ` Russ Cox 2005-01-31 19:19 ` boyd, rounin @ 2005-02-01 8:24 ` Heiko Dudzus 2005-02-01 10:29 ` Heiko Dudzus 1 sibling, 1 reply; 18+ messages in thread From: Heiko Dudzus @ 2005-02-01 8:24 UTC (permalink / raw) To: russcox, 9fans > I don't believe Plan 9 is causing the problem. I just > manually sent myself an 8-bit 8859-15 message on > Plan 9 with the script below and it came through just fine, > at least using "mail" to read (didn't try acme Mail but they > should be the same). > > Probably some mail server before Plan 9 is screwing > with the 8-bit characters. The problem remained when I sent this test mail to the local smtp server But I found out, that all is fine when I move away my pipeto file. It seems as if /mail/lib/pipeto.lib introduces the problem somewhere. I hope to find it. Heiko ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-02-01 8:24 ` Heiko Dudzus @ 2005-02-01 10:29 ` Heiko Dudzus 2005-02-01 17:11 ` Heiko Dudzus 2005-02-01 18:26 ` Russ Cox 0 siblings, 2 replies; 18+ messages in thread From: Heiko Dudzus @ 2005-02-01 10:29 UTC (permalink / raw) To: 9fans > The problem remained when I sent this test mail to the local smtp > server But I found out, that all is fine when I move away my pipeto > file. > > It seems as if /mail/lib/pipeto.lib introduces the problem somewhere. > I hope to find it. Ok, i took a mail, made by Russ' smtp dialogue script, and did manually what pipeto.lib does with every mail. % cd /mail/fs/mbox/124 % cat rawunix | sed '/^$/,$ s/^From / From /' > /tmp/msg This file is already screwed. I compared to the original rawunix file with xd: term% diff <{xd -c rawunix} <{xd -c /tmp/msg} 21,23c21,23 < 0000140 a n u m l a u t : fc \n 04 04 < 0000150 \n < 0000151 --- > 0000140 a n u m l a u t : c2 80 \n 04 > 0000150 04 \n > 0000152 term% 0xfc represents 'ü' in iso-8859-15 but sed replaces it by 0xc2 and 0xc8. Why? It should only hide bogus 'From ' lines in the mail body. Is sed allowed to replace 0xfc by something different here? Heiko ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-02-01 10:29 ` Heiko Dudzus @ 2005-02-01 17:11 ` Heiko Dudzus 2005-02-01 20:41 ` Sape Mullender 2005-02-01 18:26 ` Russ Cox 1 sibling, 1 reply; 18+ messages in thread From: Heiko Dudzus @ 2005-02-01 17:11 UTC (permalink / raw) To: 9fans > 0xfc represents 'ü' in iso-8859-15 but sed replaces it by 0xc2 and > 0xc8. Why? It should only hide bogus 'From ' lines in the mail body. > Is sed allowed to replace 0xfc by something different here? I don't feel good answering to my own mails so often. But someone mailed to me off-list and made clear that sed doesn't have to be able to read non-utf input. I understand that point now. Is it then desireable for pipeto.lib to deal with the situation? I am trying to modify it to fit my needs. If pipeto.lib should deal with incoming 8bit ISO-* mails I will put it as patch on sources when I finished. Heiko ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-02-01 17:11 ` Heiko Dudzus @ 2005-02-01 20:41 ` Sape Mullender 2005-02-01 20:45 ` boyd, rounin 0 siblings, 1 reply; 18+ messages in thread From: Sape Mullender @ 2005-02-01 20:41 UTC (permalink / raw) To: 9fans [-- Attachment #1: Type: text/plain, Size: 67 bytes --] This might help to change you umlauts into real ones :-) Sape [-- Attachment #2: utf2latin.c --] [-- Type: text/plain, Size: 1858 bytes --] #include <stdio.h> FILE *f; main(int argc, char ** argv) { int c, c1, c2; unsigned int r; char buf[8]; if (argc > 2) { fprintf(stderr, "Usage: %s [ file ]\n", argv[0]); return; } if (argc == 2) { f = fopen(argv[1],"r"); if (f == NULL) { fprintf(stderr, "Can't open %s\n", argv[1]); return; } } else { f = stdin; } while ((c = getc(f)) != EOF) { if (c < 0x80) putchar(c); else if ((c & 0xe0) == 0xc0) { /* two-char rune */ c1 = c; r = (c & 0x1f) << 6; if ((c = getc(f)) == EOF) { fprintf(stderr, "EOF in rune\n"); putchar(c1); break; } if ((c & 0xc0) != 0x80) { fprintf(stderr, "Bad rune %x, %x\n", r, c); putchar(c1); putchar(c); continue; } r = r | (c & 0x3f); if (r < 0x100) putchar(r); else { fprintf(stderr, "Rune too big %x\n", r); putchar(c1); putchar(c); } } else if ((c & 0xf0) == 0xe0) { /* three-char rune */ r = (c & 0xf) << 12; c1 = c; if ((c = getc(f)) == EOF) { fprintf(stderr, "EOF in rune\n"); putchar(c1); break; } if ((c & 0xc0) != 0x80) { fprintf(stderr, "Bad rune %x, %x\n", r, c); putchar(c1); putchar(c); continue; } c2 = c; r = r | ((c & 0x3f) << 6); if ((c = getc(f)) == EOF) { fprintf(stderr, "EOF in rune\n"); putchar(c1); putchar(c2); break; } if ((c & 0xc0) != 0x80) { fprintf(stderr, "Bad rune %x, %x\n", r, c); putchar(c1); putchar(c2); putchar(c); continue; } r = r | (c & 0x3f); if (r < 0x100) putchar(r); else { fprintf(stderr, "Rune too big %x\n", r); putchar(c1); putchar(c2); putchar(c); } } else { fprintf(stderr, "Bad rune %x, %x\n", r, c); putchar(c); } } fflush(f); return; } [-- Attachment #3: latin2utf.c --] [-- Type: text/plain, Size: 555 bytes --] #include <stdio.h> FILE *f; main(int argc, char ** argv) { int c; unsigned int r; char buf[8]; if (argc > 2) { fprintf(stderr, "Usage: %s [ file ]\n", argv[0]); return; } if (argc == 2) { f = fopen(argv[1],"r"); if (f == NULL) { fprintf(stderr, "Can't open %s\n", argv[1]); return; } } else { f = stdin; } while ((c = getc(f)) != EOF) { if (c < 0x80) putchar(c); else { /* print a two-char rune */ putchar(0xc0 | (c >> 6)); putchar(0x80 | (c & 0x3f)); } } fflush(f); return; } ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-02-01 20:41 ` Sape Mullender @ 2005-02-01 20:45 ` boyd, rounin 0 siblings, 0 replies; 18+ messages in thread From: boyd, rounin @ 2005-02-01 20:45 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs > This might help to change you umlauts into real ones :-) if yer into troff this might help too: http://www.insultant.net/repo/plan9/ralph.c -- MGRS 31U DQ 52572 12604 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-02-01 10:29 ` Heiko Dudzus 2005-02-01 17:11 ` Heiko Dudzus @ 2005-02-01 18:26 ` Russ Cox 2005-02-01 18:37 ` rog 1 sibling, 1 reply; 18+ messages in thread From: Russ Cox @ 2005-02-01 18:26 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs Sed operates on UTF, so if you give it non-UTF (aka garbage) it replaces bad UTF sequences with error runes, which is what you are seeing. Probably the best thing to do is write a program that applies the transformations you want but works a byte at a time and is character-set-ignorant. Russ On Tue, 1 Feb 2005 11:29:32 +0100, Heiko Dudzus <heiko.dudzus@gmx.de> wrote: > > The problem remained when I sent this test mail to the local smtp > > server But I found out, that all is fine when I move away my pipeto > > file. > > > > It seems as if /mail/lib/pipeto.lib introduces the problem somewhere. > > I hope to find it. > > Ok, i took a mail, made by Russ' smtp dialogue script, and did > manually what pipeto.lib does with every mail. > > % cd /mail/fs/mbox/124 > % cat rawunix | sed '/^$/,$ s/^From / From /' > /tmp/msg > > This file is already screwed. I compared to the original rawunix file > with xd: > > term% diff <{xd -c rawunix} <{xd -c /tmp/msg} > 21,23c21,23 > < 0000140 a n u m l a u t : fc \n 04 04 > < 0000150 \n > < 0000151 > --- > > 0000140 a n u m l a u t : c2 80 \n 04 > > 0000150 04 \n > > 0000152 > term% > > 0xfc represents 'ü' in iso-8859-15 but sed replaces it by 0xc2 and > 0xc8. Why? It should only hide bogus 'From ' lines in the mail body. > Is sed allowed to replace 0xfc by something different here? > > Heiko > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-02-01 18:26 ` Russ Cox @ 2005-02-01 18:37 ` rog 2005-02-01 19:50 ` boyd, rounin 2005-02-02 10:59 ` Heiko Dudzus 0 siblings, 2 replies; 18+ messages in thread From: rog @ 2005-02-01 18:37 UTC (permalink / raw) To: 9fans > Probably the best thing to do is write a program that applies > the transformations you want but works a byte at a time and > is character-set-ignorant. alternatively, a very quick solution might be to get pipeto.lib to take an extra verbatim copy of the message before running upas/fs on it. e.g. # save and parse the mail file cat > $TMP.msgraw sed '/^$/,$ s/^From / From /' < $TMP.rawmsg >$TMP.msg upas/fs -p -f $TMP.msg || exit $status and in fn spool: $BIN/deliver $RECIP $D/from $_mbox < $TMP.rawmsg || exit $status this does mean that more space is used, and that the spam classification isn't strictly accurate for non-utf charsets, but it does avoid having to write any code... N.B. i haven't tried this out! ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-02-01 18:37 ` rog @ 2005-02-01 19:50 ` boyd, rounin 2005-02-02 10:59 ` Heiko Dudzus 1 sibling, 0 replies; 18+ messages in thread From: boyd, rounin @ 2005-02-01 19:50 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs this might help to debug rune errors (well that's what it was written for): http://www.insultant.net/code/plan9/vr.c -- MGRS 31U DQ 52572 12604 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-02-01 18:37 ` rog 2005-02-01 19:50 ` boyd, rounin @ 2005-02-02 10:59 ` Heiko Dudzus 2005-02-02 13:48 ` Russ Cox 1 sibling, 1 reply; 18+ messages in thread From: Heiko Dudzus @ 2005-02-02 10:59 UTC (permalink / raw) To: 9fans > alternatively, a very quick solution might be to get pipeto.lib > to take an extra verbatim copy of the message before running > upas/fs on it. Ah, good idea. I will try it. I have a question, though, about the sed pipe here: In the usual case, where one doesn't use pipeto.lib, upas/fs has to deal with mails that have 'From '-lines in the body. There is no sed '/^$/,$ s/^From / From /' applied to mails before they get read by upas/fs. So, does pipeto.lib really has to do it better? How about dropping the sed pipe in the pipeto.lib at all? BTW: Thanks also to the other who posted code Heiko ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-02-02 10:59 ` Heiko Dudzus @ 2005-02-02 13:48 ` Russ Cox 2005-02-02 23:41 ` Heiko Dudzus 0 siblings, 1 reply; 18+ messages in thread From: Russ Cox @ 2005-02-02 13:48 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs > I have a question, though, about the sed pipe here: In the usual case, > where one doesn't use pipeto.lib, upas/fs has to deal with mails that > have 'From '-lines in the body. > > There is no sed '/^$/,$ s/^From / From /' applied to mails before they > get read by upas/fs. That's not true. Upas/deliver and the other programs that write to the mail boxes do this tranformation. If you have lines beginning with "From ", then upas/fs will think there are multiple messages. Russ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-02-02 13:48 ` Russ Cox @ 2005-02-02 23:41 ` Heiko Dudzus 0 siblings, 0 replies; 18+ messages in thread From: Heiko Dudzus @ 2005-02-02 23:41 UTC (permalink / raw) To: russcox, 9fans > That's not true. Upas/deliver and the other programs that write to > the mail boxes do this tranformation. Ok. I finally made up my mind to use Steve Simons' upas/padfrom. Thanks to all who helped or corrected me. Heiko ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-01-31 14:05 ` Heiko Dudzus 2005-01-31 18:40 ` Russ Cox @ 2005-02-01 1:42 ` Kenji Okamoto 2005-02-01 3:42 ` Russ Cox 1 sibling, 1 reply; 18+ messages in thread From: Kenji Okamoto @ 2005-02-01 1:42 UTC (permalink / raw) To: 9fans > | Content-Type: text/plain; charset=ISO-8859-15; format=flowed > | Content-Transfer-Encoding: 8bit I think this is not supported in Upas. You have to add its support by yourself. Kenji ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [9fans] conversion of charsets in upas/fs 2005-02-01 1:42 ` Kenji Okamoto @ 2005-02-01 3:42 ` Russ Cox 0 siblings, 0 replies; 18+ messages in thread From: Russ Cox @ 2005-02-01 3:42 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs > > | Content-Type: text/plain; charset=ISO-8859-15; format=flowed > > | Content-Transfer-Encoding: 8bit > > I think this is not supported in Upas. > You have to add its support by yourself. No. It is supported, and it works well. Russ ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2005-02-02 23:41 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-01-29 13:57 [9fans] conversion of charsets in upas/fs Heiko Dudzus 2005-01-31 1:21 ` Kenji Okamoto 2005-01-31 14:05 ` Heiko Dudzus 2005-01-31 18:40 ` Russ Cox 2005-01-31 19:19 ` boyd, rounin 2005-02-01 8:24 ` Heiko Dudzus 2005-02-01 10:29 ` Heiko Dudzus 2005-02-01 17:11 ` Heiko Dudzus 2005-02-01 20:41 ` Sape Mullender 2005-02-01 20:45 ` boyd, rounin 2005-02-01 18:26 ` Russ Cox 2005-02-01 18:37 ` rog 2005-02-01 19:50 ` boyd, rounin 2005-02-02 10:59 ` Heiko Dudzus 2005-02-02 13:48 ` Russ Cox 2005-02-02 23:41 ` Heiko Dudzus 2005-02-01 1:42 ` Kenji Okamoto 2005-02-01 3:42 ` Russ Cox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).