* [9fans] tcs bug
@ 2005-08-31 6:07 arisawa
2005-08-31 9:11 ` arisawa
0 siblings, 1 reply; 10+ messages in thread
From: arisawa @ 2005-08-31 6:07 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
Sorry I should have sent previous mail using uft-8 code.
The following is same as previous one except character code.
Hello,
tcs both for plan 9 and for unix has a bug in reading utf text.
that comes from:
utf_in(int fd, long *notused, struct convert *out){
char buf[N];
...
while((n = read(fd, buf+tot, N-tot)) >= 0){
...
}
in utf.c
N is assigned to be 10000 in hdr.h
if you set N to 10, you will find the problem more clearly:
tcs cannot handle correctly utf character boundary.
for example, assume a.txt have the content:
aaaaaaaこの
term% xd -c a.txt
0000000 a a a a a a a e3 81 93 e3 81 ae \n
000000e
tcs can handle this text because N=10 is just uft boundary
but tcs fails if 'a' are 6 or 8 ...
tcs is very important for me.
Who maintains tcs ?
I might help debugging.
Kenji Arisawa
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] tcs bug
2005-08-31 6:07 [9fans] tcs bug arisawa
@ 2005-08-31 9:11 ` arisawa
2005-08-31 9:17 ` Rob Pike
0 siblings, 1 reply; 10+ messages in thread
From: arisawa @ 2005-08-31 9:11 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
The bellow is a first-aid bug fix
we define read function for utf-8
/* read until utf boundary */
int
readu(int fd, char *buf, int n)
{
static char b[3];
static int nb;
int m;
char *s, *e;
if(nb)
memcpy(buf, b, nb);
m = read(fd, buf + nb, n - nb);
/*
01. x in [00000000.0bbbbbbb] → 0bbbbbbb
10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb,10bbbbbb
*/
e = buf + m + nb;
for(s = buf; s < e; s++){
if((*s & 0x80) == 0)
continue;
if((*s & 0xe0) == 0xd0){
s++;
continue;
}
/* then *s is 111bbbbb */
if(s+2 >= e)
break;
s += 2;
continue;
}
/* we have e - s bytes in s */
nb = e - s;
memcpy(b, s, nb);
return s - buf;
}
and replace 'read' by 'readu' in utf.c
utf_in(int fd, long *notused, struct convert *out)
{
...
while((n = readu(fd, buf+tot, N-tot)) >= 0){
...
}
Kenji Arisawa
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] tcs bug
2005-08-31 9:11 ` arisawa
@ 2005-08-31 9:17 ` Rob Pike
2005-08-31 10:48 ` arisawa
0 siblings, 1 reply; 10+ messages in thread
From: Rob Pike @ 2005-08-31 9:17 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
one problem with this fix is that it assumes valid utf-8 input.
you're better off using fullrune.
-rob
On 8/31/05, arisawa@ar.aichi-u.ac.jp <arisawa@ar.aichi-u.ac.jp> wrote:
> The bellow is a first-aid bug fix
>
> we define read function for utf-8
>
> /* read until utf boundary */
> int
> readu(int fd, char *buf, int n)
> {
> static char b[3];
> static int nb;
> int m;
> char *s, *e;
> if(nb)
> memcpy(buf, b, nb);
> m = read(fd, buf + nb, n - nb);
>
> /*
> 01. x in [00000000.0bbbbbbb] → 0bbbbbbb
> 10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
> 11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb,10bbbbbb
> */
>
> e = buf + m + nb;
> for(s = buf; s < e; s++){
> if((*s & 0x80) == 0)
> continue;
> if((*s & 0xe0) == 0xd0){
> s++;
> continue;
> }
> /* then *s is 111bbbbb */
> if(s+2 >= e)
> break;
> s += 2;
> continue;
> }
> /* we have e - s bytes in s */
> nb = e - s;
> memcpy(b, s, nb);
> return s - buf;
> }
>
> and replace 'read' by 'readu' in utf.c
>
> utf_in(int fd, long *notused, struct convert *out)
> {
>
> ...
> while((n = readu(fd, buf+tot, N-tot)) >= 0){
> ...
> }
>
> Kenji Arisawa
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] tcs bug
2005-08-31 9:17 ` Rob Pike
@ 2005-08-31 10:48 ` arisawa
2005-08-31 11:22 ` arisawa
0 siblings, 1 reply; 10+ messages in thread
From: arisawa @ 2005-08-31 10:48 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
> one problem with this fix is that it assumes valid utf-8 input.
> you're better off using fullrune.
>
more simple and robust solution
that follows forsyth's suggestion
/* read until utf boundary */
int
readu(int fd, char *buf, int n)
{
static char b[3];
static int nb;
int m;
char *s, *e;
if(nb)
memcpy(buf, b, nb);
m = read(fd, buf + nb, n - nb);
/*
01. x in [00000000.0bbbbbbb] → 0bbbbbbb
10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb,
10bbbbbb
*/
e = buf + m + nb;
for(s = e - 2; s < e; s++){
if((*s & 0xc0) == 0x80)
continue;
if((*s & 0xc0) == 0xc0)
break;
}
/* we have e - s bytes in s */
nb = e - s;
memcpy(b, s, nb);
return s - buf;
}
Kenji Arisawa
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] tcs bug
2005-08-31 10:48 ` arisawa
@ 2005-08-31 11:22 ` arisawa
0 siblings, 0 replies; 10+ messages in thread
From: arisawa @ 2005-08-31 11:22 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
> for(s = e - 2; s < e; s++){
> if((*s & 0xc0) == 0x80)
> continue;
> if((*s & 0xc0) == 0xc0)
> break;
> }
>
this is redundant
replace by
for(s = e - 2; s < e; s++)
if((*s & 0xc0) == 0xc0)
break;
Kenji Arisawa
^ permalink raw reply [flat|nested] 10+ messages in thread
* [9fans] tcs bug
@ 2005-09-01 0:36 quanstro
0 siblings, 0 replies; 10+ messages in thread
From: quanstro @ 2005-09-01 0:36 UTC (permalink / raw)
To: 9fans
well, somebody's got to do it. ;-)
i guess i didn't think of using bio, having never had access before
p9p.
thanks, russ.
erik
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] tcs bug.
2005-08-31 10:51 quanstro
@ 2005-08-31 21:36 ` Russ Cox
0 siblings, 0 replies; 10+ messages in thread
From: Russ Cox @ 2005-08-31 21:36 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
You've invented buffered I/O.
#include <u.h>
#include <libc.h>
#include <bio.h>
void
usage(void)
{
fprint(2, "usage: runecvt [-l | -t | -u] [file...]\n");
exits("usage");
}
void
convert(Biobuf *bin, Biobuf *bout, Rune (*fn)(Rune))
{
int c;
while((c = Bgetrune(bin)) != -1)
Bputrune(bout, fn(c));
}
void
main(int argc, char **argv)
{
int i;
Biobuf *b, bin, bout;
Rune (*fn)(Rune);
fn = toupperrune;
ARGBEGIN{
case 'l':
fn = tolowerrune;
break;
case 't':
fn = totitlerune;
break;
case 'u':
fn = toupperrune;
break;
default:
usage();
}ARGEND
Binit(&bout, 1, OWRITE);
if(argc == 0){
Binit(&bin, 0, OREAD);
convert(&bin, &bout, fn);
}else{
for(i=0; i<argc; i++){
if((b = Bopen(argv[i], OREAD)) == nil)
sysfatal("open %s: %r", argv[i]);
convert(b, &bout, fn);
}
}
Bterm(&bout);
exits(nil);
}
^ permalink raw reply [flat|nested] 10+ messages in thread
* [9fans] tcs bug.
@ 2005-08-31 10:51 quanstro
2005-08-31 21:36 ` Russ Cox
0 siblings, 1 reply; 10+ messages in thread
From: quanstro @ 2005-08-31 10:51 UTC (permalink / raw)
To: 9fans
i just had a similar problem a day or two ago.
i needed to change some capitalization and the
tr 'A-Z' 'a-z' idiom doesn't work on random utf.
i solved it a bit differently -- lifting the fullrune()
check into the main loop. so i don't have a readu()
function. also (unlike tcs) at the cost of 1 extra check
at the end-of-input, the output buffer is dumped only
when full. on japanese, greek or other text with
>1 byte/char, this will save calls to OUT() --
or in my case print().
okay, total overkill. i know. but it was more interesting
to do that way.
here's upper.c. convert to upper/lower/title case:
#include <u.h>
#include <libc.h>
enum { BLOCK = 1024*4 };
typedef Rune (*Rconv)(Rune);
void output(Rune* r, int nrunes, Rconv R){
int i;
for(i=0; i<nrunes; i++){
r[i] = R(r[i]);
}
print("%.*S", nrunes, r);
}
const char* casify(int fd, Rconv R){
char in[BLOCK + UTFmax];
Rune r[BLOCK + UTFmax];
long rem_len;
long blen;
long j;
long i;
rem_len=0;
j = 0;
again: while (0 < (blen = read(fd, in + rem_len, BLOCK))){
blen += rem_len;
for(i=0; i<blen; ){
if (!fullrune(in + i, blen - i)){
rem_len = blen - i;
memcpy(in, in + i, rem_len);
goto again;
}
i += chartorune(r + j++, in + i);
if (j > BLOCK){
output(r, j, R);
j=0;
}
}
}
if (rem_len){
// non unicode garbage.
fprint(2, "non-utf8 garbage %.*s at eof\n", rem_len, in);
}
if (j){
output(r, j, R);
}
if (blen>0){
return 0;
}
return "read";
}
void main(int argc, /* pfft const */ char** argv){
Rconv R;
const char* v;
const char* status;
const char* s;
int fd;
v = strrchr(argv[0], '/');
if (v){
v++;
} else {
v = argv[0];
}
if (0 == strcmp(v, "tolower")){
R = tolowerrune;
} else if (0 == strcmp(v, "totitle")){
R = totitlerune;
} else {
R = toupperrune;
}
ARGBEGIN{
case 'u':
R = toupperrune;
break;
case 'l':
R = tolowerrune;
break;
case 't':
R = totitlerune;
break;
default:
fprint(2, "%s: bad option %c\n", argv0, ARGC());
fprint(2, "usage: %s -[ult]\n", argv0);
exits("usage");
} ARGEND
if (!*argv){
s = casify(0, R);
} else {
for(status = 0; *argv; argv++){
fd = open(*argv, OREAD);
if (-1 == fd){
if (s && !status){
status = "open";
}
continue;
}
s = casify(fd, R);
if (s && !status){
status = s;
}
close(fd);
}
}
exits(status ? status : "");
}
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] tcs bug
2005-08-31 5:54 ` [9fans] tcs bug arisawa
@ 2005-08-31 5:57 ` Rob Pike
0 siblings, 0 replies; 10+ messages in thread
From: Rob Pike @ 2005-08-31 5:57 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
ah yes, the dreaded partial rune problem. lots of programs
must cope with this issue.
-rob
On 8/31/05, arisawa@ar.aichi-u.ac.jp <arisawa@ar.aichi-u.ac.jp> wrote:
> Hello,
>
> tcs both for plan 9 and for unix has a bug in reading utf text.
> that comes from:
> utf_in(int fd, long *notused, struct convert *out){
> char buf[N];
> ...
> while((n = read(fd, buf+tot, N-tot)) >= 0){
> ...
> }
>
> in utf.c
>
> N is assigned to be 10000 in hdr.h
>
> if you set N to 10, you will find the problem more clearly:
> tcs cannot handle correctly utf character boundary.
>
> for example, assume a.txt have the content:
> aaaaaaaこの
>
> term% xd -c a.txt
> 0000000 a a a a a a a e3 81 93 e3 81 ae \n
> 000000e
>
> tcs can handle this text because N=10 is just uft boundary
> but tcs fails if 'a' are 6 or 8 ...
>
> tcs is very important for me.
> Who maintains tcs ?
> I might help debugging.
>
> Kenji Arisawa
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* [9fans] tcs bug
2005-08-30 17:46 ` Russ Cox
@ 2005-08-31 5:54 ` arisawa
2005-08-31 5:57 ` Rob Pike
0 siblings, 1 reply; 10+ messages in thread
From: arisawa @ 2005-08-31 5:54 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
Hello,
tcs both for plan 9 and for unix has a bug in reading utf text.
that comes from:
utf_in(int fd, long *notused, struct convert *out){
char buf[N];
...
while((n = read(fd, buf+tot, N-tot)) >= 0){
...
}
in utf.c
N is assigned to be 10000 in hdr.h
if you set N to 10, you will find the problem more clearly:
tcs cannot handle correctly utf character boundary.
for example, assume a.txt have the content:
aaaaaaaこの
term% xd -c a.txt
0000000 a a a a a a a e3 81 93 e3 81 ae \n
000000e
tcs can handle this text because N=10 is just uft boundary
but tcs fails if 'a' are 6 or 8 ...
tcs is very important for me.
Who maintains tcs ?
I might help debugging.
Kenji Arisawa
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2005-09-01 0:36 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-31 6:07 [9fans] tcs bug arisawa
2005-08-31 9:11 ` arisawa
2005-08-31 9:17 ` Rob Pike
2005-08-31 10:48 ` arisawa
2005-08-31 11:22 ` arisawa
-- strict thread matches above, loose matches on Subject: below --
2005-09-01 0:36 quanstro
2005-08-31 10:51 quanstro
2005-08-31 21:36 ` Russ Cox
2005-08-29 23:23 [9fans] some Plan9 related ideas Bhanu Nagendra Pisupati
2005-08-30 17:07 ` [9fans] " Dave Eckhardt
2005-08-30 17:33 ` Francisco Ballesteros
2005-08-30 17:46 ` Russ Cox
2005-08-31 5:54 ` [9fans] tcs bug arisawa
2005-08-31 5:57 ` Rob Pike
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).