[9fans] tcs bug

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] tcs bug
@ 2005-08-31  6:07 arisawa
  2005-08-31  9:11 ` arisawa
  0 siblings, 1 reply; 10+ messages in thread
From: arisawa @ 2005-08-31  6:07 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Sorry I should have sent previous mail using uft-8 code.
The following is same as previous one except character code.

Hello,

tcs both for plan 9 and for unix has a bug in reading utf text.
that comes from:
utf_in(int fd, long *notused, struct convert *out){
     char buf[N];
     ...
     while((n = read(fd, buf+tot, N-tot)) >= 0){
         ...
}

in utf.c

N is assigned to be 10000 in hdr.h

if you set N to 10, you will find the problem more clearly:
tcs cannot handle correctly utf character boundary.

for example, assume a.txt have the content:
aaaaaaaこの

term% xd -c a.txt
0000000   a  a  a  a  a  a  a e3 81 93 e3 81 ae \n
000000e

tcs can handle this text because N=10 is just uft boundary
but tcs fails if 'a' are 6 or 8 ...

tcs is very important for me.
Who maintains tcs ?
I might help debugging.

Kenji Arisawa

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] tcs bug
  2005-08-31  6:07 [9fans] tcs bug arisawa
@ 2005-08-31  9:11 ` arisawa
  2005-08-31  9:17   ` Rob Pike
  0 siblings, 1 reply; 10+ messages in thread
From: arisawa @ 2005-08-31  9:11 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

The bellow is a first-aid bug fix

we define read function for utf-8

/* read until utf boundary */
int
readu(int fd, char *buf, int n)
{
     static char b[3];
     static int nb;
     int m;
     char *s, *e;
     if(nb)
         memcpy(buf, b, nb);
     m = read(fd, buf + nb, n - nb);

     /*
     01.   x in [00000000.0bbbbbbb] → 0bbbbbbb
     10.   x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
     11.   x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb,10bbbbbb
     */

     e = buf + m + nb;
     for(s = buf; s < e; s++){
         if((*s & 0x80) == 0)
             continue;
         if((*s & 0xe0) == 0xd0){
             s++;
             continue;
         }
         /* then *s is 111bbbbb */
         if(s+2 >= e)
             break;
         s += 2;
         continue;
     }
     /* we have e - s bytes in s    */
     nb = e - s;
     memcpy(b, s, nb);
     return s - buf;
}

and replace 'read' by 'readu' in utf.c

utf_in(int fd, long *notused, struct convert *out)
{

     ...
     while((n = readu(fd, buf+tot, N-tot)) >= 0){
         ...
}

Kenji Arisawa



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] tcs bug
  2005-08-31  9:11 ` arisawa
@ 2005-08-31  9:17   ` Rob Pike
  2005-08-31 10:48     ` arisawa
  0 siblings, 1 reply; 10+ messages in thread
From: Rob Pike @ 2005-08-31  9:17 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

one problem with this fix is that it assumes valid utf-8 input.
you're better off using fullrune.

-rob

On 8/31/05, arisawa@ar.aichi-u.ac.jp <arisawa@ar.aichi-u.ac.jp> wrote:
> The bellow is a first-aid bug fix
> 
> we define read function for utf-8
> 
> /* read until utf boundary */
> int
> readu(int fd, char *buf, int n)
> {
>      static char b[3];
>      static int nb;
>      int m;
>      char *s, *e;
>      if(nb)
>          memcpy(buf, b, nb);
>      m = read(fd, buf + nb, n - nb);
> 
>      /*
>      01.   x in [00000000.0bbbbbbb] → 0bbbbbbb
>      10.   x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
>      11.   x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb,10bbbbbb
>      */
> 
>      e = buf + m + nb;
>      for(s = buf; s < e; s++){
>          if((*s & 0x80) == 0)
>              continue;
>          if((*s & 0xe0) == 0xd0){
>              s++;
>              continue;
>          }
>          /* then *s is 111bbbbb */
>          if(s+2 >= e)
>              break;
>          s += 2;
>          continue;
>      }
>      /* we have e - s bytes in s    */
>      nb = e - s;
>      memcpy(b, s, nb);
>      return s - buf;
> }
> 
> and replace 'read' by 'readu' in utf.c
> 
> utf_in(int fd, long *notused, struct convert *out)
> {
> 
>      ...
>      while((n = readu(fd, buf+tot, N-tot)) >= 0){
>          ...
> }
> 
> Kenji Arisawa
> 
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] tcs bug
  2005-08-31  9:17   ` Rob Pike
@ 2005-08-31 10:48     ` arisawa
  2005-08-31 11:22       ` arisawa
  0 siblings, 1 reply; 10+ messages in thread
From: arisawa @ 2005-08-31 10:48 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs


> one problem with this fix is that it assumes valid utf-8 input.
> you're better off using fullrune.
>

more simple and robust solution
that follows forsyth's suggestion


/* read until utf boundary */
int
readu(int fd, char *buf, int n)
{
         static char b[3];
         static int nb;
         int m;
         char *s, *e;
         if(nb)
                 memcpy(buf, b, nb);
         m = read(fd, buf + nb, n - nb);

         /*
         01.   x in [00000000.0bbbbbbb] → 0bbbbbbb
         10.   x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
         11.   x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 
10bbbbbb
         */

         e = buf + m + nb;
         for(s = e - 2; s < e; s++){
                 if((*s & 0xc0) == 0x80)
                         continue;
                 if((*s & 0xc0) == 0xc0)
                         break;
         }

         /* we have e - s bytes in s     */
         nb = e - s;
         memcpy(b, s, nb);
         return s - buf;
}

Kenji Arisawa



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] tcs bug
  2005-08-31 10:48     ` arisawa
@ 2005-08-31 11:22       ` arisawa
  0 siblings, 0 replies; 10+ messages in thread
From: arisawa @ 2005-08-31 11:22 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs


>         for(s = e - 2; s < e; s++){
>                 if((*s & 0xc0) == 0x80)
>                         continue;
>                 if((*s & 0xc0) == 0xc0)
>                         break;
>         }
>

this is redundant
replace by

         for(s = e - 2; s < e; s++)
                 if((*s & 0xc0) == 0xc0)
                         break;

Kenji Arisawa




^ permalink raw reply	[flat|nested] 10+ messages in thread

* [9fans] tcs bug
@ 2005-09-01  0:36 quanstro
  0 siblings, 0 replies; 10+ messages in thread
From: quanstro @ 2005-09-01  0:36 UTC (permalink / raw)
  To: 9fans

well, somebody's got to do it. ;-)

i guess i didn't think of using bio, having never had access before
p9p.

thanks, russ.

erik

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] tcs bug.
  2005-08-31 10:51 quanstro
@ 2005-08-31 21:36 ` Russ Cox
  0 siblings, 0 replies; 10+ messages in thread
From: Russ Cox @ 2005-08-31 21:36 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

You've invented buffered I/O.

#include <u.h>
#include <libc.h>
#include <bio.h>

void
usage(void)
{
	fprint(2, "usage: runecvt [-l | -t | -u] [file...]\n");
	exits("usage");
}

void
convert(Biobuf *bin, Biobuf *bout, Rune (*fn)(Rune))
{
	int c;
	
	while((c = Bgetrune(bin)) != -1)
		Bputrune(bout, fn(c));
}

void
main(int argc, char **argv)
{
	int i;
	Biobuf *b, bin, bout;
	Rune (*fn)(Rune);
	
	fn = toupperrune;
	ARGBEGIN{
	case 'l':
		fn = tolowerrune;
		break;
	case 't':
		fn = totitlerune;
		break;
	case 'u':
		fn = toupperrune;
		break;
	default:
		usage();
	}ARGEND
	
	Binit(&bout, 1, OWRITE);
	if(argc == 0){
		Binit(&bin, 0, OREAD);
		convert(&bin, &bout, fn);
	}else{
		for(i=0; i<argc; i++){
			if((b = Bopen(argv[i], OREAD)) == nil)
				sysfatal("open %s: %r", argv[i]);
			convert(b, &bout, fn);
		}
	}
	Bterm(&bout);
	exits(nil);
}


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [9fans] tcs bug.
@ 2005-08-31 10:51 quanstro
  2005-08-31 21:36 ` Russ Cox
  0 siblings, 1 reply; 10+ messages in thread
From: quanstro @ 2005-08-31 10:51 UTC (permalink / raw)
  To: 9fans

i just had a similar problem a day or two ago.

i needed to change some capitalization and the 
tr 'A-Z' 'a-z' idiom doesn't work on random utf.

i solved it a bit differently -- lifting the fullrune()
check into the main loop. so i don't have a readu() 
function. also (unlike tcs) at the cost of 1 extra check 
at the end-of-input, the output buffer is dumped only 
when full. on japanese, greek or other text with 
>1 byte/char, this will save calls to OUT() --
or in my case print().

okay, total overkill. i know. but it was more interesting
to do that way. 

here's upper.c. convert to upper/lower/title case:



#include <u.h>
#include <libc.h>

enum { BLOCK = 1024*4 };

typedef Rune (*Rconv)(Rune);

void output(Rune* r, int nrunes, Rconv R){
	int i;

	for(i=0; i<nrunes; i++){
		r[i] = R(r[i]);
	}
	print("%.*S", nrunes, r);
}

const char* casify(int fd, Rconv R){
	char in[BLOCK + UTFmax];
	Rune r[BLOCK + UTFmax];
	long rem_len;
	long blen;
	long j;
	long i;

	rem_len=0;
	j = 0;
again:	while (0 < (blen = read(fd, in + rem_len, BLOCK))){
		blen += rem_len;

		for(i=0; i<blen; ){
			if (!fullrune(in + i, blen - i)){
				rem_len = blen - i;
				memcpy(in, in + i, rem_len);
				goto again;
			}
			i += chartorune(r + j++, in + i);
			if (j > BLOCK){
				output(r, j, R);
				j=0;
			}
		}
	}

	if (rem_len){
		// non unicode garbage.
		fprint(2, "non-utf8 garbage %.*s at eof\n", rem_len, in);
	}

	if (j){
		output(r, j, R);
	}

	if (blen>0){
		return 0;
	}
	return "read";
}

void main(int argc, /* pfft const */ char** argv){
	Rconv R;
	const char* v;
	const char* status;
	const char* s;
	int fd;

	v = strrchr(argv[0], '/');
	if (v){
		v++;
	} else {
		v = argv[0];
	}
	
	if (0 == strcmp(v, "tolower")){
		R = tolowerrune;
	} else if (0 == strcmp(v, "totitle")){
		R = totitlerune;
	} else {
		R = toupperrune;
	}

	ARGBEGIN{
	case 'u':
		R = toupperrune;
		break;
	case 'l':
		R = tolowerrune;
		break;
	case 't':
		R = totitlerune;
		break;
	default:
		fprint(2, "%s: bad option %c\n", argv0, ARGC());
		fprint(2, "usage: %s -[ult]\n", argv0);
		exits("usage");
	} ARGEND

	if (!*argv){
		s = casify(0, R);
	} else {
		for(status = 0; *argv; argv++){
			fd = open(*argv, OREAD);
			if (-1 == fd){
				if (s && !status){
					status = "open";
				}
				continue;
			}
			s = casify(fd, R);
			if (s && !status){
				status = s;
			}
			close(fd);
		}
	}

	exits(status ? status : "");
}





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] tcs bug
  2005-08-31  5:54       ` [9fans] tcs bug arisawa
@ 2005-08-31  5:57         ` Rob Pike
  0 siblings, 0 replies; 10+ messages in thread
From: Rob Pike @ 2005-08-31  5:57 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

ah yes, the dreaded partial rune problem. lots of programs
must cope with this issue.

-rob

On 8/31/05, arisawa@ar.aichi-u.ac.jp <arisawa@ar.aichi-u.ac.jp> wrote:
> Hello,
> 
> tcs both for plan 9 and for unix has a bug in reading utf text.
> that comes from:
> utf_in(int fd, long *notused, struct convert *out){
>      char buf[N];
>      ...
>      while((n = read(fd, buf+tot, N-tot)) >= 0){
>          ...
> }
> 
> in utf.c
> 
> N is assigned to be 10000 in hdr.h
> 
> if you set N to 10, you will find the problem more clearly:
> tcs cannot handle correctly utf character boundary.
> 
> for example, assume a.txt have the content:
> aaaaaaaこの
> 
> term% xd -c a.txt
> 0000000   a  a  a  a  a  a  a e3 81 93 e3 81 ae \n
> 000000e
> 
> tcs can handle this text because N=10 is just uft boundary
> but tcs fails if 'a' are 6 or 8 ...
> 
> tcs is very important for me.
> Who maintains tcs ?
> I might help debugging.
> 
> Kenji Arisawa
> 
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [9fans] tcs bug
  2005-08-30 17:46     ` Russ Cox
@ 2005-08-31  5:54       ` arisawa
  2005-08-31  5:57         ` Rob Pike
  0 siblings, 1 reply; 10+ messages in thread
From: arisawa @ 2005-08-31  5:54 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Hello,

tcs both for plan 9 and for unix has a bug in reading utf text.
that comes from:
utf_in(int fd, long *notused, struct convert *out){
     char buf[N];
     ...
     while((n = read(fd, buf+tot, N-tot)) >= 0){
         ...
}

in utf.c

N is assigned to be 10000 in hdr.h

if you set N to 10, you will find the problem more clearly:
tcs cannot handle correctly utf character boundary.

for example, assume a.txt have the content:
aaaaaaaこの

term% xd -c a.txt
0000000   a  a  a  a  a  a  a e3 81 93 e3 81 ae \n
000000e

tcs can handle this text because N=10 is just uft boundary
but tcs fails if 'a' are 6 or 8 ...

tcs is very important for me.
Who maintains tcs ?
I might help debugging.

Kenji Arisawa



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-09-01  0:36 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-31  6:07 [9fans] tcs bug arisawa
2005-08-31  9:11 ` arisawa
2005-08-31  9:17   ` Rob Pike
2005-08-31 10:48     ` arisawa
2005-08-31 11:22       ` arisawa
  -- strict thread matches above, loose matches on Subject: below --
2005-09-01  0:36 quanstro
2005-08-31 10:51 quanstro
2005-08-31 21:36 ` Russ Cox
2005-08-29 23:23 [9fans] some Plan9 related ideas Bhanu Nagendra Pisupati
2005-08-30 17:07 ` [9fans] " Dave Eckhardt
2005-08-30 17:33   ` Francisco Ballesteros
2005-08-30 17:46     ` Russ Cox
2005-08-31  5:54       ` [9fans] tcs bug arisawa
2005-08-31  5:57         ` Rob Pike

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).