From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Sun, 30 Mar 2014 02:10:27 -0400 To: 9fans@9fans.net Message-ID: In-Reply-To: <13277e55555fc0e249f7ce04f144a19f@felloff.net> References: <13277e55555fc0e249f7ce04f144a19f@felloff.net> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] a strange bug in grep Topicbox-Message-UUID: d2be6b5e-ead8-11e9-9d60-3106f5b1d025 On Sat Mar 29 21:46:33 EDT 2014, cinap_lenrek@felloff.net wrote: > very good. > > one question about: > > - x = re2or(x, rclass(ov, Runemask)); > + x = re2or(x, rclass(ov, 0xffff)); > > this seems wrong for 21 bit runes (the old is also wrong i think). > > shouldnt that be: > > + x = re2or(x, rclass(ov, Runemax)); > > as Runemask (0x1fffff) is not a valid rune for 21-bit rune > as it is >Runemax. yes, that's correct. i left it at 0xffff because was still a bug. tab2 still needs to burst the leading bytes so we enum all the cases. i think tab2 should be Rune tab2[] = { 0x003f, 0x0fff, 0x07ffff, }; since the first byte of the 21-bit rune is 0b11110xxx. what do you think? > as i understand it, tab1[] array contains the last valid rune > in a range of the same utf8 encoding length. > > basically: > > 0-07f -> 1 byte, 0x80-0x7ff -> 2 byte ect... > > so adding 0xffff is right. the next would be 0x10ffff for 21 bit > runes but there shouldnt be any runes above 0x10ffff. > > makes any sense? since the tab1 array is bursting at byte boundaries, the next birst is at 0x1fffff. but that's in undefined territory. - erik