From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.0 required=5.0 tests=T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 31222 invoked from network); 23 Apr 2023 14:07:38 -0000 Received: from 9front.inri.net (168.235.81.73) by inbox.vuxu.org with ESMTPUTF8; 23 Apr 2023 14:07:38 -0000 Received: from mslow1.mail.gandi.net ([217.70.178.240]) by 9front; Sun Apr 23 10:06:09 -0400 2023 Received: from relay10.mail.gandi.net (unknown [IPv6:2001:4b98:dc4:8::230]) by mslow1.mail.gandi.net (Postfix) with ESMTP id 52270C0981 for <9front@9front.org>; Sun, 23 Apr 2023 13:53:26 +0000 (UTC) Received: (Authenticated sender: oholiab@grimmwa.re) by mail.gandi.net (Postfix) with ESMTPSA id BF2AB240003 for <9front@9front.org>; Sun, 23 Apr 2023 13:53:21 +0000 (UTC) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Sun, 23 Apr 2023 14:53:21 +0100 Message-Id: From: "grimmware" To: <9front@9front.org> X-Mailer: aerc 0.14.0 List-ID: <9front.9front.org> List-Help: X-Glyph: ➈ X-Bullshit: ActivityPub browser just-in-time proxy Subject: [9front] [PATCH] Fix rune size in `acid` Reply-To: 9front@9front.org Precedence: bulk Hello! The other day I ended up writing my own function in acid for iterating thro= ugh a rune string because the native '\R' format didn't appear to be working as expected, and whilst looking through the documentation for unrelated functionality, I spotted that they were still referred to as 2-byte rather = than 4-byte. Turns out this is also still true in the code itself. I've put together a patch and a some test case to demonstrate the bug and verify the fix. Verifying the rune string ('\R' format string) was pretty e= asy, whereas wrapping my brain around the correct way to do an acid dereference = such that `print` didn't just make up for acid's deficiencies by still treating = the reference as a pointer to a 4-byte value took me a little bit more time. Th= is is due to the concert of UTF-8 to UTF-32 conversion (most notably that a bu= nch of 3 byte UTF-8 characters still only have nonzero data in the bottom 2 byt= es and due to being on a little-endian architecture these still render perfect= ly well as individual 16 bit runes!), and the way acid handles variable references. If you want to see what I'm talking about, try replacing `r=3D(*main:r\r)` with `r=3D*(*main:r\r)` and returning `r` instead of `*r`= in the acid script and watch it still manage to produce the correct character desp= ite the bug! Anyway, for my test case, I've used =F0=92=81=83 (U+12043) which won't rend= er in the vga font but is a Cuneiform character chosen at random (https://www.compart.com/en/unicode/U+12043) for the virtue of having nonze= ro bits in the upper 2 bytes. Apparently it means "potter". ``` test.c #include void test(Rune *r) {} void main() { Rune *r; r =3D L"=F0=92=81=83=CE=86=CF=81=CF=87=CE=B9=CE=BC=CE=AE=CE=B4=CE=B7=CF=82= "; print("%x\n", (u32int) *r); print("%x\n", (u16int) *r); test(r); print("%S\n", r); /* Archimedes */ } ``` I've also got an acid script: ``` demo.acid new(); bpset(test); cont(); r=3D(*main:r\r); *r; rs=3D(*main:r\R); *rs; ``` Before the patch: ``` cpu% 6c test.c; 6l test.6; acid 6.out 6.out:amd64 plan 9 executable /sys/lib/acid/port /sys/lib/acid/amd64 acid: include("demo.acid") 404005: system call _main SUBQ $0x90,SP 404005: breakpoint main+0x4 MOVL $.string(SB),CX 12043 2043 404005: breakpoint test RET =E2=81=83 =F0=92=81=83=CE=86=EF=BF=BD=01=EF=BF=BD ``` As you can see, the individual rune (`\r`) format is producing the `=E2=81= =83` character which is U+2043 (i.e. U=3D12043 truncated to the lower 2 bytes) a= nd the rune string (`\R`) format is garbage. With the patch applied: ``` cpu% 6c test.c; 6l test.6; acid 6.out 6.out:amd64 plan 9 executable /sys/lib/acid/port /sys/lib/acid/amd64 acid: include("demo.acid") 404064: system call _main SUBQ $0x90,SP 404064: breakpoint main+0x4 MOVL $.string(SB),CX 12043 2043 404064: breakpoint test RET =F0=92=81=83 =F0=92=81=83=CE=86=CF=81=CF=87=CE=B9=CE=BC=CE=AE=CE=B4=CE=B7=CF=82 ``` The patch is as follows. Thanks to sigrid for pointing me toward the correc= t bit of code, to ori for helping me verify my understanding and both for encouraging me to submit a patch. grimmware --- a/sys/doc/acid.ms +++ b/sys/doc/acid.ms @@ -285,9 +285,9 @@ Interpret the addressed bytes as UTF characters and print successive characters until a zero byte is reached. .IP \f(CWr\fP -Print a two-byte integer as a rune. +Print a four-byte integer as a rune. .IP \f(CWR\fP -Print successive two-byte integers as runes +Print successive four-byte integers as runes until a zero rune is reached. .IP \f(CWi\fP Print as machine instructions. --- a/sys/src/cmd/acid/exec.c +++ b/sys/src/cmd/acid/exec.c @@ -244,6 +244,7 @@ uchar cval; ushort sval; char buf[512], reg[12]; + int rsize; =20 r->op =3D OCONST; r->fmt =3D fmt; @@ -264,7 +265,6 @@ case 'u': case 'o': case 'q': - case 'r': r->type =3D TINT; ret =3D get2(m, addr, &sval); if (ret < 0) @@ -286,6 +286,7 @@ case 'U': case 'O': case 'Q': + case 'r': r->type =3D TINT; ret =3D get4(m, addr, &lval); if (ret < 0) @@ -317,12 +318,13 @@ r->string =3D strnode(buf); break; case 'R': + rsize =3D sizeof(Rune); r->type =3D TSTRING; - for(i =3D 0; i < sizeof(buf)-2; i +=3D 2) { - ret =3D get1(m, addr, (uchar*)&buf[i], 2); + for(i =3D 0; i < sizeof(buf)-rsize; i +=3D rsize) { + ret =3D get1(m, addr, (uchar*)&buf[i], rsize); if (ret < 0) error("indir: %r"); - addr +=3D 2; + addr +=3D rsize; if(buf[i] =3D=3D 0 && buf[i+1] =3D=3D 0) break; }