9front - general discussion about 9front
 help / color / mirror / Atom feed
* [9front] [PATCH] Fix rune size in `acid`
@ 2023-04-23 13:53 grimmware
  2023-05-03 17:54 ` ori
  0 siblings, 1 reply; 4+ messages in thread
From: grimmware @ 2023-04-23 13:53 UTC (permalink / raw)
  To: 9front

Hello!

The other day I ended up writing my own function in acid for iterating through
a rune string because the native '\R' format didn't appear to be working as
expected, and whilst looking through the documentation for unrelated
functionality, I spotted that they were still referred to as 2-byte rather than
4-byte. Turns out this is also still true in the code itself.

I've put together a patch and a some test case to demonstrate the bug and
verify the fix. Verifying the rune string ('\R' format string) was pretty easy,
whereas wrapping my brain around the correct way to do an acid dereference such
that `print` didn't just make up for acid's deficiencies by still treating the
reference as a pointer to a 4-byte value took me a little bit more time. This
is due to the concert of UTF-8 to UTF-32 conversion (most notably that a bunch
of 3 byte UTF-8 characters still only have nonzero data in the bottom 2 bytes
and due to being on a little-endian architecture these still render perfectly
well as individual 16 bit runes!), and the way acid handles variable
references. If you want to see what I'm talking about, try replacing
`r=(*main:r\r)` with `r=*(*main:r\r)` and returning `r` instead of `*r` in the
acid script and watch it still manage to produce the correct character despite
the bug!

Anyway, for my test case, I've used 𒁃 (U+12043) which won't render in the vga
font but is a Cuneiform character chosen at random
(https://www.compart.com/en/unicode/U+12043) for the virtue of having nonzero
bits in the upper 2 bytes. Apparently it means "potter".

``` test.c
#include <libc.h>

void
test(Rune *r)
{}

void
main()
{
	Rune *r;
	r = L"𒁃Άρχιμήδης";
	print("%x\n", (u32int) *r);
	print("%x\n", (u16int) *r);
	test(r);
	print("%S\n", r); /* Archimedes */
}
```

I've also got an acid script:

``` demo.acid
new();
bpset(test);
cont();
r=(*main:r\r);
*r;
rs=(*main:r\R);
*rs;
```

Before the patch:
```
cpu% 6c test.c; 6l test.6; acid 6.out
6.out:amd64 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/amd64
acid: include("demo.acid")
404005: system call	_main	SUBQ	$0x90,SP
404005: breakpoint	main+0x4	MOVL	$.string(SB),CX
12043
2043
404005: breakpoint	test	RET
⁃
𒁃Ά�\x01�
```

As you can see, the individual rune (`\r`) format is producing the `⁃`
character which is U+2043 (i.e. U=12043 truncated to the lower 2 bytes) and the
rune string (`\R`) format is garbage.

With the patch applied:

```
cpu% 6c test.c; 6l test.6; acid 6.out
6.out:amd64 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/amd64
acid: include("demo.acid")
404064: system call	_main	SUBQ	$0x90,SP
404064: breakpoint	main+0x4	MOVL	$.string(SB),CX
12043
2043
404064: breakpoint	test	RET
𒁃
𒁃Άρχιμήδης
```

The patch is as follows. Thanks to sigrid for pointing me toward the correct
bit of code, to ori for helping me verify my understanding and both for
encouraging me to submit a patch.

grimmware

--- a/sys/doc/acid.ms
+++ b/sys/doc/acid.ms
@@ -285,9 +285,9 @@
 Interpret the addressed bytes as UTF characters
 and print successive characters until a zero byte is reached.
 .IP \f(CWr\fP
-Print a two-byte integer as a rune.
+Print a four-byte integer as a rune.
 .IP \f(CWR\fP
-Print successive two-byte integers as runes
+Print successive four-byte integers as runes
 until a zero rune is reached.
 .IP \f(CWi\fP
 Print as machine instructions.
--- a/sys/src/cmd/acid/exec.c
+++ b/sys/src/cmd/acid/exec.c
@@ -244,6 +244,7 @@
 	uchar cval;
 	ushort sval;
 	char buf[512], reg[12];
+	int rsize;
 
 	r->op = OCONST;
 	r->fmt = fmt;
@@ -264,7 +265,6 @@
 	case 'u':
 	case 'o':
 	case 'q':
-	case 'r':
 		r->type = TINT;
 		ret = get2(m, addr, &sval);
 		if (ret < 0)
@@ -286,6 +286,7 @@
 	case 'U':
 	case 'O':
 	case 'Q':
+	case 'r':
 		r->type = TINT;
 		ret = get4(m, addr, &lval);
 		if (ret < 0)
@@ -317,12 +318,13 @@
 		r->string = strnode(buf);
 		break;
 	case 'R':
+		rsize = sizeof(Rune);
 		r->type = TSTRING;
-		for(i = 0; i < sizeof(buf)-2; i += 2) {
-			ret = get1(m, addr, (uchar*)&buf[i], 2);
+		for(i = 0; i < sizeof(buf)-rsize; i += rsize) {
+			ret = get1(m, addr, (uchar*)&buf[i], rsize);
 			if (ret < 0)
 				error("indir: %r");
-			addr += 2;
+			addr += rsize;
 			if(buf[i] == 0 && buf[i+1] == 0)
 				break;
 		}

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-05-03 19:16 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-23 13:53 [9front] [PATCH] Fix rune size in `acid` grimmware
2023-05-03 17:54 ` ori
2023-05-03 18:54   ` Matt Carroll
2023-05-03 19:05     ` Matt Carroll

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).