zsh-workers
 help / color / Atom feed
* [BUG] zformat -f has no multibyte support
@ 2019-12-23 22:24 zsugabubus
  2019-12-24 20:23 ` Daniel Shahaf
  0 siblings, 1 reply; 5+ messages in thread
From: zsugabubus @ 2019-12-23 22:24 UTC (permalink / raw)
  To: zsh-workers

Hi,

  $ setopt multibyte
  $ zformat -f X '%-3s' 's:ő'; echo $X
  "ő"
  $ zformat -f X '%.1s' 's:ő'; echo $X
  (garbage)
  $ zformat -f X '%-3s' 's:o'; echo $X
  "  o"

--
zsugabubus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [BUG] zformat -f has no multibyte support
  2019-12-23 22:24 [BUG] zformat -f has no multibyte support zsugabubus
@ 2019-12-24 20:23 ` Daniel Shahaf
  2019-12-26  4:35   ` dana
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Shahaf @ 2019-12-24 20:23 UTC (permalink / raw)
  To: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 510 bytes --]

zsugabubus wrote on Mon, Dec 23, 2019 at 23:24:36 +0100:
>   $ setopt multibyte
>   $ zformat -f X '%-3s' 's:ő'; echo $X
>   "ő"
>   $ zformat -f X '%.1s' 's:ő'; echo $X
>   (garbage)
>   $ zformat -f X '%-3s' 's:o'; echo $X
>   "  o"

The printf builtin handles this correctly, so this should be fairly easy to fix.

Actually, I don't suppose we could just call into the printf code directly, can
we?  It _works_ (see attachment), but it's not elegant.

Aside: It is customary to use a valid from address.

[-- Attachment #2: zformat-printf-PoC.patch --]
[-- Type: text/x-diff, Size: 3519 bytes --]

diff --git a/Src/Modules/zutil.c b/Src/Modules/zutil.c
index 7d9bf05d6..bb00c8a24 100644
--- a/Src/Modules/zutil.c
+++ b/Src/Modules/zutil.c
@@ -775,7 +775,7 @@ static char *zformat_substring(char* instr, char **specs, char **outp,
 
     for (s = instr; *s && *s != endchar; s++) {
 	if (*s == '%') {
-	    int right, min = -1, max = -1, outl, testit;
+	    int right, min = -1, max = -1, testit;
 	    char *spec, *start = s;
 
 	    if ((right = (*++s == '-')))
@@ -835,11 +835,49 @@ static char *zformat_substring(char* instr, char **specs, char **outp,
 	    } else if (skip) {
 		continue;
 	    } else if ((spec = specs[STOUC(*s)])) {
-		int len;
+		int outl;
+		Param pm;
+		LinkList args = newlinklist();
+
+		/* '%', '-', min, ',', max, 's', NUL. */
+		char fmt[1 + 1 + DIGBUFSIZE-1 + 1 + DIGBUFSIZE-1 + 1 + 1];
+
+		/* zformat uses minus to mean "pad on the left".
+		 * printf uses minus to mean "pad on the right". */
+		const char *optional_minus = (right ? "" : "-");
+
+		startparamscope();
+		pm = createparam("REPLY", PM_LOCAL|PM_SCALAR);
+		if (pm)
+		    pm->level = locallevel; /* because createparam() doesn't */
+
+		addlinknode(args, "printf");
+		addlinknode(args, "-v");
+		addlinknode(args, "REPLY");
+		
+		if (min >= 0 && max >= 0) {
+		    snprintf(fmt, sizeof(fmt), "%%%s%d.%ds", optional_minus, min, max);
+		} else if (min >= 0) {
+		    snprintf(fmt, sizeof(fmt), "%%%s%ds", optional_minus, min);
+		} else if (max >= 0) {
+		    snprintf(fmt, sizeof(fmt), "%%.%ds", max);
+		} else {
+		    snprintf(fmt, sizeof(fmt), "%%%ss", optional_minus);
+		}
+		addlinknode(args, fmt);
 
-		if ((len = strlen(spec)) > max && max >= 0)
-		    len = max;
-		outl = (min >= 0 ? (min > len ? min : len) : len);
+		addlinknode(args, spec);
+
+		{
+		    Builtin builtin_printf =
+			(Builtin)builtintab->getnode(builtintab, "printf");
+		    local_list0(assigns);
+
+		    init_list0(assigns);
+		    execbuiltin(args, &assigns, builtin_printf);
+		}
+
+		outl = strlen(getsparam("REPLY"));
 
 		if (*ousedp + outl >= *olenp) {
 		    int nlen = *olenp + outl + 128;
@@ -849,24 +887,11 @@ static char *zformat_substring(char* instr, char **specs, char **outp,
 		    *olenp = nlen;
 		    *outp = tmp;
 		}
-		if (len >= outl) {
-		    memcpy(*outp + *ousedp, spec, outl);
+		{
+		    memcpy(*outp + *ousedp, getsparam("REPLY"), outl);
 		    *ousedp += outl;
-		} else {
-		    int diff = outl - len;
-
-		    if (right) {
-			while (diff--)
-			    (*outp)[(*ousedp)++] = ' ';
-			memcpy(*outp + *ousedp, spec, len);
-			*ousedp += len;
-		    } else {
-			memcpy(*outp + *ousedp, spec, len);
-			*ousedp += len;
-			while (diff--)
-			    (*outp)[(*ousedp)++] = ' ';
-		    }
 		}
+		endparamscope();
 	    } else {
 		int len = s - start + 1;
 
diff --git a/Test/D07multibyte.ztst b/Test/D07multibyte.ztst
index e20315340..3e7ec061f 100644
--- a/Test/D07multibyte.ztst
+++ b/Test/D07multibyte.ztst
@@ -585,3 +585,12 @@
 >OK
 F:A failure here may indicate the system regex library does not
 F:support character sets outside the portable 7-bit range.
+
+ if zmodload zsh/zutil 2>/dev/null; then
+   zformat -f REPLY '%.3s' 's:ヌxéfoo'
+   echo $REPLY
+ else
+   ZTST_skip="can't load the zsh/zutil module for testing"
+ fi
+0:zformat multibyte test
+>ヌxé
diff --git a/Test/V13zformat.ztst b/Test/V13zformat.ztst
index 982866e13..91901cbf4 100644
--- a/Test/V13zformat.ztst
+++ b/Test/V13zformat.ztst
@@ -65,3 +65,5 @@
 >ipsum.bar
 >bazbaz
 >\esc:ape
+
+# Multibyte tests in D07multibyte.ztst

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [BUG] zformat -f has no multibyte support
  2019-12-24 20:23 ` Daniel Shahaf
@ 2019-12-26  4:35   ` dana
  2019-12-26  5:04     ` Daniel Shahaf
  0 siblings, 1 reply; 5+ messages in thread
From: dana @ 2019-12-26  4:35 UTC (permalink / raw)
  To: Daniel Shahaf; +Cc: zsh-workers

On 24 Dec 2019, at 14:23, Daniel Shahaf <d.s@daniel.shahaf.name> wrote:
> Actually, I don't suppose we could just call into the printf code directly, can
> we?  It _works_ (see attachment), but it's not elegant.

A very quick before/after test shows that it reduces performance quite a bit,
especially as the number of format specs increases. Admittedly, it's only a
few µs per spec in absolute numbers (at least on my Mac, with the handful of
operations i tested), and i don't know how that would compare with a
'lower-level' approach. I assume that multi-byte support will incur *some*
penalty no matter how it's done, so maybe it's reasonable, idk.

The only other thing i'd worry about is the user having to micro-manage REPLY,
but it looks like the way you've done it avoids any issues with that.

dana


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [BUG] zformat -f has no multibyte support
  2019-12-26  4:35   ` dana
@ 2019-12-26  5:04     ` Daniel Shahaf
  2019-12-26 11:28       ` Daniel Shahaf
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Shahaf @ 2019-12-26  5:04 UTC (permalink / raw)
  To: zsh-workers

dana wrote on Thu, 26 Dec 2019 04:35 +00:00:
> On 24 Dec 2019, at 14:23, Daniel Shahaf <d.s@daniel.shahaf.name> wrote:
> > Actually, I don't suppose we could just call into the printf code directly, can
> > we?  It _works_ (see attachment), but it's not elegant.
> 
> A very quick before/after test shows that it reduces performance quite a bit,
> especially as the number of format specs increases. Admittedly, it's only a
> few µs per spec in absolute numbers (at least on my Mac, with the handful of
> operations i tested),

What is the _factor_ of slowdown?  (Multiplicative, as opposed to additive)
I.e., does the patch make things slower by 1%, or 10%, or 100%, or 1000%?

> and i don't know how that would compare with a
> 'lower-level' approach. I assume that multi-byte support will incur *some*
> penalty no matter how it's done, so maybe it's reasonable, idk.

I don't have an intuition for what the penalty should be, but to get a
sense of the orders of magnitude involved we could the behaviour of the
'printf' builtin when built with and without multibyte support.

Thanks,

Daniel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [BUG] zformat -f has no multibyte support
  2019-12-26  5:04     ` Daniel Shahaf
@ 2019-12-26 11:28       ` Daniel Shahaf
  0 siblings, 0 replies; 5+ messages in thread
From: Daniel Shahaf @ 2019-12-26 11:28 UTC (permalink / raw)
  To: zsh-workers

Daniel Shahaf wrote on Thu, 26 Dec 2019 05:04 +00:00:
> dana wrote on Thu, 26 Dec 2019 04:35 +00:00:
> > On 24 Dec 2019, at 14:23, Daniel Shahaf <d.s@daniel.shahaf.name> wrote:
> > > Actually, I don't suppose we could just call into the printf code directly, can
> > > we?  It _works_ (see attachment), but it's not elegant.
> > 
> > A very quick before/after test shows that it reduces performance quite a bit,
> > especially as the number of format specs increases. Admittedly, it's only a
> > few µs per spec in absolute numbers (at least on my Mac, with the handful of
> > operations i tested),
> 
> What is the _factor_ of slowdown?  (Multiplicative, as opposed to additive)
> I.e., does the patch make things slower by 1%, or 10%, or 100%, or 1000%?

According to dana and I's results, the factor is between x4 and x29:

[[[
% for 1 in */Src/zsh; do printf "%-16s%s\n" $1:h:h $($1 -fc 'time ( zformat -f x ${(r.1000000..%a.):-} a:o )'); done
( zformat -f x ${(r.1000000..%a.):-} a:o; )  0.10s user 0.27s system 99% cpu 0.369 total
master          
( zformat -f x ${(r.1000000..%a.):-} a:o; )  1.04s user 0.33s system 99% cpu 1.372 total
patch           
]]]

[[[
# Without patch
[DEV] zsh % repeat 5 utime zformat -f x ${(r.10000..%a.):-} a:o
[429 µs]
[507 µs]
[499 µs]
[468 µs]
[609 µs]

# With patch
[DEV] zsh % repeat 5 utime zformat -f x ${(r.10000..%a.):-} a:o
[16577 µs]
[14063 µs]
[13941 µs]
[13972 µs]
[13569 µs]
]]]

I don't know how to explain the difference (we've both had other things to look into).


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, back to index

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-23 22:24 [BUG] zformat -f has no multibyte support zsugabubus
2019-12-24 20:23 ` Daniel Shahaf
2019-12-26  4:35   ` dana
2019-12-26  5:04     ` Daniel Shahaf
2019-12-26 11:28       ` Daniel Shahaf

zsh-workers

Archives are clonable: git clone --mirror http://inbox.vuxu.org/zsh-workers

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://inbox.vuxu.org/vuxu.archive.zsh.workers


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git