zsh-users
 help / color / mirror / code / Atom feed
* Match length and multibyte characters
@ 2015-09-10 11:35 Erik Bernstein
  2015-09-11 18:02 ` Jun T.
  0 siblings, 1 reply; 4+ messages in thread
From: Erik Bernstein @ 2015-09-10 11:35 UTC (permalink / raw)
  To: zsh-users

Hello everybody,

while playing around with zsh expansions, I've stumbled across this
small annoyance that I think might be worth to ask the list about.

Let's suppose I have an array and I want to know the length of the
longest string contained. After going through zshexpn(1), the first
thing I came up with was:

% array=(a bbb cc)
% print ${${(O)array//(#m)*/${#MATCH}}[1]}
3

which is perfectly fine and seems to do the job. Later I found that
the same thing can be accomplished by this.

% print ${${(ON)array%%*}[1]}
3

However, the second version seems to break on multibyte characters
while the first one works just fine:

% array=(a ä a)
% print ${${(O)array//(#m)*/${#MATCH}}[1]} ${${(ON)array%%*}[1]}
1 2

Can maybe someone shed some light on whether the second version is
supposed to work with multibyte characters and, if, what has to be
done to make it count multibyte chars only once just like the first
version does?

regards

erik


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Match length and multibyte characters
  2015-09-10 11:35 Match length and multibyte characters Erik Bernstein
@ 2015-09-11 18:02 ` Jun T.
  2015-09-11 19:40   ` Jun T.
  2015-09-17 15:14   ` Erik Bernstein
  0 siblings, 2 replies; 4+ messages in thread
From: Jun T. @ 2015-09-11 18:02 UTC (permalink / raw)
  To: Erik Bernstein, zsh-users


2015/09/10 20:35, Erik Bernstein <erik@fscking.org> wrote:
> % array=(a ä a)
> % print ${${(O)array//(#m)*/${#MATCH}}[1]} ${${(ON)array%%*}[1]}
> 1 2
> 
> Can maybe someone shed some light on whether the second version is
> supposed to work with multibyte characters and,

The second version returns 2 because ä is a 2 byte character in UTF-8.
This is a bug of the current zsh; all the flags N, B and E do not work
well with multibyte characters in ${...#...}, ${...%...} etc.

The patch below may fix the bug.

BTW, in your example, it is better to replace the flag (O) by (On)
so that the length is sorted in numerical order. Otherwise, 10 comes
before 2.


diff --git a/Src/glob.c b/Src/glob.c
index dea1bf5..43d135b 100644
--- a/Src/glob.c
+++ b/Src/glob.c
@@ -2491,17 +2491,17 @@ get_match_ret(char *s, int b, int e, int fl, char *replstr,
 	ll += 1 + (l - (e - b));
     if (fl & SUB_BIND) {
 	/* position of start of matched portion */
-	sprintf(buf, "%d ", b + 1);
+	sprintf(buf, "%d ", MB_METASTRLEN2END(s, 0, s+b) + 1);
 	ll += (bl = strlen(buf));
     }
     if (fl & SUB_EIND) {
 	/* position of end of matched portion */
-	sprintf(buf + bl, "%d ", e + 1);
+	sprintf(buf + bl, "%d ", MB_METASTRLEN2END(s, 0, s+e) + 1);
 	ll += (bl = strlen(buf));
     }
     if (fl & SUB_LEN) {
 	/* length of matched portion */
-	sprintf(buf + bl, "%d ", e - b);
+	sprintf(buf + bl, "%d ", MB_METASTRLEN2END(s+b, 0, s+e));
 	ll += (bl = strlen(buf));
     }
     if (bl)






^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Match length and multibyte characters
  2015-09-11 18:02 ` Jun T.
@ 2015-09-11 19:40   ` Jun T.
  2015-09-17 15:14   ` Erik Bernstein
  1 sibling, 0 replies; 4+ messages in thread
From: Jun T. @ 2015-09-11 19:40 UTC (permalink / raw)
  To: zsh-users


2015/09/12 03:02, I wrote:
> Otherwise, 10 comes before 2.

Sorry, this is if sorting in ascending order, i.e., the (o) flag.

(o)	1 10 2 20 3
(O)	3 20 2 10 1
(n)	1 2 3 10 20
(On)	20 10 3 2 1	this is what you want.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Match length and multibyte characters
  2015-09-11 18:02 ` Jun T.
  2015-09-11 19:40   ` Jun T.
@ 2015-09-17 15:14   ` Erik Bernstein
  1 sibling, 0 replies; 4+ messages in thread
From: Erik Bernstein @ 2015-09-17 15:14 UTC (permalink / raw)
  To: Jun T.; +Cc: zsh-users

>> % array=(a ä a)
>> % print ${${(O)array//(#m)*/${#MATCH}}[1]} ${${(ON)array%%*}[1]}
>> 1 2
>>
>> Can maybe someone shed some light on whether the second version is
>> supposed to work with multibyte characters and,
>
> The second version returns 2 because ä is a 2 byte character in UTF-8.
> This is a bug of the current zsh; all the flags N, B and E do not work
> well with multibyte characters in ${...#...}, ${...%...} etc.

Thanks for clearing that up. I was just unsure whether this is really
a bug or if there's another flag that I have to apply in order to make
it work with unicode characters, too.


> The patch below may fix the bug.

This is what I get after applying your patch:

/home/debian/zsh-5.0.7/obj/Src/../../Src/glob.c:2489: undefined
reference to `MB_METASTRLEN2END'
/home/debian/zsh-5.0.7/obj/Src/../../Src/glob.c:2495: undefined
reference to `MB_METASTRLEN2END'
/home/debian/zsh-5.0.7/obj/Src/../../Src/glob.c:2483: undefined
reference to `MB_METASTRLEN2END'

Might be due to my old version of 5.0.7, I didn't try 5.1.1. In any
case, I'd rather work around this bug until it gets fixed upstream
than patch each zsh on all of my machines individually.

> BTW, in your example, it is better to replace the flag (O) by (On)

True. I've used (On) during my tests but then forgot the crucial n in
my posting.

Best

erik


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-09-17 15:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-10 11:35 Match length and multibyte characters Erik Bernstein
2015-09-11 18:02 ` Jun T.
2015-09-11 19:40   ` Jun T.
2015-09-17 15:14   ` Erik Bernstein

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).