zsh-workers
 help / color / mirror / code / Atom feed
* [BUG] SIGSEGV under certain circumstances
@ 2017-03-01 15:38 Chi-Hsuan Yen
  2017-03-04 23:11 ` Bart Schaefer
  0 siblings, 1 reply; 16+ messages in thread
From: Chi-Hsuan Yen @ 2017-03-01 15:38 UTC (permalink / raw)
  To: Zsh hackers list

Hello zsh experts,

Yesterday I got SIGSEGV in zsh under certain circumstances. I'm on
Arch Linux x86_64. Steps to reproduce are:

1. Install the mpv media player. Seems the bug is related to its
completion script _mpv. On Arch Linux it's stored in
/usr/share/zsh/site-functions/_mpv. I have uploaded a copy to [1]
2. Use .zshrc at [2]
3. Create an empty file called 突然好想你-3565536.mp3 in $HOME
    Seems the filename affects how SIGSEGV occurs or not. Its content
is irrelevant.
4. Open a new terminal and run `mpv 突然好想你-3565536.mp3`. Then exit the
terminal to make sure that the mpv command is written to
~/.zsh_history
5. Open another new terminal, hit arrow up one or more times to the
previous mpv command.
6. Hit tab, then zsh crashes with SIGSEGV. The log under gdb can be found at [3]

My zsh version is d00931de5c2b7aa846daf137865dd05ac8d4be8a. I replaced
the value of `source` to git-master and rebuild the Arch Linux zsh
package. [4]

I can reproduce this bug with both QTerminal (a fork of Konsole) and
xfce4-terminal. So it's not a terminal-specific issue.

By the way, after those steps ~/.zsh_history seems corrupted:

$ head -n 2 ~/.zsh_history | tail -n 1 | xxd
00000000: 6d70 7620 e7aa 81e7 83a4 b6e5 a5bd e683  mpv ............
00000010: a3b3 e4bd 8380 2d33 3536 3535 3336 2e6d  ......-3565536.m
00000020: 7033 0a                                  p3.

$ echo -n 突然好想你 | xxd
00000000: e7aa 81e7 84b6 e5a5 bde6 83b3 e4bd a0  ...............

Chinese characters 突然好想你 map to:

e7aa81 e784b6 e5a5bd e683b3 e4bda0

in utf-8 (15 bytes, 3 bytes for each character). However, in
~/.zsh_history, the saved content is: (I reformatted it for easier
comparision with the correct version)

e7aa81 e783a4b6 e5a5bd e683a3b3 e4bd8380

Apparently the 2nd, 4th and 5th characters are corrupted. I'm not sure
whether it's related to the crash, though.

I know this bug is not easy to reproduce. Thanks for the patience in
reading this long letter!

Best,

Yen, Chi-Hsuan

[1] https://gist.github.com/yan12125/014c7a7510d1d9bd9ac8edf142a0c65d#file-_mpv
[2] https://gist.github.com/yan12125/014c7a7510d1d9bd9ac8edf142a0c65d#file-zshrc
[3] https://gist.github.com/yan12125/014c7a7510d1d9bd9ac8edf142a0c65d#file-backtrace
[4] https://git.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/zsh


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-01 15:38 [BUG] SIGSEGV under certain circumstances Chi-Hsuan Yen
@ 2017-03-04 23:11 ` Bart Schaefer
  2017-03-05 12:55   ` Chi-Hsuan Yen
  2017-03-05 13:09   ` Chi-Hsuan Yen
  0 siblings, 2 replies; 16+ messages in thread
From: Bart Schaefer @ 2017-03-04 23:11 UTC (permalink / raw)
  To: Chi-Hsuan Yen, Zsh hackers list

On Mar 1, 11:38pm, Chi-Hsuan Yen wrote:
}
} My zsh version is d00931de5c2b7aa846daf137865dd05ac8d4be8a.
} 
} By the way, after those steps ~/.zsh_history seems corrupted:

I'm not sure the history is actually corrupted, because the history
file is stored with metafied characters which means you can't easily
examine it independently of reading it into zsh.  There was a small
C program posted here some months ago that would unmetafy the file.

However based on your steps to reproduce there likely is a problem
related to reading the history file back in again.

The stack trace you provided is not helpful because it reveals the
point at which the corrupted memory is accessed but not the point
at which it became corrupted.  A pass through valgrind might give
better information.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-04 23:11 ` Bart Schaefer
@ 2017-03-05 12:55   ` Chi-Hsuan Yen
  2017-03-05 13:09   ` Chi-Hsuan Yen
  1 sibling, 0 replies; 16+ messages in thread
From: Chi-Hsuan Yen @ 2017-03-05 12:55 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

On 5 March 2017 at 07:11, Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Mar 1, 11:38pm, Chi-Hsuan Yen wrote:
> }
> } My zsh version is d00931de5c2b7aa846daf137865dd05ac8d4be8a.
> }
> } By the way, after those steps ~/.zsh_history seems corrupted:
>
> I'm not sure the history is actually corrupted, because the history
> file is stored with metafied characters which means you can't easily
> examine it independently of reading it into zsh.  There was a small
> C program posted here some months ago that would unmetafy the file.
>
> However based on your steps to reproduce there likely is a problem
> related to reading the history file back in again.
>
> The stack trace you provided is not helpful because it reveals the
> point at which the corrupted memory is accessed but not the point
> at which it became corrupted.  A pass through valgrind might give
> better information.

Thanks for the tip. I re-compiled with commit
8522e996ecc88697344dcc4814367ec7e32e7deb and run zsh under valgrind.
The log can be found in [1]. (The GDB backtrace is also updated to the
new build for consistency).

[1] https://gist.github.com/yan12125/014c7a7510d1d9bd9ac8edf142a0c65d#file-valgrind-log


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-04 23:11 ` Bart Schaefer
  2017-03-05 12:55   ` Chi-Hsuan Yen
@ 2017-03-05 13:09   ` Chi-Hsuan Yen
  2017-03-05 16:00     ` Bart Schaefer
  1 sibling, 1 reply; 16+ messages in thread
From: Chi-Hsuan Yen @ 2017-03-05 13:09 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

On 5 March 2017 at 07:11, Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Mar 1, 11:38pm, Chi-Hsuan Yen wrote:
> }
> } My zsh version is d00931de5c2b7aa846daf137865dd05ac8d4be8a.
> }
> } By the way, after those steps ~/.zsh_history seems corrupted:
>
> I'm not sure the history is actually corrupted, because the history
> file is stored with metafied characters which means you can't easily
> examine it independently of reading it into zsh.  There was a small
> C program posted here some months ago that would unmetafy the file.
>
> However based on your steps to reproduce there likely is a problem
> related to reading the history file back in again.
>
> The stack trace you provided is not helpful because it reveals the
> point at which the corrupted memory is accessed but not the point
> at which it became corrupted.  A pass through valgrind might give
> better information.

(I forgot that gmail servers are blocked, so I re-post my logs)

Thanks for the tip. I re-compiled with commit
8522e996ecc88697344dcc4814367ec7e32e7deb and run zsh under valgrind.
The log can be found in [1]. (The GDB backtrace is also updated to the
new build for consistency). [2] and [3] contains logs of vgdb, which I
heard that can give more information than either valgrind or gdb.

[1] https://gist.github.com/yan12125/014c7a7510d1d9bd9ac8edf142a0c65d#file-valgrind-log
[2] https://gist.github.com/yan12125/014c7a7510d1d9bd9ac8edf142a0c65d#file-valgrind-log-with-vgdb
[3] https://gist.github.com/yan12125/014c7a7510d1d9bd9ac8edf142a0c65d#file-gdb-backtrace-from-vgdb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-05 13:09   ` Chi-Hsuan Yen
@ 2017-03-05 16:00     ` Bart Schaefer
  2017-03-05 16:17       ` Peter Stephenson
  0 siblings, 1 reply; 16+ messages in thread
From: Bart Schaefer @ 2017-03-05 16:00 UTC (permalink / raw)
  To: Chi-Hsuan Yen; +Cc: Zsh hackers list

On Mar 5,  9:09pm, Chi-Hsuan Yen wrote:
}
} Thanks for the tip. I re-compiled with commit
} 8522e996ecc88697344dcc4814367ec7e32e7deb and run zsh under valgrind.

Yes, this is much better, thank you.

The bad pointer dereference is in filename completion, not in the
history as I first suspected.

In computil.c:cfp_matcher_pats there is a loop that walks the string
from the command line, in this case the file name recalled from the
history, Comparing each character to the matcher pattern.  If it gets
a match it adjusts some counters that are initialized from strlen() of
the candidate string, exiting the loop when the counters reach the
end of the string.  It also adjusts pointers into string, and one of
those pointers is running past the end.

The pattern is m:{a-zA-Z}={A-Za-z} m:{a-zA-Z}={A-Za-z} but I don't
think that matters, it's the candidate string that's causing the
confusion.

You can see the string in Yen's third backtrace - I won't attempt to
paste the string here because my mail client will probably mangle it:
https://gist.github.com/yan12125/014c7a7510d1d9bd9ac8edf142a0c65d#file-gdb-backtrace-from-vgdb

It does not appear to be metafied but it contains some bytes that
were not in the original file name in his reproducing example.

So there seem to be two problems, one that the history is either not
saving or not reloading the Chinese characters correctly, and two
that the loop in cfp_matcher_pats is not counting correctly when it
examines that garbage string recalled from history.

I'm not in a good position multilingual-environment-wise to debug this
much further.  Peter, you were the last person to touch the computil.c
code in this area, though it was a long time ago --

commit 7f470ebcb09972d46e947dddf0c16197dd3a312f
Author: Peter Stephenson <pws@users.sourceforge.net>
Date:   Sat Oct 18 19:16:24 2008 +0000

    25912: fix another metafication bug in completion
    and remove lies about matchers

-- do you have any thoughts on what's going on?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-05 16:00     ` Bart Schaefer
@ 2017-03-05 16:17       ` Peter Stephenson
  2017-03-05 18:42         ` Bart Schaefer
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Stephenson @ 2017-03-05 16:17 UTC (permalink / raw)
  To: Zsh hackers list; +Cc: Chi-Hsuan Yen

On Sun, 5 Mar 2017 08:00:54 -0800
Bart Schaefer <schaefer@brasslantern.com> wrote:
> In computil.c:cfp_matcher_pats there is a loop that walks the string
> from the command line, in this case the file name recalled from the
> history, Comparing each character to the matcher pattern.  If it gets
> a match it adjusts some counters that are initialized from strlen() of
> the candidate string, exiting the loop when the counters reach the
> end of the string.  It also adjusts pointers into string, and one of
> those pointers is running past the end.
> 
> The pattern is m:{a-zA-Z}={A-Za-z} m:{a-zA-Z}={A-Za-z} but I don't
> think that matters, it's the candidate string that's causing the
> confusion.
>...
> So there seem to be two problems, one that the history is either not
> saving or not reloading the Chinese characters correctly, and two
> that the loop in cfp_matcher_pats is not counting correctly when it
> examines that garbage string recalled from history.

The matcher code doesn't handle non-ASCII characters, and probably never
wlll --- I spent ages looking at this some years ago until I realised I
was getting nowhere.  The most we can hope is it's safe about pointing
off the end of the string.  That's complicated by the fact that the rest
of completion does handle multibyte characters.

There's a good chance this is another problem with the handling of Meta
characters.  We know that broken history parsing can get these wrong ---
we've had problems like that not so long ago.  If it gets past that
stage, we make assumptions about what the presence of a Meta in the
string means that, if it fails, can lead to any number of problems.  In
particular, we assume that any Meta anywhere in the string has the
standard meaning.

We might be able to debug the second half of this by testing for
incorrect metafication in the matcher code, but I'm not sure how far
that gets us.  We're not going to be able to be safe at every point in
the code.

I suspect tracking down the problem at the input stage is the only good
bet.   You might have thought that would be much more mechanical and
hence easy...

pws


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-05 16:17       ` Peter Stephenson
@ 2017-03-05 18:42         ` Bart Schaefer
  2017-03-05 18:52           ` Peter Stephenson
  0 siblings, 1 reply; 16+ messages in thread
From: Bart Schaefer @ 2017-03-05 18:42 UTC (permalink / raw)
  To: Zsh hackers list

On Mar 5,  4:17pm, Peter Stephenson wrote:
}
} The matcher code doesn't handle non-ASCII characters, and probably never
} wlll --- I spent ages looking at this some years ago until I realised I
} was getting nowhere.

Maybe there's a way to have cfp_matcher_pats() bail out with failure as
soon as it encounters any non-ascii character?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-05 18:42         ` Bart Schaefer
@ 2017-03-05 18:52           ` Peter Stephenson
  2017-03-05 21:45             ` Bart Schaefer
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Stephenson @ 2017-03-05 18:52 UTC (permalink / raw)
  To: zsh-workers



On 5 March 2017 18:42:39 GMT+00:00, Bart Schaefer <schaefer@brasslantern.com> wrote:
>On Mar 5,  4:17pm, Peter Stephenson wrote:
>}
>} The matcher code doesn't handle non-ASCII characters, and probably
>never
>} wlll --- I spent ages looking at this some years ago until I realised
>I
>} was getting nowhere.
>
>Maybe there's a way to have cfp_matcher_pats() bail out with failure as
>soon as it encounters any non-ascii character?

It needs at least to be able to do trivial matching,
character for character, or it becomes unusable.

pws
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-05 18:52           ` Peter Stephenson
@ 2017-03-05 21:45             ` Bart Schaefer
  2017-03-05 22:31               ` Bart Schaefer
                                 ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Bart Schaefer @ 2017-03-05 21:45 UTC (permalink / raw)
  To: zsh-workers

On Mar 5,  6:52pm, Peter Stephenson wrote:
}
} It needs at least to be able to do trivial matching,
} character for character, or it becomes unusable.

Point taken, though arguably it's *already* unusable if it crashes.

Let's go back to the history angle ... from Yen's original message:

} Chinese characters map to:
} 
} e7aa81 e784b6 e5a5bd e683b3 e4bda0
} 
} in utf-8 (15 bytes, 3 bytes for each character). However, in
} ~/.zsh_history, the saved content is: (I reformatted it for easier
} comparision with the correct version)
} 
} e7aa81 e783a4b6 e5a5bd e683a3b3 e4bd8380

Adding brackets to show metafied bytes:

e7aa81 e7[83a4]b6 e5a5bd e6[83a3]b3 e4bd[8380]

That is the same string that is passed to cfp_matcher_pats() according to
the backtrace.  So history is reading/writing correctly, it appears.

However, note that it's not the first byte of the three in any of these
examples that is metafied, and if we look at compmatch.c:pattern_match()
we find this:

	/*
	 * Now the line character.
	 */
#ifdef MULTIBYTE_SUPPORT
	len = mb_metacharlenconv_r(s, &c, &lstate);
#else
	/* We have the character itself. */
	if (*s == Meta) {
	    c = STOUC(s[1]) ^ 32;
	    len = 2;
	} else {
	    c = STOUC(*s);
	    len = 1;
	}
#endif

This will fail in this case, I would think?  We have to unmetafy first,
possibly several bytes ahead, and THEN attempt multibyte conversion;
not do either one or the other?

So we're likely to get a false hit on pattern_match(), and in fact the
bad dereference is of *mp in the branch where pattern_match() succeeds.

[Somebody will have to explain to me why (mp += cl) makes sense, as that
seems to be using a byte count to increment a (Cmatcher*), but it's done
in several places.]

Am I going astray here?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-05 21:45             ` Bart Schaefer
@ 2017-03-05 22:31               ` Bart Schaefer
  2017-03-05 22:41               ` Daniel Shahaf
  2017-03-06  9:47               ` Peter Stephenson
  2 siblings, 0 replies; 16+ messages in thread
From: Bart Schaefer @ 2017-03-05 22:31 UTC (permalink / raw)
  To: zsh-workers; +Cc: Chi-Hsuan Yen

On Mar 5,  1:45pm, Bart Schaefer wrote:
}
} This will fail in this case, I would think?  We have to unmetafy first,
} possibly several bytes ahead, and THEN attempt multibyte conversion;
} not do either one or the other?

So the following is probably deeply in need of optimization (needs a
version of unmeta() that stops after the widest possible mutltibyte
character), but just to check my hypothesis:

diff --git a/Src/Zle/compmatch.c b/Src/Zle/compmatch.c
index aedf463..b0252c2 100644
--- a/Src/Zle/compmatch.c
+++ b/Src/Zle/compmatch.c
@@ -1548,7 +1548,7 @@ pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
 {
     convchar_t c, wc;
     convchar_t ind, wind;
-    int len = 0, wlen, mt, wmt;
+    int len = 0, wlen = 0, mt, wmt;
 #ifdef MULTIBYTE_SUPPORT
     mbstate_t lstate, wstate;
 
@@ -1559,7 +1559,13 @@ pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
     while (p && wp && *s && *ws) {
 	/* First test the word character */
 #ifdef MULTIBYTE_SUPPORT
-	wlen = mb_metacharlenconv_r(ws, &wc, &wstate);
+	int ulen = mb_metacharlenconv_r(unmeta(ws), &wc, &wstate);
+	while (ulen-- > 0) {
+	    if (ws[wlen] == Meta)
+		wlen += 2;
+	    else
+		wlen += 1;
+	}
 #else
 	if (*ws == Meta) {
 	    wc = STOUC(ws[1]) ^ 32;
@@ -1577,7 +1583,13 @@ pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
 	 * Now the line character.
 	 */
 #ifdef MULTIBYTE_SUPPORT
-	len = mb_metacharlenconv_r(s, &c, &lstate);
+	ulen = mb_metacharlenconv_r(unmeta(s), &c, &lstate);
+	while (ulen-- > 0) {
+	    if (s[len] == Meta)
+		len += 2;
+	    else
+		len += 1;
+	}
 #else
 	/* We have the character itself. */
 	if (*s == Meta) {
@@ -1624,11 +1636,20 @@ pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
 	ws += wlen;
 	p = p->next;
 	wp = wp->next;
+#ifdef MULTIBYTE_SUPPORT
+	len = wlen = 0;
+#endif
     }
 
     while (p && *s) {
 #ifdef MULTIBYTE_SUPPORT
-	len = mb_metacharlenconv_r(s, &c, &lstate);
+	int ulen = mb_metacharlenconv_r(unmeta(s), &c, &lstate);
+	while (ulen-- > 0) {
+	    if (s[len] == Meta)
+		len += 2;
+	    else
+		len += 1;
+	}
 #else
 	if (*s == Meta) {
 	    c = STOUC(s[1]) ^ 32;
@@ -1642,11 +1663,20 @@ pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
 	    return 0;
 	p = p->next;
 	s += len;
+#ifdef MULTIBYTE_SUPPORT
+	len = 0;
+#endif
     }
 
     while (wp && *ws) {
 #ifdef MULTIBYTE_SUPPORT
-	wlen = mb_metacharlenconv_r(ws, &wc, &wstate);
+	int ulen = mb_metacharlenconv_r(unmeta(ws), &wc, &wstate);
+	while (ulen-- > 0) {
+	    if (ws[wlen] == Meta)
+		wlen += 2;
+	    else
+		wlen += 1;
+	}
 #else
 	if (*ws == Meta) {
 	    wc = STOUC(ws[1]) ^ 32;
@@ -1660,6 +1690,9 @@ pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
 	    return 0;
 	wp = wp->next;
 	ws += wlen;
+#ifdef MULTIBYTE_SUPPORT
+	wlen = 0;
+#endif
     }
 
     return 1;


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-05 21:45             ` Bart Schaefer
  2017-03-05 22:31               ` Bart Schaefer
@ 2017-03-05 22:41               ` Daniel Shahaf
  2017-03-05 22:51                 ` Bart Schaefer
  2017-03-06  9:47               ` Peter Stephenson
  2 siblings, 1 reply; 16+ messages in thread
From: Daniel Shahaf @ 2017-03-05 22:41 UTC (permalink / raw)
  To: zsh-workers

Bart Schaefer wrote on Sun, Mar 05, 2017 at 13:45:13 -0800:
> [Somebody will have to explain to me why (mp += cl) makes sense, as that
> seems to be using a byte count to increment a (Cmatcher*), but it's done
> in several places.]

[ Context: this is about cfp_matcher_pats(). ]

Using a byte count to modify a pointer, in itself, is fine.  It's just
pointer arithmetic that makes the pointer 'mp' point at a different
element of the array 'ms'.

The part that confuses me is why the line says 'mp += cl' rather than
simply 'mp++'.  The inner loop executes precisely 'zl' times (because
the loop variable, 'tl', is initialized to 'al' and skips Meta bytes),
and 'mp' points into an array of size 'zl'.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-05 22:41               ` Daniel Shahaf
@ 2017-03-05 22:51                 ` Bart Schaefer
  2017-03-05 23:07                   ` Bart Schaefer
  0 siblings, 1 reply; 16+ messages in thread
From: Bart Schaefer @ 2017-03-05 22:51 UTC (permalink / raw)
  To: Daniel Shahaf; +Cc: zsh-workers

On Sun, Mar 5, 2017 at 2:41 PM, Daniel Shahaf <d.s@daniel.shahaf.name> wrote:
>
> The part that confuses me is why the line says 'mp += cl' rather than
> simply 'mp++'.  The inner loop executes precisely 'zl' times (because
> the loop variable, 'tl', is initialized to 'al' and skips Meta bytes),
> and 'mp' points into an array of size 'zl'.

Yes, that's the part that confuses me as well, i.e., why is mb
incremented once for every byte rather than once for every character?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-05 22:51                 ` Bart Schaefer
@ 2017-03-05 23:07                   ` Bart Schaefer
  2017-03-06  0:23                     ` Bart Schaefer
  0 siblings, 1 reply; 16+ messages in thread
From: Bart Schaefer @ 2017-03-05 23:07 UTC (permalink / raw)
  To: zsh-workers

On Sun, Mar 5, 2017 at 2:51 PM, Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Sun, Mar 5, 2017 at 2:41 PM, Daniel Shahaf <d.s@daniel.shahaf.name> wrote:
>>
>> The part that confuses me is why the line says 'mp += cl' rather than
>> simply 'mp++'.  The inner loop executes precisely 'zl' times (because
>> the loop variable, 'tl', is initialized to 'al' and skips Meta bytes),
>> and 'mp' points into an array of size 'zl'.
>
> Yes, that's the part that confuses me as well, i.e., why is mb
> incremented once for every byte rather than once for every character?

Also, it seems to me cl needs to be the number of bytes compared by
pattern_match(), which is not always just 1 or 2 with wide characters.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-05 23:07                   ` Bart Schaefer
@ 2017-03-06  0:23                     ` Bart Schaefer
  0 siblings, 0 replies; 16+ messages in thread
From: Bart Schaefer @ 2017-03-06  0:23 UTC (permalink / raw)
  To: zsh-workers

On Mar 5,  3:07pm, Bart Schaefer wrote:
}
} Also, it seems to me cl needs to be the number of bytes compared by
} pattern_match(), which is not always just 1 or 2 with wide characters.

Let's try this?  Hopefully zero'ing the mbstate_t is both correct and
not too time consuming in unmeta_one().  It'd be nice if pattern_match()
could return the size but there are too many possibilities (including a
successful match of zero length, it seems).

This is against master, NOT on top of my previous patch.

diff --git a/Src/Zle/compmatch.c b/Src/Zle/compmatch.c
index aedf463..1cdbb8a 100644
--- a/Src/Zle/compmatch.c
+++ b/Src/Zle/compmatch.c
@@ -1548,27 +1548,11 @@ pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
 {
     convchar_t c, wc;
     convchar_t ind, wind;
-    int len = 0, wlen, mt, wmt;
-#ifdef MULTIBYTE_SUPPORT
-    mbstate_t lstate, wstate;
-
-    memset(&lstate, 0, sizeof(lstate));
-    memset(&wstate, 0, sizeof(wstate));
-#endif
+    int len = 0, wlen = 0, mt, wmt;
 
     while (p && wp && *s && *ws) {
 	/* First test the word character */
-#ifdef MULTIBYTE_SUPPORT
-	wlen = mb_metacharlenconv_r(ws, &wc, &wstate);
-#else
-	if (*ws == Meta) {
-	    wc = STOUC(ws[1]) ^ 32;
-	    wlen = 2;
-	} else {
-	    wc = STOUC(*ws);
-	    wlen = 1;
-	}
-#endif
+	wc = unmeta_one(ws, &wlen);
 	wind = pattern_match1(wp, wc, &wmt);
 	if (!wind)
 	    return 0;
@@ -1576,18 +1560,7 @@ pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
 	/*
 	 * Now the line character.
 	 */
-#ifdef MULTIBYTE_SUPPORT
-	len = mb_metacharlenconv_r(s, &c, &lstate);
-#else
-	/* We have the character itself. */
-	if (*s == Meta) {
-	    c = STOUC(s[1]) ^ 32;
-	    len = 2;
-	} else {
-	    c = STOUC(*s);
-	    len = 1;
-	}
-#endif
+	c = unmeta_one(s, &len);
 	/*
 	 * If either is "?", they match each other; no further tests.
 	 * Apply this even if the character wasn't convertable;
@@ -1627,17 +1600,7 @@ pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
     }
 
     while (p && *s) {
-#ifdef MULTIBYTE_SUPPORT
-	len = mb_metacharlenconv_r(s, &c, &lstate);
-#else
-	if (*s == Meta) {
-	    c = STOUC(s[1]) ^ 32;
-	    len = 2;
-	} else {
-	    c = STOUC(*s);
-	    len = 1;
-	}
-#endif
+	c = unmeta_one(s, &len);
 	if (!pattern_match1(p, c, &mt))
 	    return 0;
 	p = p->next;
@@ -1645,17 +1608,7 @@ pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
     }
 
     while (wp && *ws) {
-#ifdef MULTIBYTE_SUPPORT
-	wlen = mb_metacharlenconv_r(ws, &wc, &wstate);
-#else
-	if (*ws == Meta) {
-	    wc = STOUC(ws[1]) ^ 32;
-	    wlen = 2;
-	} else {
-	    wc = STOUC(*ws);
-	    wlen = 1;
-	}
-#endif
+	wc = unmeta_one(ws, &wlen);
 	if (!pattern_match1(wp, wc, &wmt))
 	    return 0;
 	wp = wp->next;
diff --git a/Src/Zle/computil.c b/Src/Zle/computil.c
index 325da6d..4ea409d 100644
--- a/Src/Zle/computil.c
+++ b/Src/Zle/computil.c
@@ -4485,10 +4485,10 @@ cfp_matcher_pats(char *matcher, char *add)
 			    } else
 				*mp = m;
 			}
-			cl = (*tmp == Meta) ? 2 : 1;
+			(void) unmeta_one(tmp, &cl);
 			tl -= cl;
 			tmp += cl;
-			mp += cl;
+			mp++;
 		    }
 		} else {
 		    stopp = m->line;
@@ -4505,10 +4505,10 @@ cfp_matcher_pats(char *matcher, char *add)
 			    } else
 				*mp = m;
 			}
-			cl = (*tmp == Meta) ? 2 : 1;
+			(void) unmeta_one(tmp, &cl);
 			tl -= cl;
 			tmp += cl;
-			mp += cl;
+			mp++;
 		    }
 		} else if (m->llen) {
 		    stopp = m->line;
@@ -4531,7 +4531,7 @@ cfp_matcher_pats(char *matcher, char *add)
 			al = tmp - add;
 			break;
 		    }
-		    cl = (*tmp == Meta) ? 2 : 1;
+		    (void) unmeta_one(tmp, &cl);
 		    tl -= cl;
 		    tmp += cl;
 		}
diff --git a/Src/utils.c b/Src/utils.c
index 7f3ddad..58983d3 100644
--- a/Src/utils.c
+++ b/Src/utils.c
@@ -4788,6 +4788,56 @@ unmeta(const char *file_name)
 }
 
 /*
+ * Unmetafy just one character and store the number of bytes it occupied.
+ */
+/**/
+mod_export convchar_t
+unmeta_one(const char *in, int *sz)
+{
+#ifdef MB_CUR_MAX
+    VARARR(char,out,MB_CUR_MAX+1);
+#else
+#define MB_CUR_MAX 2
+    char out[MB_CUR_MAX+1];
+#endif
+    convchar_t wc;
+    char *p;
+    const char *t;
+    int newsz;
+#ifdef MULTIBYTE_SUPPORT
+    int ulen;
+    mbstate_t wstate;
+#endif
+
+    if (!sz)
+	sz = &newsz;
+    *sz = 0;
+
+    if (!in)
+	return 0;
+
+    for (t = in, p = out; *t && (p - out <= MB_CUR_MAX); p++)
+	if ((*p = *t++) == Meta && *t)
+	    *p = *t++ ^ 32;
+    *p = '\0';	/* Unnecessary? */
+
+#ifdef MULTIBYTE_SUPPORT
+    memset(&wstate, 0, sizeof(wstate));
+    ulen = mb_metacharlenconv_r(out, &wc, &wstate);
+    while (ulen-- > 0) {
+	if (in[*sz] == Meta)
+		*sz += 2;
+	    else
+		*sz += 1;
+    }
+#else
+    *sz = p - out;
+    wc = STOUC(out[0]);
+#endif
+    return wc;
+}
+
+/*
  * Unmetafy and compare two strings, comparing unsigned character values.
  * "a\0" sorts after "a".
  *


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-05 21:45             ` Bart Schaefer
  2017-03-05 22:31               ` Bart Schaefer
  2017-03-05 22:41               ` Daniel Shahaf
@ 2017-03-06  9:47               ` Peter Stephenson
  2017-03-06 17:10                 ` Bart Schaefer
  2 siblings, 1 reply; 16+ messages in thread
From: Peter Stephenson @ 2017-03-06  9:47 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: zsh-workers

On Sun, 5 Mar 2017 13:45:13 -0800
Bart Schaefer <schaefer@brasslantern.com> wrote:
> [Somebody will have to explain to me why (mp += cl) makes sense, as that
> seems to be using a byte count to increment a (Cmatcher*), but it's done
> in several places.]

We're getting down to the point where the matcher hasn't been fully
converted to multibyte chracters (though I can't swear this is it), and
indeed right at the bottom I seem to remember it uses arrays to such an
extent it would be more natural to use wide characters.  But that turned
out to be easier said than done too --- partly I think it was a question
of being able to store pattern characters.

> Am I going astray here?

I don't understand your point since the functions being called here are
all adapted for metafied multibyte characters, but I'm obviously missing
something.

pws


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] SIGSEGV under certain circumstances
  2017-03-06  9:47               ` Peter Stephenson
@ 2017-03-06 17:10                 ` Bart Schaefer
  0 siblings, 0 replies; 16+ messages in thread
From: Bart Schaefer @ 2017-03-06 17:10 UTC (permalink / raw)
  To: zsh-workers

On Mar 6,  9:47am, Peter Stephenson wrote:
} Subject: Re: [BUG] SIGSEGV under certain circumstances
}
} Bart Schaefer <schaefer@brasslantern.com> wrote:
} > Am I going astray here?
} 
} I don't understand your point since the functions being called here are
} all adapted for metafied multibyte characters, but I'm obviously missing
} something.

No, it's me that's missing something.  The patch I sent in 40754 is more
complicated than necessary, because I didn't bother to check whether the
word "meta" in the name "mb_metacharlenconv" actually meant that it was
handling zsh metafied pairs.

At the same time, this still means that pattern_match() was using the
unmetafied wide characters to do comparisons, but cfp_matcher_pats()
was counting bytes as if there were at most one metafied pair making
up the whole of each character.

There's also the question of whether there is one entry in the "ms"
array of Cmatcher for every byte in the input string (including the
Meta bytes), or one entry for every character, or (the worst case)
one entry for every byte-or-metafied-pair of which there may be more
than one of each per wide character.

I think the answer is that since the ms array is being filled here to
be passed down to cfp_matcher_range(), it should be one entry for each
character.  cfp_matcher_range() does mp++ at the end of its loop.  So
incrementing mp by one each time in cfp_matcher_pats() must also be
the right thing.

Here's a corrected patch with unmeta_one() simplified.  Again this is
against the master, not on top of previous patches.  I'm reasonably
confident in this now, so inclined to commit and wait for explosions.
Y02compmatch passes, so I don't think anything basic is broken.

diff --git a/Src/Zle/compmatch.c b/Src/Zle/compmatch.c
index aedf463..1cdbb8a 100644
--- a/Src/Zle/compmatch.c
+++ b/Src/Zle/compmatch.c
@@ -1548,27 +1548,11 @@ pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
 {
     convchar_t c, wc;
     convchar_t ind, wind;
-    int len = 0, wlen, mt, wmt;
-#ifdef MULTIBYTE_SUPPORT
-    mbstate_t lstate, wstate;
-
-    memset(&lstate, 0, sizeof(lstate));
-    memset(&wstate, 0, sizeof(wstate));
-#endif
+    int len = 0, wlen = 0, mt, wmt;
 
     while (p && wp && *s && *ws) {
 	/* First test the word character */
-#ifdef MULTIBYTE_SUPPORT
-	wlen = mb_metacharlenconv_r(ws, &wc, &wstate);
-#else
-	if (*ws == Meta) {
-	    wc = STOUC(ws[1]) ^ 32;
-	    wlen = 2;
-	} else {
-	    wc = STOUC(*ws);
-	    wlen = 1;
-	}
-#endif
+	wc = unmeta_one(ws, &wlen);
 	wind = pattern_match1(wp, wc, &wmt);
 	if (!wind)
 	    return 0;
@@ -1576,18 +1560,7 @@ pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
 	/*
 	 * Now the line character.
 	 */
-#ifdef MULTIBYTE_SUPPORT
-	len = mb_metacharlenconv_r(s, &c, &lstate);
-#else
-	/* We have the character itself. */
-	if (*s == Meta) {
-	    c = STOUC(s[1]) ^ 32;
-	    len = 2;
-	} else {
-	    c = STOUC(*s);
-	    len = 1;
-	}
-#endif
+	c = unmeta_one(s, &len);
 	/*
 	 * If either is "?", they match each other; no further tests.
 	 * Apply this even if the character wasn't convertable;
@@ -1627,17 +1600,7 @@ pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
     }
 
     while (p && *s) {
-#ifdef MULTIBYTE_SUPPORT
-	len = mb_metacharlenconv_r(s, &c, &lstate);
-#else
-	if (*s == Meta) {
-	    c = STOUC(s[1]) ^ 32;
-	    len = 2;
-	} else {
-	    c = STOUC(*s);
-	    len = 1;
-	}
-#endif
+	c = unmeta_one(s, &len);
 	if (!pattern_match1(p, c, &mt))
 	    return 0;
 	p = p->next;
@@ -1645,17 +1608,7 @@ pattern_match(Cpattern p, char *s, Cpattern wp, char *ws)
     }
 
     while (wp && *ws) {
-#ifdef MULTIBYTE_SUPPORT
-	wlen = mb_metacharlenconv_r(ws, &wc, &wstate);
-#else
-	if (*ws == Meta) {
-	    wc = STOUC(ws[1]) ^ 32;
-	    wlen = 2;
-	} else {
-	    wc = STOUC(*ws);
-	    wlen = 1;
-	}
-#endif
+	wc = unmeta_one(ws, &wlen);
 	if (!pattern_match1(wp, wc, &wmt))
 	    return 0;
 	wp = wp->next;
diff --git a/Src/Zle/computil.c b/Src/Zle/computil.c
index 325da6d..e704f9f 100644
--- a/Src/Zle/computil.c
+++ b/Src/Zle/computil.c
@@ -4465,17 +4465,24 @@ cfp_matcher_pats(char *matcher, char *add)
     if (m && m != pcm_err) {
 	char *tmp;
 	int al = strlen(add), zl = ztrlen(add), tl, cl;
-	VARARR(Cmatcher, ms, zl);
+	VARARR(Cmatcher, ms, zl);	/* One Cmatcher per character */
 	Cmatcher *mp;
 	Cpattern stopp;
 	int stopl = 0;
 
+	/* zl >= (number of wide characters) is guaranteed */
 	memset(ms, 0, zl * sizeof(Cmatcher));
 
 	for (; m && *add; m = m->next) {
 	    stopp = NULL;
 	    if (!(m->flags & (CMF_LEFT|CMF_RIGHT))) {
 		if (m->llen == 1 && m->wlen == 1) {
+		    /*
+		     * In this loop and similar loops below we step
+		     * through tmp one (possibly wide) character at a
+		     * time.  pattern_match() compares only the first
+		     * character using unmeta_one() so keep in step.
+		     */
 		    for (tmp = add, tl = al, mp = ms; tl; ) {
 			if (pattern_match(m->line, tmp, NULL, NULL)) {
 			    if (*mp) {
@@ -4485,10 +4492,10 @@ cfp_matcher_pats(char *matcher, char *add)
 			    } else
 				*mp = m;
 			}
-			cl = (*tmp == Meta) ? 2 : 1;
+			(void) unmeta_one(tmp, &cl);
 			tl -= cl;
 			tmp += cl;
-			mp += cl;
+			mp++;
 		    }
 		} else {
 		    stopp = m->line;
@@ -4505,10 +4512,10 @@ cfp_matcher_pats(char *matcher, char *add)
 			    } else
 				*mp = m;
 			}
-			cl = (*tmp == Meta) ? 2 : 1;
+			(void) unmeta_one(tmp, &cl);
 			tl -= cl;
 			tmp += cl;
-			mp += cl;
+			mp++;
 		    }
 		} else if (m->llen) {
 		    stopp = m->line;
@@ -4531,7 +4538,7 @@ cfp_matcher_pats(char *matcher, char *add)
 			al = tmp - add;
 			break;
 		    }
-		    cl = (*tmp == Meta) ? 2 : 1;
+		    (void) unmeta_one(tmp, &cl);
 		    tl -= cl;
 		    tmp += cl;
 		}
diff --git a/Src/utils.c b/Src/utils.c
index 7f3ddad..839575b 100644
--- a/Src/utils.c
+++ b/Src/utils.c
@@ -4788,6 +4788,48 @@ unmeta(const char *file_name)
 }
 
 /*
+ * Unmetafy just one character and store the number of bytes it occupied.
+ */
+/**/
+mod_export convchar_t
+unmeta_one(const char *in, int *sz)
+{
+    convchar_t wc;
+    int newsz;
+#ifdef MULTIBYTE_SUPPORT
+    int ulen;
+    mbstate_t wstate;
+#endif
+
+    if (!sz)
+	sz = &newsz;
+    *sz = 0;
+
+    if (!in || !*in)
+	return 0;
+
+#ifdef MULTIBYTE_SUPPORT
+    memset(&wstate, 0, sizeof(wstate));
+    ulen = mb_metacharlenconv_r(in, &wc, &wstate);
+    while (ulen-- > 0) {
+	if (in[*sz] == Meta)
+	    *sz += 2;
+	else
+	    *sz += 1;
+    }
+#else
+    if (in[0] == Meta) {
+      *sz = 2;
+      wc = STOUC(in[1] ^ 32);
+    } else {
+      *sz = 1;
+      wc = STOUC(in[0]);
+    }
+#endif
+    return wc;
+}
+
+/*
  * Unmetafy and compare two strings, comparing unsigned character values.
  * "a\0" sorts after "a".
  *


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-03-06 20:01 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-01 15:38 [BUG] SIGSEGV under certain circumstances Chi-Hsuan Yen
2017-03-04 23:11 ` Bart Schaefer
2017-03-05 12:55   ` Chi-Hsuan Yen
2017-03-05 13:09   ` Chi-Hsuan Yen
2017-03-05 16:00     ` Bart Schaefer
2017-03-05 16:17       ` Peter Stephenson
2017-03-05 18:42         ` Bart Schaefer
2017-03-05 18:52           ` Peter Stephenson
2017-03-05 21:45             ` Bart Schaefer
2017-03-05 22:31               ` Bart Schaefer
2017-03-05 22:41               ` Daniel Shahaf
2017-03-05 22:51                 ` Bart Schaefer
2017-03-05 23:07                   ` Bart Schaefer
2017-03-06  0:23                     ` Bart Schaefer
2017-03-06  9:47               ` Peter Stephenson
2017-03-06 17:10                 ` Bart Schaefer

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).