zsh-workers
 help / color / mirror / code / Atom feed
* [BUG] Long line makes pattern matching (by //) hog Zsh
@ 2016-06-05 14:36 Sebastian Gniazdowski
  2016-06-05 19:10 ` Bart Schaefer
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Gniazdowski @ 2016-06-05 14:36 UTC (permalink / raw)
  To: Zsh hackers list

[-- Attachment #1: Type: text/plain, Size: 568 bytes --]

Hello,
attached is a test program that in general blocks Zsh from ending.
Github user kapsh reported "fatal error: out of heap memory" and I
investigated that it's about long lines in his data. The pattern:

([^ /\\\\]##[^0-9/\\\\ ]##[^/\\\\]#(#e))

is:
1. not backslash nor slash nor space [^ /\\\\]##
2. not number, slash, backslash, space [^0-9/\\\\ ]##
3. not slash, backslash [^/\\\\]#
4. end of line (#e)

Removing first segment makes Zsh end after a couple of seconds (but
not on zsh-5.0.8). Tested on zsh-5.0.2, 5.0.8, 5.2

Best regards,
Sebastian Gniazdowski

[-- Attachment #2: 2pat.zsh --]
[-- Type: application/octet-stream, Size: 3284 bytes --]

#!/usr/bin/env zsh

emulate -LR zsh
setopt extendedglob

typeset -a list disp_list
list=( "3226  1802 /opt/google/chrome/chrome --type=renderer --enable-lcd-text --enable-features=DownloadResumption,IncidentReportingModuleLoadAnalysis<SafeBrowsingIncidentReportingServiceFeatures,IncidentReportingSuspiciousModuleReporting<SafeBrowsingIncidentReportingServiceFeatures,LinuxObsoleteSystemIsEndOfTheLine<LinuxObsoleteSystemIsEndOfTheLine,PrintPreviewDistiller,*UsePasswordSeparatedSigninFlow<PasswordSeparatedSigninFlow,WebFontsIntervention<WebFontsIntervention,*WebRTC-EnableWebRtcEcdsa<WebRTC-EnableWebRtcEcdsa,brotli-encoding<BrotliEncoding,enable-automatic-password-saving --disable-features=RenderingPipelineThrottling<RenderingPipelineThrottling --force-fieldtrials=AppBannerTriggering/Aggressive/*AutoReloadExperiment/FlagEnabled/*AutoReloadVisibleOnlyExperiment/FlagEnabled/AutofillProfileOrderByFrecency/Enabled/*BrotliEncoding/Enabled/CaptivePortalInterstitial/Enabled/*ChildAccountDetection/Disabled/*ClientSideDetectionModel/Model0/*CrossDevicePromo/28DaySingleProfile/EnableSessionCrashedBubbleUI/EnabledByFlag/*ExtensionActionRedesign/Enabled/*ExtensionDeveloperModeWarning/Default/*GFE/Default/InstanceID/Enabled/LinuxObsoleteSystemIsEndOfTheLine/EndOfLine/MaterialDesignDownloads/Enabled/*NewProfileManagement/Command-Line-Disabled/*OmniboxBundledExperimentV1/Unused_2/*OutOfProcessPac/Default/*PageRevisitInstrumentation/Default/PasswordBranding/Disabled/*PasswordGeneration/Disabled/*PasswordManagerSettingsMigration/Disable/PasswordSeparatedSigninFlow/Default/*QUIC/FlagEnabled/*RenderingPipelineThrottling/Disabled/ReportCertificateErrors/ShowAndPossiblySend/*ResourcePriorities/Launch50pct_11011_1_1_10/SHA1IdentityUIWarning/Enabled/SHA1ToolbarUIJanuary2016/Warning/SHA1ToolbarUIJanuary2017/Error/SSLCommonNameMismatchHandling/Enabled/*SafeBrowsingIncidentReportingService/Default/SafeBrowsingIncidentReportingServiceFeatures/WithSuspiciousModuleReporting/SafeBrowsingUnverifiedDownloads/DisableByParameterMostSbTypes2/SafeBrowsingUpdateFrequency/Default/*SimpleCacheTrial/ExperimentYes2/SyncHttpContentCompression/Disabled/*UMA-Population-Restrict/normal/*UMA-Uniformity-Trial-1-Percent/group_61/*UMA-Uniformity-Trial-10-Percent/default/*UMA-Uniformity-Trial-100-Percent/group_01/*UMA-Uniformity-Trial-20-Percent/group_03/*UMA-Uniformity-Trial-5-Percent/group_04/*UMA-Uniformity-Trial-50-Percent/group_01/*UseDelayAgnosticAEC/DefaultEnabled/*WebFontsIntervention/Enabled/WebRTC-EnableWebRtcEcdsa/Default/ --primordial-pipe-token=F02F1A91C1338A95D3C111314BA6B87A --lang=en-US --enable-crash-reporter=831AB4A2-FDF7-4D2B-9BE0-F22377348434, --extension-process --enable-webrtc-hw-h264-encoding --enable-offline-auto-reload --enable-offline-auto-reload-visible-only --show-saved-copy=secondary --blink-settings=fetchDeferLateScripts=true,fetchIncreaseFontPriority=true,fetchIncreasePriorities=true --enable-suggestions-with-substring-match -" )

_nlist_colorify_disp_list() {
    local col=$'\x1b[00;34m' reset=$'\x1b[0m'
    NLIST_COLORING_PATTERN="([^ /\\\\]##[^0-9/\\\\ ]##[^/\\\\]#(#e))"
    disp_list=( "${(@)disp_list//(#mi)$~NLIST_COLORING_PATTERN/$col${MATCH}$reset}" )
}

disp_list=( "${(@)list[1, 50]}" )
echo "Starting"
_nlist_colorify_disp_list
echo "Done"

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Long line makes pattern matching (by //) hog Zsh
  2016-06-05 14:36 [BUG] Long line makes pattern matching (by //) hog Zsh Sebastian Gniazdowski
@ 2016-06-05 19:10 ` Bart Schaefer
  2016-06-05 19:37   ` Peter Stephenson
  2016-06-05 19:47   ` Sebastian Gniazdowski
  0 siblings, 2 replies; 7+ messages in thread
From: Bart Schaefer @ 2016-06-05 19:10 UTC (permalink / raw)
  To: Zsh hackers list

On Jun 5,  4:36pm, Sebastian Gniazdowski wrote:
}
} 1. not backslash nor slash nor space [^ /\\\\]##
} 2. not number, slash, backslash, space [^0-9/\\\\ ]##
} 3. not slash, backslash [^/\\\\]#
} 4. end of line (#e)

It's in the block in pattern.c:patchmatch() that begins with the
comment:

		/*
		 * Lookahead to avoid useless matches. This is not possible
		 * with approximation.
		 */

Specifically, in the "if (no >= min) for (;;) ..." loop, at each charater
in the input string patmatch() is called recursively to look at the rest
of the string, which again enters this same loop because the next thing
is also a one-or-more expression, which calls recursively and again
enters the loop because the thing after that is a zero-or-more.

It consumes a LOT of memory while doing this, even if I add a hack to
prevent it from recursively re-entering lookahead (or to skip the
lookahead entirely).


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Long line makes pattern matching (by //) hog Zsh
  2016-06-05 19:10 ` Bart Schaefer
@ 2016-06-05 19:37   ` Peter Stephenson
  2016-06-05 20:39     ` Peter Stephenson
  2016-06-06  0:07     ` Bart Schaefer
  2016-06-05 19:47   ` Sebastian Gniazdowski
  1 sibling, 2 replies; 7+ messages in thread
From: Peter Stephenson @ 2016-06-05 19:37 UTC (permalink / raw)
  Cc: Zsh hackers list

0On Sun, 5 Jun 2016 12:10:20 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:

> On Jun 5,  4:36pm, Sebastian Gniazdowski wrote:
> }
> } 1. not backslash nor slash nor space [^ /\\\\]##
> } 2. not number, slash, backslash, space [^0-9/\\\\ ]##
> } 3. not slash, backslash [^/\\\\]#
> } 4. end of line (#e)
>
> It's in the block in pattern.c:patchmatch() that begins with the
> comment:
> 
> 		/*
> 		 * Lookahead to avoid useless matches. This is not possible
> 		 * with approximation.
> 		 */

That comment happens to be an irrelevance --- that's where it's looking
for something that's an exact match to follow.  There isn't one in this
case.  If there was, we would have latched onto it and we wouldn't need
to try quite so hard rearranging the other elements of the pattern.

The containing block is where it's handling # and ##, so there's no
great surprise it's'= there.

> Specifically, in the "if (no >= min) for (;;) ..." loop, at each charater
> in the input string patmatch() is called recursively to look at the rest
> of the string, which again enters this same loop because the next thing
> is also a one-or-more expression, which calls recursively and again
> enters the loop because the thing after that is a zero-or-more.

The problem is the patterns are pathological.  Each of them can match
the same characters.  So it's spending a lot of time repartitioning the
mathches  between the possibilities of 1. and 2. and 3. in the above.
That's not polynomially bounded.  I'm not sure if it's even
exponentially bounded.

What I'm not sure is if there's a way of improving this without some
special case or, obviously, making the patterns more specific.

pws


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Long line makes pattern matching (by //) hog Zsh
  2016-06-05 19:10 ` Bart Schaefer
  2016-06-05 19:37   ` Peter Stephenson
@ 2016-06-05 19:47   ` Sebastian Gniazdowski
  1 sibling, 0 replies; 7+ messages in thread
From: Sebastian Gniazdowski @ 2016-06-05 19:47 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

On 5 June 2016 at 21:10, Bart Schaefer <schaefer@brasslantern.com> wrote:
> It's in the block in pattern.c:patchmatch() that begins with the
> comment:
>
>                 /*
>                  * Lookahead to avoid useless matches. This is not possible
>                  * with approximation.
>                  */
>
> Specifically, in the "if (no >= min) for (;;) ..." loop, at each charater
> in the input string patmatch() is called recursively to look at the rest
> of the string, which again enters this same loop because the next thing
> is also a one-or-more expression, which calls recursively and again
> enters the loop because the thing after that is a zero-or-more.
>
> It consumes a LOT of memory while doing this, even if I add a hack to
> prevent it from recursively re-entering lookahead (or to skip the
> lookahead entirely).

Too bad the input isn't even that large, ~2850 characters. And that
the pattern even with first segment removed – ([^0-9/\\\\
]##[^/\\\\]#(#e)) – i.e. two 0/1-or-more segments, is also slow,
although doesn't hog Zsh. Maybe your changes would make a difference
for that input size, and for first segment removed, i.e. two # or ##
blocks? I had simplified highlighting in my tool and should be fine
now, but maybe in future I could restore the more sophisticated
pattern.

Best regards,
Sebastian Gniazdowski


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Long line makes pattern matching (by //) hog Zsh
  2016-06-05 19:37   ` Peter Stephenson
@ 2016-06-05 20:39     ` Peter Stephenson
  2016-06-05 21:21       ` Sebastian Gniazdowski
  2016-06-06  0:07     ` Bart Schaefer
  1 sibling, 1 reply; 7+ messages in thread
From: Peter Stephenson @ 2016-06-05 20:39 UTC (permalink / raw)
  Cc: Zsh hackers list

On Sun, 5 Jun 2016 20:37:08 +0100
Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> The problem is the patterns are pathological.  Each of them can match
> the same characters.  So it's spending a lot of time repartitioning the
> mathches  between the possibilities of 1. and 2. and 3. in the above.
> That's not polynomially bounded.  I'm not sure if it's even
> exponentially bounded.
> 
> What I'm not sure is if there's a way of improving this without some
> special case or, obviously, making the patterns more specific.

We are maybe being naive in treating [...] expressions as if they are
"simple".  I think we have better protection against pathological
backtracking for more complicated patterns.

Does the following help?

pws

diff --git a/Src/pattern.c b/Src/pattern.c
index 4e2f236..bec980f 100644
--- a/Src/pattern.c
+++ b/Src/pattern.c
@@ -1425,7 +1425,7 @@ patcomppiece(int *flagp, int paren)
 	case Inbrack:
 	    DPUTS(zpc_special[ZPC_INBRACK] == Marker,
 		  "Treating '[' as pattern character although disabled");
-	    flags |= P_SIMPLE;
+	    /*flags |= P_SIMPLE;*/
 	    if (*patparse == Hat || *patparse == Bang) {
 		patparse++;
 		starter = patnode(P_ANYBUT);


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Long line makes pattern matching (by //) hog Zsh
  2016-06-05 20:39     ` Peter Stephenson
@ 2016-06-05 21:21       ` Sebastian Gniazdowski
  0 siblings, 0 replies; 7+ messages in thread
From: Sebastian Gniazdowski @ 2016-06-05 21:21 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

On 5 June 2016 at 22:39, Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> We are maybe being naive in treating [...] expressions as if they are
> "simple".  I think we have better protection against pathological
> backtracking for more complicated patterns.
>
> Does the following help?

It's slower with this patch:

3,02s user 0,02s system 99% cpu 3,064 total
vs
1,55s user 0,24s system 93% cpu 1,909 total

for pattern with first segment removed, i.e.: ([^0-9/\\\\ ]##[^/\\\\]#(#e))

Best regards,
Sebastian Gniazdowski


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Long line makes pattern matching (by //) hog Zsh
  2016-06-05 19:37   ` Peter Stephenson
  2016-06-05 20:39     ` Peter Stephenson
@ 2016-06-06  0:07     ` Bart Schaefer
  1 sibling, 0 replies; 7+ messages in thread
From: Bart Schaefer @ 2016-06-06  0:07 UTC (permalink / raw)
  To: Zsh hackers list

On Jun 5,  8:37pm, Peter Stephenson wrote:
}
} > On Jun 5,  4:36pm, Sebastian Gniazdowski wrote:
} > }
} > } 1. not backslash nor slash nor space [^ /\\\\]##
} > } 2. not number, slash, backslash, space [^0-9/\\\\ ]##
} > } 3. not slash, backslash [^/\\\\]#
} > } 4. end of line (#e)
} 
} The problem is the patterns are pathological.  Each of them can match
} the same characters.

Indeed, given that the first sub-pattern is "at least one" and the
second sub-pattern includes all the same characters as the first, I
think the first ## can be removed without changing what is matched;
i.e.:

    NLIST_COLORING_PATTERN="([^ /\\\\][^0-9/\\\\ ]##[^/\\\\]#(#e))"

And with that pattern the whole test script completes in a couple of
seconds.

} So it's spending a lot of time repartitioning the
} mathches  between the possibilities of 1. and 2. and 3. in the above.
} That's not polynomially bounded.  I'm not sure if it's even
} exponentially bounded.

I tried to work that out and came up with O(n**((n-1)**(n-2))) before
deciding I didn't remember enough of that math to believe it.

The other problem, which I hadn't really thought about before, is that
this is all being applied inside ${str//pat/repl} expression, bounded
only on the right, which means it has to try every position in $str
even when the pattern does NOT match, which for a fair chunk of the
example input it does not.

There may be an optimization possible in the way that igetmatch()
interacts with pattrylen(), so as to skip forward over the longest
NON-match as well as skipping over the longest match.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-06-06  0:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-05 14:36 [BUG] Long line makes pattern matching (by //) hog Zsh Sebastian Gniazdowski
2016-06-05 19:10 ` Bart Schaefer
2016-06-05 19:37   ` Peter Stephenson
2016-06-05 20:39     ` Peter Stephenson
2016-06-05 21:21       ` Sebastian Gniazdowski
2016-06-06  0:07     ` Bart Schaefer
2016-06-05 19:47   ` Sebastian Gniazdowski

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).