This change makes most-recent-file[1] completion take a few milliseconds instead of close to a minute in a directory of 30000 files. I'm not sure exactly what's going on with these flags, but surely if they are not set, then we should not deduplicate the set of matches. Most likely this was a thinko-copy-paste-o from the first NOSORT case which does have to be negated. I'm not sure why regular tab-press was always fast, but it is still fast after this change for me. I don't know when these flags are supposed to be set, so I haven't tested that case. [1] zstyle ':completion:(*-|)most-recent-file:*' match-original both zstyle ':completion:(*-|)most-recent-file:*' file-sort modification zstyle ':completion:(*-|)most-recent-file:*' file-patterns '*:all\ files' zstyle ':completion:most-recent-file:*' hidden all zstyle ':completion:(*-|)most-recent-file:*' completer _files --- This bug is present since before cvs, according to git blame. Src/Zle/compcore.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Src/Zle/compcore.c b/Src/Zle/compcore.c index ebcd7d85bb..778a4ff571 100644 --- a/Src/Zle/compcore.c +++ b/Src/Zle/compcore.c @@ -3254,7 +3254,7 @@ makearray(LinkList l, int type, int flags, int *np, int *nlp, int *llp) qsort((void *) rp, n, sizeof(Cmatch), (int (*) _((const void *, const void *)))matchcmp); - if (!(flags & CGF_UNIQCON)) { + if (flags & CGF_UNIQCON) { int dup; /* And delete the ones that occur more than once. */ @@ -3280,7 +3280,7 @@ makearray(LinkList l, int type, int flags, int *np, int *nlp, int *llp) nl++; } } else { - if (!(flags & CGF_UNIQALL) && !(flags & CGF_UNIQCON)) { + if (flags & CGF_UNIQALL) { int dup; for (ap = rp; *ap; ap++) { @@ -3303,7 +3303,7 @@ makearray(LinkList l, int type, int flags, int *np, int *nlp, int *llp) (*ap)->flags |= CMF_FMULT; } } - } else if (!(flags & CGF_UNIQCON)) { + } else if (flags & CGF_UNIQCON) { int dup; for (ap = cp = rp; *ap; ap++) { -- 2.15.1
On 3/15/22, Mikael Magnusson <mikachu@gmail.com> wrote: > This change makes most-recent-file[1] completion take a few milliseconds > instead of close to a minute in a directory of 30000 files. I'm not sure > exactly what's going on with these flags, but surely if they are not set, > then we should not deduplicate the set of matches. Most likely this was > a thinko-copy-paste-o from the first NOSORT case which does have to be > negated. I'm not sure why regular tab-press was always fast, but it is > still fast after this change for me. > > I don't know when these flags are supposed to be set, so I haven't tested > that case. > > [1] > zstyle ':completion:(*-|)most-recent-file:*' match-original both > zstyle ':completion:(*-|)most-recent-file:*' file-sort modification > zstyle ':completion:(*-|)most-recent-file:*' file-patterns '*:all\ files' > zstyle ':completion:most-recent-file:*' hidden all > zstyle ':completion:(*-|)most-recent-file:*' completer _files Okay, I set out to test the case when the flags are set, and waded through some confusion but I think the conclusion is that the comments in the header are wrong, and the names of the constants are misleading, and _path_files should probably use -V -2 since files can't be duplicated anyway. Is there a useful case when the same file could be added twice? (eg not just running _files twice by mistake). Here's the wading through confusion part: compcore.c has code like this (without my patch) if (!(flags & CGF_UNIQCON)) { int dup; /* And delete the ones that occur more than once. */ ... if (!(flags & CGF_UNIQALL) && !(flags & CGF_UNIQCON)) { int dup; } else if (!(flags & CGF_UNIQCON)) { int dup; ... Checking the comments for these flags: #define CGF_UNIQALL 8 /* remove all duplicates */ #define CGF_UNIQCON 16 /* remove consecutive duplicates */ So far it clearly seems like the checks are inverted. addmatches() does this: gflags = (((dat->aflags & CAF_NOSORT ) ? CGF_NOSORT : 0) | ... ((dat->aflags & CAF_UNIQALL) ? CGF_UNIQALL : 0) | ((dat->aflags & CAF_UNIQCON) ? CGF_UNIQCON : 0)); Okay, so we've copied over some other flags (note the CAF vs CGF), what do those say? #define CAF_UNIQCON 8 /* compadd -2: don't deduplicate */ #define CAF_UNIQALL 16 /* compadd -1: deduplicate */ okay... that's somewhat contradictory. Let's check the manpage: -1 If given together with the -V option, makes only consecutive duplicates in the group be removed. If combined with the -J option, this has no visible effect. Note that groups with and without this flag are in different name spaces. -2 If given together with the -J or -V option, makes all dupli‐ cates be kept. Again, groups with and without this flag are in different name spaces. Okay, so UNIQALL means to remove only consecutive duplicates, and UNIQCON means keep all duplicates. Right. That doesn't make any sense, but it does match what the original code does (I think). With that in mind, I have the following patch instead, any objections to this? (Again, it makes completion be instant instead of taking minutes in large directories). If there are objections, we can probably improve the deduplication algorithm to from n^2 to nlog n (sort array of pointers and use that to manipulate the original array keeping relative order of kept elements. In fact, the input data to the function is a list of pointers which we copy to an array of elements which is then manipulated). PS the manpage says -V is required for -1/-2 but -J with -o nosort works as well which is what happens below. -- 8< -- Subject: PATCH: _path_files: keep duplicates when not sorting The deduplication algorithm is too slow to run on very large sets of matches, and files can't be duplicate anyway. --- Completion/Unix/Type/_path_files | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Completion/Unix/Type/_path_files b/Completion/Unix/Type/_path_files index c09ca6fa9e..2ee3030785 100644 --- a/Completion/Unix/Type/_path_files +++ b/Completion/Unix/Type/_path_files @@ -168,7 +168,7 @@ if zstyle -s ":completion:${curcontext}:" file-sort tmp1; then if [[ "$sort" = on ]]; then sort= else - mopts=( -o nosort "${mopts[@]}" ) + mopts=( -o nosort -2 "${mopts[@]}" ) tmp2=() for tmp1 in "$pats[@]"; do -- Mikael Magnusson
On Thu, Mar 17, 2022 at 10:09 AM Mikael Magnusson <mikachu@gmail.com> wrote: > > Okay, so UNIQALL means to remove only consecutive duplicates, and > UNIQCON means keep all duplicates. Right. That doesn't make any sense, > but it does match what the original code does (I think). I suppose it depends on what CON is supposed to mean. I think the meaning is to collapse only consecutive duplicates down to one occurrence, so duplicates with intervening non-duplicates may still be left. E.g., similar to the difference between "sort -u" (UNIQALL) and "uniq" (UNIQCON). > With that in mind, I have the following patch instead, any objections > to this? I don't immediately see any problem with it, unless my remark above changes your understanding somehow. > PS the manpage says -V is required for -1/-2 but -J with -o nosort > works as well which is what happens below. That makes some sense, yes.
On 3/17/22, Bart Schaefer <schaefer@brasslantern.com> wrote: > On Thu, Mar 17, 2022 at 10:09 AM Mikael Magnusson <mikachu@gmail.com> > wrote: >> >> Okay, so UNIQALL means to remove only consecutive duplicates, and >> UNIQCON means keep all duplicates. Right. That doesn't make any sense, >> but it does match what the original code does (I think). > > I suppose it depends on what CON is supposed to mean. I think the > meaning is to collapse only consecutive duplicates down to one > occurrence, so duplicates with intervening non-duplicates may still be > left. E.g., similar to the difference between "sort -u" (UNIQALL) and > "uniq" (UNIQCON). Yeah, testing shows that one of the options keeps all duplicates and the other keeps non-consecutive duplicates, while the default is to remove all duplicates. I think the reason it's only noticably slow when sorting by file date is that when the completion code sorts by name, the deduplication algorithm is fairly linear. All the confusion stems from the fact that the comments say the exact opposite of what the code does and is documented to do :). >> With that in mind, I have the following patch instead, any objections >> to this? > > I don't immediately see any problem with it, unless my remark above > changes your understanding somehow. > >> PS the manpage says -V is required for -1/-2 but -J with -o nosort >> works as well which is what happens below. > > That makes some sense, yes. -- Mikael Magnusson
Hopefully this can save future generations some confusion. --- Src/Zle/comp.h | 6 +++--- Src/Zle/compcore.c | 8 +++++++- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/Src/Zle/comp.h b/Src/Zle/comp.h index 0c5fbd4f00..a8480c2bac 100644 --- a/Src/Zle/comp.h +++ b/Src/Zle/comp.h @@ -85,8 +85,8 @@ struct cmgroup { #define CGF_NOSORT 1 /* don't sort this group */ #define CGF_LINES 2 /* these are to be printed on different lines */ #define CGF_HASDL 4 /* has display strings printed on separate lines */ -#define CGF_UNIQALL 8 /* remove all duplicates */ -#define CGF_UNIQCON 16 /* remove consecutive duplicates */ +#define CGF_UNIQALL 8 /* remove consecutive duplicates (if neither are set, */ +#define CGF_UNIQCON 16 /* don't deduplicate */ /* remove all dupes) */ #define CGF_PACKED 32 /* LIST_PACKED for this group */ #define CGF_ROWS 64 /* LIST_ROWS_FIRST for this group */ #define CGF_FILES 128 /* contains file names */ @@ -299,7 +299,7 @@ struct menuinfo { #define CAF_NOSORT 2 /* compadd -V: don't sort */ #define CAF_MATCH 4 /* compadd without -U: do matching */ #define CAF_UNIQCON 8 /* compadd -2: don't deduplicate */ -#define CAF_UNIQALL 16 /* compadd -1: deduplicate */ +#define CAF_UNIQALL 16 /* compadd -1: deduplicate consecutive only */ #define CAF_ARRAYS 32 /* compadd -a or -k: array/assoc parameter names */ #define CAF_KEYS 64 /* compadd -k: assoc parameter names */ #define CAF_ALL 128 /* compadd -C: _all_matches */ diff --git a/Src/Zle/compcore.c b/Src/Zle/compcore.c index c6deff7568..a9ace5587b 100644 --- a/Src/Zle/compcore.c +++ b/Src/Zle/compcore.c @@ -3254,10 +3254,13 @@ makearray(LinkList l, int type, int flags, int *np, int *nlp, int *llp) qsort((void *) rp, n, sizeof(Cmatch), (int (*) _((const void *, const void *)))matchcmp); + /* since the matches are sorted and the default is to remove + * all duplicates, -1 (remove only consecutive dupes) is a no-op, + * so this condition only checks for -2 */ if (!(flags & CGF_UNIQCON)) { int dup; - /* And delete the ones that occur more than once. */ + /* we did not pass -2 so go ahead and remove those dupes */ for (ap = cp = rp; *ap; ap++) { *cp++ = *ap; for (bp = ap; bp[1] && matcheq(*ap, bp[1]); bp++, n--); @@ -3279,7 +3282,9 @@ makearray(LinkList l, int type, int flags, int *np, int *nlp, int *llp) if ((*ap)->flags & (CMF_NOLIST | CMF_MULT)) nl++; } + /* used -O nosort or -V, don't sort */ } else { + /* didn't use -1 or -2, so remove all duplicates (inefficient) */ if (!(flags & CGF_UNIQALL) && !(flags & CGF_UNIQCON)) { int dup; @@ -3303,6 +3308,7 @@ makearray(LinkList l, int type, int flags, int *np, int *nlp, int *llp) (*ap)->flags |= CMF_FMULT; } } + /* passed -1 but not -2, so remove consecutive duplicates (efficient) */ } else if (!(flags & CGF_UNIQCON)) { int dup; -- 2.15.1
This implements my idea for sorting a temporary array and then using that for deduplication. This is fast enough that -2 isn't needed in _path_files, and will also help for potential other cases other than file completion. PS I realized a bit late that Cmatch is typedeffed to a pointer so the double pointer shenanigans are a bit pointless, but I'll leave reworking that until another evening... --- Src/Zle/comp.h | 1 + Src/Zle/compcore.c | 66 +++++++++++++++++++++++++++++++++++++----------------- 2 files changed, 46 insertions(+), 21 deletions(-) diff --git a/Src/Zle/comp.h b/Src/Zle/comp.h index a8480c2bac..2ca779fe53 100644 --- a/Src/Zle/comp.h +++ b/Src/Zle/comp.h @@ -140,6 +140,7 @@ struct cmatch { #define CMF_ALL (1<<13) /* a match representing all other matches */ #define CMF_DUMMY (1<<14) /* unselectable dummy match */ #define CMF_MORDER (1<<15) /* order by matches, not display strings */ +#define CMF_DELETE (1<<16) /* used for deduplication of unsorted matches, don't set */ /* Stuff for completion matcher control. */ diff --git a/Src/Zle/compcore.c b/Src/Zle/compcore.c index a9ace5587b..126cdd3ae9 100644 --- a/Src/Zle/compcore.c +++ b/Src/Zle/compcore.c @@ -3191,6 +3191,13 @@ matchcmp(Cmatch *a, Cmatch *b) matchorder & CGF_NUMSORT) ? SORTIT_NUMERICALLY : 0)); } +/**/ +static int +matchcmp_pointer(Cmatch **a, Cmatch **b) +{ + return matchcmp(*a, *b); +} + /* This tests whether two matches are equal (would produce the same * strings on the command line). */ @@ -3284,30 +3291,47 @@ makearray(LinkList l, int type, int flags, int *np, int *nlp, int *llp) } /* used -O nosort or -V, don't sort */ } else { - /* didn't use -1 or -2, so remove all duplicates (inefficient) */ + /* didn't use -1 or -2, so remove all duplicates (efficient) */ if (!(flags & CGF_UNIQALL) && !(flags & CGF_UNIQCON)) { - int dup; - - for (ap = rp; *ap; ap++) { - for (bp = cp = ap + 1; *bp; bp++) { - if (!matcheq(*ap, *bp)) - *cp++ = *bp; - else - n--; + int dup, i; + + /* To avoid O(n^2) here, sort a temporary list of pointers to the real array */ + /* TODO: this can probably just be a copy of the array, i forgot Cmatch is typedef to pointer */ + matchorder = flags; + Cmatch **sp, **asp; + sp = (Cmatch **) zhalloc((n + 1) * sizeof(Cmatch *)); + asp = sp; + for (i = 0; i < n; i++) + *asp++ = rp + i; + *asp = NULL; + qsort((void *) sp, n, sizeof(Cmatch *), + (int (*) _((const void *, const void *)))matchcmp_pointer); + for (asp = sp + 1; *asp; asp++) { + Cmatch *ap = asp[-1], *bp = asp[0]; + if (matcheq(*ap, *bp)) { + bp[0]->flags = CMF_DELETE; + } else if (!ap[0]->disp) { + /* Mark those, that would show the same string in the list. */ + /* Mikael: I haven't tested this other than commenting out matcheq above */ + Cmatch **bsp = sp; + for (dup = 0; bp[0] && !(bp[0])->disp && + !strcmp((*ap)->str, (bp[0])->str); bp = *++sp) { + (bp[0])->flags |= CMF_MULT; + dup = 1; + } + if (dup) + (*ap)->flags |= CMF_FMULT; } - *cp = NULL; - if (!(*ap)->disp) { - for (dup = 0, bp = ap + 1; *bp; bp++) - if (!(*bp)->disp && - !((*bp)->flags & CMF_MULT) && - !strcmp((*ap)->str, (*bp)->str)) { - (*bp)->flags |= CMF_MULT; - dup = 1; - } - if (dup) - (*ap)->flags |= CMF_FMULT; - } } + int n_orig = n; + for (bp = rp, ap = rp; bp < rp + n_orig; ap++, bp++) { + while (bp[0]->flags & CMF_DELETE) { + bp++; + n--; + } + *ap = *bp; + } + *ap = NULL; /* passed -1 but not -2, so remove consecutive duplicates (efficient) */ } else if (!(flags & CGF_UNIQCON)) { int dup; -- 2.15.1
Got rid of the pointless extra layer of pointers. --- Src/Zle/comp.h | 1 + Src/Zle/compcore.c | 54 ++++++++++++++++++++++++++++++++++-------------------- 2 files changed, 35 insertions(+), 20 deletions(-) diff --git a/Src/Zle/comp.h b/Src/Zle/comp.h index a8480c2bac..2ca779fe53 100644 --- a/Src/Zle/comp.h +++ b/Src/Zle/comp.h @@ -140,6 +140,7 @@ struct cmatch { #define CMF_ALL (1<<13) /* a match representing all other matches */ #define CMF_DUMMY (1<<14) /* unselectable dummy match */ #define CMF_MORDER (1<<15) /* order by matches, not display strings */ +#define CMF_DELETE (1<<16) /* used for deduplication of unsorted matches, don't set */ /* Stuff for completion matcher control. */ diff --git a/Src/Zle/compcore.c b/Src/Zle/compcore.c index a9ace5587b..0b5b22a306 100644 --- a/Src/Zle/compcore.c +++ b/Src/Zle/compcore.c @@ -3284,30 +3284,44 @@ makearray(LinkList l, int type, int flags, int *np, int *nlp, int *llp) } /* used -O nosort or -V, don't sort */ } else { - /* didn't use -1 or -2, so remove all duplicates (inefficient) */ + /* didn't use -1 or -2, so remove all duplicates (efficient) */ if (!(flags & CGF_UNIQALL) && !(flags & CGF_UNIQCON)) { - int dup; - - for (ap = rp; *ap; ap++) { - for (bp = cp = ap + 1; *bp; bp++) { - if (!matcheq(*ap, *bp)) - *cp++ = *bp; - else + int dup, i, del = 0; + + /* To avoid O(n^2) here, sort a copy of the list, then remove marked elements */ + matchorder = flags; + Cmatch *sp, *asp; + sp = (Cmatch *) zhalloc((n + 1) * sizeof(Cmatch)); + memcpy(sp, rp, (n + 1) * sizeof(Cmatch)); + qsort((void *) sp, n, sizeof(Cmatch), + (int (*) _((const void *, const void *)))matchcmp); + for (asp = sp + 1; *asp; asp++) { + Cmatch *ap = asp - 1, *bp = asp; + if (matcheq(*ap, *bp)) { + bp[0]->flags = CMF_DELETE; + del = 1; + } else if (!ap[0]->disp) { + /* Mark those, that would show the same string in the list. */ + for (dup = 0; bp[0] && !(bp[0])->disp && + !strcmp((*ap)->str, (bp[0])->str); bp = ++sp) { + (bp[0])->flags |= CMF_MULT; + dup = 1; + } + if (dup) + (*ap)->flags |= CMF_FMULT; + } + } + if (del) { + int n_orig = n; + for (bp = rp, ap = rp; bp < rp + n_orig; ap++, bp++) { + while (bp[0]->flags & CMF_DELETE) { + bp++; n--; + } + *ap = *bp; } - *cp = NULL; - if (!(*ap)->disp) { - for (dup = 0, bp = ap + 1; *bp; bp++) - if (!(*bp)->disp && - !((*bp)->flags & CMF_MULT) && - !strcmp((*ap)->str, (*bp)->str)) { - (*bp)->flags |= CMF_MULT; - dup = 1; - } - if (dup) - (*ap)->flags |= CMF_FMULT; - } } + *ap = NULL; /* passed -1 but not -2, so remove consecutive duplicates (efficient) */ } else if (!(flags & CGF_UNIQCON)) { int dup; -- 2.15.1
On Fri, Mar 25, 2022 at 6:44 PM Mikael Magnusson <mikachu@gmail.com> wrote:
>
> This implements my idea for sorting a temporary array and then using
> that for deduplication.
Minor nit, but after this patch I get:
compcore.c: In function ‘makearray’:
compcore.c:3289:26: warning: unused variable ‘i’ [-Wunused-variable]
3289 | int dup, i, del = 0;
| ^
On 3/30/22, Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Fri, Mar 25, 2022 at 6:44 PM Mikael Magnusson <mikachu@gmail.com> wrote:
>>
>> This implements my idea for sorting a temporary array and then using
>> that for deduplication.
>
> Minor nit, but after this patch I get:
>
> compcore.c: In function ‘makearray’:
> compcore.c:3289:26: warning: unused variable ‘i’ [-Wunused-variable]
> 3289 | int dup, i, del = 0;
> | ^
i is actually used in this patch, but not in the fixed version where I
forgot to remove the declaration, thanks.
--
Mikael Magnusson