From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 29360 invoked by alias); 5 Jun 2017 11:54:53 -0000 Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Workers List List-Post: List-Help: X-Seq: 41228 Received: (qmail 26499 invoked from network); 5 Jun 2017 11:54:53 -0000 X-Qmail-Scanner-Diagnostics: from mail-wm0-f45.google.com by f.primenet.com.au (envelope-from , uid 7791) with qmail-scanner-2.11 (clamdscan: 0.99.2/21882. spamassassin: 3.4.1. Clear:RC:0(74.125.82.45):SA:0(-0.0/5.0):. Processed in 1.666537 secs); 05 Jun 2017 11:54:53 -0000 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-0.0 required=5.0 tests=FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_PASS, T_DKIM_INVALID autolearn=unavailable autolearn_force=no version=3.4.1 X-Envelope-From: stephane.chazelas@gmail.com X-Qmail-Scanner-Mime-Attachments: | X-Qmail-Scanner-Zip-Files: | Received-SPF: pass (ns1.primenet.com.au: SPF record at _netblocks.google.com designates 74.125.82.45 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:mail-followup-to:references :mime-version:content-disposition:in-reply-to:user-agent; bh=E1g9BKlreswAbv/94jLrrmyvOlli9z94wNBupB0l6Gs=; b=dTfboxDtI4Ro9BvKc+PJAipbKaMW2I8SJF7A02De/lVwEO94wnqOcCC/gR8Ss5NPKg gVUvqr4TumViTGjoXKmt9qgKKW/c4z7GKOXq/cFE4iW276kZwsGD+7+bbW7zDIo8zSzx IHSCENC33NU6zs47sBsPczNRQcbi/q//kMIFht+LWTGuCJxcrJ6mHlnkHy2u6LJzRJat ExiC3LMNLznBOEJpZsqqB5ibVfqLS5ByCPIBm3BwHjv4GvnOzcdoQOc+Awxex4ThO3MM WmWnfl1ebAaN6ybUycXgOSkVLDhcQb6C2w6s7l60TjnI0KsYyrAaDE/k6yejQo7DGCN5 Jsnw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id :mail-followup-to:references:mime-version:content-disposition :in-reply-to:user-agent; bh=E1g9BKlreswAbv/94jLrrmyvOlli9z94wNBupB0l6Gs=; b=txYYTVM8Gn5WD/soHqZ8HKYf9h9cJ8ksA8LStyF9Q+Gvk+jFnEjOpQSNHTOtDYJDBl 9RR3nMkuw/NrlyrdZlYtbRWCgX014yd0RR0QBlKY30z8DaYW1UJl0CoND4nLwT8rd733 uNdCs85gr4uboYvEtTEPNot0OjL8qNo+/Nf/JW7g/8cFpplHG2C9YMOjxtRB0Lfew0QW OnmuvQKkpdWiwfctJb7zJPpbaadwprFd0MOT/rcm82CPc21nMZyrS0D/CBm8hJW+0GVH cQJUW7ejL41Fu24VvOPwvOvBqivzwIHB74poobhLtmCNQBa5k0k5rdnLXyjKWvt9G3p4 VPqA== X-Gm-Message-State: AODbwcAgjcMfKawiUnUCc/40/F0NfVNCM9zAga6RvwZNa2iafJaKjTJl H8adk/9jlgIe509g X-Received: by 10.28.54.154 with SMTP id y26mr7249340wmh.53.1496663682905; Mon, 05 Jun 2017 04:54:42 -0700 (PDT) Date: Mon, 5 Jun 2017 12:54:39 +0100 From: Stephane Chazelas To: Bart Schaefer Cc: Zsh hackers list Subject: Re: Surprising behaviour with numeric glob sort Message-ID: <20170605115439.GA15325@chaz.gmail.com> Mail-Followup-To: Bart Schaefer , Zsh hackers list References: <20170531212453.GA31563@chaz.gmail.com> <170601152943.ZM4783@torch.brasslantern.com> <20170602090332.GA6574@chaz.gmail.com> <170602161905.ZM10488@torch.brasslantern.com> <20170603211645.GA17785@chaz.gmail.com> <170603170724.ZM15645@torch.brasslantern.com> <20170604173157.GB9094@chaz.gmail.com> <170604150135.ZM13291@torch.brasslantern.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <170604150135.ZM13291@torch.brasslantern.com> User-Agent: Mutt/1.5.24 (2015-08-30) 2017-06-04 15:01:35 -0700, Bart Schaefer: > On Jun 4, 6:31pm, Stephane Chazelas wrote: > } > } "Slow" (though probably quite negligible compared to strcoll() > } which already does things like in addition to a lot more hard > } wark) but working. > > According to one strcoll man page I have, the right thing to do is > convert all the strings with strxfrm and then strcmp those. > > It provides no advice on how to order the original array to match the > sorted result of the xfrm'd array (the transform is not reversible), > nor how to determine how much space to allocate for each transform. > > Zsh has the additional complication of needing to deal with strings > having embedded '\0' bytes, which neither strcoll nor strxfrm is able > to deal with. I'm not 100% confident that zsh's current algorithm > deals correct with this either. >>From what I can see (by using ltrace -e strcoll), zsh removes the NUL bytes before calling strcoll, so $'\0\0x' sorts the same as $'\0x' or 'x'. That's as if $'\0' had IGNORE for all weights in the collation order which I suppose is as valid as any. Per POSIX, such input would not be text, so "sorting" them may not be very meaningful. Same for strings that contain byte sequences that don't form valid characters where there's no saying what strcoll() may do with them. Having NUL sort before any other character would be preferable in the C locale that sorts by code point though (like to sort UTF-16 text based on codepoint). > A possible approach would be to pre-allocate a hash table and fill > it on the fly with keys the original strings and values the strxfrm > results. An additional strxfrm could be called on the trailing bits > after any embedded nul. Then sort the original array by comparing > the values in the hash. Doesn't solve the question of how much space > to reserve, but it would allow the current algorithm for picking out > numeric substrings to be used. You mean a Schwartzian transform (https://en.wikipedia.org/wiki/Schwartzian_transform) that sorts on (here using $'\0zsh\0-1.2\0x' as an example): { original <= $'\0zsh\0-1.2\0x' parts <= [{string: $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-")}, {num: 1}, {string: strxfrm(".")}, {num: 2}, {string: $'\0' . strxfrm("x")}] }, ... With a comparison function that does memcmp() on the "string" parts and a number comparison on the "num" parts? Yes, that sounds like a good idea that trades memory for efficiency (by doing the strxfrm() hard work up front and only once), but if we're going to transform upfront anyway, the memory is traded off already. Or same using zero-padding { original <= $'\0zsh\0-1.2\0x' transformed <= $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-") . "001" . strxfrm(".") . "002" . $'\0' . strxfrm("x")}] }, ... (and using memcmp() in the comparison function) With same benefits (a-1 a+2 a-3 sorted in that order where "-" and "+" are ignored in the first pass like without numericglobsort) and same drawback (need to find out the right padding width). > For parameter array sorting we could extend the (m) flag to indicate > that this transformation is required [it already means to count byte > widths for (l), (r), etc.] so as to avoid the overhead when it is not > needed. For globbing, we'd have to rely on something else such as > whether MULTIBYTE is set. Note that for globbing, the "numeric" sort applies after the "o+myfunc" or "oe:...:" transformation, so the strings to sort on may still contain all sorts of things like NUL bytes and byte sequences that don't form valid characters even if they were not in the file names in the first place. Like in a by_content() REPLY=$mapfile[$REPLY] echo *(no+by_content) -- Stephane