From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/13458 Path: news.gmane.org!.POSTED!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Further strstr findings Date: Sat, 17 Nov 2018 17:09:38 -0500 Message-ID: <20181117220938.GN5150@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1542492468 6856 195.159.176.226 (17 Nov 2018 22:07:48 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 17 Nov 2018 22:07:48 +0000 (UTC) User-Agent: Mutt/1.5.21 (2010-09-15) To: musl@lists.openwall.com Original-X-From: musl-return-13474-gllmg-musl=m.gmane.org@lists.openwall.com Sat Nov 17 23:07:43 2018 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1gO8kh-0001gA-LK for gllmg-musl@m.gmane.org; Sat, 17 Nov 2018 23:07:43 +0100 Original-Received: (qmail 9740 invoked by uid 550); 17 Nov 2018 22:09:51 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 9708 invoked from network); 17 Nov 2018 22:09:50 -0000 Content-Disposition: inline Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:13458 Archived-At: I've been doing some additional reading on the topic, mainly from the following sources which I'm citing here for reference: [1] Galil and Seiferas. Time-Space-Optimal String Matching. https://urresearch.rochester.edu/fileDownloadForInstitutionalItem.action?itemId=10186&itemFileId=22371 [2] Breslauer. Saving Comparisons in the Crochemore-Perrin String Matching Algorithm. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.6641&rep=rep1&type=pdf [3] Breslauer, Grossi, and Mignosi. Simple real-time constant-space string matching. https://www.sciencedirect.com/science/article/pii/S0304397512010900/pdf?md5=1bc807e14b362af266ff04b78fa8c4df&pid=1-s2.0-S0304397512010900-main.pdf&_valck=1 In particular, [1] offers a big-O-equivalent of Two-Way a decade earlier, with the same basic search-phase concept but a different (and probably more expensive, though I'm not sure) factorization approach. [2] is interesting in that it provides an approach to optimize the skip forward to prevent repeated comparisons. However, I think the academic work in this area is mostly misplaced. The interesting real-world problem does not seem to be further optimizing pathological needles with nasty periodicity properties. We already assure non-pathologically-bad performance for them by virtue of Two-Way being O(n) with a bound of roughly 2n comparisons. Rather, the interesting problem is avoiding penalizing common real-world non-periodic (or minimally periodic) needles for the sake of ensuring O(n) performance even in the worst case. One of the things that always struck me as ugly about Two-Way is that the factorization that pops out depends on a choice of ordering on the alphabet. In the non-periodic needle case, the lower-bound on the period in turn depends on the factorization, which means the performance depens on the factorization, which in turn depended on an arbitrary choice of ordering. As an example, assuming an alphabet [a-z] in the standard order, taking a non-periodic needle of length m in which "a" and "z" do not appear and inserting "az" in the middle of it ensures that the shorter maximal suffix will have length m/2+1 and that the estimated period will be m/2+2, vs the real period of m+2. This inspires a motivation to make the choice of order on the alphabet non-arbitrary, and tailor it to achieving an optimal factorization. So far the most promising approach I've found is to order the alphabet according to first appearance in the needle. For typical real-world needles, this results in the whole needle being its own maximal suffix in one direction (in which case, the exact period of the whole needle falls out as a side effect, allowing optimal skip-forward if this period is kept rather than using the estimate), and a very short maximal suffix in the other direction, which usually becomes the right factor -- the length of this maximal suffix is bounded by the distance of the last first appearance of a new character from the end of the needle. In particular, if the last character is one that has not appeared before in the needle, the needle factors into components of length m-1 and 1, and strchr can be used to search for the right factor. This choice of order is still only heuristic; in particular, it does not help when the tail of the needle is all repetitions of characters that appeared early in the needle, which includes all the canonical pathological cases that can be realized with small (e.g. just 2-character) alphabets. However it still does have the nice property of being invariant under permutations of the alphabet. Note that the special case where the right factor will have length 1 can be detected early, allowing the whole MS-decomposition to be skipped. When building the order for the alphabet, if the last slot of the needle is character that was not seen before, period=m and suffix_pos=m-1 can be inferred automatically. This does nothing for search phase but it does save on setup time. Rich