From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-lf0-f44.google.com ([209.85.215.44]) by ur; Sun Apr 10 17:25:11 EDT 2016 Received: by mail-lf0-f44.google.com with SMTP id j11so133213477lfb.1 for <9front@9front.org>; Sun, 10 Apr 2016 14:25:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to; bh=U6Og8u2NbckWwFqy7QWxxM2WaEIm092xd1lviJsM6Wo=; b=qxPG9UeTunTYUzwzaLxIJsc57seP7AlrXgWulx5Rv/bEAFRzF2ODSSO2aOBe06y9hy Ii64YTvL41tdjYOs2er8HqVOxQ3yXXCmtGH/hR09hjzhbhyyPqrKwcYcPr0qL6DDNJa5 AI8O+v50gdXuHbvBsaq+p7/rH/W1rqSGFLkj2+YfXY7FOzx1iVvh1WWr9ZLhN4mlTi9B jWjjl2GMtbAIGyups8UQuNkNH5meWawhC9oTM03Sqv7E8jDqQ9qWebKZjIdOugZFtagR CPDBLifj8QlHyaVbIbZYpZ3wwaaV2+OBzDrN6lX+OpYroAs6+EMPQuuAhO5Sot4Fx3hL lqUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to; bh=U6Og8u2NbckWwFqy7QWxxM2WaEIm092xd1lviJsM6Wo=; b=CUcaJUWqOy1nyc585H0bl0WrLXJErYbhJ0d2y/WRoN14+qzToC4Vdl+GA/021zutwp YXVdhKdvBa6uIwCnWaS6L/SRqvEPDzL829lzK7GEDc4QdRwIBLppUq7RySU/8nZjNOf3 dEjnZeZOBkiGzTonRRSHJyFFaoxxv5JB6kGwiXIoSWqMuGqphMhnP2sqisIHOEHKEIu1 SdShkxTuXUox4WpKeLkN4l1ooqwOkqRheRAZo6ykKGQz5DV8FKbaG5P40J6AIsUQNZyZ eCO3wa3+1I+poE8ZUWs/68kwpAc5GUbGeUuCqSGzWekSPUvjKmMighJUxtW5Q3DSgNRv eSuw== X-Gm-Message-State: AOPr4FUkF6RETXkhtw4kx700ljzgAQoKBTxxJ8MKuLjAEwzpkb/RH4OtcSY6CWBUbWyjU3yjq/AeIDg1JIB4Cw== MIME-Version: 1.0 X-Received: by 10.25.208.80 with SMTP id h77mr706860lfg.11.1460323508308; Sun, 10 Apr 2016 14:25:08 -0700 (PDT) Received: by 10.25.20.224 with HTTP; Sun, 10 Apr 2016 14:25:08 -0700 (PDT) Date: Sun, 10 Apr 2016 16:25:08 -0500 Message-ID: Subject: New Regular Expression Implementation From: Benjamin Purcell To: 9front@9front.org Content-Type: text/plain; charset=UTF-8 List-ID: <9front.9front.org> List-Help: X-Glyph: ➈ X-Bullshit: content-addressed proxy-aware cache map/reduce controller I mentioned a couple weeks ago on #cat-v that I wrote a new regular expression implementation for 9front. A couple people told me to write an email to the list about it. It is a drop-in replacement for the existing libregexp. Here are the key differences: (1) It is faster. For typical regular expressions, this implementation is typically 2.6 to 3 times faster than the system implementation. For abnormal regular expressions--something like 'a?a?a?a?a?a?a?'--I have observed this implementation to be 55 times faster. (2) It is correct. The system implementation will fail to capture sub-matches for certain regular expressions--typically abnormal ones. The rune implementation fails faster than the character based one. This implementation will always capture sub-matches. (3) It is a smaller implementation. The system implementation's wc is: 1395 3919 27249 total This implementation's wc is: 1269 3052 21597 total (4) It is thread-safe/re-entrant. The system implementation references external variables during compilation (the execution system is thread-safe). This implementation is fully thread-safe at all times. I have not checked memory usage but I believe that it is more efficient for simple regular expressions and possibly less efficient for abnormal expressions that cause the system implementation to fail. A full copy of the source code is available at /n/contrib/spew/newlibregexp. If you are interested in performing your own comparison, the tests directory contains two programs to compare this implementation to the system's implementation. Usage of those is: regextest -lnp -r reps -c creps -m nsubm regex matchstr usage: reps is the number of times to run regex against the matchstr (default 1). creps is the number of times to compile the regex (default 1). nsubm is the number of submatches to capture (default 10). -l is to compile a literal regular expression. -n is to compile a regular expression where . does not match '\n'. -p is to print the compiled regular expression. sysregextest uses the system implementation. This implementation also includes a fmt routine to print compiled regular expressions. You can install it with e.g. fmtinstall('R', reprogfmt); The regular expressions 'foo|bar' and 'a+b*c?(foo)+|bar' are printed below as an example. 402070 OSAVE 0 402090 OSPLIT 4020b0 402130 4020b0 ORUNE f 4020d0 ORUNE o 4020f0 ORUNE o 402110 OJMP 402190 402130 ORUNE b 402150 ORUNE a 402170 ORUNE r 402190 OUNSAVE 0 4023f0 OSAVE 0 402410 OSPLIT 402430 402630 402430 ORUNE a 402450 OSPLIT 402430 402470 402470 OSAVE 1 402490 OSPLIT 4024b0 4024f0 4024b0 ORUNE b 4024d0 OJMP 402490 4024f0 OUNSAVE 1 402510 OSPLIT 402530 402550 402530 ORUNE c 402550 OSAVE 2 402570 ORUNE f 402590 ORUNE o 4025b0 ORUNE o 4025d0 OUNSAVE 2 4025f0 OSPLIT 402550 402610 402610 OJMP 402690 402630 ORUNE b 402650 ORUNE a 402670 ORUNE r 402690 OUNSAVE 0