Re: LinkAuto.hs: automatically turning regexp-matching strings into Links

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

From: John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org>
To: Gwern Branwen <gwern-v26ZT+9V8bxeoWH0uzbU5w@public.gmane.org>,
	pandoc-discuss
	<pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: LinkAuto.hs: automatically turning regexp-matching strings into Links
Date: Mon, 28 Jun 2021 10:06:59 -0700	[thread overview]
Message-ID: <m2lf6tncjw.fsf@MacBook-Pro-2.hsd1.ca.comcast.net> (raw)
In-Reply-To: <CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>


Instead of squashing strings and using a string-based regex, you
could try using Text.Regex.Applicative (regex-applicative), which
has a type

data RE s a

so you could define regexes of type

RE Inline Inline

or

RE Inline MyLinkStructure

or whatever.  You could create a parser to take standard regex
expressions and convert them to these.

Gwern Branwen <gwern-v26ZT+9V8bxeoWH0uzbU5w@public.gmane.org> writes:

> LinkAuto.hs is a Pandoc library I am prototyping:
> https://www.gwern.net/static/build/LinkAuto.hs
>
> It lets you define a regexp and a corresponding URL, and matching text
> in a Pandoc document will be turned into that text but as a Link to
> the URL. It is intended for annotating technical jargon, terms, proper
> names, phrases, etc. Because it is automated, the dictionary can be
> updated at any time and all documents will be updated, without any
> manual annotation at all (as required by all competitors I am aware
> of).
> A link is inserted only if it is not already present in a document,
> and after insertion, subsequent instances are left unlinked (although
> this can easily be changed).
> It appears to be *mostly* correct, barring a few odd corner cases like
> definitions being inserted inside Header elements, breaking the HTML.
> It is also not *too* slow. With 528 regexp rewrites defined, my site
> compilation is maybe only 2-3x slower.
> Attached is a screenshot of a popup in which all links have been added
> automatically by LinkAuto.hs.
>
> I wrote this to help define technical jargon on gwern.net in a
> site-wide consistent way. This includes the thousands of
> auto-generated abstracts from Arxiv, Biorxiv, etc. I particularly
> wanted to define all the machine learning terms - it is just too
> difficult to define *every* term by hand, looking up URLs each time,
> when an Arxiv abstract might mention a dozen of them ("We benchmark X,
> Y, Z on dataset A with metrics B, C, D, finding another instance of
> the E law...").
> No one is going to annotate those by hand, not even with some
> search-and-replace support, and certainly will not be able to go back
> and annotate all past examples, or add a new definition and annotate
> all past examples of the new one, not with thousands of pages and
> references! So, it needs to be automated, site-wide, not require
> manual annotations beyond the definition itself, and allow new
> definitions to be added at any time.
>
> This, unfortunately, implies parsing all of the raw Strs, with the
> further logical choice of using regexps. (I'm not familiar with parser
> combinators, and from what I've seen of them, they would be quite
> verbose in this application.)
>
> So the basic idea is to take a Pandoc AST, squash together all of the
> Str/Space nodes (because if you leave the Space nodes in, how do you
> match multi-word phrases?) to produce long Str runs, which you can run
> each of a list of regexps over, break apart when there is a match,
> substitute in, reconstitute, and continue matching regexps.
> The links themselves are then handled as normal links by the rest of
> the site infrastructure, receiving popups, being smallcaps-formatted
> as necessary, etc.
> There are many small details to get right here: for example, you need
> to delimit regexps by punctuation & whitespace, so " BigGAN's", "
> BigGAN", and " GAN." all get rewritten as the user would expect them
> to.
>
> The straightforward approach of matching every regexp against every
> Str proves to be infeasibly slow (it would lead to ~17h compile-times
> for gwern.net, getting worse with every definition & page). Matching a
> single 'master' regexp, to try to skip doing more processing of most
> nodes, by using '|' alternation to concatenate each regexp, runs into
> a nasty exponential explosion in RAM - with even a few hundred, 75GB
> RAM is inadequate. (Apparently most regexp engines have some sort of
> quadratic term in the *total* regexp length, even though that appears
> totally unnecessary in a case like this and the regexp engine is just
> doing a bad job optimizing the compiled regexp; a 'regexp trie' would
> solve this, probably, but the only such library in Haskell bitrot long
> ago.) It is also still quite slow.
>
> To optimize it, I preprocess each document to try to throw away as
> many regexps as possible.
> First, the document is queried for its Links; if a Link in the regexp
> rewrites is already in the document, then that regexp can be skipped.
> Second, the document is compiled to the 'plain text' version (with
> very long lines), showing only user-visible text, and the regexps are
> run on that, and any regexp which doesn't trigger a (possibly false)
> positive is thrown away as well. It is fast to run a regexp once on an
> entire document, than to run it on every substring. So for most
> documents, lacking any possible hits, they get R regexp scans through
> the plain text version as a single big Text string, and then no
> further work needs to be done. Regexps are quite fast, so R scans is
> cheap. (It's Strs^R which is the problem.) This brought it into the
> realm of usability for me.
> If I need further efficiency, because R keeps increasing to the point
> where R scans is too expensive, I can retry the master regexp
> approach, perhaps in blocks of 20 or 50 regexps (whatever avoids the
> exponential blowup), and divide-and-conquer to find out which of the
> member regexps actually triggered the hit and should be kept.
>
> -- 
> gwern
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ%40mail.gmail.com.

next prev parent reply	other threads:[~2021-06-28 17:06 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-28  2:36 Gwern Branwen
     [not found] ` <CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-06-28 17:06   ` John MacFarlane [this message]
     [not found]     ` <m2lf6tncjw.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2021-06-29 22:52       ` Gwern Branwen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m2lf6tncjw.fsf@MacBook-Pro-2.hsd1.ca.comcast.net \
    --to=jgm-tvlzxgkolnx2fbvcvol8/a@public.gmane.org \
    --cc=gwern-v26ZT+9V8bxeoWH0uzbU5w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).