public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* LinkAuto.hs: automatically turning regexp-matching strings into Links
@ 2021-06-28  2:36 Gwern Branwen
       [not found] ` <CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Gwern Branwen @ 2021-06-28  2:36 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 5322 bytes --]

LinkAuto.hs is a Pandoc library I am prototyping:
https://www.gwern.net/static/build/LinkAuto.hs

It lets you define a regexp and a corresponding URL, and matching text
in a Pandoc document will be turned into that text but as a Link to
the URL. It is intended for annotating technical jargon, terms, proper
names, phrases, etc. Because it is automated, the dictionary can be
updated at any time and all documents will be updated, without any
manual annotation at all (as required by all competitors I am aware
of).
A link is inserted only if it is not already present in a document,
and after insertion, subsequent instances are left unlinked (although
this can easily be changed).
It appears to be *mostly* correct, barring a few odd corner cases like
definitions being inserted inside Header elements, breaking the HTML.
It is also not *too* slow. With 528 regexp rewrites defined, my site
compilation is maybe only 2-3x slower.
Attached is a screenshot of a popup in which all links have been added
automatically by LinkAuto.hs.

I wrote this to help define technical jargon on gwern.net in a
site-wide consistent way. This includes the thousands of
auto-generated abstracts from Arxiv, Biorxiv, etc. I particularly
wanted to define all the machine learning terms - it is just too
difficult to define *every* term by hand, looking up URLs each time,
when an Arxiv abstract might mention a dozen of them ("We benchmark X,
Y, Z on dataset A with metrics B, C, D, finding another instance of
the E law...").
No one is going to annotate those by hand, not even with some
search-and-replace support, and certainly will not be able to go back
and annotate all past examples, or add a new definition and annotate
all past examples of the new one, not with thousands of pages and
references! So, it needs to be automated, site-wide, not require
manual annotations beyond the definition itself, and allow new
definitions to be added at any time.

This, unfortunately, implies parsing all of the raw Strs, with the
further logical choice of using regexps. (I'm not familiar with parser
combinators, and from what I've seen of them, they would be quite
verbose in this application.)

So the basic idea is to take a Pandoc AST, squash together all of the
Str/Space nodes (because if you leave the Space nodes in, how do you
match multi-word phrases?) to produce long Str runs, which you can run
each of a list of regexps over, break apart when there is a match,
substitute in, reconstitute, and continue matching regexps.
The links themselves are then handled as normal links by the rest of
the site infrastructure, receiving popups, being smallcaps-formatted
as necessary, etc.
There are many small details to get right here: for example, you need
to delimit regexps by punctuation & whitespace, so " BigGAN's", "
BigGAN", and " GAN." all get rewritten as the user would expect them
to.

The straightforward approach of matching every regexp against every
Str proves to be infeasibly slow (it would lead to ~17h compile-times
for gwern.net, getting worse with every definition & page). Matching a
single 'master' regexp, to try to skip doing more processing of most
nodes, by using '|' alternation to concatenate each regexp, runs into
a nasty exponential explosion in RAM - with even a few hundred, 75GB
RAM is inadequate. (Apparently most regexp engines have some sort of
quadratic term in the *total* regexp length, even though that appears
totally unnecessary in a case like this and the regexp engine is just
doing a bad job optimizing the compiled regexp; a 'regexp trie' would
solve this, probably, but the only such library in Haskell bitrot long
ago.) It is also still quite slow.

To optimize it, I preprocess each document to try to throw away as
many regexps as possible.
First, the document is queried for its Links; if a Link in the regexp
rewrites is already in the document, then that regexp can be skipped.
Second, the document is compiled to the 'plain text' version (with
very long lines), showing only user-visible text, and the regexps are
run on that, and any regexp which doesn't trigger a (possibly false)
positive is thrown away as well. It is fast to run a regexp once on an
entire document, than to run it on every substring. So for most
documents, lacking any possible hits, they get R regexp scans through
the plain text version as a single big Text string, and then no
further work needs to be done. Regexps are quite fast, so R scans is
cheap. (It's Strs^R which is the problem.) This brought it into the
realm of usability for me.
If I need further efficiency, because R keeps increasing to the point
where R scans is too expensive, I can retry the master regexp
approach, perhaps in blocks of 20 or 50 regexps (whatever avoids the
exponential blowup), and divide-and-conquer to find out which of the
member regexps actually triggered the hit and should be kept.

-- 
gwern

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ%40mail.gmail.com.

[-- Attachment #2: xwd-16248477022199362.png --]
[-- Type: image/png, Size: 353429 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-06-29 22:52 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-28  2:36 LinkAuto.hs: automatically turning regexp-matching strings into Links Gwern Branwen
     [not found] ` <CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-06-28 17:06   ` John MacFarlane
     [not found]     ` <m2lf6tncjw.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2021-06-29 22:52       ` Gwern Branwen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).