LinkAuto.hs: automatically turning regexp-matching strings into Links

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* LinkAuto.hs: automatically turning regexp-matching strings into Links
@ 2021-06-28  2:36 Gwern Branwen
       [not found] ` <CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Gwern Branwen @ 2021-06-28  2:36 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 5322 bytes --]

LinkAuto.hs is a Pandoc library I am prototyping:
https://www.gwern.net/static/build/LinkAuto.hs

It lets you define a regexp and a corresponding URL, and matching text
in a Pandoc document will be turned into that text but as a Link to
the URL. It is intended for annotating technical jargon, terms, proper
names, phrases, etc. Because it is automated, the dictionary can be
updated at any time and all documents will be updated, without any
manual annotation at all (as required by all competitors I am aware
of).
A link is inserted only if it is not already present in a document,
and after insertion, subsequent instances are left unlinked (although
this can easily be changed).
It appears to be *mostly* correct, barring a few odd corner cases like
definitions being inserted inside Header elements, breaking the HTML.
It is also not *too* slow. With 528 regexp rewrites defined, my site
compilation is maybe only 2-3x slower.
Attached is a screenshot of a popup in which all links have been added
automatically by LinkAuto.hs.

I wrote this to help define technical jargon on gwern.net in a
site-wide consistent way. This includes the thousands of
auto-generated abstracts from Arxiv, Biorxiv, etc. I particularly
wanted to define all the machine learning terms - it is just too
difficult to define *every* term by hand, looking up URLs each time,
when an Arxiv abstract might mention a dozen of them ("We benchmark X,
Y, Z on dataset A with metrics B, C, D, finding another instance of
the E law...").
No one is going to annotate those by hand, not even with some
search-and-replace support, and certainly will not be able to go back
and annotate all past examples, or add a new definition and annotate
all past examples of the new one, not with thousands of pages and
references! So, it needs to be automated, site-wide, not require
manual annotations beyond the definition itself, and allow new
definitions to be added at any time.

This, unfortunately, implies parsing all of the raw Strs, with the
further logical choice of using regexps. (I'm not familiar with parser
combinators, and from what I've seen of them, they would be quite
verbose in this application.)

So the basic idea is to take a Pandoc AST, squash together all of the
Str/Space nodes (because if you leave the Space nodes in, how do you
match multi-word phrases?) to produce long Str runs, which you can run
each of a list of regexps over, break apart when there is a match,
substitute in, reconstitute, and continue matching regexps.
The links themselves are then handled as normal links by the rest of
the site infrastructure, receiving popups, being smallcaps-formatted
as necessary, etc.
There are many small details to get right here: for example, you need
to delimit regexps by punctuation & whitespace, so " BigGAN's", "
BigGAN", and " GAN." all get rewritten as the user would expect them
to.

The straightforward approach of matching every regexp against every
Str proves to be infeasibly slow (it would lead to ~17h compile-times
for gwern.net, getting worse with every definition & page). Matching a
single 'master' regexp, to try to skip doing more processing of most
nodes, by using '|' alternation to concatenate each regexp, runs into
a nasty exponential explosion in RAM - with even a few hundred, 75GB
RAM is inadequate. (Apparently most regexp engines have some sort of
quadratic term in the *total* regexp length, even though that appears
totally unnecessary in a case like this and the regexp engine is just
doing a bad job optimizing the compiled regexp; a 'regexp trie' would
solve this, probably, but the only such library in Haskell bitrot long
ago.) It is also still quite slow.

To optimize it, I preprocess each document to try to throw away as
many regexps as possible.
First, the document is queried for its Links; if a Link in the regexp
rewrites is already in the document, then that regexp can be skipped.
Second, the document is compiled to the 'plain text' version (with
very long lines), showing only user-visible text, and the regexps are
run on that, and any regexp which doesn't trigger a (possibly false)
positive is thrown away as well. It is fast to run a regexp once on an
entire document, than to run it on every substring. So for most
documents, lacking any possible hits, they get R regexp scans through
the plain text version as a single big Text string, and then no
further work needs to be done. Regexps are quite fast, so R scans is
cheap. (It's Strs^R which is the problem.) This brought it into the
realm of usability for me.
If I need further efficiency, because R keeps increasing to the point
where R scans is too expensive, I can retry the master regexp
approach, perhaps in blocks of 20 or 50 regexps (whatever avoids the
exponential blowup), and divide-and-conquer to find out which of the
member regexps actually triggered the hit and should be kept.

-- 
gwern

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ%40mail.gmail.com.

[-- Attachment #2: xwd-16248477022199362.png --]
[-- Type: image/png, Size: 353429 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

[parent not found: <CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: LinkAuto.hs: automatically turning regexp-matching strings into Links
       [not found] ` <CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2021-06-28 17:06   ` John MacFarlane
       [not found]     ` <m2lf6tncjw.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: John MacFarlane @ 2021-06-28 17:06 UTC (permalink / raw)
  To: Gwern Branwen, pandoc-discuss


Instead of squashing strings and using a string-based regex, you
could try using Text.Regex.Applicative (regex-applicative), which
has a type

data RE s a

so you could define regexes of type

RE Inline Inline

or

RE Inline MyLinkStructure

or whatever.  You could create a parser to take standard regex
expressions and convert them to these.

Gwern Branwen <gwern-v26ZT+9V8bxeoWH0uzbU5w@public.gmane.org> writes:

> LinkAuto.hs is a Pandoc library I am prototyping:
> https://www.gwern.net/static/build/LinkAuto.hs
>
> It lets you define a regexp and a corresponding URL, and matching text
> in a Pandoc document will be turned into that text but as a Link to
> the URL. It is intended for annotating technical jargon, terms, proper
> names, phrases, etc. Because it is automated, the dictionary can be
> updated at any time and all documents will be updated, without any
> manual annotation at all (as required by all competitors I am aware
> of).
> A link is inserted only if it is not already present in a document,
> and after insertion, subsequent instances are left unlinked (although
> this can easily be changed).
> It appears to be *mostly* correct, barring a few odd corner cases like
> definitions being inserted inside Header elements, breaking the HTML.
> It is also not *too* slow. With 528 regexp rewrites defined, my site
> compilation is maybe only 2-3x slower.
> Attached is a screenshot of a popup in which all links have been added
> automatically by LinkAuto.hs.
>
> I wrote this to help define technical jargon on gwern.net in a
> site-wide consistent way. This includes the thousands of
> auto-generated abstracts from Arxiv, Biorxiv, etc. I particularly
> wanted to define all the machine learning terms - it is just too
> difficult to define *every* term by hand, looking up URLs each time,
> when an Arxiv abstract might mention a dozen of them ("We benchmark X,
> Y, Z on dataset A with metrics B, C, D, finding another instance of
> the E law...").
> No one is going to annotate those by hand, not even with some
> search-and-replace support, and certainly will not be able to go back
> and annotate all past examples, or add a new definition and annotate
> all past examples of the new one, not with thousands of pages and
> references! So, it needs to be automated, site-wide, not require
> manual annotations beyond the definition itself, and allow new
> definitions to be added at any time.
>
> This, unfortunately, implies parsing all of the raw Strs, with the
> further logical choice of using regexps. (I'm not familiar with parser
> combinators, and from what I've seen of them, they would be quite
> verbose in this application.)
>
> So the basic idea is to take a Pandoc AST, squash together all of the
> Str/Space nodes (because if you leave the Space nodes in, how do you
> match multi-word phrases?) to produce long Str runs, which you can run
> each of a list of regexps over, break apart when there is a match,
> substitute in, reconstitute, and continue matching regexps.
> The links themselves are then handled as normal links by the rest of
> the site infrastructure, receiving popups, being smallcaps-formatted
> as necessary, etc.
> There are many small details to get right here: for example, you need
> to delimit regexps by punctuation & whitespace, so " BigGAN's", "
> BigGAN", and " GAN." all get rewritten as the user would expect them
> to.
>
> The straightforward approach of matching every regexp against every
> Str proves to be infeasibly slow (it would lead to ~17h compile-times
> for gwern.net, getting worse with every definition & page). Matching a
> single 'master' regexp, to try to skip doing more processing of most
> nodes, by using '|' alternation to concatenate each regexp, runs into
> a nasty exponential explosion in RAM - with even a few hundred, 75GB
> RAM is inadequate. (Apparently most regexp engines have some sort of
> quadratic term in the *total* regexp length, even though that appears
> totally unnecessary in a case like this and the regexp engine is just
> doing a bad job optimizing the compiled regexp; a 'regexp trie' would
> solve this, probably, but the only such library in Haskell bitrot long
> ago.) It is also still quite slow.
>
> To optimize it, I preprocess each document to try to throw away as
> many regexps as possible.
> First, the document is queried for its Links; if a Link in the regexp
> rewrites is already in the document, then that regexp can be skipped.
> Second, the document is compiled to the 'plain text' version (with
> very long lines), showing only user-visible text, and the regexps are
> run on that, and any regexp which doesn't trigger a (possibly false)
> positive is thrown away as well. It is fast to run a regexp once on an
> entire document, than to run it on every substring. So for most
> documents, lacking any possible hits, they get R regexp scans through
> the plain text version as a single big Text string, and then no
> further work needs to be done. Regexps are quite fast, so R scans is
> cheap. (It's Strs^R which is the problem.) This brought it into the
> realm of usability for me.
> If I need further efficiency, because R keeps increasing to the point
> where R scans is too expensive, I can retry the master regexp
> approach, perhaps in blocks of 20 or 50 regexps (whatever avoids the
> exponential blowup), and divide-and-conquer to find out which of the
> member regexps actually triggered the hit and should be kept.
>
> -- 
> gwern
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ%40mail.gmail.com.


^ permalink raw reply	[flat|nested] 3+ messages in thread

[parent not found: <m2lf6tncjw.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>]

* Re: LinkAuto.hs: automatically turning regexp-matching strings into Links
       [not found]     ` <m2lf6tncjw.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
@ 2021-06-29 22:52       ` Gwern Branwen
  0 siblings, 0 replies; 3+ messages in thread
From: Gwern Branwen @ 2021-06-29 22:52 UTC (permalink / raw)
  To: pandoc-discuss

On Mon, Jun 28, 2021 at 1:07 PM John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote:
> You could create a parser to take standard regex
> expressions and convert them to these.

Perhaps. That would be above my paygrade, however, and I suspect the
end result would be both buggy and a horror to use or debug in the
worst Haskell type zoo fashion. The concrete string approach is
relatively straightforward, and I had the merging code mostly worked
out from my previous failed attempt at a rewrite library.

-- 
gwern
https://www.gwern.net

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-06-29 22:52 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-28  2:36 LinkAuto.hs: automatically turning regexp-matching strings into Links Gwern Branwen
     [not found] ` <CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-06-28 17:06   ` John MacFarlane
     [not found]     ` <m2lf6tncjw.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2021-06-29 22:52       ` Gwern Branwen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).