* LinkAuto.hs: automatically turning regexp-matching strings into Links @ 2021-06-28 2:36 Gwern Branwen [not found] ` <CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 3+ messages in thread From: Gwern Branwen @ 2021-06-28 2:36 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1: Type: text/plain, Size: 5322 bytes --] LinkAuto.hs is a Pandoc library I am prototyping: https://www.gwern.net/static/build/LinkAuto.hs It lets you define a regexp and a corresponding URL, and matching text in a Pandoc document will be turned into that text but as a Link to the URL. It is intended for annotating technical jargon, terms, proper names, phrases, etc. Because it is automated, the dictionary can be updated at any time and all documents will be updated, without any manual annotation at all (as required by all competitors I am aware of). A link is inserted only if it is not already present in a document, and after insertion, subsequent instances are left unlinked (although this can easily be changed). It appears to be *mostly* correct, barring a few odd corner cases like definitions being inserted inside Header elements, breaking the HTML. It is also not *too* slow. With 528 regexp rewrites defined, my site compilation is maybe only 2-3x slower. Attached is a screenshot of a popup in which all links have been added automatically by LinkAuto.hs. I wrote this to help define technical jargon on gwern.net in a site-wide consistent way. This includes the thousands of auto-generated abstracts from Arxiv, Biorxiv, etc. I particularly wanted to define all the machine learning terms - it is just too difficult to define *every* term by hand, looking up URLs each time, when an Arxiv abstract might mention a dozen of them ("We benchmark X, Y, Z on dataset A with metrics B, C, D, finding another instance of the E law..."). No one is going to annotate those by hand, not even with some search-and-replace support, and certainly will not be able to go back and annotate all past examples, or add a new definition and annotate all past examples of the new one, not with thousands of pages and references! So, it needs to be automated, site-wide, not require manual annotations beyond the definition itself, and allow new definitions to be added at any time. This, unfortunately, implies parsing all of the raw Strs, with the further logical choice of using regexps. (I'm not familiar with parser combinators, and from what I've seen of them, they would be quite verbose in this application.) So the basic idea is to take a Pandoc AST, squash together all of the Str/Space nodes (because if you leave the Space nodes in, how do you match multi-word phrases?) to produce long Str runs, which you can run each of a list of regexps over, break apart when there is a match, substitute in, reconstitute, and continue matching regexps. The links themselves are then handled as normal links by the rest of the site infrastructure, receiving popups, being smallcaps-formatted as necessary, etc. There are many small details to get right here: for example, you need to delimit regexps by punctuation & whitespace, so " BigGAN's", " BigGAN", and " GAN." all get rewritten as the user would expect them to. The straightforward approach of matching every regexp against every Str proves to be infeasibly slow (it would lead to ~17h compile-times for gwern.net, getting worse with every definition & page). Matching a single 'master' regexp, to try to skip doing more processing of most nodes, by using '|' alternation to concatenate each regexp, runs into a nasty exponential explosion in RAM - with even a few hundred, 75GB RAM is inadequate. (Apparently most regexp engines have some sort of quadratic term in the *total* regexp length, even though that appears totally unnecessary in a case like this and the regexp engine is just doing a bad job optimizing the compiled regexp; a 'regexp trie' would solve this, probably, but the only such library in Haskell bitrot long ago.) It is also still quite slow. To optimize it, I preprocess each document to try to throw away as many regexps as possible. First, the document is queried for its Links; if a Link in the regexp rewrites is already in the document, then that regexp can be skipped. Second, the document is compiled to the 'plain text' version (with very long lines), showing only user-visible text, and the regexps are run on that, and any regexp which doesn't trigger a (possibly false) positive is thrown away as well. It is fast to run a regexp once on an entire document, than to run it on every substring. So for most documents, lacking any possible hits, they get R regexp scans through the plain text version as a single big Text string, and then no further work needs to be done. Regexps are quite fast, so R scans is cheap. (It's Strs^R which is the problem.) This brought it into the realm of usability for me. If I need further efficiency, because R keeps increasing to the point where R scans is too expensive, I can retry the master regexp approach, perhaps in blocks of 20 or 50 regexps (whatever avoids the exponential blowup), and divide-and-conquer to find out which of the member regexps actually triggered the hit and should be kept. -- gwern -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ%40mail.gmail.com. [-- Attachment #2: xwd-16248477022199362.png --] [-- Type: image/png, Size: 353429 bytes --] ^ permalink raw reply [flat|nested] 3+ messages in thread
[parent not found: <CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: LinkAuto.hs: automatically turning regexp-matching strings into Links [not found] ` <CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2021-06-28 17:06 ` John MacFarlane [not found] ` <m2lf6tncjw.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 0 siblings, 1 reply; 3+ messages in thread From: John MacFarlane @ 2021-06-28 17:06 UTC (permalink / raw) To: Gwern Branwen, pandoc-discuss Instead of squashing strings and using a string-based regex, you could try using Text.Regex.Applicative (regex-applicative), which has a type data RE s a so you could define regexes of type RE Inline Inline or RE Inline MyLinkStructure or whatever. You could create a parser to take standard regex expressions and convert them to these. Gwern Branwen <gwern-v26ZT+9V8bxeoWH0uzbU5w@public.gmane.org> writes: > LinkAuto.hs is a Pandoc library I am prototyping: > https://www.gwern.net/static/build/LinkAuto.hs > > It lets you define a regexp and a corresponding URL, and matching text > in a Pandoc document will be turned into that text but as a Link to > the URL. It is intended for annotating technical jargon, terms, proper > names, phrases, etc. Because it is automated, the dictionary can be > updated at any time and all documents will be updated, without any > manual annotation at all (as required by all competitors I am aware > of). > A link is inserted only if it is not already present in a document, > and after insertion, subsequent instances are left unlinked (although > this can easily be changed). > It appears to be *mostly* correct, barring a few odd corner cases like > definitions being inserted inside Header elements, breaking the HTML. > It is also not *too* slow. With 528 regexp rewrites defined, my site > compilation is maybe only 2-3x slower. > Attached is a screenshot of a popup in which all links have been added > automatically by LinkAuto.hs. > > I wrote this to help define technical jargon on gwern.net in a > site-wide consistent way. This includes the thousands of > auto-generated abstracts from Arxiv, Biorxiv, etc. I particularly > wanted to define all the machine learning terms - it is just too > difficult to define *every* term by hand, looking up URLs each time, > when an Arxiv abstract might mention a dozen of them ("We benchmark X, > Y, Z on dataset A with metrics B, C, D, finding another instance of > the E law..."). > No one is going to annotate those by hand, not even with some > search-and-replace support, and certainly will not be able to go back > and annotate all past examples, or add a new definition and annotate > all past examples of the new one, not with thousands of pages and > references! So, it needs to be automated, site-wide, not require > manual annotations beyond the definition itself, and allow new > definitions to be added at any time. > > This, unfortunately, implies parsing all of the raw Strs, with the > further logical choice of using regexps. (I'm not familiar with parser > combinators, and from what I've seen of them, they would be quite > verbose in this application.) > > So the basic idea is to take a Pandoc AST, squash together all of the > Str/Space nodes (because if you leave the Space nodes in, how do you > match multi-word phrases?) to produce long Str runs, which you can run > each of a list of regexps over, break apart when there is a match, > substitute in, reconstitute, and continue matching regexps. > The links themselves are then handled as normal links by the rest of > the site infrastructure, receiving popups, being smallcaps-formatted > as necessary, etc. > There are many small details to get right here: for example, you need > to delimit regexps by punctuation & whitespace, so " BigGAN's", " > BigGAN", and " GAN." all get rewritten as the user would expect them > to. > > The straightforward approach of matching every regexp against every > Str proves to be infeasibly slow (it would lead to ~17h compile-times > for gwern.net, getting worse with every definition & page). Matching a > single 'master' regexp, to try to skip doing more processing of most > nodes, by using '|' alternation to concatenate each regexp, runs into > a nasty exponential explosion in RAM - with even a few hundred, 75GB > RAM is inadequate. (Apparently most regexp engines have some sort of > quadratic term in the *total* regexp length, even though that appears > totally unnecessary in a case like this and the regexp engine is just > doing a bad job optimizing the compiled regexp; a 'regexp trie' would > solve this, probably, but the only such library in Haskell bitrot long > ago.) It is also still quite slow. > > To optimize it, I preprocess each document to try to throw away as > many regexps as possible. > First, the document is queried for its Links; if a Link in the regexp > rewrites is already in the document, then that regexp can be skipped. > Second, the document is compiled to the 'plain text' version (with > very long lines), showing only user-visible text, and the regexps are > run on that, and any regexp which doesn't trigger a (possibly false) > positive is thrown away as well. It is fast to run a regexp once on an > entire document, than to run it on every substring. So for most > documents, lacking any possible hits, they get R regexp scans through > the plain text version as a single big Text string, and then no > further work needs to be done. Regexps are quite fast, so R scans is > cheap. (It's Strs^R which is the problem.) This brought it into the > realm of usability for me. > If I need further efficiency, because R keeps increasing to the point > where R scans is too expensive, I can retry the master regexp > approach, perhaps in blocks of 20 or 50 regexps (whatever avoids the > exponential blowup), and divide-and-conquer to find out which of the > member regexps actually triggered the hit and should be kept. > > -- > gwern > > -- > You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ%40mail.gmail.com. ^ permalink raw reply [flat|nested] 3+ messages in thread
[parent not found: <m2lf6tncjw.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>]
* Re: LinkAuto.hs: automatically turning regexp-matching strings into Links [not found] ` <m2lf6tncjw.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> @ 2021-06-29 22:52 ` Gwern Branwen 0 siblings, 0 replies; 3+ messages in thread From: Gwern Branwen @ 2021-06-29 22:52 UTC (permalink / raw) To: pandoc-discuss On Mon, Jun 28, 2021 at 1:07 PM John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote: > You could create a parser to take standard regex > expressions and convert them to these. Perhaps. That would be above my paygrade, however, and I suspect the end result would be both buggy and a horror to use or debug in the worst Haskell type zoo fashion. The concrete string approach is relatively straightforward, and I had the merging code mostly worked out from my previous failed attempt at a rewrite library. -- gwern https://www.gwern.net ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-06-29 22:52 UTC | newest] Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-28 2:36 LinkAuto.hs: automatically turning regexp-matching strings into Links Gwern Branwen [not found] ` <CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2021-06-28 17:06 ` John MacFarlane [not found] ` <m2lf6tncjw.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 2021-06-29 22:52 ` Gwern Branwen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).