LinkAuto.hs is a Pandoc library I am prototyping: https://www.gwern.net/static/build/LinkAuto.hs It lets you define a regexp and a corresponding URL, and matching text in a Pandoc document will be turned into that text but as a Link to the URL. It is intended for annotating technical jargon, terms, proper names, phrases, etc. Because it is automated, the dictionary can be updated at any time and all documents will be updated, without any manual annotation at all (as required by all competitors I am aware of). A link is inserted only if it is not already present in a document, and after insertion, subsequent instances are left unlinked (although this can easily be changed). It appears to be *mostly* correct, barring a few odd corner cases like definitions being inserted inside Header elements, breaking the HTML. It is also not *too* slow. With 528 regexp rewrites defined, my site compilation is maybe only 2-3x slower. Attached is a screenshot of a popup in which all links have been added automatically by LinkAuto.hs. I wrote this to help define technical jargon on gwern.net in a site-wide consistent way. This includes the thousands of auto-generated abstracts from Arxiv, Biorxiv, etc. I particularly wanted to define all the machine learning terms - it is just too difficult to define *every* term by hand, looking up URLs each time, when an Arxiv abstract might mention a dozen of them ("We benchmark X, Y, Z on dataset A with metrics B, C, D, finding another instance of the E law..."). No one is going to annotate those by hand, not even with some search-and-replace support, and certainly will not be able to go back and annotate all past examples, or add a new definition and annotate all past examples of the new one, not with thousands of pages and references! So, it needs to be automated, site-wide, not require manual annotations beyond the definition itself, and allow new definitions to be added at any time. This, unfortunately, implies parsing all of the raw Strs, with the further logical choice of using regexps. (I'm not familiar with parser combinators, and from what I've seen of them, they would be quite verbose in this application.) So the basic idea is to take a Pandoc AST, squash together all of the Str/Space nodes (because if you leave the Space nodes in, how do you match multi-word phrases?) to produce long Str runs, which you can run each of a list of regexps over, break apart when there is a match, substitute in, reconstitute, and continue matching regexps. The links themselves are then handled as normal links by the rest of the site infrastructure, receiving popups, being smallcaps-formatted as necessary, etc. There are many small details to get right here: for example, you need to delimit regexps by punctuation & whitespace, so " BigGAN's", " BigGAN", and " GAN." all get rewritten as the user would expect them to. The straightforward approach of matching every regexp against every Str proves to be infeasibly slow (it would lead to ~17h compile-times for gwern.net, getting worse with every definition & page). Matching a single 'master' regexp, to try to skip doing more processing of most nodes, by using '|' alternation to concatenate each regexp, runs into a nasty exponential explosion in RAM - with even a few hundred, 75GB RAM is inadequate. (Apparently most regexp engines have some sort of quadratic term in the *total* regexp length, even though that appears totally unnecessary in a case like this and the regexp engine is just doing a bad job optimizing the compiled regexp; a 'regexp trie' would solve this, probably, but the only such library in Haskell bitrot long ago.) It is also still quite slow. To optimize it, I preprocess each document to try to throw away as many regexps as possible. First, the document is queried for its Links; if a Link in the regexp rewrites is already in the document, then that regexp can be skipped. Second, the document is compiled to the 'plain text' version (with very long lines), showing only user-visible text, and the regexps are run on that, and any regexp which doesn't trigger a (possibly false) positive is thrown away as well. It is fast to run a regexp once on an entire document, than to run it on every substring. So for most documents, lacking any possible hits, they get R regexp scans through the plain text version as a single big Text string, and then no further work needs to be done. Regexps are quite fast, so R scans is cheap. (It's Strs^R which is the problem.) This brought it into the realm of usability for me. If I need further efficiency, because R keeps increasing to the point where R scans is too expensive, I can retry the master regexp approach, perhaps in blocks of 20 or 50 regexps (whatever avoids the exponential blowup), and divide-and-conquer to find out which of the member regexps actually triggered the hit and should be kept. -- gwern -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ%40mail.gmail.com.