From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/28697 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: John MacFarlane Newsgroups: gmane.text.pandoc Subject: Re: LinkAuto.hs: automatically turning regexp-matching strings into Links Date: Mon, 28 Jun 2021 10:06:59 -0700 Message-ID: References: Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="39298"; mail-complaints-to="usenet@ciao.gmane.io" To: Gwern Branwen , pandoc-discuss Original-X-From: pandoc-discuss+bncBCJZJHG45QDBBQED5CDAMGQEWJ5PRNA-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mon Jun 28 19:07:15 2021 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-pf1-f192.google.com ([209.85.210.192]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1lxuj5-000A2j-7K for gtp-pandoc-discuss@m.gmane-mx.org; Mon, 28 Jun 2021 19:07:15 +0200 Original-Received: by mail-pf1-f192.google.com with SMTP id s42-20020a056a001c6ab029030999857faasf3562947pfw.22 for ; Mon, 28 Jun 2021 10:07:15 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1624900034; cv=pass; d=google.com; s=arc-20160816; b=vqqOaeFTxfjOtyC4RpcB0sgHrKxiFNL3yb1gpAmE7/WinodjtrMJGoL+IlPnH0zEZ9 eKPiYBiNCI68ps0lCURNTyLeHxtWGhcGhi15rbfk+HV1qi315yiCUr+jKOW4SJ1bquB1 2Hb8bBBZk+G8d9W5Wp9tM1rMOGWpQv9OZwHB7FfsfOK+wlS+K6G3nkOzopcLLrCB+vjl LLFV8SB0OLIlf0P0yPbtaMXvjtB9kZnzJUtSbvPW6E5B/MK3FVzeA6LD50F8z8SeODqO IdnyzP+mdaBfCUxSMRdYlJ2L0cmVpBFKHRWo9mTtUMvfbd3l//aRTP0JE5oaV7UG4CPx 2zYw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:reply-to:mime-version:message-id :date:references:in-reply-to:subject:to:from:sender:dkim-signature; bh=EV5mB/neIi2NDgncqNYfh2+KXKovP7FOrkPQzOX6PIk=; b=f9eFdk9+anIZfCpN2QnFnU+1IkFmr+xDjWE5KZP9+MoxVINflF0SR7+I7JKUi17jOU uGl5KAEuHOwsLa2zHbdgp/hBqYF7utzgRDeT0N1toCnGx0mKEZv5dxYxQhcdUqxH+4K2 CjvmLD+N6Bo+YvqiUWc3F1D8tXzOLaSL3cMxDYAeYyTkwwT3+4mJa58mEkC413gMXtyE cI/umpgUBGPOWSF4/e0gQ9GVcedm6O0JFqqXpgv+G4I+B9AriigOh167B3y1g4jkpepQ R4Z//yaQEpMvq7FWCH3XeWbPJpcJyHTncl6tlmmJvaTdPt4VXdlXRFCVGACIiaGQ1+p2 jU8w== ARC-Authentication-Results: i=2; gmr-mx.google.com; dkim=pass header.i=@berkeley-edu.20150623.gappssmtp.com header.s=20150623 header.b=LOsC4KMl; spf=pass (google.com: domain of jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org designates 2607:f8b0:4864:20::535 as permitted sender) smtp.mailfrom=jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:from:to:subject:in-reply-to:references:date:message-id :mime-version:x-original-sender:x-original-authentication-results :reply-to:precedence:mailing-list:list-id:list-post:list-help :list-archive:list-subscribe:list-unsubscribe; bh=EV5mB/neIi2NDgncqNYfh2+KXKovP7FOrkPQzOX6PIk=; b=pCjBQSOvAn71JWunKmYgE6OMsaiJlXnz9ioHcuFS2h3meEiQXlcgS1yn7NxKp2i9B6 6TDedMYweyrJajgzc82Iij2AgixlCjbMWLpbndxardgVYx2FNEjm24588YkUsX6lsqEW tJ2zK8GgEx71GKlNsNxhostY1e1Qq1NhUwaJjIHI77MmP71Mk+BPOMqVsfRnBAErO9aG s1LqbeFWpUgIu2jZc+HUSlfhYIjKOxUD5COCG8G7Lpqiaw83zfaJbRb7PSOHEWGkQKii yOA5V4cdiOnxV4BTI4t7jZXDS66F+h8bIAl6ZFb14qbkAZ3BhysT5tahJMW+RO/UnpjK VxFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:from:to:subject:in-reply-to:references :date:message-id:mime-version:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:x-spam-checked-in-group:list-post:list-help:list-archive :list-subscribe:list-unsubscribe; bh=EV5mB/neIi2NDgncqNYfh2+KXKovP7FOrkPQzOX6PIk=; b=FMRkUVqk5VusikfHnc6sL7igby+xDLtzVYVr63tNt/BRwUKpGBR8WLlXSftO5taRbs aB07hfF/010OEW3YhLLRvfb2+x6NRZD7xgDdmdWZjpFVgyvjwJs4Ni79zaQsYj9RF1xy Gtj7g9KPT791/Y+Fd//aKhkbzK/avnklh4m7Kipc1XUGHhn5xRTlhiwkVCFjlGunX/Kj xe6UI/s77dNpa2CR8GgHGIvZl94KLJpSsXnluFiFU8EGL4saqX7n/MfRdMTxo2WFphMu j1h9HmL6EEtMWNWxjRPOths6mfuf86Y3ncL/6LSpSxdTCHm3o+8ubCi2s/0JJTh880Zc B5fQ== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM533IhFhsAciGQ07ED3EEl3DupQWck3MshjDVzbxCk0Y9IhHamuGk 4nzh1KXfwx8k1T+r9kBdzXA= X-Google-Smtp-Source: ABdhPJzF4t6neHv83Ir8+zwO2a0084uDVl/ppJsQ/jOioArVk7gQeCtXQc6/647srHns+w82ji7n6w== X-Received: by 2002:a62:8097:0:b029:306:7dfc:fa0 with SMTP id j145-20020a6280970000b02903067dfc0fa0mr25643103pfd.17.1624900034045; Mon, 28 Jun 2021 10:07:14 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a63:e23:: with SMTP id d35ls8980304pgl.5.gmail; Mon, 28 Jun 2021 10:07:11 -0700 (PDT) X-Received: by 2002:a63:1a5b:: with SMTP id a27mr24376115pgm.427.1624900031749; Mon, 28 Jun 2021 10:07:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1624900031; cv=none; d=google.com; s=arc-20160816; b=EowIQG2GLelIyx/IvqCpOGZiuOSl3nEBJBI2jCbfeoX8iGH9B8+yDUXvhHFO8iR3kU 6g4rGxQj1EPO1eIjf4oecMoCo5cz/hqVMuSqwZUufgfm88FGkBLR0FYjFxp5FFsLuJJw vdVpXmNFKWDAFqqZZbx5OXjw9/l0tf2fwy1hhG3d/G9OMk9aaI/nDwafToYECSVxfJ45 +6J/NJNWIwuzk+gIYe5woNI3gl1gdZfKW9a+n0te0twIXQ565RD0pu8B6DX6FJXqRfid nFQUq0I8ydNAcbnX64INJpHutrohh/MM3Zdq1LXWAbbnyRyPUEIkXCV+QZH7Sntzojmt iPKw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:message-id:date:references:in-reply-to:subject:to:from :dkim-signature; bh=gA/Uiz0kjhHA166XSCFiRD9d2TkkTT4VP8Y+87m/EnA=; b=hG37l/xvNXS0lID9ZcShmCUh/2BwxdC7HGlUYayTdF3BOcLe7dk6p17UEfDEJYoDFS q/5hSdcjR18F7NDDXe+FEkBLzcMkIbSi/VpEK2qQ/Om3p7M2AA1QmZFA3Q6APUKS65N2 EmcGghltxgHyjNp9WXxRoJZ8cMoSi590sibZP0Vxxk1z42RhqOcuiu19qaCPDGl+vf6A WSK41xqpE4nVcUPoeQ2WONpddJ7p7WL46QOnD3GVs7SJyktF6NO8d1M2OEhKOlWPNwHQ XB1ncw+GDLCHHI2tPygT3rnJR483ulfzGIUrUIH9lrlPiCiPNCT55YNsXKZbPKFlNXJ3 sqkw== ARC-Authentication-Results: i=1; gmr-mx.google.com; dkim=pass header.i=@berkeley-edu.20150623.gappssmtp.com header.s=20150623 header.b=LOsC4KMl; spf=pass (google.com: domain of jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org designates 2607:f8b0:4864:20::535 as permitted sender) smtp.mailfrom=jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org Original-Received: from mail-pg1-x535.google.com (mail-pg1-x535.google.com. [2607:f8b0:4864:20::535]) by gmr-mx.google.com with ESMTPS id r7si1537464pjp.0.2021.06.28.10.07.11 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 28 Jun 2021 10:07:11 -0700 (PDT) Received-SPF: pass (google.com: domain of jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org designates 2607:f8b0:4864:20::535 as permitted sender) client-ip=2607:f8b0:4864:20::535; Original-Received: by mail-pg1-x535.google.com with SMTP id h4so15971175pgp.5 for ; Mon, 28 Jun 2021 10:07:11 -0700 (PDT) X-Received: by 2002:a62:e90f:0:b029:307:8154:9ff7 with SMTP id j15-20020a62e90f0000b029030781549ff7mr25646376pfh.79.1624900030865; Mon, 28 Jun 2021 10:07:10 -0700 (PDT) Original-Received: from johnmacfarlane.net (li55-134.members.linode.com. [74.82.3.134]) by smtp.gmail.com with ESMTPSA id m10sm3448939pfa.42.2021.06.28.10.07.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 28 Jun 2021 10:07:10 -0700 (PDT) Original-Received: by johnmacfarlane.net (Postfix, from userid 1000) id 77272A249; Mon, 28 Jun 2021 13:06:59 -0400 (EDT) In-Reply-To: X-Original-Sender: jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org X-Original-Authentication-Results: gmr-mx.google.com; dkim=pass header.i=@berkeley-edu.20150623.gappssmtp.com header.s=20150623 header.b=LOsC4KMl; spf=pass (google.com: domain of jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org designates 2607:f8b0:4864:20::535 as permitted sender) smtp.mailfrom=jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:28697 Archived-At: Instead of squashing strings and using a string-based regex, you could try using Text.Regex.Applicative (regex-applicative), which has a type data RE s a so you could define regexes of type RE Inline Inline or RE Inline MyLinkStructure or whatever. You could create a parser to take standard regex expressions and convert them to these. Gwern Branwen writes: > LinkAuto.hs is a Pandoc library I am prototyping: > https://www.gwern.net/static/build/LinkAuto.hs > > It lets you define a regexp and a corresponding URL, and matching text > in a Pandoc document will be turned into that text but as a Link to > the URL. It is intended for annotating technical jargon, terms, proper > names, phrases, etc. Because it is automated, the dictionary can be > updated at any time and all documents will be updated, without any > manual annotation at all (as required by all competitors I am aware > of). > A link is inserted only if it is not already present in a document, > and after insertion, subsequent instances are left unlinked (although > this can easily be changed). > It appears to be *mostly* correct, barring a few odd corner cases like > definitions being inserted inside Header elements, breaking the HTML. > It is also not *too* slow. With 528 regexp rewrites defined, my site > compilation is maybe only 2-3x slower. > Attached is a screenshot of a popup in which all links have been added > automatically by LinkAuto.hs. > > I wrote this to help define technical jargon on gwern.net in a > site-wide consistent way. This includes the thousands of > auto-generated abstracts from Arxiv, Biorxiv, etc. I particularly > wanted to define all the machine learning terms - it is just too > difficult to define *every* term by hand, looking up URLs each time, > when an Arxiv abstract might mention a dozen of them ("We benchmark X, > Y, Z on dataset A with metrics B, C, D, finding another instance of > the E law..."). > No one is going to annotate those by hand, not even with some > search-and-replace support, and certainly will not be able to go back > and annotate all past examples, or add a new definition and annotate > all past examples of the new one, not with thousands of pages and > references! So, it needs to be automated, site-wide, not require > manual annotations beyond the definition itself, and allow new > definitions to be added at any time. > > This, unfortunately, implies parsing all of the raw Strs, with the > further logical choice of using regexps. (I'm not familiar with parser > combinators, and from what I've seen of them, they would be quite > verbose in this application.) > > So the basic idea is to take a Pandoc AST, squash together all of the > Str/Space nodes (because if you leave the Space nodes in, how do you > match multi-word phrases?) to produce long Str runs, which you can run > each of a list of regexps over, break apart when there is a match, > substitute in, reconstitute, and continue matching regexps. > The links themselves are then handled as normal links by the rest of > the site infrastructure, receiving popups, being smallcaps-formatted > as necessary, etc. > There are many small details to get right here: for example, you need > to delimit regexps by punctuation & whitespace, so " BigGAN's", " > BigGAN", and " GAN." all get rewritten as the user would expect them > to. > > The straightforward approach of matching every regexp against every > Str proves to be infeasibly slow (it would lead to ~17h compile-times > for gwern.net, getting worse with every definition & page). Matching a > single 'master' regexp, to try to skip doing more processing of most > nodes, by using '|' alternation to concatenate each regexp, runs into > a nasty exponential explosion in RAM - with even a few hundred, 75GB > RAM is inadequate. (Apparently most regexp engines have some sort of > quadratic term in the *total* regexp length, even though that appears > totally unnecessary in a case like this and the regexp engine is just > doing a bad job optimizing the compiled regexp; a 'regexp trie' would > solve this, probably, but the only such library in Haskell bitrot long > ago.) It is also still quite slow. > > To optimize it, I preprocess each document to try to throw away as > many regexps as possible. > First, the document is queried for its Links; if a Link in the regexp > rewrites is already in the document, then that regexp can be skipped. > Second, the document is compiled to the 'plain text' version (with > very long lines), showing only user-visible text, and the regexps are > run on that, and any regexp which doesn't trigger a (possibly false) > positive is thrown away as well. It is fast to run a regexp once on an > entire document, than to run it on every substring. So for most > documents, lacking any possible hits, they get R regexp scans through > the plain text version as a single big Text string, and then no > further work needs to be done. Regexps are quite fast, so R scans is > cheap. (It's Strs^R which is the problem.) This brought it into the > realm of usability for me. > If I need further efficiency, because R keeps increasing to the point > where R scans is too expensive, I can retry the master regexp > approach, perhaps in blocks of 20 or 50 regexps (whatever avoids the > exponential blowup), and divide-and-conquer to find out which of the > member regexps actually triggered the hit and should be kept. > > -- > gwern > > -- > You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAMwO0gy25T38dmNG4y514n5zWZCONP21Woxi5YX6EHojuvtMaQ%40mail.gmail.com.