From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/24474 Path: news.gmane.io!.POSTED.ciao.gmane.io!not-for-mail From: Gwern Branwen Newsgroups: gmane.text.pandoc Subject: Auto-smallcaps filter Date: Wed, 19 Feb 2020 15:14:13 -0500 Message-ID: Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: ciao.gmane.io; posting-host="ciao.gmane.io:159.69.161.202"; logging-data="40656"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBDFJXQMSYMIRBO5OW3ZAKGQESHMKEOQ-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Wed Feb 19 21:14:55 2020 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-vs1-f59.google.com ([209.85.217.59]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1j4VkE-000AQz-UI for gtp-pandoc-discuss@m.gmane-mx.org; Wed, 19 Feb 2020 21:14:55 +0100 Original-Received: by mail-vs1-f59.google.com with SMTP id v10sf359626vso.12 for ; Wed, 19 Feb 2020 12:14:54 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1582143294; cv=pass; d=google.com; s=arc-20160816; b=GCzYSNO89f+vvU5ttunzBxq5QwrMxWUjXJyZIxx7Z8L/Zb/B+h3o4S2hFVJAv4paec JRJHXoedUA+4NmbfDz4AGuOvCzHroTqMKwzg/yp6JodnZm5mMhpBxB1wg+mO0HJwpip9 asV+xs/5QrtI+ZWWFA5y7IaGbzpG5xjjqBkU5ACi3ngHSUHuvwnEum0/Oa6B4zHNnNmE mMaA2yOGiO8Az9uAwmes5v19V+2iOsxljBcZwxdMcBDUk/nCzAtMx6JqP9WhzwbsuHhb zQqMRpoX0d7McG2BTQwKzotnRGOkglNwrU7eMx+TTfJAUnRQ4/TGBEijthEhFNSWlDhf Ou1g== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:reply-to:to:subject:message-id:date :from:mime-version:sender:dkim-signature; bh=c9DDx0gGEzs8R+Huk6yAThzMdOSqlfGguIcCRZRAOqw=; b=uFgFet653dh7c4erKgZFdd2EpQV6GPbNkPzNv48SpBIifPPy/PxguZnP9m32fwukfj 7Mdmzg+MRKu8i3l7nPCZKHIqxoheK2aMY/sf8guU7iZKr7DRHe2KTWSabfIJo+L8XJQL gtBlU3XbWpaVo2j7NTkJ39Ik5GI05FDqWAdLAkBfJEXNyK7MTEoYAAieH+csVEa5S/yo YguhKcdW02kqzicUYjCAZ0uZuS9qpe5mNZo3Ay95LWfcw9VOXO/ze5hO0lOc1pV4DsNg 2I/IqjtIsWqdOVgCmQ0Cz1Z66xOzVBwBFRnknOVrzRYHcV1imCKjobL0keaom2a3y+tP KKVQ== ARC-Authentication-Results: i=2; gmr-mx.google.com; spf=pass (google.com: domain of gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 209.85.166.48 as permitted sender) smtp.mailfrom=gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:mime-version:from:date:message-id:subject:to :x-original-sender:x-original-authentication-results:reply-to :precedence:mailing-list:list-id:list-post:list-help:list-archive :list-subscribe:list-unsubscribe; bh=c9DDx0gGEzs8R+Huk6yAThzMdOSqlfGguIcCRZRAOqw=; b=kCW8h70xh2Y7qJMpsbG0AQoEM8LgZ0N7KvhBme2YcIlvX3XreAp3alVY7SXOb/oLl+ gufkZ2G9b+SUd+sTILwKg/K/A+iqCnK5txWBfKXUphxcLoLJALZmOKKAS0KcXPSYLIbP l209FQpvs2QL3dy6knCjS2LKzqEGPdi565sqR1dA3W116SRgiYcsh/hwJX669re52/QJ 7Ts6VqLPK26c6N2yCPg9S8ediHpxdABMZNIwdrc8Lpcu7OUGImHyuZ/W1TeLJ+Hru5NS fsUIDJ0asa46Ar45Q8efpkOlbBYhBb3ZSK9FjQwrJJi66JArIlyB0dTUrnby3VQudhEK QV4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:mime-version:from:date:message-id:subject :to:x-original-sender:x-original-authentication-results:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=c9DDx0gGEzs8R+Huk6yAThzMdOSqlfGguIcCRZRAOqw=; b=oDNSqjhoFRRDVwwAKqHO7tuNOIViRZN3Xlt6HiIb/DAS9AT0uYtHlcrnPXvA12dYM0 Ipm3TkvQQpNEfNXP38GYNWNW22BiWVVQl58R5h7QWWQJvazTg6pRqfb6rQOJ3arTLFH7 gjzxjpIOHNsw470XC08Hx1NZwocKxShHTLbZFVH9M2C0lzCS5F3ZSmRz48xJoYkAXoEY 5UbPKcW2wACzbeJG1uh39Ag9dVW6ttkTe+qMhwPTcn8pcUeLvzHrwG3mt5ARiAdpbd2H TyVVc9lUgQ3inorkhwJS2sNM07AwS+7rR7pCLzP6J7Szbewv+1GwO/K9DiU8Qg6GmPSM 5uTQ== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: APjAAAV2BYYCuWx7mfCtd01pRvjuvskgRIGX8y8VFXVRrNWuK+KyaCN2 994UkfZw9/Yq57u/y4y0IJ8= X-Google-Smtp-Source: APXvYqw0NNCFP10Y5b1QSOs55lXRzfnrlCW+OtJr01RoPdTuWOck7PvtFz4v3S5og1S1P+NNFjvQIQ== X-Received: by 2002:a05:6102:310c:: with SMTP id e12mr15510939vsh.226.1582143293978; Wed, 19 Feb 2020 12:14:53 -0800 (PST) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a1f:1204:: with SMTP id 4ls1033747vks.8.gmail; Wed, 19 Feb 2020 12:14:51 -0800 (PST) X-Received: by 2002:ac5:cdcd:: with SMTP id u13mr12818318vkn.0.1582143291212; Wed, 19 Feb 2020 12:14:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582143291; cv=none; d=google.com; s=arc-20160816; b=JR4FGAjuluqx6mAUK6ZyLZAC0v0kUbFMjnHXqiUoi4Ws0VDBQz8Y3sDQM5AODhrbLF qrtpaUqm7ZQzyGoOsZwf3ikytN/bTwC42WGFtAnflqr9TYDid30hUetsraup7bOGde/d owDLU9JfIh4p7JpP1a/5NysdpQ0zr8FHLi777p+VIB4vqLonbwkniy4eJ8AjLVgf4KIP nwFOF2NWAHeRWKUxltHoblhdEz5yBY5Ak+G7EIfXDhjKXC3GcyY1fkOUp1AsLgR3z8Af fWpTug43LUjRwisUVnLZB91u4VISEGuchB39iF1yUbndiyQPpNkKnzpwBbPnY1AtlRYf YKuw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=to:subject:message-id:date:from:mime-version; bh=+W8JYUhEkulDhTSRo56ZLmvuOPxOL/P7EXgwZdQWjtA=; b=Fm4tohyGCd/QvjmEWUfos4FmU7Tq4f4zAN/yuMPg5t7zgqPrn99UYmUpQogzuUn5Db lKFCgL/GNeJ2Z8TBOmMAq9m/8/wu44JD3QlbBhlCwzeyFAVDyxkaF/4lnHc126IkgyFu PlTBt1ZkVOYsG9HQAfXUQhih2osUY4/GgM/VBiAI762qPlwkdsnvE7ZJknAz4hSkMyBL STe6TB3P5SDy1E4XUGO7pWSUd/WZmsWLRVc91rcWWUMyZYEAzjHp7l8UQM/Be1uAvIpM wlRIuF5jtjhReSEvRE2bs2Z3mF1vJwG6/0v+rFswq9K4A50OJdBz7JV8XKjQthhcYAta rO/Q== ARC-Authentication-Results: i=1; gmr-mx.google.com; spf=pass (google.com: domain of gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 209.85.166.48 as permitted sender) smtp.mailfrom=gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Original-Received: from mail-io1-f48.google.com (mail-io1-f48.google.com. [209.85.166.48]) by gmr-mx.google.com with ESMTPS id k26si73376uao.0.2020.02.19.12.14.51 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 19 Feb 2020 12:14:51 -0800 (PST) Received-SPF: pass (google.com: domain of gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 209.85.166.48 as permitted sender) client-ip=209.85.166.48; Original-Received: by mail-io1-f48.google.com with SMTP id t26so1971091ioi.13 for ; Wed, 19 Feb 2020 12:14:51 -0800 (PST) X-Received: by 2002:a02:cf2e:: with SMTP id s14mr10326480jar.124.1582143290042; Wed, 19 Feb 2020 12:14:50 -0800 (PST) X-Original-Sender: gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 209.85.166.48 as permitted sender) smtp.mailfrom=gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:24474 Archived-At: I wrote a plugin for my gwern.net Hakyll script (https://www.gwern.net/hakyll.hs) which was slightly tricky, and so might be of interest. Bringhurst & other typographers recommend using small-caps for acronyms/initials of 3 or more capital letters because with full capitals, they look too big and dominate the page (eg Bringhurst 2004, _Elements_ pg47; cf https://en.wikipedia.org/wiki/Small_caps#Uses http://theworldsgreatestbook.com/book-design-part-5/ http://webtypography.net/3.2.2 ) This can be done by hand in Pandoc by using the span syntax like `[ABC]{.smallcaps}`, but quickly grows tedious. It can also be done reasonably easily with a query-replace regexp eg in Emacs `(query-replace-regexp "\\([^>]\\)\\(\\\".*?\\\"\\)" "\\1\\2" nil begin end)`, but still must be done manually because while almost all uses in regular text can be smallcaps-fied, a blind regexp will wreck a ton of things like URLs & tooltips, code blocks, etc. However, if we walk a Pandoc AST and check for only acronyms/initials inside a `Str`, where they *can't* be part of a `Link` or `CodeBlock`, then looking over gwern.net ASTs, they seem to always be safe to substitute in `SmallCaps` elements. Unfortunately, we can't use the regular `Inline -> Inline` replacement pattern because `SmallCaps` takes a `[Inline]` argument, and so we are doing `Str String -> SmallCaps [Inline]` and changing the size/type. So we instead walk the Pandoc AST, use a regexp to split on 3+ capital letters, `SmallCaps` the matched text, and append recursively, and return the concatenated results. `bottomUp` is slower than `walk` but appears to be necessary here for greedy generation; `walk` will do only *some* substitutions, which has something to do with its tree traversal method, I think? (Regardless, `smallcapsfy` doesn't seem to add *too* much overhead.) The final code: import Text.Pandoc import Text.Regex.Posix ((=~)) smallcapsfy :: [Inline] -> [Inline] smallcapsfy ((Str []):[]) = [] -- why `::String` on the regexp pattern? need to specify it otherwise hakyll.hs OverloadedStrings makes it ambiguous & a type error smallcapsfy xs@(Str a : x) = let (before,matched,after) = a =~ ("[A-Z][A-Z][A-Z]+"::String) :: (String,String,String) in if matched=="" then xs -- no acronym anywhere in x else [Str before, SmallCaps [Str matched]] ++ smallcapsfy [Str after] ++ smallcapsfy x smallcapsfy xs = xs Regexp examples: "BigGAN" =~ "[A-Z][A-Z][A-Z]+" :: (String,String,String) ~> ("Big","GAN","") "BigGANNN BigGAN" =~ "[A-Z][A-Z][A-Z]+" :: (String,String,String) ~> ("Big","GANNN"," BigGAN") "NSFW BigGAN" =~ "[A-Z][A-Z][A-Z]+" :: (String,String,String) ~> ("","NSFW"," BigGAN") "BigGANNN BigGAN" =~ "[A-Z][A-Z][A-Z]" :: (String,String,String) ~> ("Big","GAN","NN BigGAN") "biggan means big" =~ "[A-Z][A-Z][A-Z]" :: (String,String,String) ~> ("biggan means big","","") Function examples: smallcaps [Str "BigGAN"] ~> [Str "Big",SmallCaps [Str "GAN"]] smallcaps [Str "BigGANNN means big"] ~> [Str "Big",SmallCaps [Str "GANNN"],Str " means big"] smallcaps [Str "biggan means big"] ~> [Str "biggan means big"] Whole-document examples: bottomUp smallcapsfy [Str "bigGAN means", Emph [Str "BIG"]] ~> [Str "big",SmallCaps [Str "GAN"],Str " means",Emph [Str "",SmallCaps [Str "BIG"]]] -- gwern