public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Auto-smallcaps filter
@ 2020-02-19 20:14 Gwern Branwen
       [not found] ` <CAMwO0gwVEMVMrGrSv3F4qq=ZSVeWgaq8xZ2PE+xKx51GWDKW1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Gwern Branwen @ 2020-02-19 20:14 UTC (permalink / raw)
  To: pandoc-discuss

I wrote a plugin for my gwern.net Hakyll script
(https://www.gwern.net/hakyll.hs) which was slightly tricky, and so
might be of interest.

Bringhurst & other typographers recommend using small-caps for
acronyms/initials of 3 or more capital letters because with full
capitals, they look too big and dominate the page (eg Bringhurst 2004,
_Elements_ pg47; cf https://en.wikipedia.org/wiki/Small_caps#Uses
http://theworldsgreatestbook.com/book-design-part-5/
http://webtypography.net/3.2.2 )

This can be done by hand in Pandoc by using the span syntax like
`[ABC]{.smallcaps}`, but quickly grows tedious. It can also be done
reasonably easily with a query-replace regexp eg in Emacs
`(query-replace-regexp "\\([^>]\\)\\(\\\".*?\\\"\\)" "\\1<q>\\2</q>"
nil begin end)`, but still must be done manually because while almost
all uses in regular text can be smallcaps-fied, a blind regexp will
wreck a ton of things like URLs & tooltips, code blocks, etc.

However, if we walk a Pandoc AST and check for only acronyms/initials
inside a `Str`, where they *can't* be part of a `Link` or `CodeBlock`,
then looking over gwern.net ASTs, they seem to always be safe to
substitute in `SmallCaps` elements. Unfortunately, we can't use the
regular `Inline -> Inline` replacement pattern because `SmallCaps`
takes a `[Inline]` argument, and so we are doing `Str String ->
SmallCaps [Inline]` and changing the size/type.

So we instead walk the Pandoc AST, use a regexp to split on 3+ capital
letters, `SmallCaps` the matched text, and append recursively, and
return the concatenated results.
`bottomUp` is slower than `walk` but appears to be necessary here for
greedy generation; `walk` will do only *some* substitutions, which has
something to do with its tree traversal method, I think? (Regardless,
`smallcapsfy` doesn't seem to add *too* much overhead.)

The final code:

    import Text.Pandoc
    import Text.Regex.Posix ((=~))

    smallcapsfy :: [Inline] -> [Inline]
    smallcapsfy ((Str []):[]) = []
    -- why `::String` on the regexp pattern? need to specify it
otherwise hakyll.hs OverloadedStrings makes it ambiguous & a type
error
    smallcapsfy xs@(Str a : x) = let (before,matched,after) = a =~
("[A-Z][A-Z][A-Z]+"::String) :: (String,String,String)
                                 in if matched==""
                                    then xs -- no acronym anywhere in x
                                    else [Str before, SmallCaps [Str
matched]] ++ smallcapsfy [Str after] ++ smallcapsfy x
    smallcapsfy xs = xs

Regexp examples:

    "BigGAN" =~ "[A-Z][A-Z][A-Z]+" :: (String,String,String)
    ~> ("Big","GAN","")
     "BigGANNN BigGAN" =~ "[A-Z][A-Z][A-Z]+" :: (String,String,String)
    ~> ("Big","GANNN"," BigGAN")
     "NSFW BigGAN" =~ "[A-Z][A-Z][A-Z]+" :: (String,String,String)
    ~> ("","NSFW"," BigGAN")
     "BigGANNN BigGAN" =~ "[A-Z][A-Z][A-Z]" :: (String,String,String)
    ~> ("Big","GAN","NN BigGAN")
    "biggan means big" =~ "[A-Z][A-Z][A-Z]" :: (String,String,String)
    ~> ("biggan means big","","")

Function examples:

    smallcaps [Str "BigGAN"]
    ~> [Str "Big",SmallCaps [Str "GAN"]]
    smallcaps [Str "BigGANNN means big"]
    ~> [Str "Big",SmallCaps [Str "GANNN"],Str " means big"]
    smallcaps [Str "biggan means big"]
    ~> [Str "biggan means big"]

Whole-document examples:

    bottomUp smallcapsfy [Str "bigGAN means", Emph [Str "BIG"]]
    ~> [Str "big",SmallCaps [Str "GAN"],Str " means",Emph [Str
"",SmallCaps [Str "BIG"]]]

-- 
gwern


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Auto-smallcaps filter
       [not found] ` <CAMwO0gwVEMVMrGrSv3F4qq=ZSVeWgaq8xZ2PE+xKx51GWDKW1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2020-02-20 22:58   ` John MacFarlane
       [not found]     ` <yh480ko8tsrhm4.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: John MacFarlane @ 2020-02-20 22:58 UTC (permalink / raw)
  To: Gwern Branwen, pandoc-discuss


You could use this idiom instead of bottomUp:

    walk (concatMap go)

Where 'go' is Inline -> [Inline], 'walk (concatMap go)' is
[Inline] -> [Inline].  This should perform better than
bottomUp.


Gwern Branwen <gwern-v26ZT+9V8bxeoWH0uzbU5w@public.gmane.org> writes:

> I wrote a plugin for my gwern.net Hakyll script
> (https://www.gwern.net/hakyll.hs) which was slightly tricky, and so
> might be of interest.
>
> Bringhurst & other typographers recommend using small-caps for
> acronyms/initials of 3 or more capital letters because with full
> capitals, they look too big and dominate the page (eg Bringhurst 2004,
> _Elements_ pg47; cf https://en.wikipedia.org/wiki/Small_caps#Uses
> http://theworldsgreatestbook.com/book-design-part-5/
> http://webtypography.net/3.2.2 )
>
> This can be done by hand in Pandoc by using the span syntax like
> `[ABC]{.smallcaps}`, but quickly grows tedious. It can also be done
> reasonably easily with a query-replace regexp eg in Emacs
> `(query-replace-regexp "\\([^>]\\)\\(\\\".*?\\\"\\)" "\\1<q>\\2</q>"
> nil begin end)`, but still must be done manually because while almost
> all uses in regular text can be smallcaps-fied, a blind regexp will
> wreck a ton of things like URLs & tooltips, code blocks, etc.
>
> However, if we walk a Pandoc AST and check for only acronyms/initials
> inside a `Str`, where they *can't* be part of a `Link` or `CodeBlock`,
> then looking over gwern.net ASTs, they seem to always be safe to
> substitute in `SmallCaps` elements. Unfortunately, we can't use the
> regular `Inline -> Inline` replacement pattern because `SmallCaps`
> takes a `[Inline]` argument, and so we are doing `Str String ->
> SmallCaps [Inline]` and changing the size/type.
>
> So we instead walk the Pandoc AST, use a regexp to split on 3+ capital
> letters, `SmallCaps` the matched text, and append recursively, and
> return the concatenated results.
> `bottomUp` is slower than `walk` but appears to be necessary here for
> greedy generation; `walk` will do only *some* substitutions, which has
> something to do with its tree traversal method, I think? (Regardless,
> `smallcapsfy` doesn't seem to add *too* much overhead.)
>
> The final code:
>
>     import Text.Pandoc
>     import Text.Regex.Posix ((=~))
>
>     smallcapsfy :: [Inline] -> [Inline]
>     smallcapsfy ((Str []):[]) = []
>     -- why `::String` on the regexp pattern? need to specify it
> otherwise hakyll.hs OverloadedStrings makes it ambiguous & a type
> error
>     smallcapsfy xs@(Str a : x) = let (before,matched,after) = a =~
> ("[A-Z][A-Z][A-Z]+"::String) :: (String,String,String)
>                                  in if matched==""
>                                     then xs -- no acronym anywhere in x
>                                     else [Str before, SmallCaps [Str
> matched]] ++ smallcapsfy [Str after] ++ smallcapsfy x
>     smallcapsfy xs = xs
>
> Regexp examples:
>
>     "BigGAN" =~ "[A-Z][A-Z][A-Z]+" :: (String,String,String)
>     ~> ("Big","GAN","")
>      "BigGANNN BigGAN" =~ "[A-Z][A-Z][A-Z]+" :: (String,String,String)
>     ~> ("Big","GANNN"," BigGAN")
>      "NSFW BigGAN" =~ "[A-Z][A-Z][A-Z]+" :: (String,String,String)
>     ~> ("","NSFW"," BigGAN")
>      "BigGANNN BigGAN" =~ "[A-Z][A-Z][A-Z]" :: (String,String,String)
>     ~> ("Big","GAN","NN BigGAN")
>     "biggan means big" =~ "[A-Z][A-Z][A-Z]" :: (String,String,String)
>     ~> ("biggan means big","","")
>
> Function examples:
>
>     smallcaps [Str "BigGAN"]
>     ~> [Str "Big",SmallCaps [Str "GAN"]]
>     smallcaps [Str "BigGANNN means big"]
>     ~> [Str "Big",SmallCaps [Str "GANNN"],Str " means big"]
>     smallcaps [Str "biggan means big"]
>     ~> [Str "biggan means big"]
>
> Whole-document examples:
>
>     bottomUp smallcapsfy [Str "bigGAN means", Emph [Str "BIG"]]
>     ~> [Str "big",SmallCaps [Str "GAN"],Str " means",Emph [Str
> "",SmallCaps [Str "BIG"]]]
>
> -- 
> gwern
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAMwO0gwVEMVMrGrSv3F4qq%3DZSVeWgaq8xZ2PE%2BxKx51GWDKW1w%40mail.gmail.com.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Auto-smallcaps filter
       [not found]     ` <yh480ko8tsrhm4.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
@ 2020-02-21  4:12       ` Gwern Branwen
       [not found]         ` <CAMwO0gxRuvPQnDcu-8BgLVzLWZBrj90C5_RWPccN-NzF_BqFxQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Gwern Branwen @ 2020-02-21  4:12 UTC (permalink / raw)
  To: pandoc-discuss

That seems to work, thanks.

-- 
gwern
https://www.gwern.net


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Auto-smallcaps filter
       [not found]         ` <CAMwO0gxRuvPQnDcu-8BgLVzLWZBrj90C5_RWPccN-NzF_BqFxQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2020-04-07 15:12           ` Gwern Branwen
  0 siblings, 0 replies; 4+ messages in thread
From: Gwern Branwen @ 2020-04-07 15:12 UTC (permalink / raw)
  To: pandoc-discuss

To update this: for HTML output, this code is broken because it
doesn't transform the smallcapsfied phrases into lowercase, and
smallcaps on uppercase is a null op. We need to set a new CSS class,
lowercase it, and then smallcaps it as usual.

For HTML output, this is not enough, because using smallcaps on a
capital letter is a null-op. We *could* just rewrite the capitals to
lowercases with `map toLower` etc, but then that breaks copypaste: the
underlying text for a 'Big[GAN]{.smallcaps}' is now
'[Biggan]{.smallcaps}' etc. So instead of using native SmallCaps AST
elements, we create a new HTML span class for *just* all-caps separate
from the pre-existing standard Pandoci 'smallcaps' CSS class,
'smallcaps-auto'; we annotate capitals with that new class in a Span
rather than SmallCaps, and then in CSS, we do `span.smallcaps-auto {
font-feature-settings: 'smcp'; text-transform: lowercase; }` -
smallcaps is enabled for this class, but we also lowercase everything,
thereby forcing the intended smallcaps appearance while ensuring that
copy-paste produces 'BigGAN' (as written) instead of 'Biggan'.

Aside from the new CSS declaration specified above, `smallcapsfy` need
to set a Span rather than SmallCaps as follows:

    smallcapsfy :: [Inline] -> [Inline]
    smallcapsfy = concatMap go
      where
        go :: Inline -> [Inline]
        go (Str []) = []
        go x@(Str a) = let (before,matched,after) = a =~
("[A-Z][A-Z][A-Z]+"::String) :: (String,String,String)
                                     in if matched==""
                                        then [x] -- no acronym anywhere in x
                                        else [Str before, Span ("",
["smallcaps-auto"], []) [Str matched]] ++ go (Str after)
        go x = [x]

-- 
gwern
https://www.gwern.net


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-04-07 15:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-19 20:14 Auto-smallcaps filter Gwern Branwen
     [not found] ` <CAMwO0gwVEMVMrGrSv3F4qq=ZSVeWgaq8xZ2PE+xKx51GWDKW1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-02-20 22:58   ` John MacFarlane
     [not found]     ` <yh480ko8tsrhm4.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2020-02-21  4:12       ` Gwern Branwen
     [not found]         ` <CAMwO0gxRuvPQnDcu-8BgLVzLWZBrj90C5_RWPccN-NzF_BqFxQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-04-07 15:12           ` Gwern Branwen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).