public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Proposal: add Code markup to autolinked bare URLs
@ 2021-11-26 23:56 Gwern Branwen
       [not found] ` <CAMwO0gwYO19YM8-kSM_+MjJkZmbfPtswGHoq_ip8EGT4i0viTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 2+ messages in thread
From: Gwern Branwen @ 2021-11-26 23:56 UTC (permalink / raw)
  To: pandoc-discuss

Right now, Pandoc renders bare autolinked URLs like
`<https://www.foo.com>` as if one had written
`[https://www.foo.com](https://www.foo.com)`, ie a `Link _ [Str
"$URL"] ("","")`. I propose that Pandoc change this behavior to render
it like `[`https://www.foo.com`](https://www.foo.com)`, to `Link _
[Code [Str "$URL"]] ("","")`. By putting raw URLs into code markup,
this would be displayed in a better-looking fashion, and reduce
errors.

The current behavior of rendering URLs as normal roman font strings is
weird. Roman font is meant for ordinary natural language. It is
designed to be read as English, where spelling is not too important
because of context, typesetting & hyphens are important,
variable-sized glyphs look nice, and everything is
human-interpretable. Computer code has its own requirements, which is
why monospace fonts exist and we typeset code in monospace fonts:
because every single letter and punctuation mark is absolutely vital,
and a single error can lead to failure or bugs, monospace
programming-oriented fonts prize uniformity, careful 'literal'
distinguishing of glyphs like O/0 is necessary (because there may be
no way to tell from context, code being so arbitrary),
linebreaks/hyphenations are used little if at all etc. Because we
distinguish semantically & visually source code from regular language,
our tools & CSS can do things like avoid linebreaking code fragments,
or when they do, be sure to not insert hyphens that would mislead the
reader or copypaste.

The natural way, if we weren't using autolinks, to write a URL out
would simply be... ``https://www.foo.com``. Becuase it's code, and
always has been. URLs are computer code, not natural language. They
are a formal language which can be syntactically checked, with many
strict requirements. Their contents, never all that human readable in
the first place (dating all the way back to Unixisms like '~/' or the
bizarre little/then-big-endian inversion of ordering in TLD+paths),
are increasingly opaque blobs of IDs and query arguments which could
not even in theory be read like natural language. Mixing up a O/0 will
break a URL even more surely than it'd break source code, and
typesetting URLs as normal roman text leads to URLs being line-broken
with hyphens---forcing the user to guess if there was a hyphen in the
original, or if that's added for linebreaking. With English text, you
know the spelling and rarely does it confuse one, but with source code
(was that a minus sign?) or URLs... (loads of websites use hyphens as
separators)? For that matter, perhaps the user has copy-pasted a URL
with an EN DASH or SOFT HYPHEN or god knows what other kind of Unicode
dash. (This happens all the time. I fix on a literally daily basis
incoming requests to gwern.net for URLs which have screwed up a dash
of some sort. The diversity is depressing. This is especially bad in
paper bibliograhies, where the URL is almost always split across a
line, so not only does it look bizarre to try to read 'Journal of
Journal Journaling https://x.com.de/&-$&-
^%&^%-page?stuff#morestuff', you get the indignity of trying to figure
out what the actual URL was for every. single. instance.)

If the autolinked URL was rendered as a Code inline and thus processed
& displayed like other Code fragments, it would make raw URLs easier
to skim, easier to read literally should that be necessary, avoid
error-inducing linebreaks, and just overall encode the semantics
better & be more consistent.

Drawbacks:

- implementation: minimal. This can be done at the AST level when
autolinks are processed, so I assume it might be as easy as a 1-liner
and the implementation difficulty nil.
- compatibility with downstream formats: they all support Code
highlighting/markup or have appropriate fallbacks to handle code
already, so offhand I don't know of any problems there. If they don't
handle Code, presumably they fall back to plain text, in which case
there is no visible change.
- compatibility with API users: API users have to be walking the AST
and handling nested elements anyway (if they would fail on a Link with
a Code[Str] in it, then what if the user *had* written it out as
`[`foo.com`](foo.com)`? They were broken to begin with.), so nothing
is broken by this change that was not already broken.
- compatibility with other tools: downstream users may be using hacks
or weak tools, and not adequately generalized to walking. I don't
think there is much or any of this. For example, in the HTML case, a
CSS user would be targeting the autolinks CSS class. If they do that
to set code highlighting etc, then that merely is redundant. And then
they can remove the autolinks specificness and simplify their code by
simplying treating autolink URLs like any inline code (which it is).

    If anyone would be broken by this proposed change, please speak up
now, but at least as far as I know, there is nothing.
- "it changed, I don't like it": despite the pragmatic arguments for
treating URLs like code, people may disagree on the esthetics and
think URLs *should* be typeset as if they were natural English text
and like anything else, or that Pandoc shouldn't change anything in
the interests of stability.

    This is probably the most serious counterargument. It is true that
it is common for the defaults in many systems, not just Pandoc, to not
monospace/code-ify bare URLs, but I don't think that is any ringing
endorsement - merely laziness. I find that when I typeset URLs as
code, you quickly get used to it, and eventually find roman URLs to be
just (cr/l)aziness. Like a student submitting their programming
homework in MS Word.

-- 
gwern
https://www.gwern.net


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Proposal: add Code markup to autolinked bare URLs
       [not found] ` <CAMwO0gwYO19YM8-kSM_+MjJkZmbfPtswGHoq_ip8EGT4i0viTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2021-11-27  1:57   ` John MacFarlane
  0 siblings, 0 replies; 2+ messages in thread
From: John MacFarlane @ 2021-11-27  1:57 UTC (permalink / raw)
  To: Gwern Branwen, pandoc-discuss


We used to do this, actually.  I think we stopped because it
became standard not to do this.  I think one substantive
argument for not using a monospaced font is that it takes
more width and makes it more likely that the URL won't fit
on the line.  But mostly we're just conforming to what have
become standard practices.

Fortunately, it's easy enough to style autolinks as you like,
since we give them a special "uri" class.  This way we can
support your preferences, and also those of people who
don't want URLs monospaced. (They'd have a harder time
if these were in code tags.)


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-11-27  1:57 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-26 23:56 Proposal: add Code markup to autolinked bare URLs Gwern Branwen
     [not found] ` <CAMwO0gwYO19YM8-kSM_+MjJkZmbfPtswGHoq_ip8EGT4i0viTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-11-27  1:57   ` John MacFarlane

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).