public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* pandoc as a linkchecker?
@ 2020-09-12 19:12 Joseph Reagle
       [not found] ` <f87a3346-3243-0cd4-a101-107e5ffe4902-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Joseph Reagle @ 2020-09-12 19:12 UTC (permalink / raw)
  To: pandoc-discuss

It's time to check which links in my syllabi are broken, and I'm again cursing under my breath that there's no multi-format linkchecker out there that can report line numbers. Then I thought, what about my favorite tool!?

Pandoc already chases links for `self-contained`, so I suspect this wouldn't be hard. Bonus: could it report the line of a markdown file where a broken link is?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: pandoc as a linkchecker?
       [not found] ` <f87a3346-3243-0cd4-a101-107e5ffe4902-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
@ 2020-09-12 19:35   ` Gwern Branwen
  2020-09-12 19:38   ` Daniel Staal
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Gwern Branwen @ 2020-09-12 19:35 UTC (permalink / raw)
  To: pandoc-discuss

Which kinds of links? Pandoc may chase some links in order to inline
them ("linked scripts, stylesheets, images, and videos"), but that's
not most links, and trying to check hyperlinks in full generality is
quite complex and difficult (look at the complexity of what I use for
dead-link finding, https://github.com/linkchecker/linkchecker ), and
not possible in some cases.*

* Consider relative or absolute links on a website: if I link to
'/About' on a gwern.net page, that is a valid link when deployed, but
it will break for every linkchecking tool which doesn't assume that
that is relative to 'https://www.gwern.net'. How does Pandoc know
that?

-- 
gwern


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: pandoc as a linkchecker?
       [not found] ` <f87a3346-3243-0cd4-a101-107e5ffe4902-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
  2020-09-12 19:35   ` Gwern Branwen
@ 2020-09-12 19:38   ` Daniel Staal
  2020-09-12 20:19   ` Albert Krewinkel
  2020-09-12 20:31   ` BPJ
  3 siblings, 0 replies; 7+ messages in thread
From: Daniel Staal @ 2020-09-12 19:38 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 9/12/20 3:12 PM, Joseph Reagle wrote:
> Pandoc already chases links for `self-contained`, so I suspect this wouldn't be hard. Bonus: could it report the line of a markdown file where a broken link is?

I suspect the problem would be with Pandoc's definition of 'line' and 
'file'.  Pandoc like many unix tools only really appears to read files 
as a convenience to the user - it's really reading streams of lines. 
You can pass it multiple files, and it just treats them all as one big 
stream of lines.

So I suspect Pandoc has no real idea of line numbers or files when 
working on things - it just knows that it saw an error in the current 
line being read.

(As a further complication: If you're reading markdown or similar, what 
is a 'line'?  Do you mean anything that ends with a newline, or do you 
mean any contiguous block of text that could be written as one line? 
That is: Does wrapping the text alter the number of lines in the file?)

Daniel T. Staal

-- 
---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: pandoc as a linkchecker?
       [not found] ` <f87a3346-3243-0cd4-a101-107e5ffe4902-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
  2020-09-12 19:35   ` Gwern Branwen
  2020-09-12 19:38   ` Daniel Staal
@ 2020-09-12 20:19   ` Albert Krewinkel
  2020-09-12 20:31   ` BPJ
  3 siblings, 0 replies; 7+ messages in thread
From: Albert Krewinkel @ 2020-09-12 20:19 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


Joseph Reagle writes:

> It's time to check which links in my syllabi are broken, and I'm again
> cursing under my breath that there's no multi-format linkchecker out
> there that can report line numbers. Then I thought, what about my
> favorite tool!?

Well, here's an anchor checking Lua filter which will tell you when a
link points to a nonexistent anchor. Should be not too hard to extend to
check external links as well. You won't get line numbers, though.

    local identifiers = {}
    function collect_ids (x)
      if x.identifier and x.identifier ~= '' then
        identifiers[x.identifier] = true
      end
    end

    function check_link (link)
      -- check internal links
      if link.target:sub(1,1) == '#' then
        local target_exists = identifiers[link.target:sub(2)]
        if not target_exists then
          io.stderr:write(
            table.concat {'Invalid target: ', link.target,
              ' (link text is "', pandoc.utils.stringify(link), '")\n'
            }
          )
        end
      end
    end

    return {
      {Block = collect_ids, Inline = collect_ids},
      {Link = check_link}
    }


--
Albert Krewinkel
GPG: 8eed e3e2 e8c5 6f18 81fe  e836 388d c0b2 1f63 1124


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: pandoc as a linkchecker?
       [not found] ` <f87a3346-3243-0cd4-a101-107e5ffe4902-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
                     ` (2 preceding siblings ...)
  2020-09-12 20:19   ` Albert Krewinkel
@ 2020-09-12 20:31   ` BPJ
       [not found]     ` <CADAJKhCpmA-g_LPufFmZxSY2dVJzYGw_S8vvsPrK2YQoHpRNNQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  3 siblings, 1 reply; 7+ messages in thread
From: BPJ @ 2020-09-12 20:31 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 1555 bytes --]

There are tools which check links in HTML, so one option would be to
convert your Markdown files to HTML and then check the links in there.

-- 
Better --help|less than helpless

Den lör 12 sep. 2020 21:12Joseph Reagle <joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org> skrev:

> It's time to check which links in my syllabi are broken, and I'm again
> cursing under my breath that there's no multi-format linkchecker out there
> that can report line numbers. Then I thought, what about my favorite tool!?
>
> Pandoc already chases links for `self-contained`, so I suspect this
> wouldn't be hard. Bonus: could it report the line of a markdown file where
> a broken link is?
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/f87a3346-3243-0cd4-a101-107e5ffe4902%40reagle.org
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhCpmA-g_LPufFmZxSY2dVJzYGw_S8vvsPrK2YQoHpRNNQ%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 2449 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: pandoc as a linkchecker?
       [not found]     ` <CADAJKhCpmA-g_LPufFmZxSY2dVJzYGw_S8vvsPrK2YQoHpRNNQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2020-09-14 13:19       ` Joseph Reagle
       [not found]         ` <c5259326-1317-e43a-6416-25922630b25e-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Joseph Reagle @ 2020-09-14 13:19 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 9/12/20 4:31 PM, BPJ wrote:
> There are tools which check links in HTML, so one option would be to convert your Markdown files to HTML and then check the links in there.

Yes, that's what I did for a while. This semester, I'm using `pytest-check-links` which can operate on markdown files, though it internally converts to HTML first, and so would not be able to provide line numbers [1]. However, I thought it might be possible to provide line numbers eventually given John's comment in #4565, "The commonmark parser in commonmark-hs, which I'll be integrating into pandoc, already has complete source position information." [2] (I could be completely misunderstanding what this means.)

[1]: https://github.com/jupyterlab/pytest-check-links/issues/23
[2]: https://github.com/jgm/pandoc/issues/4565

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/c5259326-1317-e43a-6416-25922630b25e%40reagle.org.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: pandoc as a linkchecker?
       [not found]         ` <c5259326-1317-e43a-6416-25922630b25e-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
@ 2020-09-14 13:23           ` Gwern Branwen
  0 siblings, 0 replies; 7+ messages in thread
From: Gwern Branwen @ 2020-09-14 13:23 UTC (permalink / raw)
  To: pandoc-discuss

You could just work around it by scripting. Use a linkchecker to get
broken URLs, and then search each broken URL in the original file.
Even grep has `--line-number`.

-- 
gwern
https://www.gwern.net


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-09-14 13:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-12 19:12 pandoc as a linkchecker? Joseph Reagle
     [not found] ` <f87a3346-3243-0cd4-a101-107e5ffe4902-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2020-09-12 19:35   ` Gwern Branwen
2020-09-12 19:38   ` Daniel Staal
2020-09-12 20:19   ` Albert Krewinkel
2020-09-12 20:31   ` BPJ
     [not found]     ` <CADAJKhCpmA-g_LPufFmZxSY2dVJzYGw_S8vvsPrK2YQoHpRNNQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-09-14 13:19       ` Joseph Reagle
     [not found]         ` <c5259326-1317-e43a-6416-25922630b25e-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2020-09-14 13:23           ` Gwern Branwen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).