public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: "christi...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org" <christian.kolen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: anchor-checker.php: a simple script to check HTML self-link validity
Date: Mon, 24 May 2021 14:04:50 -0700 (PDT)	[thread overview]
Message-ID: <dbba54c1-ea6b-4308-b90f-ba17010f8b6dn@googlegroups.com> (raw)
In-Reply-To: <CAMwO0gzbmuPPqd=Q_Qhwv5168eOP0KUvZNOALVLgD90qO85_mA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 3721 bytes --]

You may find this useful:  https://github.com/filiph/linkcheck

On Monday, May 24, 2021 at 7:56:48 AM UTC-7 Gwern Branwen wrote:

> When writing long complicated Pandoc documents over years, it is easy
> for section or anchor self-links to drift and become invalid.
> Pandoc/hakyll will not warn you about links being broken, they do not
> break in any visible ways or affect validation, and it's surprisingly
> difficult to get any linkchecking tool to warn you about them
> specifically. (The W3C linkchecking tool has no option to check only
> anchors and insists on checking all external links, while
> 'linkchecker' has a plugin for this I am told is broken and also no
> way to avoid checking all external links - this renders both tools
> completely infeasible for regularly checking a website the size of
> gwern.net, where a linkchecker run may take several days to finish.)
> This is despite being really quite a simple task: get all hrefs
> starting with '#', check that an ID corresponds, print out any ones
> missing.
>
> dbohdan wrote a PHP script for me to check gwern.net pages, which has
> worked well and exposed at least 20 erroneous anchors I've fixed, and
> which is fast enough to include in the site sync script so new errors
> will be picked up immediately. This may be useful to other
> Pandoc/hakyll users.
>
> Source: 
> https://github.com/gwern/gwern.net/blob/master/build/anchor-checker.php
>
> #! /usr/bin/env php
> <?php
> // Check anchors in HTML files. Only checks anchors local to each document.
> // Anchors prefixed with a filename are ignored even if they refer to the
> // same file. Anchors with no element with the corresponding fragment ID
> // are written to stderr prefixed with the filename.
> //
> // Usage: anchor-checker.php [FILE]...
> //
> // To the extent possible under law, D. Bohdan has waived all copyright and
> // related or neighboring rights to this work.
> //
> // Date: 2021-05-24.
> // Requirements: PHP 7.x with the standard DOM module.
>
> error_reporting(E_ALL);
>
> function main($files) {
> $exit_code = 0;
>
> foreach ($files as $file) {
> $bad_anchors = check_file($file);
> foreach ($bad_anchors as $a) {
> fprintf(STDERR, "%s\t%s\n", $file, $a);
> $exit_code = 1;
> }
> }
>
> exit($exit_code);
> }
>
> function check_file($file) {
> $html = file_get_contents($file);
> // An ugly hack to get around missing HTML5 support tripping
> up the parser.
> $html = preg_replace("/<wbr>/", "", $html);
>
> if (preg_match("/^\s*$/", $html)) return [];
>
> $dom = new DOMDocument();
> $dom->loadHTML($html, LIBXML_NOERROR | LIBXML_NOWARNING);
>
> return check_document($dom);
> }
>
> function check_document($dom) {
> $ids = (new DOMXpath($dom))->query("//@id");
> $id_set = array();
>
> foreach ($ids as $id) {
> $id_set["#" . $id->value] = true;
> }
>
> $bad_anchors = array();
> $hrefs = (new DOMXpath($dom))->query("//a/@href");
> foreach ($hrefs as $href) {
> $value = trim($href->value);
>
> if (substr($value, 0, 1) !== "#") continue;
>
> if (!array_key_exists($value, $id_set)) {
> $bad_anchors[] = $value;
> }
> }
>
> return $bad_anchors;
> }
>
> main(array_slice($argv, 1));
>
> This can be used as a post-compilation check like
> `static/build/anchor-checker.php ./_site/"$HTML"` or what have you.
>
> -- 
> gwern
> https://www.gwern.net
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/dbba54c1-ea6b-4308-b90f-ba17010f8b6dn%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 6147 bytes --]

      parent reply	other threads:[~2021-05-24 21:04 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-24 14:56 Gwern Branwen
     [not found] ` <CAMwO0gzbmuPPqd=Q_Qhwv5168eOP0KUvZNOALVLgD90qO85_mA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-05-24 21:04   ` christi...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dbba54c1-ea6b-4308-b90f-ba17010f8b6dn@googlegroups.com \
    --to=christian.kolen-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).