You may find this useful: https://github.com/filiph/linkcheck On Monday, May 24, 2021 at 7:56:48 AM UTC-7 Gwern Branwen wrote: > When writing long complicated Pandoc documents over years, it is easy > for section or anchor self-links to drift and become invalid. > Pandoc/hakyll will not warn you about links being broken, they do not > break in any visible ways or affect validation, and it's surprisingly > difficult to get any linkchecking tool to warn you about them > specifically. (The W3C linkchecking tool has no option to check only > anchors and insists on checking all external links, while > 'linkchecker' has a plugin for this I am told is broken and also no > way to avoid checking all external links - this renders both tools > completely infeasible for regularly checking a website the size of > gwern.net, where a linkchecker run may take several days to finish.) > This is despite being really quite a simple task: get all hrefs > starting with '#', check that an ID corresponds, print out any ones > missing. > > dbohdan wrote a PHP script for me to check gwern.net pages, which has > worked well and exposed at least 20 erroneous anchors I've fixed, and > which is fast enough to include in the site sync script so new errors > will be picked up immediately. This may be useful to other > Pandoc/hakyll users. > > Source: > https://github.com/gwern/gwern.net/blob/master/build/anchor-checker.php > > #! /usr/bin/env php > // Check anchors in HTML files. Only checks anchors local to each document. > // Anchors prefixed with a filename are ignored even if they refer to the > // same file. Anchors with no element with the corresponding fragment ID > // are written to stderr prefixed with the filename. > // > // Usage: anchor-checker.php [FILE]... > // > // To the extent possible under law, D. Bohdan has waived all copyright and > // related or neighboring rights to this work. > // > // Date: 2021-05-24. > // Requirements: PHP 7.x with the standard DOM module. > > error_reporting(E_ALL); > > function main($files) { > $exit_code = 0; > > foreach ($files as $file) { > $bad_anchors = check_file($file); > foreach ($bad_anchors as $a) { > fprintf(STDERR, "%s\t%s\n", $file, $a); > $exit_code = 1; > } > } > > exit($exit_code); > } > > function check_file($file) { > $html = file_get_contents($file); > // An ugly hack to get around missing HTML5 support tripping > up the parser. > $html = preg_replace("//", "", $html); > > if (preg_match("/^\s*$/", $html)) return []; > > $dom = new DOMDocument(); > $dom->loadHTML($html, LIBXML_NOERROR | LIBXML_NOWARNING); > > return check_document($dom); > } > > function check_document($dom) { > $ids = (new DOMXpath($dom))->query("//@id"); > $id_set = array(); > > foreach ($ids as $id) { > $id_set["#" . $id->value] = true; > } > > $bad_anchors = array(); > $hrefs = (new DOMXpath($dom))->query("//a/@href"); > foreach ($hrefs as $href) { > $value = trim($href->value); > > if (substr($value, 0, 1) !== "#") continue; > > if (!array_key_exists($value, $id_set)) { > $bad_anchors[] = $value; > } > } > > return $bad_anchors; > } > > main(array_slice($argv, 1)); > > This can be used as a post-compilation check like > `static/build/anchor-checker.php ./_site/"$HTML"` or what have you. > > -- > gwern > https://www.gwern.net > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/dbba54c1-ea6b-4308-b90f-ba17010f8b6dn%40googlegroups.com.