Extracting hyperlinks with Pandoc API

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Extracting hyperlinks with Pandoc API
@ 2010-11-30 20:15 Gwern Branwen
       [not found] ` <AANLkTi=rfuK2QHNfu_h3c-1P4bb7iNyyvEXQSqCGOXck-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 2+ messages in thread
From: Gwern Branwen @ 2010-11-30 20:15 UTC (permalink / raw)
  To: pandoc-discuss

In order to feed my archiver tool
(http://hackage.haskell.org/package/archiver), I've written a small
tool to extract external links from markdown files. This might be
useful for other purposes (checking whether they exist, for example).
It's a short program:

import System.Environment (getArgs)
import Text.Pandoc (defaultParserState, processWithM, readMarkdown,
Inline(Link), Pandoc)

main :: IO ()
main = getArgs >>= mapM readFile >>= mapM_ analyzePage

analyzePage :: String -> IO Pandoc
analyzePage x = processWithM printLinks (readMarkdown defaultParserState x)

printLinks :: Inline -> IO Inline
printLinks (Link _ (x, _)) = putStrLn x >> return undefined
printLinks x                   = return x

You would use it something like this:

$ find . -name "*.page" -exec link-extractor {} \; | sort | grep -v
'\!' >> ~/.urls.txt

(The sort is there for aesthetics, and the grep is there to filter out
the interwiki links. My archiver program runs on the specified file.)

-- 
gwern
http://www.gwern.net

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Extracting hyperlinks with Pandoc API
       [not found] ` <AANLkTi=rfuK2QHNfu_h3c-1P4bb7iNyyvEXQSqCGOXck-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-11-30 21:05   ` John MacFarlane
  0 siblings, 0 replies; 2+ messages in thread
From: John MacFarlane @ 2010-11-30 21:05 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

You could also use queryWith for this (untested, but
should give the basic idea):

qLink :: Inline -> [String]
qLink (Link _ (x,_)) = [x]
qLink _              = []

findLinks :: Pandoc -> [String]
findLinks = sort . queryWith qLink

John

+++ Gwern Branwen [Nov 30 10 15:15 ]:
> In order to feed my archiver tool
> (http://hackage.haskell.org/package/archiver), I've written a small
> tool to extract external links from markdown files. This might be
> useful for other purposes (checking whether they exist, for example).
> It's a short program:
> 
> import System.Environment (getArgs)
> import Text.Pandoc (defaultParserState, processWithM, readMarkdown,
> Inline(Link), Pandoc)
> 
> main :: IO ()
> main = getArgs >>= mapM readFile >>= mapM_ analyzePage
> 
> analyzePage :: String -> IO Pandoc
> analyzePage x = processWithM printLinks (readMarkdown defaultParserState x)
> 
> printLinks :: Inline -> IO Inline
> printLinks (Link _ (x, _)) = putStrLn x >> return undefined
> printLinks x                   = return x
> 
> You would use it something like this:
> 
> $ find . -name "*.page" -exec link-extractor {} \; | sort | grep -v
> '\!' >> ~/.urls.txt
> 
> (The sort is there for aesthetics, and the grep is there to filter out
> the interwiki links. My archiver program runs on the specified file.)
> 
> -- 
> gwern
> http://www.gwern.net
> 
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To unsubscribe from this group, send email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.
> 


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2010-11-30 21:05 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-30 20:15 Extracting hyperlinks with Pandoc API Gwern Branwen
     [not found] ` <AANLkTi=rfuK2QHNfu_h3c-1P4bb7iNyyvEXQSqCGOXck-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-11-30 21:05   ` John MacFarlane

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).