* Extracting hyperlinks with Pandoc API
@ 2010-11-30 20:15 Gwern Branwen
[not found] ` <AANLkTi=rfuK2QHNfu_h3c-1P4bb7iNyyvEXQSqCGOXck-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 2+ messages in thread
From: Gwern Branwen @ 2010-11-30 20:15 UTC (permalink / raw)
To: pandoc-discuss
In order to feed my archiver tool
(http://hackage.haskell.org/package/archiver), I've written a small
tool to extract external links from markdown files. This might be
useful for other purposes (checking whether they exist, for example).
It's a short program:
import System.Environment (getArgs)
import Text.Pandoc (defaultParserState, processWithM, readMarkdown,
Inline(Link), Pandoc)
main :: IO ()
main = getArgs >>= mapM readFile >>= mapM_ analyzePage
analyzePage :: String -> IO Pandoc
analyzePage x = processWithM printLinks (readMarkdown defaultParserState x)
printLinks :: Inline -> IO Inline
printLinks (Link _ (x, _)) = putStrLn x >> return undefined
printLinks x = return x
You would use it something like this:
$ find . -name "*.page" -exec link-extractor {} \; | sort | grep -v
'\!' >> ~/.urls.txt
(The sort is there for aesthetics, and the grep is there to filter out
the interwiki links. My archiver program runs on the specified file.)
--
gwern
http://www.gwern.net
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Extracting hyperlinks with Pandoc API
[not found] ` <AANLkTi=rfuK2QHNfu_h3c-1P4bb7iNyyvEXQSqCGOXck-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-11-30 21:05 ` John MacFarlane
0 siblings, 0 replies; 2+ messages in thread
From: John MacFarlane @ 2010-11-30 21:05 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
You could also use queryWith for this (untested, but
should give the basic idea):
qLink :: Inline -> [String]
qLink (Link _ (x,_)) = [x]
qLink _ = []
findLinks :: Pandoc -> [String]
findLinks = sort . queryWith qLink
John
+++ Gwern Branwen [Nov 30 10 15:15 ]:
> In order to feed my archiver tool
> (http://hackage.haskell.org/package/archiver), I've written a small
> tool to extract external links from markdown files. This might be
> useful for other purposes (checking whether they exist, for example).
> It's a short program:
>
> import System.Environment (getArgs)
> import Text.Pandoc (defaultParserState, processWithM, readMarkdown,
> Inline(Link), Pandoc)
>
> main :: IO ()
> main = getArgs >>= mapM readFile >>= mapM_ analyzePage
>
> analyzePage :: String -> IO Pandoc
> analyzePage x = processWithM printLinks (readMarkdown defaultParserState x)
>
> printLinks :: Inline -> IO Inline
> printLinks (Link _ (x, _)) = putStrLn x >> return undefined
> printLinks x = return x
>
> You would use it something like this:
>
> $ find . -name "*.page" -exec link-extractor {} \; | sort | grep -v
> '\!' >> ~/.urls.txt
>
> (The sort is there for aesthetics, and the grep is there to filter out
> the interwiki links. My archiver program runs on the specified file.)
>
> --
> gwern
> http://www.gwern.net
>
> --
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To unsubscribe from this group, send email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2010-11-30 21:05 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-30 20:15 Extracting hyperlinks with Pandoc API Gwern Branwen
[not found] ` <AANLkTi=rfuK2QHNfu_h3c-1P4bb7iNyyvEXQSqCGOXck-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-11-30 21:05 ` John MacFarlane
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).