How I copy HTML from browser to markdown by using pandoc

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* How I copy HTML from browser to markdown by using pandoc
@ 2017-05-04 14:32 support1-ZohPw8X7yHTQT0dZR+AlfA
       [not found] ` <20170504143246.GB23510-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: support1-ZohPw8X7yHTQT0dZR+AlfA @ 2017-05-04 14:32 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

I wish to share how I am copying the HTML snippets from web pages and
converting them into markdown files on the hard disk directly.

Often I write some test on third party websites, and often there is
text that was authored by other people and is permissive to be copied
and distributed.

Yet, who wants all that HTML?

First I was searching for the Firefox extension that may copy the HTML
to markdown, but what I found is just this one:
https://addons.mozilla.org/en-us/firefox/addon/copy-as-markdown/ and
it does not copy all the HTML.

My search on DuckDuckGo
https://duckduckgo.com/html?q=copy+html+as+markdown&t=gnu discovered
this website that converts the HTML to markdown:
https://puppypaste.com/

And then I found the answer on
https://unix.stackexchange.com/questions/78395/save-html-from-clipboard-as-markdown-text
which refers directly to pandoc's command:

xclip -o -selection clipboard -t text/html | pandoc -r html -w markdown

I am using so much keyboard input, as I am user of the tiling window
manager StumpWM https://stumpwm.github.io/ that gives me good control
over windows.

So I have modified one key for the window manager to run the command,
to convert the X clipboard content with HTML data (it is not to work
on Windows) to  markdown files.

So, I made the configuration like:

(define-key *root-map* (kbd "M") "exec save-html-as-markdown")

Which means, when I press the keys C-t followed by upcase M, the
program save-html-as-markdown is to be run in background.

Small program in background is peace of Lisp code, that defines the
directory where such snippets of HTML, converted to markdown, are to
be saved and runs the pandoc command.

In this example, I am using CLISP as Lisp version http://www.clisp.org
but it really can be easily adapted to any programming language that
is to save the output of the pandoc command to a file. I am saving it
into files named after date and time.

#!/home/data1/protected/bin/lisp

(defun timestamp-filename nil
  (multiple-value-bind
        (second minute hour date month year day-of-week dst-p tz)
      (get-decoded-time)
    (format nil "~d-~2,'0d-~2,'0d-~2,'0d:~2,'0d:~2,'0d"
            year
            month
            date
            hour
            minute
            second
            )))

(defparameter *html-to-markdown-dir* "/home/data1/protected/Documents/HTML-Markdown/")

(let* ((filename (concatenate 'string (timestamp-filename) ".md"))
       (markdown (uiop:run-program "xclip -t text/html -selection primary -out | pandoc -r html -w commonmark" :output :string))
       (output (concatenate 'string *html-to-markdown-dir* filename)))
  (alexandria:write-string-into-file markdown output)
  (uiop:run-program (concatenate 'string "emacs-client-x " output)))

I am sure somebody can write much easier shorter Bash script or
Python, whatever similar script to give the same result.

It could be as simple as:

#!/bin/bash
FILE=`/bin/date -Iseconds`.md
xclip -t text/html -selection primary -out | pandoc -r html -w commonmark > $FILE
emacs $FILE

In my version, after the program execution, GNU Emacs editor is firing up the file
that was saved as
/home/data1/protected/Documents/HTML-Markdown/2017-05-04-16:58:27.md
for example, and I may modify the file and also make sure that file
does exist.

This way, anything that I write on someone's blog or if I find HTML
content that I wish to reuse, I simply mark, copy and press keys C-t M
that creates the markdown file on the disk.

Other window managers may do the same if they allow the keyboard
customization.

Jean

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <20170504143246.GB23510-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org>]

* How I copy HTML from browser to markdown by using pandoc
       [not found] ` <20170504143246.GB23510-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org>
@ 2017-05-05  9:53   ` Kolen Cheung
       [not found]     ` <bb322584-959f-4fbe-aceb-1d87128487dd-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Kolen Cheung @ 2017-05-05  9:53 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 41 bytes --]

you might try http://heckyesmarkdown.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <bb322584-959f-4fbe-aceb-1d87128487dd-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: How I copy HTML from browser to markdown by using pandoc
       [not found]     ` <bb322584-959f-4fbe-aceb-1d87128487dd-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-05-05 10:00       ` RCDRUN
       [not found]         ` <20170505100011.GA11734-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: RCDRUN @ 2017-05-05 10:00 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On Fri, May 05, 2017 at 02:53:21AM -0700, Kolen Cheung wrote:
> you might try http://heckyesmarkdown.com

Oh, you want me to spend time by copy pasting URLs into some website?
:-)

My short article is about HTML snippets, not the full pages. Otherwise
it becomes tiresom to edit the Markdown.

Imagine you have 200 pages, like I collected yesterday, than all the
process of

1. copy URL
2. paste URL
3. copy markdown
4. open editor
5. paste markdown
6. edit markdown as it is whole page
7. save file

is much longer than

1. copy HTML
2. Press key and file is saved automatically
2. rarely edit anything

Jean

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <20170505100011.GA11734-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org>]

* Re: How I copy HTML from browser to markdown by using pandoc
       [not found]         ` <20170505100011.GA11734-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org>
@ 2017-05-06  0:55           ` Kolen Cheung
       [not found]             ` <58c1d809-a2c3-4a1e-a053-6b1f55da4576-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Kolen Cheung @ 2017-05-06  0:55 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 261 bytes --]

It is also scriptable. I made a script that uses that as "web clipper" that converts all urls into markdown pages.

But yes you need snippet only, and I didn't explore that possibility since I didn't need that yet. But I would guess that it should be possible.

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <58c1d809-a2c3-4a1e-a053-6b1f55da4576-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: How I copy HTML from browser to markdown by using pandoc
       [not found]             ` <58c1d809-a2c3-4a1e-a053-6b1f55da4576-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-05-06  0:59               ` Kolen Cheung
       [not found]                 ` <5977ca6e-4cda-4548-a82a-c5fdabaa9368-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2017-05-06  8:01               ` support1-ZohPw8X7yHTQT0dZR+AlfA
  1 sibling, 1 reply; 7+ messages in thread
From: Kolen Cheung @ 2017-05-06  0:59 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 1032 bytes --]

I guess I might need to explain more why markdownify might be an alternative one wanted to try. From my experience webpages usually has a lot of noises where pandoc will fully render them (header, sidebar, ads, etc.). Markdownify has used another tool to strip out and clean things up.

But again you are talking about snippets, so that might not be an issue to you. But others who are interested in what you're saying might also interested in that.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5977ca6e-4cda-4548-a82a-c5fdabaa9368%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <5977ca6e-4cda-4548-a82a-c5fdabaa9368-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]

* Re: How I copy HTML from browser to markdown by using pandoc
       [not found]                 ` <5977ca6e-4cda-4548-a82a-c5fdabaa9368-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-05-06  8:03                   ` support1-ZohPw8X7yHTQT0dZR+AlfA
  0 siblings, 0 replies; 7+ messages in thread
From: support1-ZohPw8X7yHTQT0dZR+AlfA @ 2017-05-06  8:03 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On Fri, May 05, 2017 at 05:59:49PM -0700, Kolen Cheung wrote:
> I guess I might need to explain more why markdownify might be an
> alternative one wanted to try. From my experience webpages usually
> has a lot of noises where pandoc will fully render them (header,
> sidebar, ads, etc.). Markdownify has used another tool to strip out
> and clean things up.
> 
> But again you are talking about snippets, so that might not be an issue to you. But others who are interested in what you're saying might also interested in that.

Which is that other tool? Is it command line?

If application is web based it is not good for me to send data to
third party, receive data back, who wants to rely on websites?!

Self run commands do it best.

Jean


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How I copy HTML from browser to markdown by using pandoc
       [not found]             ` <58c1d809-a2c3-4a1e-a053-6b1f55da4576-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2017-05-06  0:59               ` Kolen Cheung
@ 2017-05-06  8:01               ` support1-ZohPw8X7yHTQT0dZR+AlfA
  1 sibling, 0 replies; 7+ messages in thread
From: support1-ZohPw8X7yHTQT0dZR+AlfA @ 2017-05-06  8:01 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On Fri, May 05, 2017 at 05:55:25PM -0700, Kolen Cheung wrote:
> It is also scriptable. I made a script that uses that as "web clipper" that converts all urls into markdown pages.
> 
> But yes you need snippet only, and I didn't explore that possibility since I didn't need that yet. But I would guess that it should be possible.

Is it this one?

https://github.com/Elephant418/Markdownify

Currently I have no need for such software. In case I need to strip or
improve HTML, there is http://www.html-tidy.org

some command like

elinks -source 1  http://localhost | tidy -bare -clean | pandoc -f html -t commonmark

may work fine for my needs. It fetch the HTML by using elinks
browser: http://elinks.cz/community.html

Then by using wget -Emk http://example.com it is possible to download
whole website with links changed to local links, and with some bash
script, HTML Tidy and pandoc, to process all files of a website into a
markdown.

I am sure this type of text filtering and processing may be well used
to convert the dynamic websites into static websites, such as
conversion from Wordpress website to a static website.

I can even think of automatically creating Makefile or including the
files into the database, along with the category slugs and file names.

Jean

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-05-06  8:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-04 14:32 How I copy HTML from browser to markdown by using pandoc support1-ZohPw8X7yHTQT0dZR+AlfA
     [not found] ` <20170504143246.GB23510-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org>
2017-05-05  9:53   ` Kolen Cheung
     [not found]     ` <bb322584-959f-4fbe-aceb-1d87128487dd-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-05-05 10:00       ` RCDRUN
     [not found]         ` <20170505100011.GA11734-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org>
2017-05-06  0:55           ` Kolen Cheung
     [not found]             ` <58c1d809-a2c3-4a1e-a053-6b1f55da4576-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-05-06  0:59               ` Kolen Cheung
     [not found]                 ` <5977ca6e-4cda-4548-a82a-c5fdabaa9368-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-05-06  8:03                   ` support1-ZohPw8X7yHTQT0dZR+AlfA
2017-05-06  8:01               ` support1-ZohPw8X7yHTQT0dZR+AlfA

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).