* How I copy HTML from browser to markdown by using pandoc @ 2017-05-04 14:32 support1-ZohPw8X7yHTQT0dZR+AlfA [not found] ` <20170504143246.GB23510-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: support1-ZohPw8X7yHTQT0dZR+AlfA @ 2017-05-04 14:32 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw I wish to share how I am copying the HTML snippets from web pages and converting them into markdown files on the hard disk directly. Often I write some test on third party websites, and often there is text that was authored by other people and is permissive to be copied and distributed. Yet, who wants all that HTML? First I was searching for the Firefox extension that may copy the HTML to markdown, but what I found is just this one: https://addons.mozilla.org/en-us/firefox/addon/copy-as-markdown/ and it does not copy all the HTML. My search on DuckDuckGo https://duckduckgo.com/html?q=copy+html+as+markdown&t=gnu discovered this website that converts the HTML to markdown: https://puppypaste.com/ And then I found the answer on https://unix.stackexchange.com/questions/78395/save-html-from-clipboard-as-markdown-text which refers directly to pandoc's command: xclip -o -selection clipboard -t text/html | pandoc -r html -w markdown I am using so much keyboard input, as I am user of the tiling window manager StumpWM https://stumpwm.github.io/ that gives me good control over windows. So I have modified one key for the window manager to run the command, to convert the X clipboard content with HTML data (it is not to work on Windows) to markdown files. So, I made the configuration like: (define-key *root-map* (kbd "M") "exec save-html-as-markdown") Which means, when I press the keys C-t followed by upcase M, the program save-html-as-markdown is to be run in background. Small program in background is peace of Lisp code, that defines the directory where such snippets of HTML, converted to markdown, are to be saved and runs the pandoc command. In this example, I am using CLISP as Lisp version http://www.clisp.org but it really can be easily adapted to any programming language that is to save the output of the pandoc command to a file. I am saving it into files named after date and time. #!/home/data1/protected/bin/lisp (defun timestamp-filename nil (multiple-value-bind (second minute hour date month year day-of-week dst-p tz) (get-decoded-time) (format nil "~d-~2,'0d-~2,'0d-~2,'0d:~2,'0d:~2,'0d" year month date hour minute second ))) (defparameter *html-to-markdown-dir* "/home/data1/protected/Documents/HTML-Markdown/") (let* ((filename (concatenate 'string (timestamp-filename) ".md")) (markdown (uiop:run-program "xclip -t text/html -selection primary -out | pandoc -r html -w commonmark" :output :string)) (output (concatenate 'string *html-to-markdown-dir* filename))) (alexandria:write-string-into-file markdown output) (uiop:run-program (concatenate 'string "emacs-client-x " output))) I am sure somebody can write much easier shorter Bash script or Python, whatever similar script to give the same result. It could be as simple as: #!/bin/bash FILE=`/bin/date -Iseconds`.md xclip -t text/html -selection primary -out | pandoc -r html -w commonmark > $FILE emacs $FILE In my version, after the program execution, GNU Emacs editor is firing up the file that was saved as /home/data1/protected/Documents/HTML-Markdown/2017-05-04-16:58:27.md for example, and I may modify the file and also make sure that file does exist. This way, anything that I write on someone's blog or if I find HTML content that I wish to reuse, I simply mark, copy and press keys C-t M that creates the markdown file on the disk. Other window managers may do the same if they allow the keyboard customization. Jean ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <20170504143246.GB23510-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org>]
* How I copy HTML from browser to markdown by using pandoc [not found] ` <20170504143246.GB23510-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org> @ 2017-05-05 9:53 ` Kolen Cheung [not found] ` <bb322584-959f-4fbe-aceb-1d87128487dd-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Kolen Cheung @ 2017-05-05 9:53 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1: Type: text/plain, Size: 41 bytes --] you might try http://heckyesmarkdown.com ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <bb322584-959f-4fbe-aceb-1d87128487dd-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: How I copy HTML from browser to markdown by using pandoc [not found] ` <bb322584-959f-4fbe-aceb-1d87128487dd-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2017-05-05 10:00 ` RCDRUN [not found] ` <20170505100011.GA11734-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: RCDRUN @ 2017-05-05 10:00 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw On Fri, May 05, 2017 at 02:53:21AM -0700, Kolen Cheung wrote: > you might try http://heckyesmarkdown.com Oh, you want me to spend time by copy pasting URLs into some website? :-) My short article is about HTML snippets, not the full pages. Otherwise it becomes tiresom to edit the Markdown. Imagine you have 200 pages, like I collected yesterday, than all the process of 1. copy URL 2. paste URL 3. copy markdown 4. open editor 5. paste markdown 6. edit markdown as it is whole page 7. save file is much longer than 1. copy HTML 2. Press key and file is saved automatically 2. rarely edit anything Jean ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <20170505100011.GA11734-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org>]
* Re: How I copy HTML from browser to markdown by using pandoc [not found] ` <20170505100011.GA11734-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org> @ 2017-05-06 0:55 ` Kolen Cheung [not found] ` <58c1d809-a2c3-4a1e-a053-6b1f55da4576-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Kolen Cheung @ 2017-05-06 0:55 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1: Type: text/plain, Size: 261 bytes --] It is also scriptable. I made a script that uses that as "web clipper" that converts all urls into markdown pages. But yes you need snippet only, and I didn't explore that possibility since I didn't need that yet. But I would guess that it should be possible. ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <58c1d809-a2c3-4a1e-a053-6b1f55da4576-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: How I copy HTML from browser to markdown by using pandoc [not found] ` <58c1d809-a2c3-4a1e-a053-6b1f55da4576-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2017-05-06 0:59 ` Kolen Cheung [not found] ` <5977ca6e-4cda-4548-a82a-c5fdabaa9368-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2017-05-06 8:01 ` support1-ZohPw8X7yHTQT0dZR+AlfA 1 sibling, 1 reply; 7+ messages in thread From: Kolen Cheung @ 2017-05-06 0:59 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1: Type: text/plain, Size: 1032 bytes --] I guess I might need to explain more why markdownify might be an alternative one wanted to try. From my experience webpages usually has a lot of noises where pandoc will fully render them (header, sidebar, ads, etc.). Markdownify has used another tool to strip out and clean things up. But again you are talking about snippets, so that might not be an issue to you. But others who are interested in what you're saying might also interested in that. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5977ca6e-4cda-4548-a82a-c5fdabaa9368%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <5977ca6e-4cda-4548-a82a-c5fdabaa9368-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: How I copy HTML from browser to markdown by using pandoc [not found] ` <5977ca6e-4cda-4548-a82a-c5fdabaa9368-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2017-05-06 8:03 ` support1-ZohPw8X7yHTQT0dZR+AlfA 0 siblings, 0 replies; 7+ messages in thread From: support1-ZohPw8X7yHTQT0dZR+AlfA @ 2017-05-06 8:03 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw On Fri, May 05, 2017 at 05:59:49PM -0700, Kolen Cheung wrote: > I guess I might need to explain more why markdownify might be an > alternative one wanted to try. From my experience webpages usually > has a lot of noises where pandoc will fully render them (header, > sidebar, ads, etc.). Markdownify has used another tool to strip out > and clean things up. > > But again you are talking about snippets, so that might not be an issue to you. But others who are interested in what you're saying might also interested in that. Which is that other tool? Is it command line? If application is web based it is not good for me to send data to third party, receive data back, who wants to rely on websites?! Self run commands do it best. Jean ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: How I copy HTML from browser to markdown by using pandoc [not found] ` <58c1d809-a2c3-4a1e-a053-6b1f55da4576-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2017-05-06 0:59 ` Kolen Cheung @ 2017-05-06 8:01 ` support1-ZohPw8X7yHTQT0dZR+AlfA 1 sibling, 0 replies; 7+ messages in thread From: support1-ZohPw8X7yHTQT0dZR+AlfA @ 2017-05-06 8:01 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw On Fri, May 05, 2017 at 05:55:25PM -0700, Kolen Cheung wrote: > It is also scriptable. I made a script that uses that as "web clipper" that converts all urls into markdown pages. > > But yes you need snippet only, and I didn't explore that possibility since I didn't need that yet. But I would guess that it should be possible. Is it this one? https://github.com/Elephant418/Markdownify Currently I have no need for such software. In case I need to strip or improve HTML, there is http://www.html-tidy.org some command like elinks -source 1 http://localhost | tidy -bare -clean | pandoc -f html -t commonmark may work fine for my needs. It fetch the HTML by using elinks browser: http://elinks.cz/community.html Then by using wget -Emk http://example.com it is possible to download whole website with links changed to local links, and with some bash script, HTML Tidy and pandoc, to process all files of a website into a markdown. I am sure this type of text filtering and processing may be well used to convert the dynamic websites into static websites, such as conversion from Wordpress website to a static website. I can even think of automatically creating Makefile or including the files into the database, along with the category slugs and file names. Jean ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-05-06 8:03 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-05-04 14:32 How I copy HTML from browser to markdown by using pandoc support1-ZohPw8X7yHTQT0dZR+AlfA [not found] ` <20170504143246.GB23510-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org> 2017-05-05 9:53 ` Kolen Cheung [not found] ` <bb322584-959f-4fbe-aceb-1d87128487dd-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2017-05-05 10:00 ` RCDRUN [not found] ` <20170505100011.GA11734-vvHXCvOI15V+RnA8QueWCFaTQe2KTcn/@public.gmane.org> 2017-05-06 0:55 ` Kolen Cheung [not found] ` <58c1d809-a2c3-4a1e-a053-6b1f55da4576-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2017-05-06 0:59 ` Kolen Cheung [not found] ` <5977ca6e-4cda-4548-a82a-c5fdabaa9368-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2017-05-06 8:03 ` support1-ZohPw8X7yHTQT0dZR+AlfA 2017-05-06 8:01 ` support1-ZohPw8X7yHTQT0dZR+AlfA
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).