* Converting HTML to markdown while leaving YAML frontmatter intact
@ 2017-12-02 15:40 Gabriel Birke
[not found] ` <ff13d84d-d30d-4bc4-8eb6-3a47f3190ece-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Gabriel Birke @ 2017-12-02 15:40 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1.1: Type: text/plain, Size: 1327 bytes --]
I have a static site generator that allows content to be either HTML or
Markdown, as long as it has a "YAML Frontmatter", section separated with
three dashes from the rest of the content. HTML Example:
---
title: My blog post
tags: [blog, important, internal]
---
<p>First paragraph of the post</p>
Now I'd like to use pandoc to convert some older HTML posts to markdown for
easier editing. However, the frontmatter gets mangled in the process: The
dashes get escaped and everything is put on one line. Is there some easy
way to preserve the front matter? Or do I have to resort to shell
scripting, splitting frontmatter from content, processing the content with
pandoc and then joining the pandoc output with the frontmatter?
Cheers,
Gabriel
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/ff13d84d-d30d-4bc4-8eb6-3a47f3190ece%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
[-- Attachment #1.2: Type: text/html, Size: 1908 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Converting HTML to markdown while leaving YAML frontmatter intact
[not found] ` <ff13d84d-d30d-4bc4-8eb6-3a47f3190ece-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-12-02 16:21 ` 'Jason Seeley' via pandoc-discuss
[not found] ` <912cdd88-2a94-43a0-a70f-815b948dbbd1-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-12-02 18:38 ` John MacFarlane
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: 'Jason Seeley' via pandoc-discuss @ 2017-12-02 16:21 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1.1: Type: text/plain, Size: 1664 bytes --]
Try setting the input format to markdown (-f markdown, or possibly -f
markdown+raw_html). Standard markdown allows raw HTML in its content, so it
should pick up the front matter and still read the rest just fine.
-Jason
On Saturday, December 2, 2017 at 9:40:55 AM UTC-6, Gabriel Birke wrote:
>
> I have a static site generator that allows content to be either HTML or
> Markdown, as long as it has a "YAML Frontmatter", section separated with
> three dashes from the rest of the content. HTML Example:
>
> ---
> title: My blog post
> tags: [blog, important, internal]
> ---
> <p>First paragraph of the post</p>
>
>
> Now I'd like to use pandoc to convert some older HTML posts to markdown
> for easier editing. However, the frontmatter gets mangled in the process:
> The dashes get escaped and everything is put on one line. Is there some
> easy way to preserve the front matter? Or do I have to resort to shell
> scripting, splitting frontmatter from content, processing the content with
> pandoc and then joining the pandoc output with the frontmatter?
>
> Cheers,
>
> Gabriel
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/912cdd88-2a94-43a0-a70f-815b948dbbd1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
[-- Attachment #1.2: Type: text/html, Size: 2381 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Converting HTML to markdown while leaving YAML frontmatter intact
[not found] ` <912cdd88-2a94-43a0-a70f-815b948dbbd1-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-12-02 16:42 ` Gabriel Birke
0 siblings, 0 replies; 9+ messages in thread
From: Gabriel Birke @ 2017-12-02 16:42 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1.1: Type: text/plain, Size: 1902 bytes --]
Nope, using markdown as input format does not convert the HTML to markdown,
it leaves the HTML intact and removes the frontmatter.
Am Samstag, 2. Dezember 2017 17:21:58 UTC+1 schrieb Jason Seeley:
>
> Try setting the input format to markdown (-f markdown, or possibly -f
> markdown+raw_html). Standard markdown allows raw HTML in its content, so it
> should pick up the front matter and still read the rest just fine.
>
> -Jason
>
> On Saturday, December 2, 2017 at 9:40:55 AM UTC-6, Gabriel Birke wrote:
>>
>> I have a static site generator that allows content to be either HTML or
>> Markdown, as long as it has a "YAML Frontmatter", section separated with
>> three dashes from the rest of the content. HTML Example:
>>
>> ---
>> title: My blog post
>> tags: [blog, important, internal]
>> ---
>> <p>First paragraph of the post</p>
>>
>>
>> Now I'd like to use pandoc to convert some older HTML posts to markdown
>> for easier editing. However, the frontmatter gets mangled in the process:
>> The dashes get escaped and everything is put on one line. Is there some
>> easy way to preserve the front matter? Or do I have to resort to shell
>> scripting, splitting frontmatter from content, processing the content with
>> pandoc and then joining the pandoc output with the frontmatter?
>>
>> Cheers,
>>
>> Gabriel
>>
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/92e6dbeb-f882-442a-a632-a5cb40c6fcb6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
[-- Attachment #1.2: Type: text/html, Size: 2729 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Converting HTML to markdown while leaving YAML frontmatter intact
[not found] ` <ff13d84d-d30d-4bc4-8eb6-3a47f3190ece-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-12-02 16:21 ` 'Jason Seeley' via pandoc-discuss
@ 2017-12-02 18:38 ` John MacFarlane
2017-12-04 11:23 ` BP Jonsson
2017-12-04 17:17 ` John O'Regan
3 siblings, 0 replies; 9+ messages in thread
From: John MacFarlane @ 2017-12-02 18:38 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
YAML metadata is a markdown extension for pandoc. So, if
you have YAML front matter in an HTML file, you'll need to
strip it off before passing to pandoc.
+++ Gabriel Birke [Dec 02 17 07:40 ]:
> I have a static site generator that allows content to be either HTML or
> Markdown, as long as it has a "YAML Frontmatter", section separated
> with three dashes from the rest of the content. HTML Example:
> ---
> title: My blog post
> tags: [blog, important, internal]
> ---
> <p>First paragraph of the post</p>
> Now I'd like to use pandoc to convert some older HTML posts to markdown
> for easier editing. However, the frontmatter gets mangled in the
> process: The dashes get escaped and everything is put on one line. Is
> there some easy way to preserve the front matter? Or do I have to
> resort to shell scripting, splitting frontmatter from content,
> processing the content with pandoc and then joining the pandoc output
> with the frontmatter?
> Cheers,
> Gabriel
>
> --
> You received this message because you are subscribed to the Google
> Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [1]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to
> [2]pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> [3]https://groups.google.com/d/msgid/pandoc-discuss/ff13d84d-d30d-4bc4-
> 8eb6-3a47f3190ece%40googlegroups.com.
> For more options, visit [4]https://groups.google.com/d/optout.
>
>References
>
> 1. mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
> 2. mailto:pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
> 3. https://groups.google.com/d/msgid/pandoc-discuss/ff13d84d-d30d-4bc4-8eb6-3a47f3190ece-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org?utm_medium=email&utm_source=footer
> 4. https://groups.google.com/d/optout
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Converting HTML to markdown while leaving YAML frontmatter intact
[not found] ` <ff13d84d-d30d-4bc4-8eb6-3a47f3190ece-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-12-02 16:21 ` 'Jason Seeley' via pandoc-discuss
2017-12-02 18:38 ` John MacFarlane
@ 2017-12-04 11:23 ` BP Jonsson
2017-12-04 17:17 ` John O'Regan
3 siblings, 0 replies; 9+ messages in thread
From: BP Jonsson @ 2017-12-04 11:23 UTC (permalink / raw)
To: Gabriel Birke, pandoc-discuss
Den 2017-12-02 kl. 16:40, skrev Gabriel Birke:
> I have a static site generator that allows content to be either HTML or
> Markdown, as long as it has a "YAML Frontmatter", section separated with
> three dashes from the rest of the content. HTML Example:
>
> ---
> title: My blog post
> tags: [blog, important, internal]
> ---
> <p>First paragraph of the post</p>
>
>
> Now I'd like to use pandoc to convert some older HTML posts to markdown for
> easier editing. However, the frontmatter gets mangled in the process: The
> dashes get escaped and everything is put on one line. Is there some easy
> way to preserve the front matter? Or do I have to resort to shell
> scripting, splitting frontmatter from content, processing the content with
> pandoc and then joining the pandoc output with the frontmatter?
>
> Cheers,
>
> Gabriel
>
On systems with make and perl installed (Unix, Linux, Mac...) this makefile does the trick:
``````makefile
# Converts HTML files with a YAML header starting on the first line and
# ending on the first other line with a `---` marker to Markdown, while
# copying the YAML header to the Markdown file.
# USAGE:
#
# make [HTMLSRC='foo.html bar.html'] [PDCOPTS='<extra pandoc args>']
#
# HTMLSRC defaults to globbing all *.html files
HTMLSRC ?= $(wildcard *.html)
MDTARGETS := $(HTMLSRC:%.html=%.md)
.PHONY: html2md
html2md: $(MDTARGETS)
# The first invocation of perl loops over the source file and
# prints out the YAML header and a blank line then stops,
# its output being redirected to the target file.
#
# The second invocation of perl also loops over the source file but
# skips the YAML header printing out the rest, which
# is piped through pandoc for conversion to Markdown, which is then
# appended to the target file.
$(MDTARGETS): %.md: %.html
@perl -ne'last unless 1 ... /^---\s*$$/; print $$_; END {print "\n"}' <$< >$@
@perl -ne'print unless 1 ... /^---\s*$$/' <$< | pandoc -r html -w markdown $(PDCOPTS) >>$@
``````
Beware of mail programs which convert tabs to spaces! The blanks at the start of the last two lines must be a single tab.
IMPORTANT: The YAML block in the *.html files must start on line 1 for this to work, since your YAML blocks both start and end in ---.
There is a reason why the start and end marker are different in compliant YAML: if the block had ended in ... a
`1 ... /^\.\.\.\s*$/` would have included any lines between the first and the ..., including the --- at the top of the YAML and any lines before it.
PS: to preserve metadata when converting --to markdown use the --standalone/-s option.
Thus PDCOPTS='--standalone'. If you do that you will get a metadata block from pandoc as well,
and if both the original YAML block and the HTML contained a title for example you will get
a duplicate. However the original YAML block will win as it stands before the one pandoc generated.
/bpj
/bpj
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Converting HTML to markdown while leaving YAML frontmatter intact
[not found] ` <ff13d84d-d30d-4bc4-8eb6-3a47f3190ece-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
` (2 preceding siblings ...)
2017-12-04 11:23 ` BP Jonsson
@ 2017-12-04 17:17 ` John O'Regan
[not found] ` <5d2adaaf-c7f5-41b5-a161-ba8fd65e32d3-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
3 siblings, 1 reply; 9+ messages in thread
From: John O'Regan @ 2017-12-04 17:17 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1.1: Type: text/plain, Size: 649 bytes --]
Hi Gabriel,
For HTML files, why not wrap the yaml in an HTML comment?
John
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5d2adaaf-c7f5-41b5-a161-ba8fd65e32d3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
[-- Attachment #1.2: Type: text/html, Size: 1120 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Converting HTML to markdown while leaving YAML frontmatter intact
[not found] ` <5d2adaaf-c7f5-41b5-a161-ba8fd65e32d3-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-12-05 12:10 ` John O'Regan
[not found] ` <CALwjjMiOR=Mqnn9Ruh+KH-XYBbUBT7cB4XjNgBaxo+QRA2fOBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: John O'Regan @ 2017-12-05 12:10 UTC (permalink / raw)
To: pandoc-discuss
PS: Please tell me more about your static site generator, Gabriel.
Can I download it? Is it on GitHub?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Converting HTML to markdown while leaving YAML frontmatter intact
[not found] ` <CALwjjMiOR=Mqnn9Ruh+KH-XYBbUBT7cB4XjNgBaxo+QRA2fOBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-12-05 19:15 ` mb21
[not found] ` <14dfcbde-85d8-4c11-a970-0c653622fc20-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: mb21 @ 2017-12-05 19:15 UTC (permalink / raw)
To: pandoc-discuss
[-- Attachment #1.1: Type: text/plain, Size: 908 bytes --]
@Gabriel you could try pretending that it's markdown with raw HTML mixed
in: run `pandoc -f markdown` on it.
@John O'Regan, try https://jekyllrb.com
On Tuesday, December 5, 2017 at 1:11:01 PM UTC+1, John O'Regan wrote:
>
> PS: Please tell me more about your static site generator, Gabriel.
> Can I download it? Is it on GitHub?
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/14dfcbde-85d8-4c11-a970-0c653622fc20%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
[-- Attachment #1.2: Type: text/html, Size: 1528 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Converting HTML to markdown while leaving YAML frontmatter intact
[not found] ` <14dfcbde-85d8-4c11-a970-0c653622fc20-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-12-05 22:07 ` John O'Regan
0 siblings, 0 replies; 9+ messages in thread
From: John O'Regan @ 2017-12-05 22:07 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
Hi Mauro,
I've heard of Jekyll, thanks! :) I'm interested in static site/blog
generators in general. I was curious to see Gabriel's approach. I'm
trying to write a WordPress.com blogging client for Windows users. I
want it to be a stand-alone script that can run on Windows XP upwards.
Not as straightforward as I previously thought!
John
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2017-12-05 22:07 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-02 15:40 Converting HTML to markdown while leaving YAML frontmatter intact Gabriel Birke
[not found] ` <ff13d84d-d30d-4bc4-8eb6-3a47f3190ece-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-12-02 16:21 ` 'Jason Seeley' via pandoc-discuss
[not found] ` <912cdd88-2a94-43a0-a70f-815b948dbbd1-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-12-02 16:42 ` Gabriel Birke
2017-12-02 18:38 ` John MacFarlane
2017-12-04 11:23 ` BP Jonsson
2017-12-04 17:17 ` John O'Regan
[not found] ` <5d2adaaf-c7f5-41b5-a161-ba8fd65e32d3-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-12-05 12:10 ` John O'Regan
[not found] ` <CALwjjMiOR=Mqnn9Ruh+KH-XYBbUBT7cB4XjNgBaxo+QRA2fOBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-12-05 19:15 ` mb21
[not found] ` <14dfcbde-85d8-4c11-a970-0c653622fc20-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-12-05 22:07 ` John O'Regan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).