public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Converting HTML to markdown while leaving YAML frontmatter intact
@ 2017-12-02 15:40 Gabriel Birke
       [not found] ` <ff13d84d-d30d-4bc4-8eb6-3a47f3190ece-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Gabriel Birke @ 2017-12-02 15:40 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1327 bytes --]

I have a static site generator that allows content to be either HTML or 
Markdown, as long as it has a "YAML Frontmatter", section separated with 
three dashes from the rest of the content. HTML Example:

---
title: My blog post
tags: [blog, important, internal]
---
<p>First paragraph of the post</p>


Now I'd like to use pandoc to convert some older HTML posts to markdown for 
easier editing. However, the frontmatter gets mangled in the process: The 
dashes get escaped and everything is put on one line. Is there some easy 
way to preserve the front matter? Or do I have to resort to shell 
scripting, splitting frontmatter from content, processing the content with 
pandoc and then joining the pandoc output with the frontmatter?

Cheers,

Gabriel

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/ff13d84d-d30d-4bc4-8eb6-3a47f3190ece%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 1908 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Converting HTML to markdown while leaving YAML frontmatter intact
       [not found] ` <ff13d84d-d30d-4bc4-8eb6-3a47f3190ece-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-12-02 16:21   ` 'Jason Seeley' via pandoc-discuss
       [not found]     ` <912cdd88-2a94-43a0-a70f-815b948dbbd1-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2017-12-02 18:38   ` John MacFarlane
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: 'Jason Seeley' via pandoc-discuss @ 2017-12-02 16:21 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1664 bytes --]

Try setting the input format to markdown (-f markdown, or possibly -f 
markdown+raw_html). Standard markdown allows raw HTML in its content, so it 
should pick up the front matter and still read the rest just fine.

-Jason

On Saturday, December 2, 2017 at 9:40:55 AM UTC-6, Gabriel Birke wrote:
>
> I have a static site generator that allows content to be either HTML or 
> Markdown, as long as it has a "YAML Frontmatter", section separated with 
> three dashes from the rest of the content. HTML Example:
>
> ---
> title: My blog post
> tags: [blog, important, internal]
> ---
> <p>First paragraph of the post</p>
>
>
> Now I'd like to use pandoc to convert some older HTML posts to markdown 
> for easier editing. However, the frontmatter gets mangled in the process: 
> The dashes get escaped and everything is put on one line. Is there some 
> easy way to preserve the front matter? Or do I have to resort to shell 
> scripting, splitting frontmatter from content, processing the content with 
> pandoc and then joining the pandoc output with the frontmatter?
>
> Cheers,
>
> Gabriel
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/912cdd88-2a94-43a0-a70f-815b948dbbd1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 2381 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Converting HTML to markdown while leaving YAML frontmatter intact
       [not found]     ` <912cdd88-2a94-43a0-a70f-815b948dbbd1-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-12-02 16:42       ` Gabriel Birke
  0 siblings, 0 replies; 9+ messages in thread
From: Gabriel Birke @ 2017-12-02 16:42 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1902 bytes --]

Nope, using markdown as input format does not convert the HTML to markdown, 
it leaves the HTML intact and removes the frontmatter.

Am Samstag, 2. Dezember 2017 17:21:58 UTC+1 schrieb Jason Seeley:
>
> Try setting the input format to markdown (-f markdown, or possibly -f 
> markdown+raw_html). Standard markdown allows raw HTML in its content, so it 
> should pick up the front matter and still read the rest just fine.
>
> -Jason
>
> On Saturday, December 2, 2017 at 9:40:55 AM UTC-6, Gabriel Birke wrote:
>>
>> I have a static site generator that allows content to be either HTML or 
>> Markdown, as long as it has a "YAML Frontmatter", section separated with 
>> three dashes from the rest of the content. HTML Example:
>>
>> ---
>> title: My blog post
>> tags: [blog, important, internal]
>> ---
>> <p>First paragraph of the post</p>
>>
>>
>> Now I'd like to use pandoc to convert some older HTML posts to markdown 
>> for easier editing. However, the frontmatter gets mangled in the process: 
>> The dashes get escaped and everything is put on one line. Is there some 
>> easy way to preserve the front matter? Or do I have to resort to shell 
>> scripting, splitting frontmatter from content, processing the content with 
>> pandoc and then joining the pandoc output with the frontmatter?
>>
>> Cheers,
>>
>> Gabriel
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/92e6dbeb-f882-442a-a632-a5cb40c6fcb6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 2729 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Converting HTML to markdown while leaving YAML frontmatter intact
       [not found] ` <ff13d84d-d30d-4bc4-8eb6-3a47f3190ece-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2017-12-02 16:21   ` 'Jason Seeley' via pandoc-discuss
@ 2017-12-02 18:38   ` John MacFarlane
  2017-12-04 11:23   ` BP Jonsson
  2017-12-04 17:17   ` John O'Regan
  3 siblings, 0 replies; 9+ messages in thread
From: John MacFarlane @ 2017-12-02 18:38 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

YAML metadata is a markdown extension for pandoc.  So, if
you have YAML front matter in an HTML file, you'll need to
strip it off before passing to pandoc.

+++ Gabriel Birke [Dec 02 17 07:40 ]:
>   I have a static site generator that allows content to be either HTML or
>   Markdown, as long as it has a "YAML Frontmatter", section separated
>   with three dashes from the rest of the content. HTML Example:
>   ---
>   title: My blog post
>   tags: [blog, important, internal]
>   ---
>   <p>First paragraph of the post</p>
>   Now I'd like to use pandoc to convert some older HTML posts to markdown
>   for easier editing. However, the frontmatter gets mangled in the
>   process: The dashes get escaped and everything is put on one line. Is
>   there some easy way to preserve the front matter? Or do I have to
>   resort to shell scripting, splitting frontmatter from content,
>   processing the content with pandoc and then joining the pandoc output
>   with the frontmatter?
>   Cheers,
>   Gabriel
>
>   --
>   You received this message because you are subscribed to the Google
>   Groups "pandoc-discuss" group.
>   To unsubscribe from this group and stop receiving emails from it, send
>   an email to [1]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To post to this group, send email to
>   [2]pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To view this discussion on the web visit
>   [3]https://groups.google.com/d/msgid/pandoc-discuss/ff13d84d-d30d-4bc4-
>   8eb6-3a47f3190ece%40googlegroups.com.
>   For more options, visit [4]https://groups.google.com/d/optout.
>
>References
>
>   1. mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   2. mailto:pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   3. https://groups.google.com/d/msgid/pandoc-discuss/ff13d84d-d30d-4bc4-8eb6-3a47f3190ece-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org?utm_medium=email&utm_source=footer
>   4. https://groups.google.com/d/optout


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Converting HTML to markdown while leaving YAML frontmatter intact
       [not found] ` <ff13d84d-d30d-4bc4-8eb6-3a47f3190ece-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2017-12-02 16:21   ` 'Jason Seeley' via pandoc-discuss
  2017-12-02 18:38   ` John MacFarlane
@ 2017-12-04 11:23   ` BP Jonsson
  2017-12-04 17:17   ` John O'Regan
  3 siblings, 0 replies; 9+ messages in thread
From: BP Jonsson @ 2017-12-04 11:23 UTC (permalink / raw)
  To: Gabriel Birke, pandoc-discuss

Den 2017-12-02 kl. 16:40, skrev Gabriel Birke:
> I have a static site generator that allows content to be either HTML or
> Markdown, as long as it has a "YAML Frontmatter", section separated with
> three dashes from the rest of the content. HTML Example:
> 
> ---
> title: My blog post
> tags: [blog, important, internal]
> ---
> <p>First paragraph of the post</p>
> 
> 
> Now I'd like to use pandoc to convert some older HTML posts to markdown for
> easier editing. However, the frontmatter gets mangled in the process: The
> dashes get escaped and everything is put on one line. Is there some easy
> way to preserve the front matter? Or do I have to resort to shell
> scripting, splitting frontmatter from content, processing the content with
> pandoc and then joining the pandoc output with the frontmatter?
> 
> Cheers,
> 
> Gabriel
> 

On systems with make and perl installed (Unix, Linux, Mac...) this makefile does the trick:

``````makefile
# Converts HTML files with a YAML header starting on the first line and 
# ending on the first other line with a `---` marker to Markdown, while
# copying the YAML header to the Markdown file.

# USAGE:
#
#     make [HTMLSRC='foo.html bar.html'] [PDCOPTS='<extra pandoc args>']
#
# HTMLSRC defaults to globbing all *.html files

HTMLSRC ?= $(wildcard *.html)
MDTARGETS   := $(HTMLSRC:%.html=%.md)

.PHONY: html2md

html2md: $(MDTARGETS)

# The first invocation of perl loops over the source file and 
# prints out the YAML header and a blank line then stops, 
# its output being redirected to the target file.
# 
# The second invocation of perl also loops over the source file but
# skips the YAML header printing out the rest, which
# is piped through pandoc for conversion to Markdown, which is then
# appended to the target file.

$(MDTARGETS): %.md: %.html
	@perl -ne'last unless 1 ... /^---\s*$$/; print $$_; END {print "\n"}' <$< >$@
	@perl -ne'print unless 1 ... /^---\s*$$/' <$< | pandoc -r html -w markdown $(PDCOPTS) >>$@
``````

Beware of mail programs which convert tabs to spaces! The blanks at the start of the last two lines must be a single tab.

IMPORTANT: The YAML block in the *.html files must start on line 1 for this to work, since your YAML blocks both start and end in ---. 
There is a reason why the start and end marker are different in compliant YAML: if the block had ended in ... a
`1 ... /^\.\.\.\s*$/` would have included any lines between the first and the ..., including the --- at the top of the YAML and any lines before it.

PS: to preserve metadata when converting --to markdown use the --standalone/-s option.
Thus PDCOPTS='--standalone'. If you do that you will get a metadata block from pandoc as well,
and if both the original YAML block and the HTML contained a title for example you will get
a duplicate. However the original YAML block will win as it stands before the one pandoc generated.

/bpj

/bpj


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Converting HTML to markdown while leaving YAML frontmatter intact
       [not found] ` <ff13d84d-d30d-4bc4-8eb6-3a47f3190ece-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
                     ` (2 preceding siblings ...)
  2017-12-04 11:23   ` BP Jonsson
@ 2017-12-04 17:17   ` John O'Regan
       [not found]     ` <5d2adaaf-c7f5-41b5-a161-ba8fd65e32d3-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  3 siblings, 1 reply; 9+ messages in thread
From: John O'Regan @ 2017-12-04 17:17 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 649 bytes --]

Hi Gabriel,

For HTML files, why not wrap the yaml in an HTML comment?

John

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5d2adaaf-c7f5-41b5-a161-ba8fd65e32d3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 1120 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Converting HTML to markdown while leaving YAML frontmatter intact
       [not found]     ` <5d2adaaf-c7f5-41b5-a161-ba8fd65e32d3-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-12-05 12:10       ` John O'Regan
       [not found]         ` <CALwjjMiOR=Mqnn9Ruh+KH-XYBbUBT7cB4XjNgBaxo+QRA2fOBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: John O'Regan @ 2017-12-05 12:10 UTC (permalink / raw)
  To: pandoc-discuss

PS: Please tell me more about your static site generator, Gabriel.
Can I download it?  Is it on GitHub?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Converting HTML to markdown while leaving YAML frontmatter intact
       [not found]         ` <CALwjjMiOR=Mqnn9Ruh+KH-XYBbUBT7cB4XjNgBaxo+QRA2fOBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-12-05 19:15           ` mb21
       [not found]             ` <14dfcbde-85d8-4c11-a970-0c653622fc20-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: mb21 @ 2017-12-05 19:15 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 908 bytes --]

@Gabriel you could try pretending that it's markdown with raw HTML mixed 
in: run `pandoc -f markdown` on it.

@John O'Regan, try https://jekyllrb.com

On Tuesday, December 5, 2017 at 1:11:01 PM UTC+1, John O'Regan wrote:
>
> PS: Please tell me more about your static site generator, Gabriel. 
> Can I download it?  Is it on GitHub? 
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/14dfcbde-85d8-4c11-a970-0c653622fc20%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 1528 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Converting HTML to markdown while leaving YAML frontmatter intact
       [not found]             ` <14dfcbde-85d8-4c11-a970-0c653622fc20-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-12-05 22:07               ` John O'Regan
  0 siblings, 0 replies; 9+ messages in thread
From: John O'Regan @ 2017-12-05 22:07 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Hi Mauro,

I've heard of Jekyll, thanks! :) I'm interested in static site/blog
generators in general.  I was curious to see Gabriel's approach.  I'm
trying to write a WordPress.com blogging client for Windows users.  I
want it to be a stand-alone script that can run on Windows XP upwards.
Not as straightforward as I previously thought!

John


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-12-05 22:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-02 15:40 Converting HTML to markdown while leaving YAML frontmatter intact Gabriel Birke
     [not found] ` <ff13d84d-d30d-4bc4-8eb6-3a47f3190ece-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-12-02 16:21   ` 'Jason Seeley' via pandoc-discuss
     [not found]     ` <912cdd88-2a94-43a0-a70f-815b948dbbd1-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-12-02 16:42       ` Gabriel Birke
2017-12-02 18:38   ` John MacFarlane
2017-12-04 11:23   ` BP Jonsson
2017-12-04 17:17   ` John O'Regan
     [not found]     ` <5d2adaaf-c7f5-41b5-a161-ba8fd65e32d3-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-12-05 12:10       ` John O'Regan
     [not found]         ` <CALwjjMiOR=Mqnn9Ruh+KH-XYBbUBT7cB4XjNgBaxo+QRA2fOBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-12-05 19:15           ` mb21
     [not found]             ` <14dfcbde-85d8-4c11-a970-0c653622fc20-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-12-05 22:07               ` John O'Regan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).