public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* HTML to Markdown and >/< entities
@ 2017-07-28 10:14 Benjamin Ullrich
       [not found] ` <6e17b71f-4ce2-4e88-9bc3-fbdbb4d74365-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Benjamin Ullrich @ 2017-07-28 10:14 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1963 bytes --]

I'm new to pandoc and currently experimenting to convert from HTML to Markdown. For this I have created the following sample input text: 


text&nbsp;&nbsp;<strong>strong</strong> <em>emphasize</em> &gt; &amp; &lt; &copy; text <a href=
"http://domain.test" <http://domain.test>>Link Title</a> 

My expectation for the result would be: 

text  **strong** **emphasize** > & < © text [Link Title](http://domain.test) 


Instead, running pandoc like: 

pandoc -f html -t markdown_strict --normalize --wrap none /tmp/pandoc_input 

where /tmp/pandoc_input contains the above mentioned text produces the following output: 


text  **strong** **emphasize** &gt; & &lt; © text [Link Title](
http://domain.test) 

All looks fine except &gt; and &lt; are not converted as I would expect. 

My current system: 

Ubuntu 16.04 LTS 
pandoc 1.16.0.2 

I also tried with pandoc 1.19.1 from Ubuntu Artful, same problem. 

I also tried with pandoc 1.12.0.2 on Ubuntu 14.04 LTS. There &gt; and &lt; are converted but preceded with a \ which is also not what I would expect: 


text  **bold** **italic** \> & \< © text [Link Title](http://domain.test) 


Am I just missing something here? Should it work as I expect or is my expectation wrong? Any input on what could be the problem or how to circumvent the issue would be appreciated. 


Regards, 
Benjamin


-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/6e17b71f-4ce2-4e88-9bc3-fbdbb4d74365%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 3586 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: HTML to Markdown and &gt;/&lt; entities
       [not found] ` <6e17b71f-4ce2-4e88-9bc3-fbdbb4d74365-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-07-28 11:53   ` Ivan Lazar Miljenovic
       [not found]     ` <CA+u6gbzgquNuuGXLdsCtcdo+JhS2BchV9zQbyJNmaRVa7_UsVQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Ivan Lazar Miljenovic @ 2017-07-28 11:53 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 28 July 2017 at 20:14, Benjamin Ullrich <bulirich-S0/GAf8tV78@public.gmane.org> wrote:
> I'm new to pandoc and currently experimenting to convert from HTML to
> Markdown. For this I have created the following sample input text:
>
> text&nbsp;&nbsp;<strong>strong</strong> <em>emphasize</em> &gt; &amp; &lt;
> &copy; text <a href="http://domain.test">Link Title</a>
>
> My expectation for the result would be:
>
> text  **strong** *emphasize* > & < © text [Link Title](http://domain.test)
>
> Instead, running pandoc like:
>
> pandoc -f html -t markdown_strict --normalize --wrap none /tmp/pandoc_input
>
> where /tmp/pandoc_input contains the above mentioned text produces the
> following output:
>
> text  **strong** *emphasize* &gt; & &lt; © text [Link
> Title](http://domain.test)
>
> All looks fine except &gt; and &lt; are not converted as I would expect.
>
> My current system:
>
> Ubuntu 16.04 LTS
> pandoc 1.16.0.2
>
> I also tried with pandoc 1.19.1 from Ubuntu Artful, same problem.
>
> I also tried with pandoc 1.12.0.2 on Ubuntu 14.04 LTS. There &gt; and &lt;
> are converted but preceded with a \ which is also not what I would expect:
>
> text  **bold** *italic* \> & \< © text [Link Title](http://domain.test)
>
>
> Am I just missing something here? Should it work as I expect or is my
> expectation wrong? Any input on what could be the problem or how to
> circumvent the issue would be appreciated.

Markdown allows for inline html, so if you want > or < you do need to
escape them; the behaviour of how they are escaped by the Markdown
writer seems to have changed though.

>
> Regards,
> Benjamin
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/6e17b71f-4ce2-4e88-9bc3-fbdbb4d74365%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



-- 
Ivan Lazar Miljenovic
Ivan.Miljenovic-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
http://IvanMiljenovic.wordpress.com

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CA%2Bu6gbzgquNuuGXLdsCtcdo%2BJhS2BchV9zQbyJNmaRVa7_UsVQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: HTML to Markdown and &gt;/&lt; entities
       [not found]     ` <CA+u6gbzgquNuuGXLdsCtcdo+JhS2BchV9zQbyJNmaRVa7_UsVQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-07-30 20:42       ` John MacFarlane
       [not found]         ` <20170730204232.GF11715-9Rnp8PDaXcadBw3G0RLmbRFnWt+6NQIA@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: John MacFarlane @ 2017-07-30 20:42 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

We need to escape `<` and `>` in general, since these
can have special meanings in Markdown (autolinks,
raw HTML).

We could use either `\<` or `&lt;`; we used the latter
because the former doesn't work with the original
Markdown.pl and derivatives (such as showndown,
python markdown, etc.).

I think it would make sense, though, to use `\<`
and `\>` when `all_symbols_escapable` is set.
I'm going to implement that change.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: HTML to Markdown and &gt;/&lt; entities
       [not found]         ` <20170730204232.GF11715-9Rnp8PDaXcadBw3G0RLmbRFnWt+6NQIA@public.gmane.org>
@ 2017-08-01 22:10           ` Melroch
  0 siblings, 0 replies; 4+ messages in thread
From: Melroch @ 2017-08-01 22:10 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 1795 bytes --]

Den 30 jul 2017 22:43 skrev "John MacFarlane" <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org>:

We need to escape `<` and `>` in general, since these
can have special meanings in Markdown (autolinks,
raw HTML).

We could use either `\<` or `&lt;`; we used the latter
because the former doesn't work with the original
Markdown.pl and derivatives (such as showndown,
python markdown, etc.).

I think it would make sense, though, to use `\<`
and `\>` when `all_symbols_escapable` is set.
I'm going to implement that change.


I'm glad to hear that. I've been postfiltering generated markdown to
achieve that for some time.

Thanks!

/bpj





-- 
You received this message because you are subscribed to the Google Groups
"pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/ms
gid/pandoc-discuss/20170730204232.GF11715%40Johns-MacBook-Pro.local.

For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhCen%2ByX60WC3gFp9vF4hZA5HRQZaP%2B6e4osCsCK4%3D6_KA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 3538 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-08-01 22:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-28 10:14 HTML to Markdown and &gt;/&lt; entities Benjamin Ullrich
     [not found] ` <6e17b71f-4ce2-4e88-9bc3-fbdbb4d74365-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-07-28 11:53   ` Ivan Lazar Miljenovic
     [not found]     ` <CA+u6gbzgquNuuGXLdsCtcdo+JhS2BchV9zQbyJNmaRVa7_UsVQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-07-30 20:42       ` John MacFarlane
     [not found]         ` <20170730204232.GF11715-9Rnp8PDaXcadBw3G0RLmbRFnWt+6NQIA@public.gmane.org>
2017-08-01 22:10           ` Melroch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).