public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* name attribute lost in <a> by -parse-raw html reader, since pandoc 1.16
@ 2016-11-11 23:57 john.r.rose-QHcLZuEGTsvQT0dZR+AlfA
       [not found] ` <d1a2140b-7c27-4dc5-9d00-6d5a601d0dfa-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: john.r.rose-QHcLZuEGTsvQT0dZR+AlfA @ 2016-11-11 23:57 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1743 bytes --]

I am parsing legacy HTML files that make heavy use of <a name=>.
I'm using -parse-raw to preserve as much HTML structure as possible.
(The result translates to a mix of pandoc-markdown and HTML fragments.)

Pandoc apparently mistreats <a name=> even in -parse-raw mode.  This feels 
like a bug.

Here's the behavior:

$ echo '<a name="anchor"/>hello world' | /usr/local/bin/pandoc-1.15.2 
--parse-raw -f html -t html
<a name="anchor"></a>hello world

$ echo '<a name="anchor"/>hello world' | /usr/local/bin/pandoc-1.16.0.2 
--parse-raw -f html -t html
<a href=""></a>hello world

(And so on, up to the present 1.18.)

If I use <a id="anchor"> instead, as a workaround, I get either <span 
id="anchor"> or <a href="" id="anchor">,
depending on release.  Those are OK, but they seem to be short of the mark 
also.

In any case, because I am working with legacy files, it would be unpleasant 
to preprocess the <a name=>
attributes to <a id=> attributes.

It would be pleasant for me if the pandoc authors agreed that the present 
behavior is a bug,
and made -parse-raw preserve the name attributes.

Comments?  Am I missing a workaround or a theory of operation?

Thanks.


-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/d1a2140b-7c27-4dc5-9d00-6d5a601d0dfa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 2696 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: name attribute lost in <a> by -parse-raw html reader, since pandoc 1.16
       [not found] ` <d1a2140b-7c27-4dc5-9d00-6d5a601d0dfa-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2016-11-12  0:34   ` John MacFarlane
       [not found]     ` <20161112003422.GH77829-jF64zX8BO0/xZR0Txf6TOv112/MQ1Lpv@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: John MacFarlane @ 2016-11-12  0:34 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

With pandoc 1.18:

% echo '<a name="anchor"/>hello world' | pandoc
<p><a name="anchor"/>hello world</p>

(Same thing with or without --parse-raw.) Is this what you expect?

+++ john.r.rose-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org [Nov 11 16 15:57 ]:
>   I am parsing legacy HTML files that make heavy use of <a name=>.
>   I'm using -parse-raw to preserve as much HTML structure as possible.
>   (The result translates to a mix of pandoc-markdown and HTML fragments.)
>   Pandoc apparently mistreats <a name=> even in -parse-raw mode.  This
>   feels like a bug.
>   Here's the behavior:
>   $ echo '<a name="anchor"/>hello world' | /usr/local/bin/pandoc-1.15.2
>   --parse-raw -f html -t html
>   <a name="anchor"></a>hello world
>   $ echo '<a name="anchor"/>hello world' | /usr/local/bin/pandoc-1.16.0.2
>   --parse-raw -f html -t html
>   <a href=""></a>hello world
>   (And so on, up to the present 1.18.)
>   If I use <a id="anchor"> instead, as a workaround, I get either <span
>   id="anchor"> or <a href="" id="anchor">,
>   depending on release.  Those are OK, but they seem to be short of the
>   mark also.
>   In any case, because I am working with legacy files, it would be
>   unpleasant to preprocess the <a name=>
>   attributes to <a id=> attributes.
>   It would be pleasant for me if the pandoc authors agreed that the
>   present behavior is a bug,
>   and made -parse-raw preserve the name attributes.
>   Comments?  Am I missing a workaround or a theory of operation?
>   Thanks.
>
>   --
>   You received this message because you are subscribed to the Google
>   Groups "pandoc-discuss" group.
>   To unsubscribe from this group and stop receiving emails from it, send
>   an email to [1]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To post to this group, send email to
>   [2]pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To view this discussion on the web visit
>   [3]https://groups.google.com/d/msgid/pandoc-discuss/d1a2140b-7c27-4dc5-
>   9d00-6d5a601d0dfa%40googlegroups.com.
>   For more options, visit [4]https://groups.google.com/d/optout.
>
>References
>
>   1. mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   2. mailto:pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   3. https://groups.google.com/d/msgid/pandoc-discuss/d1a2140b-7c27-4dc5-9d00-6d5a601d0dfa-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org?utm_medium=email&utm_source=footer
>   4. https://groups.google.com/d/optout


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: name attribute lost in <a> by -parse-raw html reader, since pandoc 1.16
       [not found]     ` <20161112003422.GH77829-jF64zX8BO0/xZR0Txf6TOv112/MQ1Lpv@public.gmane.org>
@ 2016-11-12  1:28       ` john.r.rose-QHcLZuEGTsvQT0dZR+AlfA
       [not found]         ` <c5632684-9f96-4433-9061-fe960be21238-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: john.r.rose-QHcLZuEGTsvQT0dZR+AlfA @ 2016-11-12  1:28 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1006 bytes --]



On Friday, November 11, 2016 at 4:34:25 PM UTC-8, John MacFarlane wrote:
>
> With pandoc 1.18: 
>
> % echo '<a name="anchor"/>hello world' | pandoc 
> <p><a name="anchor"/>hello world</p> 
>
> (Same thing with or without --parse-raw.) Is this what you expect? 
>
>
Yes.  But then:
% echo '<a name="anchor"/>hello world' | /usr/local/bin/pandoc-1.18 -f html
<a href=""></a>hello world

The "-f html" does something surprising here. 

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/c5632684-9f96-4433-9061-fe960be21238%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 1746 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: name attribute lost in <a> by -parse-raw html reader, since pandoc 1.16
       [not found]         ` <c5632684-9f96-4433-9061-fe960be21238-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2016-11-12  1:54           ` john.r.rose-QHcLZuEGTsvQT0dZR+AlfA
  0 siblings, 0 replies; 4+ messages in thread
From: john.r.rose-QHcLZuEGTsvQT0dZR+AlfA @ 2016-11-12  1:54 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1043 bytes --]

More info:

% echo '<a bozo="clown"/>hello world' | /usr/local/bin/pandoc-1.18 -f 
markdown -t html
<p><a bozo="clown"/>hello world</p>

So markdown input copies through HTML fragments without trying to 
understand them. But:

% echo '<a bozo="clown"/>hello world' | /usr/local/bin/pandoc-1.18 -f html 
-t html
<a href=""></a>hello world

The HTML reader apparently normalizes away the "name=" attribute of an <a> 
tag.

I filed https://github.com/jgm/pandoc/issues/3226

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/f0443e22-6858-4970-80e8-a4b0b8803c09%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 1725 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-11-12  1:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-11 23:57 name attribute lost in <a> by -parse-raw html reader, since pandoc 1.16 john.r.rose-QHcLZuEGTsvQT0dZR+AlfA
     [not found] ` <d1a2140b-7c27-4dc5-9d00-6d5a601d0dfa-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2016-11-12  0:34   ` John MacFarlane
     [not found]     ` <20161112003422.GH77829-jF64zX8BO0/xZR0Txf6TOv112/MQ1Lpv@public.gmane.org>
2016-11-12  1:28       ` john.r.rose-QHcLZuEGTsvQT0dZR+AlfA
     [not found]         ` <c5632684-9f96-4433-9061-fe960be21238-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2016-11-12  1:54           ` john.r.rose-QHcLZuEGTsvQT0dZR+AlfA

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).