reading html, <h1 class="title"> header ignored

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* reading html, <h1 class="title"> header ignored
@ 2019-08-27  0:07 Mikhail Ramendik
       [not found] ` <8a9e115c-2983-47d7-a7df-82af5d73822c-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Mikhail Ramendik @ 2019-08-27  0:07 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1.1: Type: text/plain, Size: 1149 bytes --]

Hello,

I am converting an HTML file to ODT. (The problem is with the reader, not 
writer, as it also reproduces of converting to MediaWiki).

My HTML generator uses the <h1 class="title"> markup for chapter titles. 
And these titles end up entirely missing on pandoc output.

If I do a replace in the file so the tag looks like <h1 class="meow"> 
instead and then convert, the titles are in place.

How can I make pandoc process chapter titles that are marked up with <h1 
class="title"> and include them in the output? Or do I need to create a 
bug/issue somewhere?

$ pandoc --version 
pandoc 2.1.2 
Compiled with pandoc-types 1.17.3.1, texmath 0.10.1.2, skylighting 0.6

(Installed from Fedora 29 repository).

Yours, Mikhail Ramendik

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/8a9e115c-2983-47d7-a7df-82af5d73822c%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 1935 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: reading html, <h1 class="title"> header ignored
       [not found] ` <8a9e115c-2983-47d7-a7df-82af5d73822c-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2019-08-27 16:33   ` John MacFarlane
       [not found]     ` <m2mufuefgc.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: John MacFarlane @ 2019-08-27 16:33 UTC (permalink / raw)
  To: Mikhail Ramendik, pandoc-discuss

This is because pandoc uses <h1 class="title"> for metadata
titles when rendering HTML.  So, for better round-trip consistency,
we parse these as metadata when reading HTML.

Every once and a while someone runs into this issue, using HTML
created elsewhere that uses the same class.

It would probably have been better, in retrospect, to use a
class like "pandoc-title".  I'd be reluctant to change that now,
though, since it would affect lots of customized templates.

One possibility would be to change pandoc's HTML reader so that
<h1 class="title"> is normally parsed as a regular level-1
heading, UNLESS <meta generator="pandoc"> is present in the
head section.  That would allow nice round tripping from pandoc
but not get in the way of other HTML-producers.

However, it may be that pandoc's current behavior is actually
better in many cases, even when processing HTML produced by
other sources.  So it's quite possible that making this change
would lead to a surge of complaints. (Comments welcome on this.)

Another, probably better approach would be to parse
<h1 class="title"> as a metadata title when pandoc is run
with --standalone, but not when pandoc is run in fragment mode.
(Currently, in fragment mode, the h1 just disappears, since
no metadata is created.)  Feel free to add an issue to the
tracker suggesting this (and, comments welcome from anyone).

A workaround for you would be to preprocess the input, or
run in --standalone mode and use a lua filter that extracts
the metadata title and inserts a level 1 header with its content
at the beginning of the document.

Mikhail Ramendik <mr-eJ/51bLfIl8ox3rIn2DAYQ@public.gmane.org> writes:

> Hello,
>
> I am converting an HTML file to ODT. (The problem is with the reader, not 
> writer, as it also reproduces of converting to MediaWiki).
>
> My HTML generator uses the <h1 class="title"> markup for chapter titles. 
> And these titles end up entirely missing on pandoc output.
>
> If I do a replace in the file so the tag looks like <h1 class="meow"> 
> instead and then convert, the titles are in place.
>
> How can I make pandoc process chapter titles that are marked up with <h1 
> class="title"> and include them in the output? Or do I need to create a 
> bug/issue somewhere?
>
> $ pandoc --version 
> pandoc 2.1.2 
> Compiled with pandoc-types 1.17.3.1, texmath 0.10.1.2, skylighting 0.6
>
> (Installed from Fedora 29 repository).
>
> Yours, Mikhail Ramendik
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/8a9e115c-2983-47d7-a7df-82af5d73822c%40googlegroups.com.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: reading html, <h1 class="title"> header ignored
       [not found]     ` <m2mufuefgc.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
@ 2019-08-27 22:54       ` Mikhail Ramendik
       [not found]         ` <684df614-496b-455f-aa2d-e602b19c96b0-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Mikhail Ramendik @ 2019-08-27 22:54 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2203 bytes --]

Hello, 

Thank you very much for your response!

On Tuesday, August 27, 2019 at 5:33:24 PM UTC+1, John MacFarlane wrote:
>
>
> One possibility would be to change pandoc's HTML reader so that 
> <h1 class="title"> is normally parsed as a regular level-1 
> heading, UNLESS <meta generator="pandoc"> is present in the 
> head section.  That would allow nice round tripping from pandoc 
> but not get in the way of other HTML-producers. 
>


> However, it may be that pandoc's current behavior is actually 
> better in many cases, even when processing HTML produced by 
> other sources.  So it's quite possible that making this change 
> would lead to a surge of complaints. (Comments welcome on this.) 
>

I would suggest that this behaviour become the default, BUT you add a 
command line option to invoke the present behaviour.

So:

- with <meta generator="pandoc">, process <h1 class="title"> as metadata
- with --title-metadata (or similar), process <h1 class="title"> as metadata
- otherwise process <h1 class="title"> as a header
 

>
> Another, probably better approach would be to parse 
> <h1 class="title"> as a metadata title when pandoc is run 
> with --standalone, but not when pandoc is run in fragment mode.


But I want to get a complete ODT document as output. Don't I need to use 
--standalone? If I do then this fix would do nothing for me.
 

>
> A workaround for you would be to preprocess the input, or 
> run in --standalone mode and use a lua filter that extracts 
> the metadata title and inserts a level 1 header with its content 
> at the beginning of the document. 
>

Preprocessing the input with a mere search and replace, changing 
class="title" to class="meow", is a simple approach that works. But it is a 
mandatory extra step.

 Yours, Mikhail Ramendik 

>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/684df614-496b-455f-aa2d-e602b19c96b0%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 3564 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: reading html, <h1 class="title"> header ignored
       [not found]         ` <684df614-496b-455f-aa2d-e602b19c96b0-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2019-08-28  0:53           ` John MacFarlane
       [not found]             ` <yh480kk1ayxg7w.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: John MacFarlane @ 2019-08-28  0:53 UTC (permalink / raw)
  To: Mikhail Ramendik, pandoc-discuss


I take back one thing I said.  The <h1 class="title"> just
gets skipped; it doesn't get parsed into a metadata title.
If it did, it would appear as a title heading in your ODT
document and I think you'd have no complaints.

So maybe this can be solved by changing the HTML reader to
insert the contents of <h1 class="title"> into the title
metadata, if it hasn't already been populated by a
<title> tag in the header (I assume your document lacks one?)

Mikhail Ramendik <mr-eJ/51bLfIl8ox3rIn2DAYQ@public.gmane.org> writes:

> Hello, 
>
> Thank you very much for your response!
>
> On Tuesday, August 27, 2019 at 5:33:24 PM UTC+1, John MacFarlane wrote:
>>
>>
>> One possibility would be to change pandoc's HTML reader so that 
>> <h1 class="title"> is normally parsed as a regular level-1 
>> heading, UNLESS <meta generator="pandoc"> is present in the 
>> head section.  That would allow nice round tripping from pandoc 
>> but not get in the way of other HTML-producers. 
>>
>
>
>> However, it may be that pandoc's current behavior is actually 
>> better in many cases, even when processing HTML produced by 
>> other sources.  So it's quite possible that making this change 
>> would lead to a surge of complaints. (Comments welcome on this.) 
>>
>
> I would suggest that this behaviour become the default, BUT you add a 
> command line option to invoke the present behaviour.
>
> So:
>
> - with <meta generator="pandoc">, process <h1 class="title"> as metadata
> - with --title-metadata (or similar), process <h1 class="title"> as metadata
> - otherwise process <h1 class="title"> as a header
>  
>
>>
>> Another, probably better approach would be to parse 
>> <h1 class="title"> as a metadata title when pandoc is run 
>> with --standalone, but not when pandoc is run in fragment mode.
>
>
> But I want to get a complete ODT document as output. Don't I need to use 
> --standalone? If I do then this fix would do nothing for me.
>  
>
>>
>> A workaround for you would be to preprocess the input, or 
>> run in --standalone mode and use a lua filter that extracts 
>> the metadata title and inserts a level 1 header with its content 
>> at the beginning of the document. 
>>
>
> Preprocessing the input with a mere search and replace, changing 
> class="title" to class="meow", is a simple approach that works. But it is a 
> mandatory extra step.
>
>  Yours, Mikhail Ramendik 
>
>>
>>
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/684df614-496b-455f-aa2d-e602b19c96b0%40googlegroups.com.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: reading html, <h1 class="title"> header ignored
       [not found]             ` <yh480kk1ayxg7w.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
@ 2019-08-28  1:11               ` Mikhail Ramendik
  0 siblings, 0 replies; 5+ messages in thread
From: Mikhail Ramendik @ 2019-08-28  1:11 UTC (permalink / raw)
  To: John MacFarlane; +Cc: pandoc-discuss

On Wed, 28 Aug 2019 at 01:54, John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote:

> So maybe this can be solved by changing the HTML reader to
> insert the contents of <h1 class="title"> into the title
> metadata, if it hasn't already been populated by a
> <title> tag in the header (I assume your document lacks one?)

Well maybe - I can't judge until I try it.

-- 
Yours, Mikhail Ramendik

Unless explicitly stated, all opinions in my mail are my own and do
not reflect the views of any organization


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-08-28  1:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-27  0:07 reading html, <h1 class="title"> header ignored Mikhail Ramendik
     [not found] ` <8a9e115c-2983-47d7-a7df-82af5d73822c-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2019-08-27 16:33   ` John MacFarlane
     [not found]     ` <m2mufuefgc.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2019-08-27 22:54       ` Mikhail Ramendik
     [not found]         ` <684df614-496b-455f-aa2d-e602b19c96b0-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2019-08-28  0:53           ` John MacFarlane
     [not found]             ` <yh480kk1ayxg7w.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2019-08-28  1:11               ` Mikhail Ramendik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).