HTML attributes not being stripped off

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* HTML attributes not being stripped off
@ 2012-11-11 11:19 Pablo Rodríguez
       [not found] ` <509F89B3.4070403-S0/GAf8tV78@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Pablo Rodríguez @ 2012-11-11 11:19 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Hi John,

I'm using pandoc mainly to generate ePub files.

I used textile first as source language, but it isn't fully implemented
by pandoc and textile itself has issues with multiparagraph elements.

It seems HTML is probably a much better option for pandoc as source
language, although I have to forget footnotes. There is no way to have
it all.

But pandoc strips almost all attributes from HTML elements.

A minimal sample:

<ol start="2" style="list-style-type:lower-latin;">
<li><p>Well there is no other way to tag <em lang="la">lingua
latina</em>.</p>
<li><p>Or even classes or ids.</p>.</li>
</ol>

Would it be possible that there is an option that doesn't strip off
attributes from HTML code?

BTW, when converting from HTML to another HTML code, at least id, class
and lang attributes shouldn't be stripped off by default.

Many thanks for your help,

Pablo
-- 
http://www.ousia.tk

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HTML attributes not being stripped off
       [not found] ` <509F89B3.4070403-S0/GAf8tV78@public.gmane.org>
@ 2012-11-11 22:36   ` John MacFarlane
       [not found]     ` <20121111223615.GE4399-9Rnp8PDaXcZ2EAH53EmH34tHsfhOvSUSZkel5v8DVj8@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: John MacFarlane @ 2012-11-11 22:36 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

You've got to remember that pandoc converts the input format to an
internal representation of the document (the 'Pandoc' structure), and
then converts that to the output format.

This internal representation (see
http://hackage.haskell.org/packages/archive/pandoc-types/1.9.1/doc/html/Text-Pandoc-Definition.html)
is much less expressive than HTML, and doesn't have a place for the
attributes you want.  That's why they are lost on HTML -> HTML
translation.

+++ Pablo Rodríguez [Nov 11 12 12:19 ]:
> Hi John,
> 
> I'm using pandoc mainly to generate ePub files.
> 
> I used textile first as source language, but it isn't fully implemented
> by pandoc and textile itself has issues with multiparagraph elements.
> 
> It seems HTML is probably a much better option for pandoc as source
> language, although I have to forget footnotes. There is no way to have
> it all.
> 
> But pandoc strips almost all attributes from HTML elements.
> 
> A minimal sample:
> 
> <ol start="2" style="list-style-type:lower-latin;">
> <li><p>Well there is no other way to tag <em lang="la">lingua
> latina</em>.</p>
> <li><p>Or even classes or ids.</p>.</li>
> </ol>
> 
> Would it be possible that there is an option that doesn't strip off
> attributes from HTML code?
> 
> BTW, when converting from HTML to another HTML code, at least id, class
> and lang attributes shouldn't be stripped off by default.
> 
> Many thanks for your help,
> 
> 
> Pablo
> -- 
> http://www.ousia.tk
> 
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
> 
> 

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HTML attributes not being stripped off
       [not found]     ` <20121111223615.GE4399-9Rnp8PDaXcZ2EAH53EmH34tHsfhOvSUSZkel5v8DVj8@public.gmane.org>
@ 2012-11-12 19:14       ` Pablo Rodríguez
       [not found]         ` <50A14A92.9060301-S0/GAf8tV78@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Pablo Rodríguez @ 2012-11-12 19:14 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Thank you for your explanation, John.

I'm afraid I cannot code and pandoc's internal representation of the
document (sorry if the naming isn't accurate, but this is really Greek
to me [χαλεπὰ τὰ καλά, I agree :-)]) is beyond my extremely limited
understanding on these matters.

I can understand (and I hope I'm not wrong) that pandoc cannot be as
flexible as HTML and this is on purpose. (This might be problematic for
some uses, as the Spanish law gazette uses no headings, but it
distinguishes the different <p> with different classes.)

Not focusing specifically on HTML, I think that pandoc should allow to
uniquely identify, add to a class and set the language to any element,
desired text span or division.

I know that this is related to a couple of messages I sent yesterday.
Sorry for repeating myself, but these are basic features to write documents.

From the documentation perspective, it would be to apply the type Attr
to any constructor from data Block and Inline. And to data TableCell.

Although language could be defined as a key-value pair in type Attr, I
think is clearer to define a new specific language attribute.

Is there anything wrong with this approach?

Many thanks for your help,



Pablo

On 11/11/12 23:36, John MacFarlane wrote:
> You've got to remember that pandoc converts the input format to an
> internal representation of the document (the 'Pandoc' structure), and
> then converts that to the output format.
> 
> This internal representation (see
> http://hackage.haskell.org/packages/archive/pandoc-types/1.9.1/doc/html/Text-Pandoc-Definition.html)
> is much less expressive than HTML, and doesn't have a place for the
> attributes you want.  That's why they are lost on HTML -> HTML
> translation.
> 
> +++ Pablo Rodríguez [Nov 11 12 12:19 ]:
>> Hi John,
>>
>> I'm using pandoc mainly to generate ePub files.
>>
>> I used textile first as source language, but it isn't fully implemented
>> by pandoc and textile itself has issues with multiparagraph elements.
>>
>> It seems HTML is probably a much better option for pandoc as source
>> language, although I have to forget footnotes. There is no way to have
>> it all.
>>
>> But pandoc strips almost all attributes from HTML elements.
>>
>> A minimal sample:
>>
>> <ol start="2" style="list-style-type:lower-latin;">
>> <li><p>Well there is no other way to tag <em lang="la">lingua
>> latina</em>.</p>
>> <li><p>Or even classes or ids.</p>.</li>
>> </ol>
>>
>> Would it be possible that there is an option that doesn't strip off
>> attributes from HTML code?
>>
>> BTW, when converting from HTML to another HTML code, at least id, class
>> and lang attributes shouldn't be stripped off by default.
>>
>> Many thanks for your help,
>>
>>
>> Pablo
>> -- 
>> http://www.ousia.tk
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
> 

-- 
http://www.ousia.tk

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HTML attributes not being stripped off
       [not found]         ` <50A14A92.9060301-S0/GAf8tV78@public.gmane.org>
@ 2022-06-27  9:42           ` 'guenael Muller' via pandoc-discuss
       [not found]             ` <33fcfdbf-3edc-4145-a7f0-325bfd42698fn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: 'guenael Muller' via pandoc-discuss @ 2022-06-27  9:42 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 4788 bytes --]

Hello all,

Up from the future (10 years !)

I have a similar issue now with playing with pandoc and weasyprint using 
pandoc for the templating.
What seems to be a bit curious, is that some html attribute are stripped 
off and some are not.

Example:

<p style="xxx" lang="ar" dir="rtl" class="test" id="first">Just some 
text</p>
<div style="xxx" lang="ar" dir="rtl" class="test" id="second">
<p>Just some text</p>
</div>

give:

<p>Just some text</p>
<div id="second" class="test" style="xxx" lang="ar" dir="rtl">
<p>Just some text</p>
</div>

I don't get the logic behind the fact that HTML attribute are stripped for 
p and not for div. Understanding it may help me make a decision
if it's worth to adapt my HTML depending on this behavior.

Thanks for your answers,

Guenael

Le lundi 12 novembre 2012 à 20:14:26 UTC+1, ousia a écrit :

> Thank you for your explanation, John.
>
> I'm afraid I cannot code and pandoc's internal representation of the
> document (sorry if the naming isn't accurate, but this is really Greek
> to me [χαλεπὰ τὰ καλά, I agree :-)]) is beyond my extremely limited
> understanding on these matters.
>
> I can understand (and I hope I'm not wrong) that pandoc cannot be as
> flexible as HTML and this is on purpose. (This might be problematic for
> some uses, as the Spanish law gazette uses no headings, but it
> distinguishes the different <p> with different classes.)
>
> Not focusing specifically on HTML, I think that pandoc should allow to
> uniquely identify, add to a class and set the language to any element,
> desired text span or division.
>
> I know that this is related to a couple of messages I sent yesterday.
> Sorry for repeating myself, but these are basic features to write 
> documents.
>
> From the documentation perspective, it would be to apply the type Attr
> to any constructor from data Block and Inline. And to data TableCell.
>
> Although language could be defined as a key-value pair in type Attr, I
> think is clearer to define a new specific language attribute.
>
> Is there anything wrong with this approach?
>
> Many thanks for your help,
>
>
>
> Pablo
>
> On 11/11/12 23:36, John MacFarlane wrote:
> > You've got to remember that pandoc converts the input format to an
> > internal representation of the document (the 'Pandoc' structure), and
> > then converts that to the output format.
> > 
> > This internal representation (see
> > 
> http://hackage.haskell.org/packages/archive/pandoc-types/1.9.1/doc/html/Text-Pandoc-Definition.html
> )
> > is much less expressive than HTML, and doesn't have a place for the
> > attributes you want. That's why they are lost on HTML -> HTML
> > translation.
> > 
> > +++ Pablo Rodríguez [Nov 11 12 12:19 ]:
> >> Hi John,
> >>
> >> I'm using pandoc mainly to generate ePub files.
> >>
> >> I used textile first as source language, but it isn't fully implemented
> >> by pandoc and textile itself has issues with multiparagraph elements.
> >>
> >> It seems HTML is probably a much better option for pandoc as source
> >> language, although I have to forget footnotes. There is no way to have
> >> it all.
> >>
> >> But pandoc strips almost all attributes from HTML elements.
> >>
> >> A minimal sample:
> >>
> >> <ol start="2" style="list-style-type:lower-latin;">
> >> <li><p>Well there is no other way to tag <em lang="la">lingua
> >> latina</em>.</p>
> >> <li><p>Or even classes or ids.</p>.</li>
> >> </ol>
> >>
> >> Would it be possible that there is an option that doesn't strip off
> >> attributes from HTML code?
> >>
> >> BTW, when converting from HTML to another HTML code, at least id, class
> >> and lang attributes shouldn't be stripped off by default.
> >>
> >> Many thanks for your help,
> >>
> >>
> >> Pablo
> >> -- 
> >> http://www.ousia.tk
> >>
> >> -- 
> >> You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> >> To post to this group, send email to pandoc-...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> >> To unsubscribe from this group, send email to 
> pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> >> For more options, visit https://groups.google.com/groups/opt_out.
> >>
> >>
> > 
>
> -- 
> http://www.ousia.tk
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/33fcfdbf-3edc-4145-a7f0-325bfd42698fn%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 8244 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HTML attributes not being stripped off
       [not found]             ` <33fcfdbf-3edc-4145-a7f0-325bfd42698fn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2022-06-27  9:47               ` Albert Krewinkel
  2022-06-27  9:55               ` Sukil Etxenike arizaleta
  1 sibling, 0 replies; 9+ messages in thread
From: Albert Krewinkel @ 2022-06-27  9:47 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Hi,

"'guenael Muller' via pandoc-discuss" <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> writes:

> I have a similar issue now with playing with pandoc and weasyprint
> using pandoc for the templating.
>
> [...]
>
> I don't get the logic behind the fact that HTML attribute are stripped
> for p and not for div. Understanding it may help me make a decision
> if it's worth to adapt my HTML depending on this behavior.

If you are using pandoc just for templating and as a wrapper for
weasyprint, then you may want to move to a different templating engine
like m4 (here's a nice introduction:
https://ericscrivner.me/2018/07/m4-as-a-basic-templating-system/)


-- 
Albert Krewinkel
GPG: 8eed e3e2 e8c5 6f18 81fe  e836 388d c0b2 1f63 1124


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HTML attributes not being stripped off
       [not found]             ` <33fcfdbf-3edc-4145-a7f0-325bfd42698fn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2022-06-27  9:47               ` Albert Krewinkel
@ 2022-06-27  9:55               ` Sukil Etxenike arizaleta
       [not found]                 ` <87174047-ad9b-b702-4a08-eaa3c00c511d-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 9+ messages in thread
From: Sukil Etxenike arizaleta @ 2022-06-27  9:55 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 6160 bytes --]

Hi,

What are you trying to do? Pandoc's markdown (and so Pandoc itself) 
supports codeblocks (pre or code, I don't remember), divs and spans with 
attributes (and I believe headers too). I think paragraphs can't have 
attributes.

Cheers,

Sukil


El 27/06/2022 a las 11:42, 'guenael Muller' via pandoc-discuss escribió:
> Hello all,
>
> Up from the future (10 years !)
>
> I have a similar issue now with playing with pandoc and weasyprint 
> using pandoc for the templating.
> What seems to be a bit curious, is that some html attribute are 
> stripped off and some are not.
>
> Example:
>
> <p style="xxx" lang="ar" dir="rtl" class="test" id="first">Just some 
> text</p>
> <div style="xxx" lang="ar" dir="rtl" class="test" id="second">
> <p>Just some text</p>
> </div>
>
> give:
>
> <p>Just some text</p>
> <div id="second" class="test" style="xxx" lang="ar" dir="rtl">
> <p>Just some text</p>
> </div>
>
> I don't get the logic behind the fact that HTML attribute are stripped 
> for p and not for div. Understanding it may help me make a decision
> if it's worth to adapt my HTML depending on this behavior.
>
> Thanks for your answers,
>
> Guenael
>
> Le lundi 12 novembre 2012 à 20:14:26 UTC+1, ousia a écrit :
>
>     Thank you for your explanation, John.
>
>     I'm afraid I cannot code and pandoc's internal representation of the
>     document (sorry if the naming isn't accurate, but this is really
>     Greek
>     to me [χαλεπὰ τὰ καλά, I agree :-)]) is beyond my extremely limited
>     understanding on these matters.
>
>     I can understand (and I hope I'm not wrong) that pandoc cannot be as
>     flexible as HTML and this is on purpose. (This might be
>     problematic for
>     some uses, as the Spanish law gazette uses no headings, but it
>     distinguishes the different <p> with different classes.)
>
>     Not focusing specifically on HTML, I think that pandoc should
>     allow to
>     uniquely identify, add to a class and set the language to any
>     element,
>     desired text span or division.
>
>     I know that this is related to a couple of messages I sent yesterday.
>     Sorry for repeating myself, but these are basic features to write
>     documents.
>
>     From the documentation perspective, it would be to apply the type
>     Attr
>     to any constructor from data Block and Inline. And to data TableCell.
>
>     Although language could be defined as a key-value pair in type
>     Attr, I
>     think is clearer to define a new specific language attribute.
>
>     Is there anything wrong with this approach?
>
>     Many thanks for your help,
>
>
>
>     Pablo
>
>     On 11/11/12 23:36, John MacFarlane wrote:
>     > You've got to remember that pandoc converts the input format to an
>     > internal representation of the document (the 'Pandoc'
>     structure), and
>     > then converts that to the output format.
>     >
>     > This internal representation (see
>     >
>     http://hackage.haskell.org/packages/archive/pandoc-types/1.9.1/doc/html/Text-Pandoc-Definition.html)
>
>     > is much less expressive than HTML, and doesn't have a place for the
>     > attributes you want. That's why they are lost on HTML -> HTML
>     > translation.
>     >
>     > +++ Pablo Rodríguez [Nov 11 12 12:19 ]:
>     >> Hi John,
>     >>
>     >> I'm using pandoc mainly to generate ePub files.
>     >>
>     >> I used textile first as source language, but it isn't fully
>     implemented
>     >> by pandoc and textile itself has issues with multiparagraph
>     elements.
>     >>
>     >> It seems HTML is probably a much better option for pandoc as
>     source
>     >> language, although I have to forget footnotes. There is no way
>     to have
>     >> it all.
>     >>
>     >> But pandoc strips almost all attributes from HTML elements.
>     >>
>     >> A minimal sample:
>     >>
>     >> <ol start="2" style="list-style-type:lower-latin;">
>     >> <li><p>Well there is no other way to tag <em lang="la">lingua
>     >> latina</em>.</p>
>     >> <li><p>Or even classes or ids.</p>.</li>
>     >> </ol>
>     >>
>     >> Would it be possible that there is an option that doesn't strip
>     off
>     >> attributes from HTML code?
>     >>
>     >> BTW, when converting from HTML to another HTML code, at least
>     id, class
>     >> and lang attributes shouldn't be stripped off by default.
>     >>
>     >> Many thanks for your help,
>     >>
>     >>
>     >> Pablo
>     >> --
>     >> http://www.ousia.tk
>     >>
>     >> --
>     >> You received this message because you are subscribed to the
>     Google Groups "pandoc-discuss" group.
>     >> To post to this group, send email to pandoc-...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>     >> To unsubscribe from this group, send email to
>     pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>     >> For more options, visit https://groups.google.com/groups/opt_out.
>     >>
>     >>
>     >
>
>     -- 
>     http://www.ousia.tk
>
> -- 
> You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/33fcfdbf-3edc-4145-a7f0-325bfd42698fn%40googlegroups.com 
> <https://groups.google.com/d/msgid/pandoc-discuss/33fcfdbf-3edc-4145-a7f0-325bfd42698fn%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/87174047-ad9b-b702-4a08-eaa3c00c511d%40gmail.com.

[-- Attachment #2: Type: text/html, Size: 12830 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HTML attributes not being stripped off
       [not found]                 ` <87174047-ad9b-b702-4a08-eaa3c00c511d-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2022-06-27 10:17                   ` 'guenael Muller' via pandoc-discuss
       [not found]                     ` <e1b7f6d6-56c7-469e-b2f1-082718e2cbb2n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: 'guenael Muller' via pandoc-discuss @ 2022-06-27 10:17 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 6462 bytes --]

Hi, 

The idea there, is to be able to convert both html (generated by a rich 
text editor)  and markdown (or other similar markup language) file through 
a similar pipeline to a pdf with similar style. Using a different 
templating engine somewhere in the pipeline mean more complexity, so i'm 
considering the idea of using pandoc templating if the html result is okay.

Thanks.


Le lundi 27 juin 2022 à 11:55:12 UTC+2, suki...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org a écrit :

> Hi,
>
> What are you trying to do? Pandoc's markdown (and so Pandoc itself) 
> supports codeblocks (pre or code, I don't remember), divs and spans with 
> attributes (and I believe headers too). I think paragraphs can't have 
> attributes.
>
> Cheers,
>
> Sukil
>
>
> El 27/06/2022 a las 11:42, 'guenael Muller' via pandoc-discuss escribió:
>
> Hello all,
>
> Up from the future (10 years !)
>
> I have a similar issue now with playing with pandoc and weasyprint using 
> pandoc for the templating.
> What seems to be a bit curious, is that some html attribute are stripped 
> off and some are not.
>
> Example:
>
> <p style="xxx" lang="ar" dir="rtl" class="test" id="first">Just some 
> text</p>
> <div style="xxx" lang="ar" dir="rtl" class="test" id="second">
> <p>Just some text</p>
> </div>
>
> give:
>
> <p>Just some text</p>
> <div id="second" class="test" style="xxx" lang="ar" dir="rtl">
> <p>Just some text</p>
> </div>
>
> I don't get the logic behind the fact that HTML attribute are stripped for 
> p and not for div. Understanding it may help me make a decision
> if it's worth to adapt my HTML depending on this behavior.
>
> Thanks for your answers,
>
> Guenael
>
> Le lundi 12 novembre 2012 à 20:14:26 UTC+1, ousia a écrit :
>
>> Thank you for your explanation, John. 
>>
>> I'm afraid I cannot code and pandoc's internal representation of the 
>> document (sorry if the naming isn't accurate, but this is really Greek 
>> to me [χαλεπὰ τὰ καλά, I agree :-)]) is beyond my extremely limited 
>> understanding on these matters. 
>>
>> I can understand (and I hope I'm not wrong) that pandoc cannot be as 
>> flexible as HTML and this is on purpose. (This might be problematic for 
>> some uses, as the Spanish law gazette uses no headings, but it 
>> distinguishes the different <p> with different classes.) 
>>
>> Not focusing specifically on HTML, I think that pandoc should allow to 
>> uniquely identify, add to a class and set the language to any element, 
>> desired text span or division. 
>>
>> I know that this is related to a couple of messages I sent yesterday. 
>> Sorry for repeating myself, but these are basic features to write 
>> documents. 
>>
>> From the documentation perspective, it would be to apply the type Attr 
>> to any constructor from data Block and Inline. And to data TableCell. 
>>
>> Although language could be defined as a key-value pair in type Attr, I 
>> think is clearer to define a new specific language attribute. 
>>
>> Is there anything wrong with this approach? 
>>
>> Many thanks for your help, 
>>
>>
>>
>> Pablo 
>>
>> On 11/11/12 23:36, John MacFarlane wrote: 
>> > You've got to remember that pandoc converts the input format to an 
>> > internal representation of the document (the 'Pandoc' structure), and 
>> > then converts that to the output format. 
>> > 
>> > This internal representation (see 
>> > 
>> http://hackage.haskell.org/packages/archive/pandoc-types/1.9.1/doc/html/Text-Pandoc-Definition.html) 
>>
>> > is much less expressive than HTML, and doesn't have a place for the 
>> > attributes you want. That's why they are lost on HTML -> HTML 
>> > translation. 
>> > 
>> > +++ Pablo Rodríguez [Nov 11 12 12:19 ]: 
>> >> Hi John, 
>> >> 
>> >> I'm using pandoc mainly to generate ePub files. 
>> >> 
>> >> I used textile first as source language, but it isn't fully 
>> implemented 
>> >> by pandoc and textile itself has issues with multiparagraph elements. 
>> >> 
>> >> It seems HTML is probably a much better option for pandoc as source 
>> >> language, although I have to forget footnotes. There is no way to have 
>> >> it all. 
>> >> 
>> >> But pandoc strips almost all attributes from HTML elements. 
>> >> 
>> >> A minimal sample: 
>> >> 
>> >> <ol start="2" style="list-style-type:lower-latin;"> 
>> >> <li><p>Well there is no other way to tag <em lang="la">lingua 
>> >> latina</em>.</p> 
>> >> <li><p>Or even classes or ids.</p>.</li> 
>> >> </ol> 
>> >> 
>> >> Would it be possible that there is an option that doesn't strip off 
>> >> attributes from HTML code? 
>> >> 
>> >> BTW, when converting from HTML to another HTML code, at least id, 
>> class 
>> >> and lang attributes shouldn't be stripped off by default. 
>> >> 
>> >> Many thanks for your help, 
>> >> 
>> >> 
>> >> Pablo 
>> >> -- 
>> >> http://www.ousia.tk 
>> >> 
>> >> -- 
>> >> You received this message because you are subscribed to the Google 
>> Groups "pandoc-discuss" group. 
>> >> To post to this group, send email to pandoc-...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org 
>> >> To unsubscribe from this group, send email to 
>> pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org 
>> >> For more options, visit https://groups.google.com/groups/opt_out. 
>> >> 
>> >> 
>> > 
>>
>> -- 
>> http://www.ousia.tk 
>>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "pandoc-discuss" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/33fcfdbf-3edc-4145-a7f0-325bfd42698fn%40googlegroups.com 
> <https://groups.google.com/d/msgid/pandoc-discuss/33fcfdbf-3edc-4145-a7f0-325bfd42698fn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e1b7f6d6-56c7-469e-b2f1-082718e2cbb2n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 13297 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HTML attributes not being stripped off
       [not found]                     ` <e1b7f6d6-56c7-469e-b2f1-082718e2cbb2n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2022-06-27 11:37                       ` Albert Krewinkel
       [not found]                         ` <87r13abaeb.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Albert Krewinkel @ 2022-06-27 11:37 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


"'guenael Muller' via pandoc-discuss" <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> writes:

> The idea there, is to be able to convert both html (generated by a rich
> text editor)  and markdown (or other similar markup language) file
> through a similar pipeline to a pdf with similar style. Using a
> different templating engine somewhere in the pipeline mean more
> complexity, so i'm considering the idea of using pandoc templating if
> the html result is okay.

OK, I see. How about the following approach then: use a custom reader
that passes the input through as raw HTML if any of the files have an
`.html` extension, but otherwise treats the input as Markdown.

``` lua
function Reader (sources, opts)
  local raw_html = false
  for _, source in ipairs(sources) do
    if source.name:match '%.htm[l]$' then
      raw_html = true
    end
  end
  if raw_html then
    local blocks = sources:map(
      function (source)
        return pandoc.RawBlock('html', tostring(source))
      end
    )
    return pandoc.Pandoc(blocks)
  else
    return pandoc.read(sources, 'markdown', opts)
  end
end
```

See also <https://pandoc.org/custom-readers.html>.

-- 
Albert Krewinkel
GPG: 8eed e3e2 e8c5 6f18 81fe  e836 388d c0b2 1f63 1124


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: HTML attributes not being stripped off
       [not found]                         ` <87r13abaeb.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
@ 2022-06-27 12:14                           ` Albert Krewinkel
  0 siblings, 0 replies; 9+ messages in thread
From: Albert Krewinkel @ 2022-06-27 12:14 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


Albert Krewinkel <albert+pandoc-9EawChwDxG8hFhg+JK9F0w@public.gmane.org> writes:

> "'guenael Muller' via pandoc-discuss" <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> writes:
>
>> The idea there, is to be able to convert both html (generated by a rich
>> text editor)  and markdown (or other similar markup language) file
>> through a similar pipeline to a pdf with similar style. Using a
>> different templating engine somewhere in the pipeline mean more
>> complexity, so i'm considering the idea of using pandoc templating if
>> the html result is okay.
>
> OK, I see. How about the following approach then: use a custom reader
> that passes the input through as raw HTML if any of the files have an
> `.html` extension, but otherwise treats the input as Markdown.

Shorter version that requires a current development version of pandoc,
but allows to mix .html files and Markdown files in the same command:

``` lua
function Reader (sources, opts)
  local doc = pandoc.Pandoc{}
  for _, source in ipairs(sources) do
    doc = doc ..
      (source.name:match '%.htm[l]?$'
       and pandoc.Pandoc{pandoc.RawBlock('html', tostring(source))}
       or pandoc.read(source, 'markdown', opts))
  end
  return doc
end
```

-- 
Albert Krewinkel
GPG: 8eed e3e2 e8c5 6f18 81fe  e836 388d c0b2 1f63 1124


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-06-27 12:14 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-11 11:19 HTML attributes not being stripped off Pablo Rodríguez
     [not found] ` <509F89B3.4070403-S0/GAf8tV78@public.gmane.org>
2012-11-11 22:36   ` John MacFarlane
     [not found]     ` <20121111223615.GE4399-9Rnp8PDaXcZ2EAH53EmH34tHsfhOvSUSZkel5v8DVj8@public.gmane.org>
2012-11-12 19:14       ` Pablo Rodríguez
     [not found]         ` <50A14A92.9060301-S0/GAf8tV78@public.gmane.org>
2022-06-27  9:42           ` 'guenael Muller' via pandoc-discuss
     [not found]             ` <33fcfdbf-3edc-4145-a7f0-325bfd42698fn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-06-27  9:47               ` Albert Krewinkel
2022-06-27  9:55               ` Sukil Etxenike arizaleta
     [not found]                 ` <87174047-ad9b-b702-4a08-eaa3c00c511d-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-06-27 10:17                   ` 'guenael Muller' via pandoc-discuss
     [not found]                     ` <e1b7f6d6-56c7-469e-b2f1-082718e2cbb2n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-06-27 11:37                       ` Albert Krewinkel
     [not found]                         ` <87r13abaeb.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
2022-06-27 12:14                           ` Albert Krewinkel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).