From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/30856 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: "'guenael Muller' via pandoc-discuss" Newsgroups: gmane.text.pandoc Subject: Re: HTML attributes not being stripped off Date: Mon, 27 Jun 2022 03:17:31 -0700 (PDT) Message-ID: References: <509F89B3.4070403@web.de> <20121111223615.GE4399@Johns-MacBook-Air-2.local> <50A14A92.9060301@web.de> <33fcfdbf-3edc-4145-a7f0-325bfd42698fn@googlegroups.com> <87174047-ad9b-b702-4a08-eaa3c00c511d@gmail.com> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_3104_1525286524.1656325051817" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="22018"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBDJ65S6P64DRBPEH42KQMGQET25K6BA-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mon Jun 27 12:17:36 2022 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-yw1-f186.google.com ([209.85.128.186]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1o5loG-0005UW-9w for gtp-pandoc-discuss@m.gmane-mx.org; Mon, 27 Jun 2022 12:17:36 +0200 Original-Received: by mail-yw1-f186.google.com with SMTP id 00721157ae682-31814f7654dsf74370967b3.15 for ; Mon, 27 Jun 2022 03:17:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=tX0qZhH3EQ1jFMFBrI5gwY2tIw8q82uHBivmjtoGrLA=; b=JZNbIHldDdB5p0TdDqPhMXObe/U+SkV/hKHydx+dqb20GGN+gonE3jSKhgamWXmcR7 rJad6f35FSnAEmpDv5eE87CsmCCCzPUAgwiAhjxBtA3nhX5gNCcMKY60nGYuwnRHnTWs uqtq9Q4d+/Yzqf+2ONSRyk5Hlk6GTy/F8AZSfHArG4AMMGNeeFVJK8CrxyyD/wnMfH8f 1m4JoPPYZYc+lMShCfXOwqdKPrEdnpapronIAorffd0gCXNi/5/busdkZnRNXtZsGAIx wtcuoLB2Le5LHzvNtFYgVJrXDfBict7JDNAhSmw6rsSyyEuwWxYpiRIvEi9MLpO9XzkR 3itQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:message-id:in-reply-to:references :subject:mime-version:x-original-sender:reply-to:precedence :mailing-list:list-id:x-spam-checked-in-group:list-post:list-help :list-archive:list-subscribe:list-unsubscribe; bh=tX0qZhH3EQ1jFMFBrI5gwY2tIw8q82uHBivmjtoGrLA=; b=2ABrX1gbtKTEYBAVGP5Yef9vaV9HLZUou73yx/USNacaTuicAvxSzYxEIR07jS1ExD jPjPx51Y8HXDfm6Yfr1mA+5usyOcoN+pvWnIEkXdIHauRD/INMQkTKBh9Dn4T1Sw4j0U KtcH6bQZbGylHJq16j1ve4+BQAqMM8xZzh3+2oi9Jia6onL58F3kdvP6BzYNHGGaZLjs PSCfafyOCggOK6va1pf+yjloOpEUVCfsulxrIw6Samy5TdODISwQ/cVg+HzY29hhIiDL +0SgIdydAht+TwmcMZoPX1ynDcFIAwMO2H78OVfcgVwWMKPHNEJOzz8XUGDFoRiJhzxd q6Jw== X-Gm-Message-State: AJIora8I25I3O4iC/p0XT6DMgrX481d1t0AMZC5ETsgoagrhzDVVa/J8 Mq2kdAtgCPA7Y2qV0kqg96s= X-Google-Smtp-Source: AGRyM1vPeEtJ/KvqGFQgQkKdqbUqKNkuYeRp/tWoYGrbNS1WJVKEuKYbaEF92oPaEqCrgXBU/+vlwQ== X-Received: by 2002:a25:9943:0:b0:669:4c:d205 with SMTP id n3-20020a259943000000b00669004cd205mr12919209ybo.272.1656325055236; Mon, 27 Jun 2022 03:17:35 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a25:bc91:0:b0:66c:8184:36ad with SMTP id e17-20020a25bc91000000b0066c818436adls4351666ybk.5.gmail; Mon, 27 Jun 2022 03:17:32 -0700 (PDT) X-Received: by 2002:a25:b319:0:b0:669:b685:7e02 with SMTP id l25-20020a25b319000000b00669b6857e02mr12577417ybj.329.1656325052379; Mon, 27 Jun 2022 03:17:32 -0700 (PDT) In-Reply-To: <87174047-ad9b-b702-4a08-eaa3c00c511d-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> X-Original-Sender: guenael.muller-SY6ho0IHB4A@public.gmane.org X-Original-From: guenael Muller Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:30856 Archived-At: ------=_Part_3104_1525286524.1656325051817 Content-Type: multipart/alternative; boundary="----=_Part_3105_89912478.1656325051817" ------=_Part_3105_89912478.1656325051817 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi,=20 The idea there, is to be able to convert both html (generated by a rich=20 text editor) and markdown (or other similar markup language) file through= =20 a similar pipeline to a pdf with similar style. Using a different=20 templating engine somewhere in the pipeline mean more complexity, so i'm=20 considering the idea of using pandoc templating if the html result is okay. Thanks. Le lundi 27 juin 2022 =C3=A0 11:55:12 UTC+2, suki...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org a =C3=A9crit= : > Hi, > > What are you trying to do? Pandoc's markdown (and so Pandoc itself)=20 > supports codeblocks (pre or code, I don't remember), divs and spans with= =20 > attributes (and I believe headers too). I think paragraphs can't have=20 > attributes. > > Cheers, > > Sukil > > > El 27/06/2022 a las 11:42, 'guenael Muller' via pandoc-discuss escribi=C3= =B3: > > Hello all, > > Up from the future (10 years !) > > I have a similar issue now with playing with pandoc and weasyprint using= =20 > pandoc for the templating. > What seems to be a bit curious, is that some html attribute are stripped= =20 > off and some are not. > > Example: > >

Just= some=20 > text

>
>

Just some text

>
> > give: > >

Just some text

>
>

Just some text

>
> > I don't get the logic behind the fact that HTML attribute are stripped fo= r=20 > p and not for div. Understanding it may help me make a decision > if it's worth to adapt my HTML depending on this behavior. > > Thanks for your answers, > > Guenael > > Le lundi 12 novembre 2012 =C3=A0 20:14:26 UTC+1, ousia a =C3=A9crit : > >> Thank you for your explanation, John.=20 >> >> I'm afraid I cannot code and pandoc's internal representation of the=20 >> document (sorry if the naming isn't accurate, but this is really Greek= =20 >> to me [=CF=87=CE=B1=CE=BB=CE=B5=CF=80=E1=BD=B0 =CF=84=E1=BD=B0 =CE=BA=CE= =B1=CE=BB=CE=AC, I agree :-)]) is beyond my extremely limited=20 >> understanding on these matters.=20 >> >> I can understand (and I hope I'm not wrong) that pandoc cannot be as=20 >> flexible as HTML and this is on purpose. (This might be problematic for= =20 >> some uses, as the Spanish law gazette uses no headings, but it=20 >> distinguishes the different

with different classes.)=20 >> >> Not focusing specifically on HTML, I think that pandoc should allow to= =20 >> uniquely identify, add to a class and set the language to any element,= =20 >> desired text span or division.=20 >> >> I know that this is related to a couple of messages I sent yesterday.=20 >> Sorry for repeating myself, but these are basic features to write=20 >> documents.=20 >> >> From the documentation perspective, it would be to apply the type Attr= =20 >> to any constructor from data Block and Inline. And to data TableCell.=20 >> >> Although language could be defined as a key-value pair in type Attr, I= =20 >> think is clearer to define a new specific language attribute.=20 >> >> Is there anything wrong with this approach?=20 >> >> Many thanks for your help,=20 >> >> >> >> Pablo=20 >> >> On 11/11/12 23:36, John MacFarlane wrote:=20 >> > You've got to remember that pandoc converts the input format to an=20 >> > internal representation of the document (the 'Pandoc' structure), and= =20 >> > then converts that to the output format.=20 >> >=20 >> > This internal representation (see=20 >> >=20 >> http://hackage.haskell.org/packages/archive/pandoc-types/1.9.1/doc/html/= Text-Pandoc-Definition.html)=20 >> >> > is much less expressive than HTML, and doesn't have a place for the=20 >> > attributes you want. That's why they are lost on HTML -> HTML=20 >> > translation.=20 >> >=20 >> > +++ Pablo Rodr=C3=ADguez [Nov 11 12 12:19 ]:=20 >> >> Hi John,=20 >> >>=20 >> >> I'm using pandoc mainly to generate ePub files.=20 >> >>=20 >> >> I used textile first as source language, but it isn't fully=20 >> implemented=20 >> >> by pandoc and textile itself has issues with multiparagraph elements.= =20 >> >>=20 >> >> It seems HTML is probably a much better option for pandoc as source= =20 >> >> language, although I have to forget footnotes. There is no way to hav= e=20 >> >> it all.=20 >> >>=20 >> >> But pandoc strips almost all attributes from HTML elements.=20 >> >>=20 >> >> A minimal sample:=20 >> >>=20 >> >>

    =20 >> >>
  1. Well there is no other way to tag lingua=20 >> >> latina.

    =20 >> >>
  2. Or even classes or ids.

    .
  3. =20 >> >>
=20 >> >>=20 >> >> Would it be possible that there is an option that doesn't strip off= =20 >> >> attributes from HTML code?=20 >> >>=20 >> >> BTW, when converting from HTML to another HTML code, at least id,=20 >> class=20 >> >> and lang attributes shouldn't be stripped off by default.=20 >> >>=20 >> >> Many thanks for your help,=20 >> >>=20 >> >>=20 >> >> Pablo=20 >> >> --=20 >> >> http://www.ousia.tk=20 >> >>=20 >> >> --=20 >> >> You received this message because you are subscribed to the Google=20 >> Groups "pandoc-discuss" group.=20 >> >> To post to this group, send email to pandoc-...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org=20 >> >> To unsubscribe from this group, send email to=20 >> pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org=20 >> >> For more options, visit https://groups.google.com/groups/opt_out.=20 >> >>=20 >> >>=20 >> >=20 >> >> --=20 >> http://www.ousia.tk=20 >> > --=20 > You received this message because you are subscribed to the Google Groups= =20 > "pandoc-discuss" group. > > To unsubscribe from this group and stop receiving emails from it, send an= =20 > email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit=20 > https://groups.google.com/d/msgid/pandoc-discuss/33fcfdbf-3edc-4145-a7f0-= 325bfd42698fn%40googlegroups.com=20 > > . > > --=20 You received this message because you are subscribed to the Google Groups "= pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/= pandoc-discuss/e1b7f6d6-56c7-469e-b2f1-082718e2cbb2n%40googlegroups.com. ------=_Part_3105_89912478.1656325051817 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi,

The idea there, is to be able to con= vert both html (generated by a rich text editor)  and markdown (or=20 other similar markup language) file through a similar pipeline to a pdf=20 with similar style. Using a different templating engine somewhere in the pipeline mean more complexity, so i'm considering the idea of using pandoc= templating if the html result is okay.

Thanks= .


Le lundi 27 juin 2022 =C3=A0 11:55:12 UTC+2, suki...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org a =C3= =A9crit=C2=A0:
=20
Hello all,

Up from the future (10 years !)

I have a similar issue now with playing with pandoc and weasyprint using pandoc for the templating.
What seems to be a bit curious, is that some html attribute are stripped off and some are not.

Example:

<p style=3D"xxx" lang=3D"ar" dir=3D"rtl" class=3D"test"= ; id=3D"first">Just some text</p>
<div style=3D"xxx" lang=3D"ar" dir=3D"r= tl" class=3D"test" id=3D"second">
<p>Just some text</p>
</div>

give:

<p>Just some text</p>
<div id=3D"second" class=3D"test" style=3D= "xxx" lang=3D"ar" dir=3D"rtl">
<p>Just some text</p>
</div>

I don't get the logic behind the fact that HTML attribute are stripped for p and not for div. Understanding it may help me make a decision
if it's worth to adapt my HTML depending on this behavior.

Thanks for your answers,

Guenael

Le lundi 12 novembre 2012 = =C3=A0 20:14:26 UTC+1, ousia a =C3=A9crit=C2=A0:
Thank you for your explanation, John.

I'm afraid I cannot code and pandoc's internal representa= tion of the
document (sorry if the naming isn't accurate, but this is really Greek
to me [=CF=87=CE=B1=CE=BB=CE=B5=CF=80=E1=BD=B0 =CF=84=E1=BD=B0 = =CE=BA=CE=B1=CE=BB=CE=AC, I agree :-)]) is beyond my extremely limited
understanding on these matters.

I can understand (and I hope I'm not wrong) that pandoc canno= t be as
flexible as HTML and this is on purpose. (This might be problematic for
some uses, as the Spanish law gazette uses no headings, but it
distinguishes the different <p> with different classes.)

Not focusing specifically on HTML, I think that pandoc should allow to
uniquely identify, add to a class and set the language to any element,
desired text span or division.

I know that this is related to a couple of messages I sent yesterday.
Sorry for repeating myself, but these are basic features to write documents.

From the documentation perspective, it would be to apply the type Attr
to any constructor from data Block and Inline. And to data TableCell.

Although language could be defined as a key-value pair in type Attr, I
think is clearer to define a new specific language attribute.

Is there anything wrong with this approach?

Many thanks for your help,



Pablo

On 11/11/12 23:36, John MacFarlane wrote:
> You've got to remember that pandoc converts the input format to an
> internal representation of the document (the 'Pandoc'= ; structure), and
> then converts that to the output format.
>
> This internal representation (see
> http://hackage.haskell.org/packages= /archive/pandoc-types/1.9.1/doc/html/Text-Pandoc-Definition.html)
> is much less expressive than HTML, and doesn't have a place for the
> attributes you want. That's why they are lost on HTML -> HTML
> translation.
>
> +++ Pablo Rodr=C3=ADguez [Nov 11 12 12:19 ]:
>> Hi John,
>>
>> I'm using pandoc mainly to generate ePub files.
>>
>> I used textile first as source language, but it isn'= t fully implemented
>> by pandoc and textile itself has issues with multiparagraph elements.
>>
>> It seems HTML is probably a much better option for pandoc as source
>> language, although I have to forget footnotes. There is no way to have
>> it all.
>>
>> But pandoc strips almost all attributes from HTML elements.
>>
>> A minimal sample:
>>
>> <ol start=3D"2" style=3D"list-style-type:lower-latin;">
>> <li><p>Well there is no other way to tag <em lang=3D"la">lingua
>> latina</em>.</p>
>> <li><p>Or even classes or ids.</p>.</li>
>> </ol>
>>
>> Would it be possible that there is an option that doesn't strip off
>> attributes from HTML code?
>>
>> BTW, when converting from HTML to another HTML code, at least id, class
>> and lang attributes shouldn't be stripped off by default.
>>
>> Many thanks for your help,
>>
>>
>> Pablo
>> --
>> http://www.ousia.tk
>>
>> --
>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>> To post to this group, send email to pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
>> To unsubscribe from this group, send email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>

--
http://www.ousia.tk
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
<= blockquote type=3D"cite"> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-di= scus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/33fcfd= bf-3edc-4145-a7f0-325bfd42698fn%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/e1b7f6d6-56c7-469e-b2f1-082718e2cbb2n%40googlegroups.= com.
------=_Part_3105_89912478.1656325051817-- ------=_Part_3104_1525286524.1656325051817--