From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/30853 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: "'guenael Muller' via pandoc-discuss" Newsgroups: gmane.text.pandoc Subject: Re: HTML attributes not being stripped off Date: Mon, 27 Jun 2022 02:42:55 -0700 (PDT) Message-ID: <33fcfdbf-3edc-4145-a7f0-325bfd42698fn@googlegroups.com> References: <509F89B3.4070403@web.de> <20121111223615.GE4399@Johns-MacBook-Air-2.local> <50A14A92.9060301@web.de> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_220_507151360.1656322975998" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="9534"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBDJ65S6P64DRBIPX4WKQMGQEOTGPHKY-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mon Jun 27 11:43:02 2022 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-yb1-f188.google.com ([209.85.219.188]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1o5lGn-0002GT-9i for gtp-pandoc-discuss@m.gmane-mx.org; Mon, 27 Jun 2022 11:43:01 +0200 Original-Received: by mail-yb1-f188.google.com with SMTP id r6-20020a5b06c6000000b006693f6a6d67sf7741683ybq.7 for ; Mon, 27 Jun 2022 02:43:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=7bUcj88Fk6T8GosCW7hqcrKFRBknvhORIw/TBmvaoV4=; b=Q24BaXMtBuOeTEIMvs4e0OA/4o0XzLar/YC0l2hTiQb9KGKCHBFsPTczmJbeJBACfl tausCZQVYhH2kntw8T/31tsy/G3eWFCIrrTgmd2sKHoQyTJEMJsDy9pKpamAhFYrCgas 8kk/rdj1HayW6l2KH3pyKzoOQbqyA7b8gXQdz/90xYwIsZeQD5Fq+IpllxYXcJqMLxYW H1X/qWYwqJiGGD74UJJm7e9VrHUPyzlFczTXYYEnWId4fSNJ1dbO0focs55JHk3tFj2T dzMhN35U/K5tkatK80F/eu6/LRF3ofRsZR6GYH33UrT0iP4KDVC0MfKJmUSSDAxdTmH6 H6bA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:message-id:in-reply-to:references :subject:mime-version:x-original-sender:reply-to:precedence :mailing-list:list-id:x-spam-checked-in-group:list-post:list-help :list-archive:list-subscribe:list-unsubscribe; bh=7bUcj88Fk6T8GosCW7hqcrKFRBknvhORIw/TBmvaoV4=; b=W4M4uGB4PPWZGgyiymcXKfFddzpSIxJODtJH8PrKDlP7Mn/vRmhd8JFc2NZ2EB/TjN ufDmO6G/BORzKElDvrBdVhmFDJNYPOT/pedaiFMym0OwJPPs6NI+PNUFfC5SH9LNT81N 55yvEd9tc4y7/BgM8H+iWov64d1huNDSp4xLx6p+rHuNoOqcC34IIAUPAOxNbMyLe5Ly XmUx8I6HH749VEg9SXxmNcrWmv5pJlnX9Y2B5Dqi0CffuBebl4LTmxQPNXRxSH3IG6Ai ZQ0dxIZuzazTb2jfmH+Sex7pwrpXal3bBxSf/4fUNPSlI6SWKBLM4QvlMjiZYrsFkrJI zEsA== X-Gm-Message-State: AJIora8g6M82k7VUWqN1igHbFK1NVU5QIorlKEScQyshSMV7Ch0TQIom jVJhIwdZ+U9NNlTraFGjZYU= X-Google-Smtp-Source: AGRyM1uevCjE5JD6GEy4fg+a4G2on+A32rusD/yEo9wQZfs+BdNio2FXBMq+MA0xA/d2YIrhEQV0bQ== X-Received: by 2002:a81:ad1:0:b0:318:3b63:6d00 with SMTP id 200-20020a810ad1000000b003183b636d00mr14361153ywk.146.1656322980318; Mon, 27 Jun 2022 02:43:00 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a05:6902:120e:b0:669:ae9d:b421 with SMTP id s14-20020a056902120e00b00669ae9db421ls6993941ybu.0.gmail; Mon, 27 Jun 2022 02:42:57 -0700 (PDT) X-Received: by 2002:a25:420d:0:b0:66c:b438:9934 with SMTP id p13-20020a25420d000000b0066cb4389934mr5935782yba.471.1656322976845; Mon, 27 Jun 2022 02:42:56 -0700 (PDT) In-Reply-To: <50A14A92.9060301-S0/GAf8tV78@public.gmane.org> X-Original-Sender: guenael.muller-SY6ho0IHB4A@public.gmane.org X-Original-From: guenael Muller Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:30853 Archived-At: ------=_Part_220_507151360.1656322975998 Content-Type: multipart/alternative; boundary="----=_Part_221_1942054984.1656322975998" ------=_Part_221_1942054984.1656322975998 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello all, Up from the future (10 years !) I have a similar issue now with playing with pandoc and weasyprint using=20 pandoc for the templating. What seems to be a bit curious, is that some html attribute are stripped=20 off and some are not. Example:

Just s= ome=20 text

Just some text

give:

Just some text

Just some text

I don't get the logic behind the fact that HTML attribute are stripped for= =20 p and not for div. Understanding it may help me make a decision if it's worth to adapt my HTML depending on this behavior. Thanks for your answers, Guenael Le lundi 12 novembre 2012 =C3=A0 20:14:26 UTC+1, ousia a =C3=A9crit : > Thank you for your explanation, John. > > I'm afraid I cannot code and pandoc's internal representation of the > document (sorry if the naming isn't accurate, but this is really Greek > to me [=CF=87=CE=B1=CE=BB=CE=B5=CF=80=E1=BD=B0 =CF=84=E1=BD=B0 =CE=BA=CE= =B1=CE=BB=CE=AC, I agree :-)]) is beyond my extremely limited > understanding on these matters. > > I can understand (and I hope I'm not wrong) that pandoc cannot be as > flexible as HTML and this is on purpose. (This might be problematic for > some uses, as the Spanish law gazette uses no headings, but it > distinguishes the different

with different classes.) > > Not focusing specifically on HTML, I think that pandoc should allow to > uniquely identify, add to a class and set the language to any element, > desired text span or division. > > I know that this is related to a couple of messages I sent yesterday. > Sorry for repeating myself, but these are basic features to write=20 > documents. > > From the documentation perspective, it would be to apply the type Attr > to any constructor from data Block and Inline. And to data TableCell. > > Although language could be defined as a key-value pair in type Attr, I > think is clearer to define a new specific language attribute. > > Is there anything wrong with this approach? > > Many thanks for your help, > > > > Pablo > > On 11/11/12 23:36, John MacFarlane wrote: > > You've got to remember that pandoc converts the input format to an > > internal representation of the document (the 'Pandoc' structure), and > > then converts that to the output format. > >=20 > > This internal representation (see > >=20 > http://hackage.haskell.org/packages/archive/pandoc-types/1.9.1/doc/html/T= ext-Pandoc-Definition.html > ) > > is much less expressive than HTML, and doesn't have a place for the > > attributes you want. That's why they are lost on HTML -> HTML > > translation. > >=20 > > +++ Pablo Rodr=C3=ADguez [Nov 11 12 12:19 ]: > >> Hi John, > >> > >> I'm using pandoc mainly to generate ePub files. > >> > >> I used textile first as source language, but it isn't fully implemente= d > >> by pandoc and textile itself has issues with multiparagraph elements. > >> > >> It seems HTML is probably a much better option for pandoc as source > >> language, although I have to forget footnotes. There is no way to have > >> it all. > >> > >> But pandoc strips almost all attributes from HTML elements. > >> > >> A minimal sample: > >> > >>

    > >>
  1. Well there is no other way to tag lingua > >> latina.

    > >>
  2. Or even classes or ids.

    .
  3. > >>
> >> > >> Would it be possible that there is an option that doesn't strip off > >> attributes from HTML code? > >> > >> BTW, when converting from HTML to another HTML code, at least id, clas= s > >> and lang attributes shouldn't be stripped off by default. > >> > >> Many thanks for your help, > >> > >> > >> Pablo > >> --=20 > >> http://www.ousia.tk > >> > >> --=20 > >> You received this message because you are subscribed to the Google=20 > Groups "pandoc-discuss" group. > >> To post to this group, send email to pandoc-...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > >> To unsubscribe from this group, send email to=20 > pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > >> For more options, visit https://groups.google.com/groups/opt_out. > >> > >> > >=20 > > --=20 > http://www.ousia.tk > --=20 You received this message because you are subscribed to the Google Groups "= pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/= pandoc-discuss/33fcfdbf-3edc-4145-a7f0-325bfd42698fn%40googlegroups.com. ------=_Part_221_1942054984.1656322975998 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello all,

Up from the future (10 years !)

I have a similar issue now with playing with pandoc = and weasyprint using pandoc for the templating.
What seems to be = a bit curious, is that some html attribute are stripped off and some are no= t.

Example:

<div style=3D"xxx" lang=3D"ar" dir=3D"rtl"= class=3D"test" id=3D"second">
<p>Just some text</p>
<= /span>
</div>

give:

<p>Just some text</p>
<div id=3D"second"= class=3D"test" style=3D"xxx" lang=3D"ar" dir=3D"rtl">
<p>Just = some text</p>
</div>

I don't get the logic behind the fact that HTML attribute are strip= ped for p and not for div. Understanding it may help me make a decision
if it's worth to adapt my HTML depending on this beha= vior.

Thanks for your answers,
=


Le lundi 12 novembre 2012 =C3=A0 20:14:26 UTC+1, ousia a =C3= =A9crit=C2=A0:
http://hackage.haskell.org/packages/archi= ve/pandoc-types/1.9.1/doc/html/Text-Pandoc-Definition.html)
> is much less expressive than HTML, and doesn't have a place fo= r the
> attributes you want. That's why they are lost on HTML -> H= TML
> translation.
>=20
> +++ Pablo Rodr=C3=ADguez [Nov 11 12 12:19 ]:
>> Hi John,
>>
>> I'm using pandoc mainly to generate ePub files.
>>
>> I used textile first as source language, but it isn't full= y implemented
>> by pandoc and textile itself has issues with multiparagraph el= ements.
>>
>> It seems HTML is probably a much better option for pandoc as s= ource
>> language, although I have to forget footnotes. There is no way= to have
>> it all.
>>
>> But pandoc strips almost all attributes from HTML elements.
>>
>> A minimal sample:
>>
>> <ol start=3D"2" style=3D"list-style-type:low= er-latin;">
>> <li><p>Well there is no other way to tag <em la= ng=3D"la">lingua
>> latina</em>.</p>
>> <li><p>Or even classes or ids.</p>.</li&g= t;
>> </ol>
>>
>> Would it be possible that there is an option that doesn't = strip off
>> attributes from HTML code?
>>
>> BTW, when converting from HTML to another HTML code, at least = id, class
>> and lang attributes shouldn't be stripped off by default.
>>
>> Many thanks for your help,
>>
>>
>> Pablo
>> --=20
>> http://www.ousia.tk
>>
>> --=20
>> You received this message because you are subscribed to the Go= ogle Groups "pandoc-discuss" group.
>> To post to this group, send email to pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
>> To unsubscribe from this group, send email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>=20

--=20
http://www.ousia.tk

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/33fcfdbf-3edc-4145-a7f0-325bfd42698fn%40googlegroups.= com.
------=_Part_221_1942054984.1656322975998-- ------=_Part_220_507151360.1656322975998--