From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/29838 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: =?UTF-8?Q?S=C3=A9bastien_Boisg=C3=A9rault?= Newsgroups: gmane.text.pandoc Subject: Re: Pandoc Document Model in Python Date: Thu, 23 Dec 2021 02:56:01 -0800 (PST) Message-ID: References: Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_7957_1124328357.1640256961353" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="11218"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBCIOBSUCXMMRBQVLSGHAMGQEK6G7QBY-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Thu Dec 23 11:56:06 2021 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-oi1-f192.google.com ([209.85.167.192]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1n0LlV-0002fS-Fk for gtp-pandoc-discuss@m.gmane-mx.org; Thu, 23 Dec 2021 11:56:05 +0100 Original-Received: by mail-oi1-f192.google.com with SMTP id y74-20020aca4b4d000000b002c6fd8df444sf318221oia.22 for ; Thu, 23 Dec 2021 02:56:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=AQ8Jas/XQ5SLICDVEbg+HjymRDgYOYRvsJ8S4jnrVAs=; b=Iv/89JTlIJpsZPNE99dPJkFxt6OWI2LPkhXD9BP0q57CBPhG0vuSau4PkMJiPyLa/4 G6nPMta1eTZs2QmHbtHCa4BUKWqKXXFgQyxrqemiQrS50ermn8mkSoUzuL2UfRIP7m7p aqxn3g/Mq2j8mQwtiEdufG/fhTWbfejPglI+2t5rR73iYih6kEOebMASub0uJKLbhwhn JO84GhuYfwJpyKk8eD5bbpzIE5COaq8jEuHOg+Dnu2xmesA+b4xLhv7iZ7qaZcK9udvq 3U6Mcqe2RrCt89Kqq4Nz569DF5R5SWpDtoqNrkgO0mSeDEI94v3DZipAixoGwVDS3Sfx FJSA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=AQ8Jas/XQ5SLICDVEbg+HjymRDgYOYRvsJ8S4jnrVAs=; b=fGCx9alqJ0BX9rPu/V0S+lVOucXtXmDsZLP7qlqQHIWP2RYfoFjOKOauEh47V5EA/1 Lt+1IcHHfoQueXcIKluwli6Z6957CmFGOQKEYEPpj8bnpDiwfG/2SbsUxxknosq6UJEQ BeHp/7yFf3DIrQgPNxYWFrF7jrbt8QaqpVakK0Lky1BTuxuVK3n8WddB2I9tW2wtb3Wg rAqWYpUYl6f5cLz1BaHkd0QiSLD3O+CU12k/18GlRiomCyn2jp1PaeDXMaZF61kM4k6H jTC8V2mp9xM9aGz/cEbNJcGA0O8xbRzYDbZACed053dvB0d5wwnGAPRRWT8IM9LSbeWD 9jUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=AQ8Jas/XQ5SLICDVEbg+HjymRDgYOYRvsJ8S4jnrVAs=; b=HUzULg8RaAjLYvwChJRppNXQGfgpmn8fToN5yx741uYc04IGOsVYaswAJZPBn6KJvh enugTUak0PAKGexHpeHhHEX6kRXAknCi6IKtSA92Q8fzX2KEUibdE7YG2FmRknU19WTV OgKDFJ6JC6+0XMrWqRKuHBIjSA9yxjlX9s2W8uXrqjkLGMPTRs8qf1IKryl8ENvST2rB Q+WJ9FW4SS9zpCRDy6wFVPeWIgvOyxiiwCDKJiSCKyh+2yfgoc0siD/tEgT8iAJ85MTa JixZUdv7g4IukHMXRJg4Q0tQt8luQNfO7bDRctxjJYFRvKCMEswJI2Aagzx8XEb++Z9r EB2Q== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM530x89zKNbEhX/wc+zcArJkBwKEaHN4rMwkrEdKte8BDPW6N6eTP Nudev53uhpbFrgmexOU4iws= X-Google-Smtp-Source: ABdhPJxbvFNifpPAzPqepTBBdf6a2X4EJeQf4INEwn8mQeAt6X7jlser4T4KXQXXT6EAo+EQdK7NHA== X-Received: by 2002:a05:6830:1004:: with SMTP id a4mr1136883otp.294.1640256963863; Thu, 23 Dec 2021 02:56:03 -0800 (PST) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a05:6830:438e:: with SMTP id s14ls1054207otv.7.gmail; Thu, 23 Dec 2021 02:56:02 -0800 (PST) X-Received: by 2002:a9d:4d0e:: with SMTP id n14mr1054712otf.261.1640256961929; Thu, 23 Dec 2021 02:56:01 -0800 (PST) In-Reply-To: X-Original-Sender: Sebastien.Boisgerault-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:29838 Archived-At: ------=_Part_7957_1124328357.1640256961353 Content-Type: multipart/alternative; boundary="----=_Part_7958_1542575078.1640256961353" ------=_Part_7958_1542575078.1640256961353 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Joseph, Le mercredi 22 d=C3=A9cembre 2021 =C3=A0 19:05:31 UTC+1, Joseph a =C3=A9cri= t : > I finally had a chance to read through the documentation: wow, impressive= !=20 > And by that I mean not only the library but the documentation itself. It'= s=20 > rare to see such comprehensive and clear documentation of a new project.= =20 > For folks who don't already read haskell, it's a great way to learn about= =20 > Pandoc.=20 > > Thank you for the kind words!=20 =20 > BTW: It's probably too late to change, but I wonder if you should've give= n=20 > it a novel name? I wonder if it'll be easy for folks to find? (I'm not su= re=20 > on github why it's labeled as a predominantly JavaScript project?)=20 > I think that many people for which this library could be useful would=20 search for "Pandoc" and "Python" in a search engine ; for me that returns= =20 "Scripting with Pandoc" (which is totally appropriate), this project and=20 then pypandoc (whose scope is different: it is a wrapper for the=20 command-line, the document model is not exposed). So I guess it's easy to= =20 find at this stage ; easier than panflute for example (which I can't see in= =20 the results), despite the fact than panflute is more mature and obviously a= =20 totally appropriate result. I guess that the Javascript component found by GitHub refers to the list of= =20 pandoc types hierarchy (one by version) which are stored as a JSON file := =20 =20 https://github.com/boisgera/pandoc/blob/master/src/pandoc/pandoc-types.js =20 > > Also, I wonder if there will ever be a higher level way of=20 > searching/transforming markdown in Python? Panflute is a bit higher-level= =20 > and more python-idiomatic, and your examples [1] are fantastic, but I cra= ve=20 > the intuitive XML-based selectors (e.g., eTree, BeautifulSoup, and CSS).= =20 Your API, like most, requires me to be familiar with the pandoc AST to do= =20 > anything (e.g., meta is the first -- `doc[0]` -- items in the document=20 > structure).=20 > Yes, you're 100% right:=20 - The library is low-level (at this stage) and therefore you're=20 "expected" to build your own helpers on top of it if you want a=20 higher-level API. There is a finder example in the documentatrion=20 (https://boisgera.github.io/pandoc/cookbook/#finder), but this is not in=20 the official high-level API (yet). After using the low-level API myself for a long time, I still wonder=20 what kind of high-level API I'd like to have that would at least cover my= =20 own use cases ... I rather dislike selector language (xpath, CSS selectors, regexp,=20 etc.), but there are other great sources of inspiration (xml.etree and=20 beautifulsoup finders,=20 chained queries of rethinkdb, etc.). - In the current state, you need to be familiar with the pandoc AST to do= =20 anything. I don't know to which extent we can avoid that (in any library)= =20 for advanced use cases. I tried to improve the learning curve a bit (you can discover the type= =20 hierarchy interactively in the interpreter:=20 https://boisgera.github.io/pandoc/document/#types),=20 but I agree that's it's likely to be a show-stopper for many people.=20 > > [1]: https://boisgera.github.io/pandoc/cookbook/=20 > > In the examples below, I exercise the three options for pandoc and python= .=20 > I kind of like using pandoc to convert it to HTML, use those selectors, a= nd=20 > then convert back if need be... It's be great if panflute (or pandoc) had= =20 > high-level selectors.=20 > > > ```python=20 > # 1. Using pandoc API to print date=20 > # Requires I remember pandoc data model via list indices=20 > # No find/select; lots of iteration=20 > > doc =3D pandoc.read(COMMONMARK_SPEC)=20 > meta =3D doc[0] # doc: Pandoc(Meta, [Block])=20 > meta_dict =3D meta[0] # meta: Meta({Text: MetaValue})=20 > date =3D meta_dict["date"]=20 > date_inlines =3D date[0] # date: MetaInlines([Inline])=20 > print("pandoc:" + pandoc.write(date_inlines).strip())=20 > > # 2. Using panflute to print date=20 > # Data-model is a bit more intuitive.=20 > # No find/select=20 > > doc =3D pf.convert_text(COMMONMARK_SPEC, standalone=3DTrue)=20 > print("panflute" + doc.get_metadata()["date"])=20 > > # 3. Using pandoc + BeautifulSoup to print date=20 > # Requires me to remember HTML model, but I'm more familiar.=20 > # Can use BeautifulSoup or CSS selectors=20 > > doc =3D pandoc.read(COMMONMARK_SPEC)=20 > html =3D pandoc.write(doc, format=3D"html", options=3D["--standalone"])= =20 > soup =3D BeautifulSoup(html, "html5lib")=20 > date =3D soup.find("meta", {"name": "dcterms.date"})["content"]=20 > print("BS native selector:" + date)=20 > # CSS selector=20 > date =3D soup.select("""meta[name=3D"dcterms.date"]""")[0]["content"] # C= SS=20 > print("BS/CSS selector:" + date)=20 > > ```=20 > Very useful example ! Thank you for taking the time to do this at this=20 level of detail. I'll have a deeper look at BeautifulSoup find/find_all methods to start=20 with and will experiment a bit ; I'll report back here. Cheers, SB P.S.: the metadata is especially hard to use right now, since it is=20 littered with Haskell-like wrapper types (MetaInlines, MetaBlocks, etc.)=20 and 90% of the time you'd just like to have the result as a "regular Python= =20 dictionnary". I'll also work a bit on this ; I have not done it so far=20 because I know that any such conversion will lose some type info (how to=20 you distinguish empty list of blocks and list of inlines for example ?) and= =20 therefore cannot be used reliably for round-tripping ; so the metadata=20 handling so far is "correct" but not at all convenient. =20 --=20 You received this message because you are subscribed to the Google Groups "= pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/= pandoc-discuss/e45c083b-fff3-46ac-8af5-b416c60f6a97n%40googlegroups.com. ------=_Part_7958_1542575078.1640256961353 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Hi Joseph,

Le mercredi 22 d=C3=A9cembre 2021 =C3=A0= 19:05:31 UTC+1, Joseph a =C3=A9crit :
I finally had a chance to read through the do= cumentation: wow, impressive! And by that I mean not only the library but t= he documentation itself. It's rare to see such comprehensive and clear docu= mentation of a new project. For folks who don't already read haskell, it's = a great way to learn about Pandoc.

Thank you for the kind words!
 
BTW: It's probably = too late to change, but I wonder if you should've given it a novel name? I = wonder if it'll be easy for folks to find? (I'm not sure on github why it's= labeled as a predominantly JavaScript project?)

I think that many people for which thi= s library could be useful would search for "Pandoc" and "Python" in a searc= h engine ; for me that returns "Scripting with Pandoc" (which is totally ap= propriate), this project and then pypandoc (whose scope is different: it is= a wrapper for the command-line, the document model is not exposed). So I g= uess it's easy to find at this stage ; easier than panflute for example (wh= ich I can't see in the results), despite the fact than panflute is more mat= ure and obviously a totally appropriate result.

I guess that the Javascript component found by GitHub refers to the list = of pandoc types hierarchy (one by version) which are stored as a JSON file = :

    https://github.com/boisg= era/pandoc/blob/master/src/pandoc/pandoc-types.js
 

Also, I wonder if there will ever be a higher level way of searching/tr= ansforming markdown in Python? Panflute is a bit higher-level and more pyth= on-idiomatic, and your examples [1] are fantastic, but I crave the intuitiv= e XML-based selectors (e.g., eTree, BeautifulSoup, and CSS). <= blockquote class=3D"gmail_quote" style=3D"margin: 0 0 0 0.8ex; border-left:= 1px solid rgb(204, 204, 204); padding-left: 1ex;">Your API, like most, req= uires me to be familiar with the pandoc AST to do anything (e.g., meta is t= he first -- `doc[0]` -- items in the document structure).

Yes, you're 100% right:

  - The library is low-level (at this stage) and there= fore you're "expected" to build your own helpers on top of it if you want a= higher-level API.
    There is a finder example i= n the documentatrion (https://boisgera.github.io/pandoc/cookbook/#finder), = but this is not in the official high-level API (yet).
  = ;  After using the low-level API myself for a long time, I still wonde= r what kind of high-level API I'd like to have that would at least cover my= own use cases ...
    I rather dislike select= or language (xpath, CSS selectors, regexp, etc.), but there are other great= sources of inspiration (xml.etree and beautifulsoup finders,
    chained queries of rethinkdb, etc.).

  - In the current state, you need to be familiar with the pan= doc AST to do anything. I don't know to which extent we can avoid that (in = any library) for advanced use cases.
    I tried t= o improve the learning curve a bit (you can discover the type hierarchy int= eractively in the interpreter: https://boisgera.github.io/pandoc/document/#= types),
    but I agree that's it's likely to= be a show-stopper for many people.

[1]: https://boisgera.github.io/pandoc/cookbook/

In the examples below, I exercise the three options for pandoc and pyth= on. I kind of like using pandoc to convert it to HTML, use those selectors,= and then convert back if need be... It's be great if panflute (or pandoc) = had high-level selectors.


```python
# 1. Using pandoc API to print date
# Requires I remember pandoc data model via list indices
# No find/select; lots of iteration

doc =3D pandoc.read(COMMONMARK_SPEC)
meta =3D doc[0] # doc: Pandoc(Meta, [Block])
meta_dict =3D meta[0] # meta: Meta({Text: MetaValue})
date =3D meta_dict["date"]
date_inlines =3D date[0] # date: MetaInlines([Inline])
print("pandoc:" + pandoc.write(date_inlines).strip())

# 2. Using panflute to print date
# Data-model is a bit more intuitive.
# No find/select

doc =3D pf.convert_text(COMMONMARK_SPEC, standalone=3DTrue)
print("panflute" + doc.get_metadata()["date"])

# 3. Using pandoc + BeautifulSoup to print date
# Requires me to remember HTML model, but I'm more familiar.
# Can use BeautifulSoup or CSS selectors

doc =3D pandoc.read(COMMONMARK_SPEC)
html =3D pandoc.write(doc, format=3D"html", options=3D["--standalone"])
soup =3D BeautifulSoup(html, "html5lib")
date =3D soup.find("meta", {"name": "dcterms.date"})["content"]
print("BS native selector:" + date)
# CSS selector
date =3D soup.select("""meta[name=3D"dcterms.date"]""")[0]["content"] = # CSS
print("BS/CSS selector:" + date)

```

Very useful example ! Thank you for ta= king the time to do this at this level of detail.

<= div>I'll have a deeper look at BeautifulSoup find/find_all methods to start= with and will experiment a bit ; I'll report back here.

=
Cheers,

SB

P.S.: the metadata is especially hard to use right now, since it is litter= ed with Haskell-like wrapper types (MetaInlines, MetaBlocks, etc.) and 90% = of the time you'd just like to have the result as a "regular Python diction= nary". I'll also work a bit on this ; I have not done it so far because I k= now that any such conversion will lose some type info (how to you distingui= sh empty list of blocks and list of inlines for example ?) and therefore ca= nnot be used reliably for round-tripping ; so the metadata handling so far = is "correct" but not at all convenient.
 

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/e45c083b-fff3-46ac-8af5-b416c60f6a97n%40googlegroups.= com.
------=_Part_7958_1542575078.1640256961353-- ------=_Part_7957_1124328357.1640256961353--