Re: Pandoc Document Model in Python - Sébastien Boisgérault

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

From: "Sébastien Boisgérault" <sebastien.boisgerault-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: Pandoc Document Model in Python
Date: Thu, 23 Dec 2021 02:56:01 -0800 (PST)	[thread overview]
Message-ID: <e45c083b-fff3-46ac-8af5-b416c60f6a97n@googlegroups.com> (raw)
In-Reply-To: <c0c49e25-898d-c72c-3303-69005985ea01-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 5926 bytes --]


Hi Joseph,

Le mercredi 22 décembre 2021 à 19:05:31 UTC+1, Joseph a écrit :

> I finally had a chance to read through the documentation: wow, impressive! 
> And by that I mean not only the library but the documentation itself. It's 
> rare to see such comprehensive and clear documentation of a new project. 
> For folks who don't already read haskell, it's a great way to learn about 
> Pandoc. 
>
> Thank you for the kind words! 
 

> BTW: It's probably too late to change, but I wonder if you should've given 
> it a novel name? I wonder if it'll be easy for folks to find? (I'm not sure 
> on github why it's labeled as a predominantly JavaScript project?) 
>

I think that many people for which this library could be useful would 
search for "Pandoc" and "Python" in a search engine ; for me that returns 
"Scripting with Pandoc" (which is totally appropriate), this project and 
then pypandoc (whose scope is different: it is a wrapper for the 
command-line, the document model is not exposed). So I guess it's easy to 
find at this stage ; easier than panflute for example (which I can't see in 
the results), despite the fact than panflute is more mature and obviously a 
totally appropriate result.

I guess that the Javascript component found by GitHub refers to the list of 
pandoc types hierarchy (one by version) which are stored as a JSON file : 

    
https://github.com/boisgera/pandoc/blob/master/src/pandoc/pandoc-types.js
 

>
> Also, I wonder if there will ever be a higher level way of 
> searching/transforming markdown in Python? Panflute is a bit higher-level 
> and more python-idiomatic, and your examples [1] are fantastic, but I crave 
> the intuitive XML-based selectors (e.g., eTree, BeautifulSoup, and CSS). 

Your API, like most, requires me to be familiar with the pandoc AST to do 
> anything (e.g., meta is the first -- `doc[0]` -- items in the document 
> structure). 
>

Yes, you're 100% right: 

  - The library is low-level (at this stage) and therefore you're 
"expected" to build your own helpers on top of it if you want a 
higher-level API.
    There is a finder example in the documentatrion 
(https://boisgera.github.io/pandoc/cookbook/#finder), but this is not in 
the official high-level API (yet).
    After using the low-level API myself for a long time, I still wonder 
what kind of high-level API I'd like to have that would at least cover my 
own use cases ...
    I rather dislike selector language (xpath, CSS selectors, regexp, 
etc.), but there are other great sources of inspiration (xml.etree and 
beautifulsoup finders, 
    chained queries of rethinkdb, etc.).

  - In the current state, you need to be familiar with the pandoc AST to do 
anything. I don't know to which extent we can avoid that (in any library) 
for advanced use cases.
    I tried to improve the learning curve a bit (you can discover the type 
hierarchy interactively in the interpreter: 
https://boisgera.github.io/pandoc/document/#types), 
    but I agree that's it's likely to be a show-stopper for many people. 

>
> [1]: https://boisgera.github.io/pandoc/cookbook/ 
>
> In the examples below, I exercise the three options for pandoc and python. 
> I kind of like using pandoc to convert it to HTML, use those selectors, and 
> then convert back if need be... It's be great if panflute (or pandoc) had 
> high-level selectors. 
>
>
> ```python 
> # 1. Using pandoc API to print date 
> # Requires I remember pandoc data model via list indices 
> # No find/select; lots of iteration 
>
> doc = pandoc.read(COMMONMARK_SPEC) 
> meta = doc[0] # doc: Pandoc(Meta, [Block]) 
> meta_dict = meta[0] # meta: Meta({Text: MetaValue}) 
> date = meta_dict["date"] 
> date_inlines = date[0] # date: MetaInlines([Inline]) 
> print("pandoc:" + pandoc.write(date_inlines).strip()) 
>
> # 2. Using panflute to print date 
> # Data-model is a bit more intuitive. 
> # No find/select 
>
> doc = pf.convert_text(COMMONMARK_SPEC, standalone=True) 
> print("panflute" + doc.get_metadata()["date"]) 
>
> # 3. Using pandoc + BeautifulSoup to print date 
> # Requires me to remember HTML model, but I'm more familiar. 
> # Can use BeautifulSoup or CSS selectors 
>
> doc = pandoc.read(COMMONMARK_SPEC) 
> html = pandoc.write(doc, format="html", options=["--standalone"]) 
> soup = BeautifulSoup(html, "html5lib") 
> date = soup.find("meta", {"name": "dcterms.date"})["content"] 
> print("BS native selector:" + date) 
> # CSS selector 
> date = soup.select("""meta[name="dcterms.date"]""")[0]["content"] # CSS 
> print("BS/CSS selector:" + date) 
>
> ``` 
>

Very useful example ! Thank you for taking the time to do this at this 
level of detail.

I'll have a deeper look at BeautifulSoup find/find_all methods to start 
with and will experiment a bit ; I'll report back here.

Cheers,

SB

P.S.: the metadata is especially hard to use right now, since it is 
littered with Haskell-like wrapper types (MetaInlines, MetaBlocks, etc.) 
and 90% of the time you'd just like to have the result as a "regular Python 
dictionnary". I'll also work a bit on this ; I have not done it so far 
because I know that any such conversion will lose some type info (how to 
you distinguish empty list of blocks and list of inlines for example ?) and 
therefore cannot be used reliably for round-tripping ; so the metadata 
handling so far is "correct" but not at all convenient.
 

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e45c083b-fff3-46ac-8af5-b416c60f6a97n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 7717 bytes --]

next prev parent reply	other threads:[~2021-12-23 10:56 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <AQHX6RGokE17J35tB0eLtRDrIrd1JqwiXcS8///5/wCAABOrKA==>
     [not found] ` <AQHX6RGokE17J35tB0eLtRDrIrd1JqwiXcS8>
2021-12-04 13:20   ` Sébastien Boisgérault
     [not found]     ` <f224cd2c-7d68-40b4-a855-7d4d0d7aa442n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-12-04 14:06       ` AW: " denis.maier-NSENcxR/0n0
     [not found]         ` <3b5d75fe4e2a45e38ab45a820d110faf-NSENcxR/0n0@public.gmane.org>
2021-12-04 14:43           ` Sébastien Boisgérault
     [not found]             ` <de1fd005-0d0d-49a2-86cc-5a72c764835dn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-12-04 14:58               ` AW: " denis.maier-NSENcxR/0n0
     [not found]                 ` <fafa9cffd5e4437c865e71875b2f58a2-NSENcxR/0n0@public.gmane.org>
2021-12-04 15:35                   ` Sébastien Boisgérault
2021-12-04 15:30               ` Joseph Reagle
     [not found]                 ` <fe2b314b-863d-f8c0-8dfc-1104422fbf52-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2021-12-04 16:17                   ` Sébastien Boisgérault
     [not found]                     ` <1e952a20-a77f-4987-9e7f-bac963ba4385n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-12-04 16:48                       ` Sébastien Boisgérault
2021-12-04 17:30                       ` John MacFarlane
2021-12-22 18:05       ` Joseph Reagle
     [not found]         ` <c0c49e25-898d-c72c-3303-69005985ea01-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2021-12-23 10:56           ` Sébastien Boisgérault [this message]
     [not found]             ` <e45c083b-fff3-46ac-8af5-b416c60f6a97n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-12-23 14:25               ` Joseph Reagle

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e45c083b-fff3-46ac-8af5-b416c60f6a97n@googlegroups.com \
    --to=sebastien.boisgerault-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).