From: "Sébastien Boisgérault" <sebastien.boisgerault-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: Pandoc Document Model in Python
Date: Thu, 23 Dec 2021 02:56:01 -0800 (PST) [thread overview]
Message-ID: <e45c083b-fff3-46ac-8af5-b416c60f6a97n@googlegroups.com> (raw)
In-Reply-To: <c0c49e25-898d-c72c-3303-69005985ea01-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
[-- Attachment #1.1: Type: text/plain, Size: 5926 bytes --]
Hi Joseph,
Le mercredi 22 décembre 2021 à 19:05:31 UTC+1, Joseph a écrit :
> I finally had a chance to read through the documentation: wow, impressive!
> And by that I mean not only the library but the documentation itself. It's
> rare to see such comprehensive and clear documentation of a new project.
> For folks who don't already read haskell, it's a great way to learn about
> Pandoc.
>
> Thank you for the kind words!
> BTW: It's probably too late to change, but I wonder if you should've given
> it a novel name? I wonder if it'll be easy for folks to find? (I'm not sure
> on github why it's labeled as a predominantly JavaScript project?)
>
I think that many people for which this library could be useful would
search for "Pandoc" and "Python" in a search engine ; for me that returns
"Scripting with Pandoc" (which is totally appropriate), this project and
then pypandoc (whose scope is different: it is a wrapper for the
command-line, the document model is not exposed). So I guess it's easy to
find at this stage ; easier than panflute for example (which I can't see in
the results), despite the fact than panflute is more mature and obviously a
totally appropriate result.
I guess that the Javascript component found by GitHub refers to the list of
pandoc types hierarchy (one by version) which are stored as a JSON file :
https://github.com/boisgera/pandoc/blob/master/src/pandoc/pandoc-types.js
>
> Also, I wonder if there will ever be a higher level way of
> searching/transforming markdown in Python? Panflute is a bit higher-level
> and more python-idiomatic, and your examples [1] are fantastic, but I crave
> the intuitive XML-based selectors (e.g., eTree, BeautifulSoup, and CSS).
Your API, like most, requires me to be familiar with the pandoc AST to do
> anything (e.g., meta is the first -- `doc[0]` -- items in the document
> structure).
>
Yes, you're 100% right:
- The library is low-level (at this stage) and therefore you're
"expected" to build your own helpers on top of it if you want a
higher-level API.
There is a finder example in the documentatrion
(https://boisgera.github.io/pandoc/cookbook/#finder), but this is not in
the official high-level API (yet).
After using the low-level API myself for a long time, I still wonder
what kind of high-level API I'd like to have that would at least cover my
own use cases ...
I rather dislike selector language (xpath, CSS selectors, regexp,
etc.), but there are other great sources of inspiration (xml.etree and
beautifulsoup finders,
chained queries of rethinkdb, etc.).
- In the current state, you need to be familiar with the pandoc AST to do
anything. I don't know to which extent we can avoid that (in any library)
for advanced use cases.
I tried to improve the learning curve a bit (you can discover the type
hierarchy interactively in the interpreter:
https://boisgera.github.io/pandoc/document/#types),
but I agree that's it's likely to be a show-stopper for many people.
>
> [1]: https://boisgera.github.io/pandoc/cookbook/
>
> In the examples below, I exercise the three options for pandoc and python.
> I kind of like using pandoc to convert it to HTML, use those selectors, and
> then convert back if need be... It's be great if panflute (or pandoc) had
> high-level selectors.
>
>
> ```python
> # 1. Using pandoc API to print date
> # Requires I remember pandoc data model via list indices
> # No find/select; lots of iteration
>
> doc = pandoc.read(COMMONMARK_SPEC)
> meta = doc[0] # doc: Pandoc(Meta, [Block])
> meta_dict = meta[0] # meta: Meta({Text: MetaValue})
> date = meta_dict["date"]
> date_inlines = date[0] # date: MetaInlines([Inline])
> print("pandoc:" + pandoc.write(date_inlines).strip())
>
> # 2. Using panflute to print date
> # Data-model is a bit more intuitive.
> # No find/select
>
> doc = pf.convert_text(COMMONMARK_SPEC, standalone=True)
> print("panflute" + doc.get_metadata()["date"])
>
> # 3. Using pandoc + BeautifulSoup to print date
> # Requires me to remember HTML model, but I'm more familiar.
> # Can use BeautifulSoup or CSS selectors
>
> doc = pandoc.read(COMMONMARK_SPEC)
> html = pandoc.write(doc, format="html", options=["--standalone"])
> soup = BeautifulSoup(html, "html5lib")
> date = soup.find("meta", {"name": "dcterms.date"})["content"]
> print("BS native selector:" + date)
> # CSS selector
> date = soup.select("""meta[name="dcterms.date"]""")[0]["content"] # CSS
> print("BS/CSS selector:" + date)
>
> ```
>
Very useful example ! Thank you for taking the time to do this at this
level of detail.
I'll have a deeper look at BeautifulSoup find/find_all methods to start
with and will experiment a bit ; I'll report back here.
Cheers,
SB
P.S.: the metadata is especially hard to use right now, since it is
littered with Haskell-like wrapper types (MetaInlines, MetaBlocks, etc.)
and 90% of the time you'd just like to have the result as a "regular Python
dictionnary". I'll also work a bit on this ; I have not done it so far
because I know that any such conversion will lose some type info (how to
you distinguish empty list of blocks and list of inlines for example ?) and
therefore cannot be used reliably for round-tripping ; so the metadata
handling so far is "correct" but not at all convenient.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e45c083b-fff3-46ac-8af5-b416c60f6a97n%40googlegroups.com.
[-- Attachment #1.2: Type: text/html, Size: 7717 bytes --]
next prev parent reply other threads:[~2021-12-23 10:56 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <AQHX6RGokE17J35tB0eLtRDrIrd1JqwiXcS8///5/wCAABOrKA==>
[not found] ` <AQHX6RGokE17J35tB0eLtRDrIrd1JqwiXcS8>
2021-12-04 13:20 ` Sébastien Boisgérault
[not found] ` <f224cd2c-7d68-40b4-a855-7d4d0d7aa442n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-12-04 14:06 ` AW: " denis.maier-NSENcxR/0n0
[not found] ` <3b5d75fe4e2a45e38ab45a820d110faf-NSENcxR/0n0@public.gmane.org>
2021-12-04 14:43 ` Sébastien Boisgérault
[not found] ` <de1fd005-0d0d-49a2-86cc-5a72c764835dn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-12-04 14:58 ` AW: " denis.maier-NSENcxR/0n0
[not found] ` <fafa9cffd5e4437c865e71875b2f58a2-NSENcxR/0n0@public.gmane.org>
2021-12-04 15:35 ` Sébastien Boisgérault
2021-12-04 15:30 ` Joseph Reagle
[not found] ` <fe2b314b-863d-f8c0-8dfc-1104422fbf52-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2021-12-04 16:17 ` Sébastien Boisgérault
[not found] ` <1e952a20-a77f-4987-9e7f-bac963ba4385n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-12-04 16:48 ` Sébastien Boisgérault
2021-12-04 17:30 ` John MacFarlane
2021-12-22 18:05 ` Joseph Reagle
[not found] ` <c0c49e25-898d-c72c-3303-69005985ea01-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2021-12-23 10:56 ` Sébastien Boisgérault [this message]
[not found] ` <e45c083b-fff3-46ac-8af5-b416c60f6a97n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-12-23 14:25 ` Joseph Reagle
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e45c083b-fff3-46ac-8af5-b416c60f6a97n@googlegroups.com \
--to=sebastien.boisgerault-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
--cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).