Le mercredi 22 décembre 2021 à 19:05:31 UTC+1, Joseph a écrit :

I finally had a chance to read through the documentation: wow, impressive! And by that I mean not only the library but the documentation itself. It's rare to see such comprehensive and clear documentation of a new project. For folks who don't already read haskell, it's a great way to learn about Pandoc.

Thank you for the kind words!

BTW: It's probably too late to change, but I wonder if you should've given it a novel name? I wonder if it'll be easy for folks to find? (I'm not sure on github why it's labeled as a predominantly JavaScript project?)

I think that many people for which this library could be useful would search for "Pandoc" and "Python" in a search engine ; for me that returns "Scripting with Pandoc" (which is totally appropriate), this project and then pypandoc (whose scope is different: it is a wrapper for the command-line, the document model is not exposed). So I guess it's easy to find at this stage ; easier than panflute for example (which I can't see in the results), despite the fact than panflute is more mature and obviously a totally appropriate result.

I guess that the Javascript component found by GitHub refers to the list of pandoc types hierarchy (one by version) which are stored as a JSON file :

https://github.com/boisgera/pandoc/blob/master/src/pandoc/pandoc-types.js

Also, I wonder if there will ever be a higher level way of searching/transforming markdown in Python? Panflute is a bit higher-level and more python-idiomatic, and your examples [1] are fantastic, but I crave the intuitive XML-based selectors (e.g., eTree, BeautifulSoup, and CSS).

Your API, like most, requires me to be familiar with the pandoc AST to do anything (e.g., meta is the first -- `doc[0]` -- items in the document structure).

Yes, you're 100% right:

- The library is low-level (at this stage) and therefore you're "expected" to build your own helpers on top of it if you want a higher-level API.

There is a finder example in the documentatrion (https://boisgera.github.io/pandoc/cookbook/#finder), but this is not in the official high-level API (yet).

After using the low-level API myself for a long time, I still wonder what kind of high-level API I'd like to have that would at least cover my own use cases ...

I rather dislike selector language (xpath, CSS selectors, regexp, etc.), but there are other great sources of inspiration (xml.etree and beautifulsoup finders,

chained queries of rethinkdb, etc.).

- In the current state, you need to be familiar with the pandoc AST to do anything. I don't know to which extent we can avoid that (in any library) for advanced use cases.

I tried to improve the learning curve a bit (you can discover the type hierarchy interactively in the interpreter: https://boisgera.github.io/pandoc/document/#types),

but I agree that's it's likely to be a show-stopper for many people.

[1]: https://boisgera.github.io/pandoc/cookbook/

In the examples below, I exercise the three options for pandoc and python. I kind of like using pandoc to convert it to HTML, use those selectors, and then convert back if need be... It's be great if panflute (or pandoc) had high-level selectors.

```python
# 1. Using pandoc API to print date
# Requires I remember pandoc data model via list indices
# No find/select; lots of iteration

doc = pandoc.read(COMMONMARK_SPEC)
meta = doc[0] # doc: Pandoc(Meta, [Block])
meta_dict = meta[0] # meta: Meta({Text: MetaValue})
date = meta_dict["date"]
date_inlines = date[0] # date: MetaInlines([Inline])
print("pandoc:" + pandoc.write(date_inlines).strip())

# 2. Using panflute to print date
# Data-model is a bit more intuitive.
# No find/select

doc = pf.convert_text(COMMONMARK_SPEC, standalone=True)
print("panflute" + doc.get_metadata()["date"])

# 3. Using pandoc + BeautifulSoup to print date
# Requires me to remember HTML model, but I'm more familiar.
# Can use BeautifulSoup or CSS selectors

doc = pandoc.read(COMMONMARK_SPEC)
html = pandoc.write(doc, format="html", options=["--standalone"])
soup = BeautifulSoup(html, "html5lib")
date = soup.find("meta", {"name": "dcterms.date"})["content"]
print("BS native selector:" + date)
# CSS selector
date = soup.select("""meta[name="dcterms.date"]""")[0]["content"] # CSS
print("BS/CSS selector:" + date)

```

Very useful example ! Thank you for taking the time to do this at this level of detail.

I'll have a deeper look at BeautifulSoup find/find_all methods to start with and will experiment a bit ; I'll report back here.

Cheers,

P.S.: the metadata is especially hard to use right now, since it is littered with Haskell-like wrapper types (MetaInlines, MetaBlocks, etc.) and 90% of the time you'd just like to have the result as a "regular Python dictionnary". I'll also work a bit on this ; I have not done it so far because I know that any such conversion will lose some type info (how to you distinguish empty list of blocks and list of inlines for example ?) and therefore cannot be used reliably for round-tripping ; so the metadata handling so far is "correct" but not at all convenient.