From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/21664 Path: news.gmane.org!.POSTED!not-for-mail From: Cormac Relf Newsgroups: gmane.text.pandoc Subject: Re: Experimental citeproc implementation in Rust Date: Wed, 12 Dec 2018 01:21:51 -0800 (PST) Message-ID: <9e7db31a-8244-4ac8-800b-25709cedc240@googlegroups.com> References: <78b7f42d-7640-45ff-a359-f59355217af8@googlegroups.com> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_242_559818882.1544606511653" X-Trace: blaine.gmane.org 1544607588 31065 195.159.176.226 (12 Dec 2018 09:39:48 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Wed, 12 Dec 2018 09:39:48 +0000 (UTC) To: pandoc-discuss Original-X-From: pandoc-discuss+bncBDK6LDPPQ4GRBYNPYPQAKGQEKTB336A-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Wed Dec 12 10:39:44 2018 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane.org Original-Received: from mail-oi1-f184.google.com ([209.85.167.184]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gX0zW-0007xZ-WC for gtp-pandoc-discuss@m.gmane.org; Wed, 12 Dec 2018 10:39:43 +0100 Original-Received: by mail-oi1-f184.google.com with SMTP id e141sf9101915oig.11 for ; Wed, 12 Dec 2018 01:41:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=nI6TTy70B/w6QuUesj4J1UXGqiKzlxIjAR3GBEtbUko=; b=WQynfIURtX2P2RoJYnju5R4i3eNpC3BOxkbT1VJrc4GjxKlMmmcNJQVQ79vHeX5TTa 3q60aXvHVJhGkLN81VgNDn9KRpofwyYSVaGL6M6bU3JISkgRfCBLrBynswyyNxm92Lf+ FStImqzbh4wne8jx8K+KKHtr8vui6p9Y8Qb+DcGvvY/b+JB7OCAzV2NUxoPEJkY38+jf je1LCX+abMxiuyEh/ItpB/r0xdzbUyVDOdWC7gdHlxjSOybHRK70iPehbce1laEM59Fb HEZBtULI3VpGzo2Zgetsr+7v3bk0k3JvEXzkuE3dpOsoUuiPcb91DsP+N03bsy4kel5D dMpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=nI6TTy70B/w6QuUesj4J1UXGqiKzlxIjAR3GBEtbUko=; b=Tpo8sfwPRTuLLWbfMyd4rJx2FUfBLMulafogKwBF9iU1QJl3GRAlk3KXgUsKgLQQ54 N4ucgBGWgTHhVkFtaWocQIto6euqv0HyFp7vdeWp+Uda2rN6ldB8MfXO1f8WEu0SmanQ Vn6EF2yXk3AUX7zCwPnF6aWARoJAzcdi/ISDKVOttOQaVeTgStmDsEP9wZ/1vaeEgBQ/ Dpu8FI+/igKpVeE8cIkVT5OipUns3IhLeMxRiIhQn1Un1GTKvqDIfMfFiw/TpzoeNn8Z TXY3Hd7bJwaBQtvuos4gSPKqIroMFoStqL2tqjnPTVRu2YL0WpxyHIRAHipzTxIh+TTh kY8g== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AA+aEWYeRrgyy9RVPGGOlqDIBM9z5LMLXADdSbHdgBYf/gU74OV5GvCB Pg6aRefFQ8YY6qJrVqe+AlA= X-Google-Smtp-Source: AFSGD/Vh0G8od4h/kVh+rjRGf5x+7nTzZ10gD1P1nKQdUPU738n95Z9Itpd2Wp1L5aKAEN8l2SRnEA== X-Received: by 2002:a9d:da3:: with SMTP id 32mr457008ots.3.1544607713647; Wed, 12 Dec 2018 01:41:53 -0800 (PST) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:aca:6046:: with SMTP id u67ls8050108oib.7.gmail; Wed, 12 Dec 2018 01:41:52 -0800 (PST) X-Received: by 2002:aca:5884:: with SMTP id m126mr83013oib.4.1544606512362; Wed, 12 Dec 2018 01:21:52 -0800 (PST) In-Reply-To: X-Original-Sender: web-v7Sng7lNsVbsQp/K+IV0sw@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Spam-Checked-In-Group: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.org gmane.text.pandoc:21664 Archived-At: ------=_Part_242_559818882.1544606511653 Content-Type: multipart/alternative; boundary="----=_Part_243_2056684717.1544606511653" ------=_Part_243_2056684717.1544606511653 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable =20 That=E2=80=99s a good point about native lua modules. I=E2=80=99m looking i= nto a safe API=20 for that over at rlua , but it=E2= =80=99s=20 clearly possible in unsafe Rust. The output formatting architecture so far doesn=E2=80=99t actually use any= =20 particular internal format, it=E2=80=99s just a trait (like a Haskell typec= lass)=20 with an associated type. So PlainText builds Strings and ignores formatting= =20 (and is very fast), but Pandoc builds Vec, where Inline comes from= =20 the pandoc_types crate. So an unsupported CSL formatting instruction like= =20 display=3D"block" would be simply ignored in the Pandoc implementation. A= =20 fully-featured format would encode everything in an Html type that knows=20 how to serialize itself at the end. I might change this, given its code=20 size implications for the WebAssembly output, as Rust's monomorphisation=20 means all formatting-dependent functions would be compiled and emitted=20 three times with inlining performed on each. In the current architecture, you also have *inputs* that are generic over= =20 the output format. So a Cite is actually specialised for each input format,= =20 such that the locators and affixes are specialised and any deserialization= =20 would be to a Cite, which will read Pandoc::Build =3D Vec= =20 into its affixes. This is through the serde_json::Deserialize trait, which= =20 is pretty dead easy, it just boils down to keeping it in sync. That could= =20 be mitigated by doing an incomplete deserialization, and leaving=20 unrecognised nodes in serialized form, such that new AST nodes wouldn't=20 cause parse errors. But that's probably more work than maintenance in the= =20 first place. The BibTex parsing is a tricky one, though. There=E2=80=99s this=20 for the main syntax, at least. I= =20 wouldn=E2=80=99t want to fork out to Pandoc for every single latex text fie= ld, but=20 maybe the Lua API=E2=80=99s read would help here. It might be simpler to su= pport=20 both citeproc-js=E2=80=99 micro-HTML and a similarly limited micro-LaTeX wi= th a=20 simple Rust-based parser, but not at the same time. What do people use=20 backslash commands for in BibTeX? Are there names and document titles out= =20 there that really need the whole power of LaTeX to render? I might have to= =20 think about this some more. Perhaps a successor to CSL-JSON that accepts=20 arbitrary JSON objects wherever the old one accepts strings. On Wednesday, December 12, 2018 at 5:44:35 AM UTC+11, John MacFarlane wrote= : > > > That's an interesting idea. pandoc-citeproc is still=20 > pretty crufty, and it doesn't always behave like=20 > citeproc-js, so I can see the point of this.=20 > > The difficulties are that=20 > > - pandoc-citeproc is currently quite tightly=20 > integrated with pandoc; it operates on the pandoc=20 > AST. So as you note, that capability would have to=20 > be reproduced somehow in citeproc-rs. I think that=20 > the tree-walking work could be given to a lua filter=20 > that either called out to citeproc-rs or linked to=20 > a version of it. (I don't think luajit is required=20 > for this; one can write lua modules in C, so it=20 > should be possible to do it in rust.) But citeproc-rs would=20 > still have to be able to handle pandoc JSON. Perhaps=20 > that could just be the underlying format it operates=20 > on (it would have to replace the current HTML-ish=20 > syntax used in citeproc-js, and maybe it would have=20 > to be made more expressive).=20 > > - One potential problem is that citeproc-rs would need to=20 > change, sometimes, when pandoc does. Currently=20 > that's not a problem since I maintain pandoc-citeproc.=20 > > - pandoc-citeproc does some things citeproc-js does=20 > not do (these are, strictly speaking, extensions to=20 > standard citeproc). For example, author-in-text=20 > citations, citation prefixes and suffixes, proper=20 > handling of math (that's actually just folded into=20 > general pandoc support), movement of punctuation,=20 > conversion from bibtex/biblatex and other formats.=20 > Note that conversion from bibtex relies on pandoc's=20 > latex parser; to reproduce this functionality, you'd=20 > have to write a latex parser in rust or somehow call=20 > out to pandoc.=20 > > Best,=20 > John=20 > --=20 You received this message because you are subscribed to the Google Groups "= pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/= pandoc-discuss/9e7db31a-8244-4ac8-800b-25709cedc240%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. ------=_Part_243_2056684717.1544606511653 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

That=E2=80=99s a good point about native lua modules. I=E2=80=99m lookin= g into a safe API for that over at rlua, but it=E2=80=99s clearly possible in unsafe Rust.


The output formatting architecture so far doesn=E2=80=99t actually use a= ny=20 particular internal format, it=E2=80=99s just a trait (like a Haskell typec= lass) with an associated type. So PlainText builds Strings and ignores formatting (and is very fast), but Pandoc bui= lds Vec<Inline>, where Inline comes from th= e pandoc_types crate. So an unsupported CSL formatting instruc= tion like display=3D"block" would be simply ignored in the Pandoc=20 implementation. A fully-featured format would encode everything in an Html type that knows how to serialize itself at the end. I might change=20 this, given its code size implications for the WebAssembly output, as Rust&= #39;s monomorphisation means all formatting-dependent functions would be co= mpiled and=20 emitted three times with inlining performed on each.


In the current architecture, you also have inputs that are gene= ric over the output format. So a Cite is actually specialised for each input format, such that the locators=20 and affixes are specialised and any deserialization would be to a Cit= e<Pandoc>, which will read Pandoc::Build =3D Vec<Inline> into its affixes. This is through the serde_json= ::Deserialize trait, which is pretty dead easy, it just boils down to keepi= ng it in sync. That could be mitigated by doing an incomplete deserializati= on, and leaving unrecognised nodes in serialized form, such that new AST no= des wouldn't cause parse errors. But that's probably more work than= maintenance in the first place.


The BibTex parsing is= a tricky one, though. There=E2=80=99s this for the main syntax, at least. I wouldn=E2=80=99t want to fork out to Pand= oc=20 for every single latex text field, but maybe the Lua API=E2=80=99s re= ad would help here. It might be simpler to support both citeproc-js=E2=80=99= =20 micro-HTML and a similarly limited micro-LaTeX with a simple Rust-based=20 parser, but not at the same time. What do people use backslash=20 commands for in BibTeX? Are there names and document titles out there=20 that really need the whole power of LaTeX to render? I might have to=20 think about this some more. Perhaps a successor to CSL-JSON that accepts ar= bitrary JSON objects wherever the old one accepts strings.



On Wednesday, December 12, 2018 at 5:44:35 AM UTC+11, John MacFarla= ne wrote:

That's an interesting idea. =C2=A0pandoc-citeproc is still
pretty crufty, and it doesn't always behave like
citeproc-js, so I can see the point of this.

The difficulties are that

- pandoc-citeproc is currently quite tightly
=C2=A0 integrated with pandoc; it operates on the pandoc
=C2=A0 AST. =C2=A0So as you note, that capability would have to
=C2=A0 be reproduced somehow in citeproc-rs. =C2=A0I think that
=C2=A0 the tree-walking work could be given to a lua filter
=C2=A0 that either called out to citeproc-rs or linked to
=C2=A0 a version of it. =C2=A0(I don't think luajit is required
=C2=A0 for this; one can write lua modules in C, so it
=C2=A0 should be possible to do it in rust.) But citeproc-rs would
=C2=A0 still have to be able to handle pandoc JSON. Perhaps
=C2=A0 that could just be the underlying format it operates
=C2=A0 on (it would have to replace the current HTML-ish
=C2=A0 syntax used in citeproc-js, and maybe it would have
=C2=A0 to be made more expressive).

- One potential problem is that citeproc-rs would need to
=C2=A0 change, sometimes, when pandoc does. =C2=A0Currently
=C2=A0 that's not a problem since I maintain pandoc-citeproc.

- pandoc-citeproc does some things citeproc-js does
=C2=A0 not do (these are, strictly speaking, extensions to
=C2=A0 standard citeproc). =C2=A0For example, author-in-text
=C2=A0 citations, citation prefixes and suffixes, proper
=C2=A0 handling of math (that's actually just folded into
=C2=A0 general pandoc support), movement of punctuation,
=C2=A0 conversion from bibtex/biblatex and other formats.
=C2=A0 Note that conversion from bibtex relies on pandoc's
=C2=A0 latex parser; to reproduce this functionality, you'd
=C2=A0 have to write a latex parser in rust or somehow call
=C2=A0 out to pandoc.

Best,
John

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d/= msgid/pandoc-discuss/9e7db31a-8244-4ac8-800b-25709cedc240%40googlegroups.co= m.
For more options, visit http= s://groups.google.com/d/optout.
------=_Part_243_2056684717.1544606511653-- ------=_Part_242_559818882.1544606511653--