public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: John MACFARLANE <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org>
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
Subject: Re: Using Pandoc for general text processing, e.g. writing a Ctags emitter: source text info in the AST
Date: Wed, 11 Oct 2017 11:20:09 -0700	[thread overview]
Message-ID: <20171011182009.GA42638@protagoras> (raw)
In-Reply-To: <9c830d97-68ca-4cda-8892-3cad8b2c975d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>

Unfortunately, I didn't have the forethought to make design
choices early on that would have made this easier.

There's some discussion at https://github.com/jgm/pandoc/issues/684
about adding attributes to all elements of the AST.

But even if this were done, some of the parsers are not
designed in a way that would make it easy to track source
positions exactly.  (For example, a common parsing pattern
is to extract the content of a list item or block quote,
strip off indentation, and parse it -- but here source
positions get lost.)

I've recently rewritten the LaTeX reader in a way that
allows source positions to be accurately reported. (We
now have an initial tokenization phase, and source positions
are included in the tokens.) The same methods could be used
in the other parsers, but this, like the addition of
attributes, would be quite a big change.

+++ gw2286-WLbs8XpHrcb2fBVCVOL8/A@public.gmane.org [Oct 11 17 10:35 ]:
>   I'm interested in using Pandoc to write a generic Ctags emitter.
>   However, I'm finding this is difficult because it seems impossible to
>   connect a node to its original source code in the current API(s).
>   Is there any way to access the line number in the source file where a
>   node first appears, from the Haskell API or the JSON formatted output?
>   If not, is this something that would be feasible to track and expose?
>   Or, to stretch the idea a bit, would it be totally crazy to extend
>   Pandoc with the ability to attach this kind of metadata to each node,
>   or perhaps allow a reader to attach *arbitrary* metadata? In the latter
>   case, the structure of the node-level metadata would be a matter of
>   convention, and writers would be free to simply ignore it.
>   I'm asking about both the feasbility of implementation ("would it be
>   possible without having to rewrite huge amounts of code?") and the
>   desirability of implementation ("is this something the Pandoc project
>   is interested in?").
>   For what it's worth, I don't envision this being useful solely for
>   Ctags tag generation, although IMO a "format-agnostic" tag generator
>   for dozens of markup formats that comes "for free" out of a single
>   implementation seems like a good enough prize on its own, so long as
>   you can easily add, say, "linenrStart", "linenrEnd", and/or
>   "verbatimSource" attributes to the reader. You could also use this
>   ability to create Pandoc-based code/text formatters and linters that
>   introspect on the contents of nodes, or automatically inject a "view
>   source for this section on GitHub" link into top-level headings.
>
>   --
>   You received this message because you are subscribed to the Google
>   Groups "pandoc-discuss" group.
>   To unsubscribe from this group and stop receiving emails from it, send
>   an email to [1]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To post to this group, send email to
>   [2]pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To view this discussion on the web visit
>   [3]https://groups.google.com/d/msgid/pandoc-discuss/9c830d97-68ca-4cda-
>   8892-3cad8b2c975d%40googlegroups.com.
>   For more options, visit [4]https://groups.google.com/d/optout.
>
>References
>
>   1. mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   2. mailto:pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   3. https://groups.google.com/d/msgid/pandoc-discuss/9c830d97-68ca-4cda-8892-3cad8b2c975d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org?utm_medium=email&utm_source=footer
>   4. https://groups.google.com/d/optout


      parent reply	other threads:[~2017-10-11 18:20 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-11 17:35 gw2286-WLbs8XpHrcb2fBVCVOL8/A
     [not found] ` <9c830d97-68ca-4cda-8892-3cad8b2c975d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-10-11 18:20   ` John MACFARLANE [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171011182009.GA42638@protagoras \
    --to=jgm-tvlzxgkolnx2fbvcvol8/a@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).