Using Pandoc for general text processing, e.g. writing a Ctags emitter: source text info in the AST

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Using Pandoc for general text processing, e.g. writing a Ctags emitter: source text info in the AST
@ 2017-10-11 17:35 gw2286-WLbs8XpHrcb2fBVCVOL8/A
       [not found] ` <9c830d97-68ca-4cda-8892-3cad8b2c975d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 2+ messages in thread
From: gw2286-WLbs8XpHrcb2fBVCVOL8/A @ 2017-10-11 17:35 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1.1: Type: text/plain, Size: 2172 bytes --]

I'm interested in using Pandoc to write a generic Ctags emitter. However, 
I'm finding this is difficult because it seems impossible to connect a node 
to its original source code in the current API(s).

Is there any way to access the line number in the source file where a node 
first appears, from the Haskell API or the JSON formatted output? If not, 
is this something that would be feasible to track and expose?

Or, to stretch the idea a bit, would it be totally crazy to extend Pandoc 
with the ability to attach this kind of metadata to each node, or perhaps 
allow a reader to attach *arbitrary* metadata? In the latter case, the 
structure of the node-level metadata would be a matter of convention, and 
writers would be free to simply ignore it.

I'm asking about both the feasbility of implementation ("would it be 
possible without having to rewrite huge amounts of code?") and the 
desirability of implementation ("is this something the Pandoc project is 
interested in?").

For what it's worth, I don't envision this being useful solely for Ctags 
tag generation, although IMO a "format-agnostic" tag generator for dozens 
of markup formats that comes "for free" out of a single implementation 
seems like a good enough prize on its own, so long as you can easily add, 
say, "linenrStart", "linenrEnd", and/or "verbatimSource" attributes to the 
reader. You could also use this ability to create Pandoc-based code/text 
formatters and linters that introspect on the contents of nodes, or 
automatically inject a "view source for this section on GitHub" link into 
top-level headings.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9c830d97-68ca-4cda-8892-3cad8b2c975d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 2693 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Using Pandoc for general text processing, e.g. writing a Ctags emitter: source text info in the AST
       [not found] ` <9c830d97-68ca-4cda-8892-3cad8b2c975d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2017-10-11 18:20   ` John MACFARLANE
  0 siblings, 0 replies; 2+ messages in thread
From: John MACFARLANE @ 2017-10-11 18:20 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Unfortunately, I didn't have the forethought to make design
choices early on that would have made this easier.

There's some discussion at https://github.com/jgm/pandoc/issues/684
about adding attributes to all elements of the AST.

But even if this were done, some of the parsers are not
designed in a way that would make it easy to track source
positions exactly.  (For example, a common parsing pattern
is to extract the content of a list item or block quote,
strip off indentation, and parse it -- but here source
positions get lost.)

I've recently rewritten the LaTeX reader in a way that
allows source positions to be accurately reported. (We
now have an initial tokenization phase, and source positions
are included in the tokens.) The same methods could be used
in the other parsers, but this, like the addition of
attributes, would be quite a big change.

+++ gw2286-WLbs8XpHrcb2fBVCVOL8/A@public.gmane.org [Oct 11 17 10:35 ]:
>   I'm interested in using Pandoc to write a generic Ctags emitter.
>   However, I'm finding this is difficult because it seems impossible to
>   connect a node to its original source code in the current API(s).
>   Is there any way to access the line number in the source file where a
>   node first appears, from the Haskell API or the JSON formatted output?
>   If not, is this something that would be feasible to track and expose?
>   Or, to stretch the idea a bit, would it be totally crazy to extend
>   Pandoc with the ability to attach this kind of metadata to each node,
>   or perhaps allow a reader to attach *arbitrary* metadata? In the latter
>   case, the structure of the node-level metadata would be a matter of
>   convention, and writers would be free to simply ignore it.
>   I'm asking about both the feasbility of implementation ("would it be
>   possible without having to rewrite huge amounts of code?") and the
>   desirability of implementation ("is this something the Pandoc project
>   is interested in?").
>   For what it's worth, I don't envision this being useful solely for
>   Ctags tag generation, although IMO a "format-agnostic" tag generator
>   for dozens of markup formats that comes "for free" out of a single
>   implementation seems like a good enough prize on its own, so long as
>   you can easily add, say, "linenrStart", "linenrEnd", and/or
>   "verbatimSource" attributes to the reader. You could also use this
>   ability to create Pandoc-based code/text formatters and linters that
>   introspect on the contents of nodes, or automatically inject a "view
>   source for this section on GitHub" link into top-level headings.
>
>   --
>   You received this message because you are subscribed to the Google
>   Groups "pandoc-discuss" group.
>   To unsubscribe from this group and stop receiving emails from it, send
>   an email to [1]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To post to this group, send email to
>   [2]pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To view this discussion on the web visit
>   [3]https://groups.google.com/d/msgid/pandoc-discuss/9c830d97-68ca-4cda-
>   8892-3cad8b2c975d%40googlegroups.com.
>   For more options, visit [4]https://groups.google.com/d/optout.
>
>References
>
>   1. mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   2. mailto:pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   3. https://groups.google.com/d/msgid/pandoc-discuss/9c830d97-68ca-4cda-8892-3cad8b2c975d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org?utm_medium=email&utm_source=footer
>   4. https://groups.google.com/d/optout


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-10-11 18:20 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-11 17:35 Using Pandoc for general text processing, e.g. writing a Ctags emitter: source text info in the AST gw2286-WLbs8XpHrcb2fBVCVOL8/A
     [not found] ` <9c830d97-68ca-4cda-8892-3cad8b2c975d-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2017-10-11 18:20   ` John MACFARLANE

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).