public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* RFC: change in reader types
@ 2021-03-29 18:34 John MacFarlane
       [not found] ` <m2k0ppkdwn.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: John MacFarlane @ 2021-03-29 18:34 UTC (permalink / raw)
  To: pandoc-discuss


I've added a comment here
https://github.com/jgm/pandoc/issues/5501#issuecomment-809607817
that suggests changing the types of pandoc's readers.  This would
potentially be a breaking change, but it seems to me there's a
way to do it in a largely backwards-compatible way.

Motivation:  currently pandoc's parsers don't know what file a
particular bit of text comes from.  That's because the readers
just take a Text as parameter; the Text is formed by
concatenating input files.  This causes a problem when, e.g. we
process an INCLUDE directive in formats where relative paths are
interpreted relative to the directory containing the file.

To give the parsers information about the file containing each
bit of text, we could change the type of the readers from

PandocMonad m => ReaderOptions -> Text -> m Pandoc

to

PandocMonad m => ReaderOptions -> [(FilePath, Text)] -> m Pandoc

Then we could define a Stream instance for Parsec so that the
source positions are updated with the filename.  (This would work
for those readers that are implemented using parsec; it wouldn't
work for XML formats, where we use xml-conduit, but maybe that's
okay.)  So I don't think we'd need to change much in the code of
the reader modules.

To avoid breaking backwards compatibility, we could define
a typeclass PandocInput with instances including Text and
[(FilePath, Text)].  The reader type would tehn be

(PandocMonad m, PandocInput t) => ReaderOptions -> t -> m Pandoc

Existing code that passes a Text to the readers would then still
work.

We could even define PandocInput instances for ByteString,
Text, Lazy ByteString, Lazy Text, and Handle. This would break
existing code that passed input as a string literal, relying
on the OverloadedStrings extension, but I suspect there is little
code like this.

Anyway, comments welcome!


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: change in reader types
       [not found] ` <m2k0ppkdwn.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
@ 2021-03-31  8:38   ` Albert Krewinkel
       [not found]     ` <87ft0bsopb.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
  2021-03-31 20:56   ` MarLinn
  1 sibling, 1 reply; 6+ messages in thread
From: Albert Krewinkel @ 2021-03-31  8:38 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> writes:

> To give the parsers information about the file containing each
> bit of text, we could change the type of the readers from
>
> PandocMonad m => ReaderOptions -> Text -> m Pandoc
>
> to
>
> PandocMonad m => ReaderOptions -> [(FilePath, Text)] -> m Pandoc
>
> Then we could define a Stream instance for Parsec so that the
> source positions are updated with the filename.

I'm very much in favor of doing this.

The one thing I worry about is that reader behavior would be a bit more
involved and possibly confusing. E.g., will we continue to always insert
a newline char between files, or can readers handle this differently?

The change would also allow us to let `--file-scope` work on the input
side, which is something that people seem to expect.

> To avoid breaking backwards compatibility, we could define
> a typeclass PandocInput with instances including Text and
> [(FilePath, Text)].  The reader type would tehn be
>
> (PandocMonad m, PandocInput t) => ReaderOptions -> t -> m Pandoc

I see the appeal of using a typeclass for this. My preference would be
to try a PandocInput ADT instead, which I think would be easier to
extend in the future. But now I'm bikeshedding.


--
Albert Krewinkel
GPG: 8eed e3e2 e8c5 6f18 81fe  e836 388d c0b2 1f63 1124


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: change in reader types
       [not found]     ` <87ft0bsopb.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
@ 2021-03-31 15:48       ` John MacFarlane
       [not found]         ` <m2zgyjfho0.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: John MacFarlane @ 2021-03-31 15:48 UTC (permalink / raw)
  To: Albert Krewinkel, pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


>> To avoid breaking backwards compatibility, we could define
>> a typeclass PandocInput with instances including Text and
>> [(FilePath, Text)].  The reader type would tehn be
>>
>> (PandocMonad m, PandocInput t) => ReaderOptions -> t -> m Pandoc
>
> I see the appeal of using a typeclass for this. My preference would be
> to try a PandocInput ADT instead, which I think would be easier to
> extend in the future. But now I'm bikeshedding.

Can you say more?  I may not understand what you have in mind.
But a typeclass can also be extended (to new instances) in the
future, and I don't see how an ADT would give us backwards
compatibility.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: change in reader types
       [not found]         ` <m2zgyjfho0.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
@ 2021-03-31 20:26           ` Albert Krewinkel
       [not found]             ` <87eefvrrxe.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Albert Krewinkel @ 2021-03-31 20:26 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> writes:

>>> To avoid breaking backwards compatibility, we could define
>>> a typeclass PandocInput with instances including Text and
>>> [(FilePath, Text)].  The reader type would tehn be
>>>
>>> (PandocMonad m, PandocInput t) => ReaderOptions -> t -> m Pandoc
>>
>> I see the appeal of using a typeclass for this. My preference would be
>> to try a PandocInput ADT instead, which I think would be easier to
>> extend in the future. But now I'm bikeshedding.
>
> Can you say more?  I may not understand what you have in mind.
> But a typeclass can also be extended (to new instances) in the
> future, and I don't see how an ADT would give us backwards
> compatibility.

I phrased that badly. To explain my thoughts: I tried to look at what
would happen if we wanted to add, say, info on the http request used to
get the contents. It would be easy to just add some new constructor to
an ADT, like in this ad-hoc example:

    data PandocInput = PandocInput [(FilePath, Text)]
                       -- ^ what we start with
                       -- | what's added later
                     | HttpInput Encoding URL Text PandocInput
                     | FileInput (FilePath, Text) PandocInput
                     | NoInput

Adding the other constructors later makes for an ugly type, but probably
wouldn't require changes in depending libraries as they are unlikely to
deconstruct a PandocInput, just construct it. We could also get rid of
the PandocInput constructor and replace it with a pattern synonym. The
downside is that we have to do more changes in the readers.

Of course, a solution that's easy to use and backwards compatible *now*
is admittedly more valuable than some hypothetical future.

To be honest, I think I just hold a not-entirely-reasonable distrust
towards typeclasses, probably because of the headaches they are giving
me in hslua. Also, at first I thought library users would have to adjust
class instances every time we changed the class, but now I remember that
default implementations exist, so it isn't at all a valid objection.

--
Albert Krewinkel
GPG: 8eed e3e2 e8c5 6f18 81fe  e836 388d c0b2 1f63 1124


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: change in reader types
       [not found] ` <m2k0ppkdwn.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
  2021-03-31  8:38   ` Albert Krewinkel
@ 2021-03-31 20:56   ` MarLinn
  1 sibling, 0 replies; 6+ messages in thread
From: MarLinn @ 2021-03-31 20:56 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 1763 bytes --]

> PandocMonad m => ReaderOptions -> [(FilePath, Text)] -> m Pandoc


I like this idea, but I have two suggestions for the bikeshed colors:


1. You already mentioned ByteString instances for the PandocInput 
typeclass. How about [(FilePath, ByteString)] as well?


2. May I suggest a record instead of a tuple. For example, the type 
could become (disregarding the new typeclass for clarity)

     PandocMonad m => ReaderOptions -> [PandocInputFile] -> m Pandoc

     data PandocInputFile = PandocInputFile { inputFileName :: FilePath, 
inputFileHandle :: Handle}

Now a Reader author can get a rich API like

inputFileHandle :: PandocInputFile -> Handle inputFileAsByteString :: 
PandocInputFile -> ByteString
	inputFileAsText       ::PandocInputFile -> TextinputFileName :: PandocInputFile -> FilePath 
inputFolder :: PandocInputFile -> FilePath

But especially if the constructor is hidden, this would still allow for 
re-shuffeling under the hood if necessary.

Part of me wants to generalize this further by turning the record back 
into a (free and comonadic I think?) functor, but that may be going too far.


3. Well actually, how about…

     instance Foldable f => PandocInput (f PandocInputFile) where…

Ok, now I'm really shedding my bikes all over your beautiful couch. 
Sorry about that.


-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/ecb48165-93a1-9566-c987-ddd42a7e409d%40gmail.com.

[-- Attachment #2: Type: text/html, Size: 5567 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: change in reader types
       [not found]             ` <87eefvrrxe.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
@ 2021-03-31 23:15               ` John MacFarlane
  0 siblings, 0 replies; 6+ messages in thread
From: John MacFarlane @ 2021-03-31 23:15 UTC (permalink / raw)
  To: Albert Krewinkel, pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


Good point about the need to handle HTTP input too.
Of course, we could define data types like

data FileInput ...
data HttpInput ...

and just make them all instances of a typeclass (as well as
plain old Text, for backwards compatibility).  In addition to
backwards compatibility, an advantage of this approach over your
method is that adding new types which are instances of the
class would break fewer things than adding a new constructor
on a big data type (which could break pattern matching by making
it non-exhaustive).

The main constraint is that for any of these we'll want it
to be possible to define a parsec Stream instance.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-03-31 23:15 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-29 18:34 RFC: change in reader types John MacFarlane
     [not found] ` <m2k0ppkdwn.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
2021-03-31  8:38   ` Albert Krewinkel
     [not found]     ` <87ft0bsopb.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
2021-03-31 15:48       ` John MacFarlane
     [not found]         ` <m2zgyjfho0.fsf-jF64zX8BO08an7k8zZ43ob9bIa4KchGshsV+eolpW18@public.gmane.org>
2021-03-31 20:26           ` Albert Krewinkel
     [not found]             ` <87eefvrrxe.fsf-9EawChwDxG8hFhg+JK9F0w@public.gmane.org>
2021-03-31 23:15               ` John MacFarlane
2021-03-31 20:56   ` MarLinn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).