How to process chunkedhtml output with Lua

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* How to process chunkedhtml output with Lua
@ 2023-01-21 13:42 ChrisD
       [not found] ` <35211aad-9b34-1c74-b25f-c2c3777da632-4SSc53hpTiu9TMao6EloiEEOCMrvLtNR@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: ChrisD @ 2023-01-21 13:42 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

The thread "Lua filter to process chunkedhtml output" has gotten kind of side tracked with a discussion of the logging module and how data is represented. Which is fine, this is good info and I appreciate the improvements to logging.

But I'd like to get back to the question of how to process chunked html output into other formats with Lua. I don't understand what data is available, when it's available, and what data structures are used.

1. Lua filter: I'm looking for the list of all files that will exist in the output folder, and the table of contents (essentially the data in sitemap.json). Is that data even available at the time a filer runs? If so, how do I access it?

2. Lua custom writer: If it can't be done in a filter, can it be done using a custom writer? Where would I find the relevant data?

3. Post-processing with Lua: Pandoc can now be run as a Lua interpreter. If neither (1) nor (2) is possible, I'm thinking I could run pandoc normally to produce a chunked html output folder, and then run pandoc again with a Lua script that finds all the files in the output folder, and reads sitemap.json. This is the same approach as doing the post processing in some other language, except that we don't have to have any additional tools installed. Is there anything that would prevent this approach?

Thanks,

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/35211aad-9b34-1c74-b25f-c2c3777da632%40intielectronics.com.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to process chunkedhtml output with Lua
       [not found] ` <35211aad-9b34-1c74-b25f-c2c3777da632-4SSc53hpTiu9TMao6EloiEEOCMrvLtNR@public.gmane.org>
@ 2023-01-21 22:23   ` John MacFarlane
       [not found]     ` <F03F5F72-A9B2-4AFB-92D7-DFA722BE5361-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: John MacFarlane @ 2023-01-21 22:23 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw



> On Jan 21, 2023, at 5:42 AM, ChrisD <cd34-gg-4SSc53hpTiu9TMao6EloiEEOCMrvLtNR@public.gmane.org> wrote:
> 
> The thread "Lua filter to process chunkedhtml output" has gotten kind of side tracked with a discussion of the logging module and how data is represented. Which is fine, this is good info and I appreciate the improvements to logging.
> 
> But I'd like to get back to the question of how to process chunked html output into other formats with Lua. I don't understand what data is available, when it's available, and what data structures are used.
> 
> 1. Lua filter: I'm looking for the list of all files that will exist in the output folder, and the table of contents (essentially the data in sitemap.json). Is that data even available at the time a filer runs? If so, how do I access it?

No, because the document gets broken up in the ChunkedHTML writer, and the filter runs before the writer.

However, in lua (via the pandoc.structure model) you have access to the function pandoc will use to split up the document, so you can split it up yourself and then you should have the data, as long as the parameters you use for splitting are the same as what the writer will use.


-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/F03F5F72-A9B2-4AFB-92D7-DFA722BE5361%40gmail.com.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to process chunkedhtml output with Lua
       [not found]     ` <F03F5F72-A9B2-4AFB-92D7-DFA722BE5361-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2023-01-25 17:17       ` ChrisD
       [not found]         ` <84b97b97-8fe6-fb71-7d97-6ee0733b5763-4SSc53hpTiu9TMao6EloiEEOCMrvLtNR@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: ChrisD @ 2023-01-25 17:17 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw



On 1/21/2023 3:23 PM, John MacFarlane wrote:
>
>> On Jan 21, 2023, at 5:42 AM, ChrisD <cd34-gg-4SSc53hpTiu9TMao6EloiEEOCMrvLtNR@public.gmane.org> wrote:
>>
>> The thread "Lua filter to process chunkedhtml output" has gotten kind of side tracked with a discussion of the logging module and how data is represented. Which is fine, this is good info and I appreciate the improvements to logging.
>>
>> But I'd like to get back to the question of how to process chunked html output into other formats with Lua. I don't understand what data is available, when it's available, and what data structures are used.
>>
>> 1. Lua filter: I'm looking for the list of all files that will exist in the output folder, and the table of contents (essentially the data in sitemap.json). Is that data even available at the time a filer runs? If so, how do I access it?
> No, because the document gets broken up in the ChunkedHTML writer, and the filter runs before the writer.
>
> However, in lua (via the pandoc.structure model) you have access to the function pandoc will use to split up the document, so you can split it up yourself and then you should have the data, as long as the parameters you use for splitting are the same as what the writer will use.
>
Thanks. I'm making some progress with this. A couple more questions:

1) pandoc.structure.split_into_chunks takes an opts parameter that has a path_template value. Is there a way to get the path_template that will be used by the chunked html writer?

2) The pandoc.structure.table_of_contents function returns a BulletList with the toc entries, but they are unnumbered even when --number-sections is true. I am calling
     pandoc.structure.table_of_contents(chunkeddoc, PANDOC_WRITER_OPTIONS)
where chunkeddoc is the output of split_into_chunks.  I have verified that PANDOC_WRITER_OPTIONS.number_sections = true. Am I missing something? Is this a bug?

3) Is there a simple way to get a list of files (including the image files) that will be included in the chunked html output folder? Maybe I can generate this from the ChunkedDoc, but it's going to take some parsing.

Thanks,

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/84b97b97-8fe6-fb71-7d97-6ee0733b5763%40intielectronics.com.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to process chunkedhtml output with Lua
       [not found]         ` <84b97b97-8fe6-fb71-7d97-6ee0733b5763-4SSc53hpTiu9TMao6EloiEEOCMrvLtNR@public.gmane.org>
@ 2023-01-26  0:50           ` John MacFarlane
       [not found]             ` <3F114306-007A-47CB-A067-3F7EE07900B0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: John MacFarlane @ 2023-01-26  0:50 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw



> Thanks. I'm making some progress with this. A couple more questions:
> 
> 1) pandoc.structure.split_into_chunks takes an opts parameter that has a path_template value. Is there a way to get the path_template that will be used by the chunked html writer?

It is "%s-%i.html"

> 2) The pandoc.structure.table_of_contents function returns a BulletList with the toc entries, but they are unnumbered even when --number-sections is true. I am calling
>     pandoc.structure.table_of_contents(chunkeddoc, PANDOC_WRITER_OPTIONS)
> where chunkeddoc is the output of split_into_chunks.  I have verified that PANDOC_WRITER_OPTIONS.number_sections = true. Am I missing something? Is this a bug?

Yes, we produce an unnumbered list. The numbers will be part of the content of the list items. (This is because the numbering scheme might not match up what is generated for an ordered list.)  In HTML you’d want to use some CSS to suppress the bullet.

> 3) Is there a simple way to get a list of files (including the image files) that will be included in the chunked html output folder? Maybe I can generate this from the ChunkedDoc, but it's going to take some parsing.

It should be easy to get the non-image files from the ChunkedDoc.  Then there’s index.json.  Image files, not so easy. 

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/3F114306-007A-47CB-A067-3F7EE07900B0%40gmail.com.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to process chunkedhtml output with Lua
       [not found]             ` <3F114306-007A-47CB-A067-3F7EE07900B0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2023-01-26 17:41               ` ChrisD
       [not found]                 ` <621a843e-049e-1a2b-1c60-df3158b6dc2e-4SSc53hpTiu9TMao6EloiEEOCMrvLtNR@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: ChrisD @ 2023-01-26 17:41 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On 1/25/2023 5:50 PM, John MacFarlane wrote:
> 3) Is there a simple way to get a list of files (including the image files) that will be included in the chunked html output folder? Maybe I can generate this from the ChunkedDoc, but it's going to take some parsing.
> It should be easy to get the non-image files from the ChunkedDoc.  Then there’s index.json.
What is index.json? If you mean sitemap.json, that doesn't exist yet, and it doesn't include the image files.

> Image files, not so easy.
I'm thinking this task may be easier to do as a post-processing step, rather than as a filter. I'll have sitemap.json, and I can generate a list of files from the output folder or zip file.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/621a843e-049e-1a2b-1c60-df3158b6dc2e%40intielectronics.com.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to process chunkedhtml output with Lua
       [not found]                 ` <621a843e-049e-1a2b-1c60-df3158b6dc2e-4SSc53hpTiu9TMao6EloiEEOCMrvLtNR@public.gmane.org>
@ 2023-01-26 21:57                   ` John MacFarlane
  2023-01-27 10:42                   ` BPJ
  1 sibling, 0 replies; 7+ messages in thread
From: John MacFarlane @ 2023-01-26 21:57 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw



> On Jan 26, 2023, at 9:41 AM, ChrisD <cd34-gg-4SSc53hpTiu9TMao6EloiEEOCMrvLtNR@public.gmane.org> wrote:
> 
> On 1/25/2023 5:50 PM, John MacFarlane wrote:
>> 3) Is there a simple way to get a list of files (including the image files) that will be included in the chunked html output folder? Maybe I can generate this from the ChunkedDoc, but it's going to take some parsing.
>> It should be easy to get the non-image files from the ChunkedDoc.  Then there’s index.json.
> What is index.json? If you mean sitemap.json, that doesn't exist yet, and it doesn't include the image files.

Yes that’s what I meant.


>> Image files, not so easy.
> I'm thinking this task may be easier to do as a post-processing step, rather than as a filter. I'll have sitemap.json, and I can generate a list of files from the output folder or zip file.

Possibly!


-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/61D3323C-9B42-4A8E-A6C9-D9545E1C83A5%40gmail.com.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: How to process chunkedhtml output with Lua
       [not found]                 ` <621a843e-049e-1a2b-1c60-df3158b6dc2e-4SSc53hpTiu9TMao6EloiEEOCMrvLtNR@public.gmane.org>
  2023-01-26 21:57                   ` John MacFarlane
@ 2023-01-27 10:42                   ` BPJ
  1 sibling, 0 replies; 7+ messages in thread
From: BPJ @ 2023-01-27 10:42 UTC (permalink / raw)
  To: pandoc-discuss

[-- Attachment #1: Type: text/plain, Size: 1805 bytes --]

Den tors 26 jan. 2023 18:42ChrisD <cd34-gg-4SSc53hpTiu9TMao6EloiEEOCMrvLtNR@public.gmane.org> skrev:

> On 1/25/2023 5:50 PM, John MacFarlane wrote:
> > 3) Is there a simple way to get a list of files (including the image
> files) that will be included in the chunked html output folder? Maybe I can
> generate this from the ChunkedDoc, but it's going to take some parsing.
> > It should be easy to get the non-image files from the ChunkedDoc.  Then
> there’s index.json.
> What is index.json? If you mean sitemap.json, that doesn't exist yet, and
> it doesn't include the image files.
>

You can always make two runs and use the one produced by the first run
during the second run.



> > Image files, not so easy.
> I'm thinking this task may be easier to do as a post-processing step,
> rather than as a filter. I'll have sitemap.json, and I can generate a list
> of files from the output folder or zip file.
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/621a843e-049e-1a2b-1c60-df3158b6dc2e%40intielectronics.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhBtitjLES8th0zKwCSV6GDSCXyM3bqTZekCoex2fqvfWw%40mail.gmail.com.

[-- Attachment #2: Type: text/html, Size: 2957 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-01-27 10:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-21 13:42 How to process chunkedhtml output with Lua ChrisD
     [not found] ` <35211aad-9b34-1c74-b25f-c2c3777da632-4SSc53hpTiu9TMao6EloiEEOCMrvLtNR@public.gmane.org>
2023-01-21 22:23   ` John MacFarlane
     [not found]     ` <F03F5F72-A9B2-4AFB-92D7-DFA722BE5361-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2023-01-25 17:17       ` ChrisD
     [not found]         ` <84b97b97-8fe6-fb71-7d97-6ee0733b5763-4SSc53hpTiu9TMao6EloiEEOCMrvLtNR@public.gmane.org>
2023-01-26  0:50           ` John MacFarlane
     [not found]             ` <3F114306-007A-47CB-A067-3F7EE07900B0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2023-01-26 17:41               ` ChrisD
     [not found]                 ` <621a843e-049e-1a2b-1c60-df3158b6dc2e-4SSc53hpTiu9TMao6EloiEEOCMrvLtNR@public.gmane.org>
2023-01-26 21:57                   ` John MacFarlane
2023-01-27 10:42                   ` BPJ

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).