Needing to write a filter to transform malformed headings

public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed

* Needing to write a filter to transform malformed headings
@ 2015-06-18 20:50 Diego Algorta
       [not found] ` <44ec194d-4a3a-49f3-a802-35e74cab4344-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Diego Algorta @ 2015-06-18 20:50 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1.1: Type: text/plain, Size: 1308 bytes --]

Hi everyone,

I learned about pandoc this week and it's great. thanks for such a great 
project.

I need to process a bunch of docx files with a specific outline format, but 
I'll need to transform the AST a bit to fix some inconsistencies before 
generating the final formats (HTML, docx, markdown and PDF). 
Inconsistencies are regarding wrong setting of heading styles and hard line 
breaks.

I laid out the problem in a gist for better formatting. Would you please 
take a look at it?
https://gist.github.com/oboxodo/4948ac8e79aa952df8bc

Most of my experience is in ruby and js so I expect to write my filter with 
the node port, but some guidance in ANY of the supported languages would be 
appreciated.

Thanks in advance,
Diego

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/44ec194d-4a3a-49f3-a802-35e74cab4344%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 1856 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Needing to write a filter to transform malformed headings
       [not found] ` <44ec194d-4a3a-49f3-a802-35e74cab4344-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2015-06-20  0:52   ` BP Jonsson
       [not found]     ` <CAFC_yuRHBgsw2UQ8mVqT5jXyJEmWC4Yv8=jhHhbDw3_imsF+jA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-06-22 21:14   ` Diego Algorta
  1 sibling, 1 reply; 7+ messages in thread
From: BP Jonsson @ 2015-06-20  0:52 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 3438 bytes --]

It seems your original data tries to put headings inside list items. I
don't know if you can do that in Word — which I haven't used for years —,
but you can't do that in pandoc.
If those 'headings' do have a style in the original docx you might try to
export the docx to (X)HTML and process it with my perl script at <
https://github.com/bpj/change-html-attrs>, turning those 'headings' into
divs or spans with an appropriate class, before importing the modified HTML
into pandoc, where you could use a filter to inject raw LaTeX to format
them appropriately when generating PDF. Be aware that I've never used that
script or its predecessor on (X)HTML exported by Word, but only on XHTML
exported by LibreOffice, albeit based on docx and doc originals. I have
done a not yet pushed addition to that script so that it can use plugins
written in perl, which expands its capabilities considerably. I can also
write a custom perl script to modify all HTML headings inside list items
rather easily with the help of the HTML::Tree module. The problem is that
once you have read the data into pandoc they will just be another phrase
with strong emphasis, so it may be necessary to go via HTML.
Den 18 jun 2015 22:50 skrev "Diego Algorta" <diego-X9ybnTC+vipBDgjK7y7TUQ@public.gmane.org>:

> Hi everyone,
>
> I learned about pandoc this week and it's great. thanks for such a great
> project.
>
> I need to process a bunch of docx files with a specific outline format,
> but I'll need to transform the AST a bit to fix some inconsistencies before
> generating the final formats (HTML, docx, markdown and PDF).
> Inconsistencies are regarding wrong setting of heading styles and hard line
> breaks.
>
> I laid out the problem in a gist for better formatting. Would you please
> take a look at it?
> https://gist.github.com/oboxodo/4948ac8e79aa952df8bc
>
> Most of my experience is in ruby and js so I expect to write my filter
> with the node port, but some guidance in ANY of the supported languages
> would be appreciated.
>
> Thanks in advance,
> Diego
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/44ec194d-4a3a-49f3-a802-35e74cab4344%40googlegroups.com
> <https://groups.google.com/d/msgid/pandoc-discuss/44ec194d-4a3a-49f3-a802-35e74cab4344%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuRHBgsw2UQ8mVqT5jXyJEmWC4Yv8%3DjhHhbDw3_imsF%2BjA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 4651 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Needing to write a filter to transform malformed headings
       [not found]     ` <CAFC_yuRHBgsw2UQ8mVqT5jXyJEmWC4Yv8=jhHhbDw3_imsF+jA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-06-20 16:48       ` Diego Algorta
       [not found]         ` <CAPfjgv+r9rbRSg2k_G-Ff--4NtG-MmUoY4N=MdrRifLkeqN97Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Diego Algorta @ 2015-06-20 16:48 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 5645 bytes --]

The original docx file is just using bold text in the first line of each
numbered list. Except in one of them (not included in the sample I shared
before) where it does have a Heading 2 and pandoc recognizes it perfectly
fine.

Pandoc does seem to support headings inside lists. I crafted this example:
https://gist.github.com/oboxodo/cfb773ee0c50fdab132e

There you have an original markdown file and a resulting html converted
with pandoc. Anecdotally, I pandoc produces the exact same markdown file if
I convert back from the html.

So right now, my biggest problems are:
1. How to split a Para by the LineBreaks in it into a bunch of Paras with
no LineBreaks.
2. How to detect the first Para in an OrderedList, remove formatting from
it (Strong for example) and convert it into a Heading.

I have the example from https://github.com/mvhenderson/pandoc-filter-node
working and I've made some progress changing other stuff. What I haven't
been successful with is converting an Inline elemento into a Block element
for example.

Any guiding is appreciated.
Diego

On Fri, Jun 19, 2015 at 9:52 PM, BP Jonsson <bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> It seems your original data tries to put headings inside list items. I
> don't know if you can do that in Word — which I haven't used for years —,
> but you can't do that in pandoc.
> If those 'headings' do have a style in the original docx you might try to
> export the docx to (X)HTML and process it with my perl script at <
> https://github.com/bpj/change-html-attrs>, turning those 'headings' into
> divs or spans with an appropriate class, before importing the modified HTML
> into pandoc, where you could use a filter to inject raw LaTeX to format
> them appropriately when generating PDF. Be aware that I've never used that
> script or its predecessor on (X)HTML exported by Word, but only on XHTML
> exported by LibreOffice, albeit based on docx and doc originals. I have
> done a not yet pushed addition to that script so that it can use plugins
> written in perl, which expands its capabilities considerably. I can also
> write a custom perl script to modify all HTML headings inside list items
> rather easily with the help of the HTML::Tree module. The problem is that
> once you have read the data into pandoc they will just be another phrase
> with strong emphasis, so it may be necessary to go via HTML.
> Den 18 jun 2015 22:50 skrev "Diego Algorta" <diego-X9ybnTC+vipBDgjK7y7TUQ@public.gmane.org>:
>
>> Hi everyone,
>>
>> I learned about pandoc this week and it's great. thanks for such a great
>> project.
>>
>> I need to process a bunch of docx files with a specific outline format,
>> but I'll need to transform the AST a bit to fix some inconsistencies before
>> generating the final formats (HTML, docx, markdown and PDF).
>> Inconsistencies are regarding wrong setting of heading styles and hard line
>> breaks.
>>
>> I laid out the problem in a gist for better formatting. Would you please
>> take a look at it?
>> https://gist.github.com/oboxodo/4948ac8e79aa952df8bc
>>
>> Most of my experience is in ruby and js so I expect to write my filter
>> with the node port, but some guidance in ANY of the supported languages
>> would be appreciated.
>>
>> Thanks in advance,
>> Diego
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/pandoc-discuss/44ec194d-4a3a-49f3-a802-35e74cab4344%40googlegroups.com
>> <https://groups.google.com/d/msgid/pandoc-discuss/44ec194d-4a3a-49f3-a802-35e74cab4344%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "pandoc-discuss" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/pandoc-discuss/Kmc1TiRHzO4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuRHBgsw2UQ8mVqT5jXyJEmWC4Yv8%3DjhHhbDw3_imsF%2BjA%40mail.gmail.com
> <https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuRHBgsw2UQ8mVqT5jXyJEmWC4Yv8%3DjhHhbDw3_imsF%2BjA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Diego Algorta : ob.oxo.do

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAPfjgv%2Br9rbRSg2k_G-Ff--4NtG-MmUoY4N%3DMdrRifLkeqN97Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 8092 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Needing to write a filter to transform malformed headings
       [not found]         ` <CAPfjgv+r9rbRSg2k_G-Ff--4NtG-MmUoY4N=MdrRifLkeqN97Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-06-20 17:24           ` Matthew Pickering
       [not found]             ` <CALuQ0m91pdYSv89xC6OsMQo+a=Yj1zthWyuqv__ApPqE0ci8ng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Pickering @ 2015-06-20 17:24 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Hello Diego,

If you are not averse to Haskell then please try this filter.

https://gist.github.com/mpickering/4e3fa135d9e5a908688d

Matt

On Sat, Jun 20, 2015 at 5:48 PM, Diego Algorta <diego-X9ybnTC+vipBDgjK7y7TUQ@public.gmane.org> wrote:
> The original docx file is just using bold text in the first line of each
> numbered list. Except in one of them (not included in the sample I shared
> before) where it does have a Heading 2 and pandoc recognizes it perfectly
> fine.
>
> Pandoc does seem to support headings inside lists. I crafted this example:
> https://gist.github.com/oboxodo/cfb773ee0c50fdab132e
>
> There you have an original markdown file and a resulting html converted with
> pandoc. Anecdotally, I pandoc produces the exact same markdown file if I
> convert back from the html.
>
> So right now, my biggest problems are:
> 1. How to split a Para by the LineBreaks in it into a bunch of Paras with no
> LineBreaks.
> 2. How to detect the first Para in an OrderedList, remove formatting from it
> (Strong for example) and convert it into a Heading.
>
> I have the example from https://github.com/mvhenderson/pandoc-filter-node
> working and I've made some progress changing other stuff. What I haven't
> been successful with is converting an Inline elemento into a Block element
> for example.
>
> Any guiding is appreciated.
> Diego
>
> On Fri, Jun 19, 2015 at 9:52 PM, BP Jonsson <bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>> It seems your original data tries to put headings inside list items. I
>> don't know if you can do that in Word — which I haven't used for years —,
>> but you can't do that in pandoc.
>> If those 'headings' do have a style in the original docx you might try to
>> export the docx to (X)HTML and process it with my perl script at
>> <https://github.com/bpj/change-html-attrs>, turning those 'headings' into
>> divs or spans with an appropriate class, before importing the modified HTML
>> into pandoc, where you could use a filter to inject raw LaTeX to format them
>> appropriately when generating PDF. Be aware that I've never used that script
>> or its predecessor on (X)HTML exported by Word, but only on XHTML exported
>> by LibreOffice, albeit based on docx and doc originals. I have done a not
>> yet pushed addition to that script so that it can use plugins written in
>> perl, which expands its capabilities considerably. I can also write a custom
>> perl script to modify all HTML headings inside list items rather easily with
>> the help of the HTML::Tree module. The problem is that once you have read
>> the data into pandoc they will just be another phrase with strong emphasis,
>> so it may be necessary to go via HTML.
>>
>> Den 18 jun 2015 22:50 skrev "Diego Algorta" <diego-X9ybnTC+vipBDgjK7y7TUQ@public.gmane.org>:
>>>
>>> Hi everyone,
>>>
>>> I learned about pandoc this week and it's great. thanks for such a great
>>> project.
>>>
>>> I need to process a bunch of docx files with a specific outline format,
>>> but I'll need to transform the AST a bit to fix some inconsistencies before
>>> generating the final formats (HTML, docx, markdown and PDF). Inconsistencies
>>> are regarding wrong setting of heading styles and hard line breaks.
>>>
>>> I laid out the problem in a gist for better formatting. Would you please
>>> take a look at it?
>>> https://gist.github.com/oboxodo/4948ac8e79aa952df8bc
>>>
>>> Most of my experience is in ruby and js so I expect to write my filter
>>> with the node port, but some guidance in ANY of the supported languages
>>> would be appreciated.
>>>
>>> Thanks in advance,
>>> Diego
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "pandoc-discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/pandoc-discuss/44ec194d-4a3a-49f3-a802-35e74cab4344%40googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "pandoc-discuss" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/pandoc-discuss/Kmc1TiRHzO4/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuRHBgsw2UQ8mVqT5jXyJEmWC4Yv8%3DjhHhbDw3_imsF%2BjA%40mail.gmail.com.
>>
>> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> Diego Algorta : ob.oxo.do
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/CAPfjgv%2Br9rbRSg2k_G-Ff--4NtG-MmUoY4N%3DMdrRifLkeqN97Q%40mail.gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CALuQ0m91pdYSv89xC6OsMQo%2Ba%3DYj1zthWyuqv__ApPqE0ci8ng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Needing to write a filter to transform malformed headings
       [not found]             ` <CALuQ0m91pdYSv89xC6OsMQo+a=Yj1zthWyuqv__ApPqE0ci8ng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-06-20 21:15               ` Diego Algorta
       [not found]                 ` <CAPfjgvJ7sKpKNy2BScsxgD_hZ2ASCn2s7+-q=dHagZD0b5h7PQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Diego Algorta @ 2015-06-20 21:15 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 7574 bytes --]

/me starts learning Haskell...

On Sat, Jun 20, 2015 at 2:24 PM, Matthew Pickering <
matthewtpickering-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> Hello Diego,
>
> If you are not averse to Haskell then please try this filter.
>
> https://gist.github.com/mpickering/4e3fa135d9e5a908688d
>
> Matt
>
> On Sat, Jun 20, 2015 at 5:48 PM, Diego Algorta <diego-X9ybnTC+vipBDgjK7y7TUQ@public.gmane.org> wrote:
> > The original docx file is just using bold text in the first line of each
> > numbered list. Except in one of them (not included in the sample I shared
> > before) where it does have a Heading 2 and pandoc recognizes it perfectly
> > fine.
> >
> > Pandoc does seem to support headings inside lists. I crafted this
> example:
> > https://gist.github.com/oboxodo/cfb773ee0c50fdab132e
> >
> > There you have an original markdown file and a resulting html converted
> with
> > pandoc. Anecdotally, I pandoc produces the exact same markdown file if I
> > convert back from the html.
> >
> > So right now, my biggest problems are:
> > 1. How to split a Para by the LineBreaks in it into a bunch of Paras
> with no
> > LineBreaks.
> > 2. How to detect the first Para in an OrderedList, remove formatting
> from it
> > (Strong for example) and convert it into a Heading.
> >
> > I have the example from
> https://github.com/mvhenderson/pandoc-filter-node
> > working and I've made some progress changing other stuff. What I haven't
> > been successful with is converting an Inline elemento into a Block
> element
> > for example.
> >
> > Any guiding is appreciated.
> > Diego
> >
> > On Fri, Jun 19, 2015 at 9:52 PM, BP Jonsson <bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>
> >> It seems your original data tries to put headings inside list items. I
> >> don't know if you can do that in Word — which I haven't used for years
> —,
> >> but you can't do that in pandoc.
> >> If those 'headings' do have a style in the original docx you might try
> to
> >> export the docx to (X)HTML and process it with my perl script at
> >> <https://github.com/bpj/change-html-attrs>, turning those 'headings'
> into
> >> divs or spans with an appropriate class, before importing the modified
> HTML
> >> into pandoc, where you could use a filter to inject raw LaTeX to format
> them
> >> appropriately when generating PDF. Be aware that I've never used that
> script
> >> or its predecessor on (X)HTML exported by Word, but only on XHTML
> exported
> >> by LibreOffice, albeit based on docx and doc originals. I have done a
> not
> >> yet pushed addition to that script so that it can use plugins written in
> >> perl, which expands its capabilities considerably. I can also write a
> custom
> >> perl script to modify all HTML headings inside list items rather easily
> with
> >> the help of the HTML::Tree module. The problem is that once you have
> read
> >> the data into pandoc they will just be another phrase with strong
> emphasis,
> >> so it may be necessary to go via HTML.
> >>
> >> Den 18 jun 2015 22:50 skrev "Diego Algorta" <diego-X9ybnTC+vipBDgjK7y7TUQ@public.gmane.org>:
> >>>
> >>> Hi everyone,
> >>>
> >>> I learned about pandoc this week and it's great. thanks for such a
> great
> >>> project.
> >>>
> >>> I need to process a bunch of docx files with a specific outline format,
> >>> but I'll need to transform the AST a bit to fix some inconsistencies
> before
> >>> generating the final formats (HTML, docx, markdown and PDF).
> Inconsistencies
> >>> are regarding wrong setting of heading styles and hard line breaks.
> >>>
> >>> I laid out the problem in a gist for better formatting. Would you
> please
> >>> take a look at it?
> >>> https://gist.github.com/oboxodo/4948ac8e79aa952df8bc
> >>>
> >>> Most of my experience is in ruby and js so I expect to write my filter
> >>> with the node port, but some guidance in ANY of the supported languages
> >>> would be appreciated.
> >>>
> >>> Thanks in advance,
> >>> Diego
> >>>
> >>> --
> >>> You received this message because you are subscribed to the Google
> Groups
> >>> "pandoc-discuss" group.
> >>> To unsubscribe from this group and stop receiving emails from it, send
> an
> >>> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> >>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> >>> To view this discussion on the web visit
> >>>
> https://groups.google.com/d/msgid/pandoc-discuss/44ec194d-4a3a-49f3-a802-35e74cab4344%40googlegroups.com
> .
> >>> For more options, visit https://groups.google.com/d/optout.
> >>
> >> --
> >> You received this message because you are subscribed to a topic in the
> >> Google Groups "pandoc-discuss" group.
> >> To unsubscribe from this topic, visit
> >>
> https://groups.google.com/d/topic/pandoc-discuss/Kmc1TiRHzO4/unsubscribe.
> >> To unsubscribe from this group and all its topics, send an email to
> >> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> >> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> >> To view this discussion on the web visit
> >>
> https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuRHBgsw2UQ8mVqT5jXyJEmWC4Yv8%3DjhHhbDw3_imsF%2BjA%40mail.gmail.com
> .
> >>
> >> For more options, visit https://groups.google.com/d/optout.
> >
> >
> >
> >
> > --
> > Diego Algorta : ob.oxo.do
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/pandoc-discuss/CAPfjgv%2Br9rbRSg2k_G-Ff--4NtG-MmUoY4N%3DMdrRifLkeqN97Q%40mail.gmail.com
> .
> >
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "pandoc-discuss" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/pandoc-discuss/Kmc1TiRHzO4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/CALuQ0m91pdYSv89xC6OsMQo%2Ba%3DYj1zthWyuqv__ApPqE0ci8ng%40mail.gmail.com
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Diego Algorta : ob.oxo.do

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAPfjgvJ7sKpKNy2BScsxgD_hZ2ASCn2s7%2B-q%3DdHagZD0b5h7PQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 11883 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Needing to write a filter to transform malformed headings
       [not found] ` <44ec194d-4a3a-49f3-a802-35e74cab4344-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2015-06-20  0:52   ` BP Jonsson
@ 2015-06-22 21:14   ` Diego Algorta
  1 sibling, 0 replies; 7+ messages in thread
From: Diego Algorta @ 2015-06-22 21:14 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


[-- Attachment #1.1: Type: text/plain, Size: 1760 bytes --]

Replying to myself as an status update...

So I managed to write a few working filters. Baby steps, but at least 
working ones. I'm sure these are a bit brittle though.

Here they are in case they're useful for anyone else as an example of 
javascript/node based filters: 
https://gist.github.com/oboxodo/ddfbd07aa616762e4e01

Thanks,
Diego

On Thursday, June 18, 2015 at 5:50:50 PM UTC-3, Diego Algorta wrote:
>
> Hi everyone,
>
> I learned about pandoc this week and it's great. thanks for such a great 
> project.
>
> I need to process a bunch of docx files with a specific outline format, 
> but I'll need to transform the AST a bit to fix some inconsistencies before 
> generating the final formats (HTML, docx, markdown and PDF). 
> Inconsistencies are regarding wrong setting of heading styles and hard line 
> breaks.
>
> I laid out the problem in a gist for better formatting. Would you please 
> take a look at it?
> https://gist.github.com/oboxodo/4948ac8e79aa952df8bc
>
> Most of my experience is in ruby and js so I expect to write my filter 
> with the node port, but some guidance in ANY of the supported languages 
> would be appreciated.
>
> Thanks in advance,
> Diego
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/ab602c8c-9781-41a1-9dbf-39e50b194ca9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 2994 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Needing to write a filter to transform malformed headings
       [not found]                 ` <CAPfjgvJ7sKpKNy2BScsxgD_hZ2ASCn2s7+-q=dHagZD0b5h7PQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-06-22 21:18                   ` Diego Algorta
  0 siblings, 0 replies; 7+ messages in thread
From: Diego Algorta @ 2015-06-22 21:18 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


[-- Attachment #1.1: Type: text/plain, Size: 8303 bytes --]

Matthew, thanks for your filter. I wasn't able to run it as I couldn't get 
a haskell environment properly configured. I confess I might just haven't 
dedicated enough time to make it work. I did read your filter and learnt 
some Haskell on the side, though. So thanks for that!

In any case, I think I got some progress writing my own filters (posted a 
separate reply to my original message with them).

Thank you.

On Saturday, June 20, 2015 at 6:16:03 PM UTC-3, Diego Algorta wrote:
>
> /me starts learning Haskell...
>
> On Sat, Jun 20, 2015 at 2:24 PM, Matthew Pickering <
> matthewtpickering-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>> Hello Diego,
>>
>> If you are not averse to Haskell then please try this filter.
>>
>> https://gist.github.com/mpickering/4e3fa135d9e5a908688d
>>
>> Matt
>>
>> On Sat, Jun 20, 2015 at 5:48 PM, Diego Algorta <diego-X9ybnTC+vipBDgjK7y7TUQ@public.gmane.org> wrote:
>> > The original docx file is just using bold text in the first line of each
>> > numbered list. Except in one of them (not included in the sample I 
>> shared
>> > before) where it does have a Heading 2 and pandoc recognizes it 
>> perfectly
>> > fine.
>> >
>> > Pandoc does seem to support headings inside lists. I crafted this 
>> example:
>> > https://gist.github.com/oboxodo/cfb773ee0c50fdab132e
>> >
>> > There you have an original markdown file and a resulting html converted 
>> with
>> > pandoc. Anecdotally, I pandoc produces the exact same markdown file if I
>> > convert back from the html.
>> >
>> > So right now, my biggest problems are:
>> > 1. How to split a Para by the LineBreaks in it into a bunch of Paras 
>> with no
>> > LineBreaks.
>> > 2. How to detect the first Para in an OrderedList, remove formatting 
>> from it
>> > (Strong for example) and convert it into a Heading.
>> >
>> > I have the example from 
>> https://github.com/mvhenderson/pandoc-filter-node
>> > working and I've made some progress changing other stuff. What I haven't
>> > been successful with is converting an Inline elemento into a Block 
>> element
>> > for example.
>> >
>> > Any guiding is appreciated.
>> > Diego
>> >
>> > On Fri, Jun 19, 2015 at 9:52 PM, BP Jonsson <bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 
>> wrote:
>> >>
>> >> It seems your original data tries to put headings inside list items. I
>> >> don't know if you can do that in Word — which I haven't used for years 
>> —,
>> >> but you can't do that in pandoc.
>> >> If those 'headings' do have a style in the original docx you might try 
>> to
>> >> export the docx to (X)HTML and process it with my perl script at
>> >> <https://github.com/bpj/change-html-attrs>, turning those 'headings' 
>> into
>> >> divs or spans with an appropriate class, before importing the modified 
>> HTML
>> >> into pandoc, where you could use a filter to inject raw LaTeX to 
>> format them
>> >> appropriately when generating PDF. Be aware that I've never used that 
>> script
>> >> or its predecessor on (X)HTML exported by Word, but only on XHTML 
>> exported
>> >> by LibreOffice, albeit based on docx and doc originals. I have done a 
>> not
>> >> yet pushed addition to that script so that it can use plugins written 
>> in
>> >> perl, which expands its capabilities considerably. I can also write a 
>> custom
>> >> perl script to modify all HTML headings inside list items rather 
>> easily with
>> >> the help of the HTML::Tree module. The problem is that once you have 
>> read
>> >> the data into pandoc they will just be another phrase with strong 
>> emphasis,
>> >> so it may be necessary to go via HTML.
>> >>
>> >> Den 18 jun 2015 22:50 skrev "Diego Algorta" <diego-X9ybnTC+vipBDgjK7y7TUQ@public.gmane.org>:
>> >>>
>> >>> Hi everyone,
>> >>>
>> >>> I learned about pandoc this week and it's great. thanks for such a 
>> great
>> >>> project.
>> >>>
>> >>> I need to process a bunch of docx files with a specific outline 
>> format,
>> >>> but I'll need to transform the AST a bit to fix some inconsistencies 
>> before
>> >>> generating the final formats (HTML, docx, markdown and PDF). 
>> Inconsistencies
>> >>> are regarding wrong setting of heading styles and hard line breaks.
>> >>>
>> >>> I laid out the problem in a gist for better formatting. Would you 
>> please
>> >>> take a look at it?
>> >>> https://gist.github.com/oboxodo/4948ac8e79aa952df8bc
>> >>>
>> >>> Most of my experience is in ruby and js so I expect to write my filter
>> >>> with the node port, but some guidance in ANY of the supported 
>> languages
>> >>> would be appreciated.
>> >>>
>> >>> Thanks in advance,
>> >>> Diego
>> >>>
>> >>> --
>> >>> You received this message because you are subscribed to the Google 
>> Groups
>> >>> "pandoc-discuss" group.
>> >>> To unsubscribe from this group and stop receiving emails from it, 
>> send an
>> >>> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> >>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
>> >>> To view this discussion on the web visit
>> >>> 
>> https://groups.google.com/d/msgid/pandoc-discuss/44ec194d-4a3a-49f3-a802-35e74cab4344%40googlegroups.com
>> .
>> >>> For more options, visit https://groups.google.com/d/optout.
>> >>
>> >> --
>> >> You received this message because you are subscribed to a topic in the
>> >> Google Groups "pandoc-discuss" group.
>> >> To unsubscribe from this topic, visit
>> >> 
>> https://groups.google.com/d/topic/pandoc-discuss/Kmc1TiRHzO4/unsubscribe.
>> >> To unsubscribe from this group and all its topics, send an email to
>> >> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> >> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> >> To view this discussion on the web visit
>> >> 
>> https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuRHBgsw2UQ8mVqT5jXyJEmWC4Yv8%3DjhHhbDw3_imsF%2BjA%40mail.gmail.com
>> .
>> >>
>> >> For more options, visit https://groups.google.com/d/optout.
>> >
>> >
>> >
>> >
>> > --
>> > Diego Algorta : ob.oxo.do
>> >
>> > --
>> > You received this message because you are subscribed to the Google 
>> Groups
>> > "pandoc-discuss" group.
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an
>> > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> > To view this discussion on the web visit
>> > 
>> https://groups.google.com/d/msgid/pandoc-discuss/CAPfjgv%2Br9rbRSg2k_G-Ff--4NtG-MmUoY4N%3DMdrRifLkeqN97Q%40mail.gmail.com
>> .
>> >
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "pandoc-discuss" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/pandoc-discuss/Kmc1TiRHzO4/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to 
>> pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/pandoc-discuss/CALuQ0m91pdYSv89xC6OsMQo%2Ba%3DYj1zthWyuqv__ApPqE0ci8ng%40mail.gmail.com
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Diego Algorta : ob.oxo.do
>  

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/035cd8fc-66b0-4d71-808f-bc9ae7fba0ea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 19963 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-06-22 21:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-18 20:50 Needing to write a filter to transform malformed headings Diego Algorta
     [not found] ` <44ec194d-4a3a-49f3-a802-35e74cab4344-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2015-06-20  0:52   ` BP Jonsson
     [not found]     ` <CAFC_yuRHBgsw2UQ8mVqT5jXyJEmWC4Yv8=jhHhbDw3_imsF+jA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-20 16:48       ` Diego Algorta
     [not found]         ` <CAPfjgv+r9rbRSg2k_G-Ff--4NtG-MmUoY4N=MdrRifLkeqN97Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-20 17:24           ` Matthew Pickering
     [not found]             ` <CALuQ0m91pdYSv89xC6OsMQo+a=Yj1zthWyuqv__ApPqE0ci8ng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-20 21:15               ` Diego Algorta
     [not found]                 ` <CAPfjgvJ7sKpKNy2BScsxgD_hZ2ASCn2s7+-q=dHagZD0b5h7PQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-22 21:18                   ` Diego Algorta
2015-06-22 21:14   ` Diego Algorta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).