public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* RTF to Markdown questions
@ 2022-04-27 18:02 Kris Wilk
       [not found] ` <aecd40a2-09db-4e1b-96ad-752973375e0cn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Kris Wilk @ 2022-04-27 18:02 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1686 bytes --]

Sorry if anyone gets this twice, had to correct my formatting...

I'm trying to use pandoc (for the first time) to convert some RTF files to 
markdown. My goal is to extract the text with ***bold*** and **italics** 
preserved and no other formatting.

Simply converting with "pandoc in.rtf -o out.md" produces a markdown file 
that's not quite what I need. For instance, here's a line from the output:

**[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863

FIRST and foremost, pandoc tries to preserve the underlined text, which I 
don't want. Can this be disabled? I've tried the "bracketed_spans" and "
native_spans" extensions but this still processes the underlines as:

**<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863

SECOND, at least when I view this in VSCode's markdown preview, the bold 
and emphasis are not presented correctly, I guess because they touch each 
other or have spaces (or both?)? It displays correctly if it's:

**Scientific Name:** *Aplysia parvula* Morch, 1863

I realize that the text in the RTF might have the bold/italic tagged 
weirdly but is there a way to deal with this or am I just stuck? I have 
about 500 such files to process, so I'm looking for automated methods.

Thanks in advance for any help you can provide!

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 2415 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RTF to Markdown questions
       [not found] ` <aecd40a2-09db-4e1b-96ad-752973375e0cn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2022-04-27 18:28   ` John MacFarlane
       [not found]     ` <m27d7aqt35.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: John MacFarlane @ 2022-04-27 18:28 UTC (permalink / raw)
  To: Kris Wilk, pandoc-discuss


The issue with bold is probably because the RTF file includes
some spaces inside the boldface emphasis.  That is depressingly
common in word processing documents, and we have code in the docx
reader, if I recall, that handles it by converting

<b>helloSPACE</b>
to
<b>hello</b>SPACE

We could port this over to the RTF reader, I think -- can you
put up an issue on the tracker so we don't forget?

The other issue can be handled using a simple Lua filter.
Save it as ununderline.lua and use -L ununderline.lua on
the command line:

function Underline(el)
  return el.content
end

You could probably handle the spacing issue with a more complex
Lua filter, as well.

Kris Wilk <kris-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:

> Sorry if anyone gets this twice, had to correct my formatting...
>
> I'm trying to use pandoc (for the first time) to convert some RTF files to 
> markdown. My goal is to extract the text with ***bold*** and **italics** 
> preserved and no other formatting.
>
> Simply converting with "pandoc in.rtf -o out.md" produces a markdown file 
> that's not quite what I need. For instance, here's a line from the output:
>
> **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863
>
> FIRST and foremost, pandoc tries to preserve the underlined text, which I 
> don't want. Can this be disabled? I've tried the "bracketed_spans" and "
> native_spans" extensions but this still processes the underlines as:
>
> **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863
>
> SECOND, at least when I view this in VSCode's markdown preview, the bold 
> and emphasis are not presented correctly, I guess because they touch each 
> other or have spaces (or both?)? It displays correctly if it's:
>
> **Scientific Name:** *Aplysia parvula* Morch, 1863
>
> I realize that the text in the RTF might have the bold/italic tagged 
> weirdly but is there a way to deal with this or am I just stuck? I have 
> about 500 such files to process, so I'm looking for automated methods.
>
> Thanks in advance for any help you can provide!
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RTF to Markdown questions
       [not found]     ` <m27d7aqt35.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
@ 2022-04-27 18:38       ` Kris Wilk
       [not found]         ` <bf4044b0-6746-4720-942f-53303a5cb296n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  2022-04-28  0:53       ` Kris Wilk
  1 sibling, 1 reply; 9+ messages in thread
From: Kris Wilk @ 2022-04-27 18:38 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3551 bytes --]

Thanks, the script to strip the underlines is exactly what I needed. As for 
the spaces in the bolds/italics, I'll add an issue for that. Obviously, 
this is not pandoc's fault but if you have a workaround that can be ported 
to the RTF reader at some point, that would be super.

In the meantime, I guess I'll investigate doing the job myself in Lua. Even 
if it takes me a couple of days to figure out it'll be faster than 
processing every file manually!

Kris

On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote:

>
> The issue with bold is probably because the RTF file includes
> some spaces inside the boldface emphasis. That is depressingly
> common in word processing documents, and we have code in the docx
> reader, if I recall, that handles it by converting
>
> <b>helloSPACE</b>
> to
> <b>hello</b>SPACE
>
> We could port this over to the RTF reader, I think -- can you
> put up an issue on the tracker so we don't forget?
>
> The other issue can be handled using a simple Lua filter.
> Save it as ununderline.lua and use -L ununderline.lua on
> the command line:
>
> function Underline(el)
> return el.content
> end
>
> You could probably handle the spacing issue with a more complex
> Lua filter, as well.
>
> Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:
>
> > Sorry if anyone gets this twice, had to correct my formatting...
> >
> > I'm trying to use pandoc (for the first time) to convert some RTF files 
> to 
> > markdown. My goal is to extract the text with ***bold*** and **italics** 
> > preserved and no other formatting.
> >
> > Simply converting with "pandoc in.rtf -o out.md" produces a markdown 
> file 
> > that's not quite what I need. For instance, here's a line from the 
> output:
> >
> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863
> >
> > FIRST and foremost, pandoc tries to preserve the underlined text, which 
> I 
> > don't want. Can this be disabled? I've tried the "bracketed_spans" and "
> > native_spans" extensions but this still processes the underlines as:
> >
> > **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863
> >
> > SECOND, at least when I view this in VSCode's markdown preview, the bold 
> > and emphasis are not presented correctly, I guess because they touch 
> each 
> > other or have spaces (or both?)? It displays correctly if it's:
> >
> > **Scientific Name:** *Aplysia parvula* Morch, 1863
> >
> > I realize that the text in the RTF might have the bold/italic tagged 
> > weirdly but is there a way to deal with this or am I just stuck? I have 
> > about 500 such files to process, so I'm looking for automated methods.
> >
> > Thanks in advance for any help you can provide!
> >
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/bf4044b0-6746-4720-942f-53303a5cb296n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 4902 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RTF to Markdown questions
       [not found]         ` <bf4044b0-6746-4720-942f-53303a5cb296n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2022-04-27 19:06           ` Kris Wilk
       [not found]             ` <dec1524b-c96f-4090-be4c-6a2509879d59n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Kris Wilk @ 2022-04-27 19:06 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 4254 bytes --]

Follow-up question: I just spotted another markup that has been preserved 
that I don't recognize (and don't want...only bold/italics). Here's a 
snippet of converted RTF to markdown:

...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale...

I'm guessing that's some kind of RTF attribute/tag? I presume it can be 
stripped as easily as the underlines but I have no idea what it is in the 
first place. I suspect the answer is found somewhere in

https://pandoc.org/lua-filters.html

but I'm not sure where to look.

Kris

On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote:

> Thanks, the script to strip the underlines is exactly what I needed. As 
> for the spaces in the bolds/italics, I'll add an issue for that. Obviously, 
> this is not pandoc's fault but if you have a workaround that can be ported 
> to the RTF reader at some point, that would be super.
>
> In the meantime, I guess I'll investigate doing the job myself in Lua. 
> Even if it takes me a couple of days to figure out it'll be faster than 
> processing every file manually!
>
> Kris
>
> On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote:
>
>>
>> The issue with bold is probably because the RTF file includes
>> some spaces inside the boldface emphasis. That is depressingly
>> common in word processing documents, and we have code in the docx
>> reader, if I recall, that handles it by converting
>>
>> <b>helloSPACE</b>
>> to
>> <b>hello</b>SPACE
>>
>> We could port this over to the RTF reader, I think -- can you
>> put up an issue on the tracker so we don't forget?
>>
>> The other issue can be handled using a simple Lua filter.
>> Save it as ununderline.lua and use -L ununderline.lua on
>> the command line:
>>
>> function Underline(el)
>> return el.content
>> end
>>
>> You could probably handle the spacing issue with a more complex
>> Lua filter, as well.
>>
>> Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:
>>
>> > Sorry if anyone gets this twice, had to correct my formatting...
>> >
>> > I'm trying to use pandoc (for the first time) to convert some RTF files 
>> to 
>> > markdown. My goal is to extract the text with ***bold*** and 
>> **italics** 
>> > preserved and no other formatting.
>> >
>> > Simply converting with "pandoc in.rtf -o out.md" produces a markdown 
>> file 
>> > that's not quite what I need. For instance, here's a line from the 
>> output:
>> >
>> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863
>> >
>> > FIRST and foremost, pandoc tries to preserve the underlined text, which 
>> I 
>> > don't want. Can this be disabled? I've tried the "bracketed_spans" and "
>> > native_spans" extensions but this still processes the underlines as:
>> >
>> > **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863
>> >
>> > SECOND, at least when I view this in VSCode's markdown preview, the 
>> bold 
>> > and emphasis are not presented correctly, I guess because they touch 
>> each 
>> > other or have spaces (or both?)? It displays correctly if it's:
>> >
>> > **Scientific Name:** *Aplysia parvula* Morch, 1863
>> >
>> > I realize that the text in the RTF might have the bold/italic tagged 
>> > weirdly but is there a way to deal with this or am I just stuck? I have 
>> > about 500 such files to process, so I'm looking for automated methods.
>> >
>> > Thanks in advance for any help you can provide!
>> >
>> > -- 
>> > You received this message because you are subscribed to the Google 
>> Groups "pandoc-discuss" group.
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> > To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 5918 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RTF to Markdown questions
       [not found]             ` <dec1524b-c96f-4090-be4c-6a2509879d59n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2022-04-27 20:23               ` John MacFarlane
       [not found]                 ` <yh480kbkwmtgvg.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: John MacFarlane @ 2022-04-27 20:23 UTC (permalink / raw)
  To: Kris Wilk, pandoc-discuss


In pandoc's AST, that is a Span with an id.
So your filter will need to match on Span instead of underline,
but otherwise same as the last one.

Kris Wilk <kris-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:

> Follow-up question: I just spotted another markup that has been preserved 
> that I don't recognize (and don't want...only bold/italics). Here's a 
> snippet of converted RTF to markdown:
>
> ...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale...
>
> I'm guessing that's some kind of RTF attribute/tag? I presume it can be 
> stripped as easily as the underlines but I have no idea what it is in the 
> first place. I suspect the answer is found somewhere in
>
> https://pandoc.org/lua-filters.html
>
> but I'm not sure where to look.
>
> Kris
>
> On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote:
>
>> Thanks, the script to strip the underlines is exactly what I needed. As 
>> for the spaces in the bolds/italics, I'll add an issue for that. Obviously, 
>> this is not pandoc's fault but if you have a workaround that can be ported 
>> to the RTF reader at some point, that would be super.
>>
>> In the meantime, I guess I'll investigate doing the job myself in Lua. 
>> Even if it takes me a couple of days to figure out it'll be faster than 
>> processing every file manually!
>>
>> Kris
>>
>> On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote:
>>
>>>
>>> The issue with bold is probably because the RTF file includes
>>> some spaces inside the boldface emphasis. That is depressingly
>>> common in word processing documents, and we have code in the docx
>>> reader, if I recall, that handles it by converting
>>>
>>> <b>helloSPACE</b>
>>> to
>>> <b>hello</b>SPACE
>>>
>>> We could port this over to the RTF reader, I think -- can you
>>> put up an issue on the tracker so we don't forget?
>>>
>>> The other issue can be handled using a simple Lua filter.
>>> Save it as ununderline.lua and use -L ununderline.lua on
>>> the command line:
>>>
>>> function Underline(el)
>>> return el.content
>>> end
>>>
>>> You could probably handle the spacing issue with a more complex
>>> Lua filter, as well.
>>>
>>> Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:
>>>
>>> > Sorry if anyone gets this twice, had to correct my formatting...
>>> >
>>> > I'm trying to use pandoc (for the first time) to convert some RTF files 
>>> to 
>>> > markdown. My goal is to extract the text with ***bold*** and 
>>> **italics** 
>>> > preserved and no other formatting.
>>> >
>>> > Simply converting with "pandoc in.rtf -o out.md" produces a markdown 
>>> file 
>>> > that's not quite what I need. For instance, here's a line from the 
>>> output:
>>> >
>>> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863
>>> >
>>> > FIRST and foremost, pandoc tries to preserve the underlined text, which 
>>> I 
>>> > don't want. Can this be disabled? I've tried the "bracketed_spans" and "
>>> > native_spans" extensions but this still processes the underlines as:
>>> >
>>> > **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863
>>> >
>>> > SECOND, at least when I view this in VSCode's markdown preview, the 
>>> bold 
>>> > and emphasis are not presented correctly, I guess because they touch 
>>> each 
>>> > other or have spaces (or both?)? It displays correctly if it's:
>>> >
>>> > **Scientific Name:** *Aplysia parvula* Morch, 1863
>>> >
>>> > I realize that the text in the RTF might have the bold/italic tagged 
>>> > weirdly but is there a way to deal with this or am I just stuck? I have 
>>> > about 500 such files to process, so I'm looking for automated methods.
>>> >
>>> > Thanks in advance for any help you can provide!
>>> >
>>> > -- 
>>> > You received this message because you are subscribed to the Google 
>>> Groups "pandoc-discuss" group.
>>> > To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>> > To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com
>>> .
>>>
>>
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RTF to Markdown questions
       [not found]                 ` <yh480kbkwmtgvg.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
@ 2022-04-27 20:56                   ` Bastien DUMONT
  2022-04-27 23:51                     ` Kris Wilk
  2022-04-27 23:51                   ` Kris Wilk
  1 sibling, 1 reply; 9+ messages in thread
From: Bastien DUMONT @ 2022-04-27 20:56 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Don't you have any cross-reference to it in your RTF file? If you have, you may want to keep it.

Le Wednesday 27 April 2022 à 01:23:47PM, John MacFarlane a écrit :
> 
> In pandoc's AST, that is a Span with an id.
> So your filter will need to match on Span instead of underline,
> but otherwise same as the last one.
> 
> Kris Wilk <kris-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:
> 
> > Follow-up question: I just spotted another markup that has been preserved 
> > that I don't recognize (and don't want...only bold/italics). Here's a 
> > snippet of converted RTF to markdown:
> >
> > ...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale...
> >
> > I'm guessing that's some kind of RTF attribute/tag? I presume it can be 
> > stripped as easily as the underlines but I have no idea what it is in the 
> > first place. I suspect the answer is found somewhere in
> >
> > https://pandoc.org/lua-filters.html
> >
> > but I'm not sure where to look.
> >
> > Kris
> >
> > On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote:
> >
> >> Thanks, the script to strip the underlines is exactly what I needed. As 
> >> for the spaces in the bolds/italics, I'll add an issue for that. Obviously, 
> >> this is not pandoc's fault but if you have a workaround that can be ported 
> >> to the RTF reader at some point, that would be super.
> >>
> >> In the meantime, I guess I'll investigate doing the job myself in Lua. 
> >> Even if it takes me a couple of days to figure out it'll be faster than 
> >> processing every file manually!
> >>
> >> Kris
> >>
> >> On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote:
> >>
> >>>
> >>> The issue with bold is probably because the RTF file includes
> >>> some spaces inside the boldface emphasis. That is depressingly
> >>> common in word processing documents, and we have code in the docx
> >>> reader, if I recall, that handles it by converting
> >>>
> >>> <b>helloSPACE</b>
> >>> to
> >>> <b>hello</b>SPACE
> >>>
> >>> We could port this over to the RTF reader, I think -- can you
> >>> put up an issue on the tracker so we don't forget?
> >>>
> >>> The other issue can be handled using a simple Lua filter.
> >>> Save it as ununderline.lua and use -L ununderline.lua on
> >>> the command line:
> >>>
> >>> function Underline(el)
> >>> return el.content
> >>> end
> >>>
> >>> You could probably handle the spacing issue with a more complex
> >>> Lua filter, as well.
> >>>
> >>> Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:
> >>>
> >>> > Sorry if anyone gets this twice, had to correct my formatting...
> >>> >
> >>> > I'm trying to use pandoc (for the first time) to convert some RTF files 
> >>> to 
> >>> > markdown. My goal is to extract the text with ***bold*** and 
> >>> **italics** 
> >>> > preserved and no other formatting.
> >>> >
> >>> > Simply converting with "pandoc in.rtf -o out.md" produces a markdown 
> >>> file 
> >>> > that's not quite what I need. For instance, here's a line from the 
> >>> output:
> >>> >
> >>> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863
> >>> >
> >>> > FIRST and foremost, pandoc tries to preserve the underlined text, which 
> >>> I 
> >>> > don't want. Can this be disabled? I've tried the "bracketed_spans" and "
> >>> > native_spans" extensions but this still processes the underlines as:
> >>> >
> >>> > **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863
> >>> >
> >>> > SECOND, at least when I view this in VSCode's markdown preview, the 
> >>> bold 
> >>> > and emphasis are not presented correctly, I guess because they touch 
> >>> each 
> >>> > other or have spaces (or both?)? It displays correctly if it's:
> >>> >
> >>> > **Scientific Name:** *Aplysia parvula* Morch, 1863
> >>> >
> >>> > I realize that the text in the RTF might have the bold/italic tagged 
> >>> > weirdly but is there a way to deal with this or am I just stuck? I have 
> >>> > about 500 such files to process, so I'm looking for automated methods.
> >>> >
> >>> > Thanks in advance for any help you can provide!
> >>> >
> >>> > -- 
> >>> > You received this message because you are subscribed to the Google 
> >>> Groups "pandoc-discuss" group.
> >>> > To unsubscribe from this group and stop receiving emails from it, send 
> >>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> >>> > To view this discussion on the web visit 
> >>> https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com
> >>> .
> >>>
> >>
> >
> > -- 
> > You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com.
> 
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/yh480kbkwmtgvg.fsf%40johnmacfarlane.net.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/Ymmt7z2UIuKb98ky%40localhost.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RTF to Markdown questions
  2022-04-27 20:56                   ` Bastien DUMONT
@ 2022-04-27 23:51                     ` Kris Wilk
  0 siblings, 0 replies; 9+ messages in thread
From: Kris Wilk @ 2022-04-27 23:51 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 6611 bytes --]

In this case, no, any such references/markup other than bold/italic are 
spurious and must be removed. The aim is to reduce the content to plain 
text with the simplest markup only, for ingestion into a database that will 
be used by a web app.

Kris

On Wednesday, April 27, 2022 at 4:56:22 PM UTC-4 Bastien Dumont wrote:

> Don't you have any cross-reference to it in your RTF file? If you have, 
> you may want to keep it.
>
> Le Wednesday 27 April 2022 à 01:23:47PM, John MacFarlane a écrit :
> > 
> > In pandoc's AST, that is a Span with an id.
> > So your filter will need to match on Span instead of underline,
> > but otherwise same as the last one.
> > 
> > Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:
> > 
> > > Follow-up question: I just spotted another markup that has been 
> preserved 
> > > that I don't recognize (and don't want...only bold/italics). Here's a 
> > > snippet of converted RTF to markdown:
> > >
> > > ...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale...
> > >
> > > I'm guessing that's some kind of RTF attribute/tag? I presume it can 
> be 
> > > stripped as easily as the underlines but I have no idea what it is in 
> the 
> > > first place. I suspect the answer is found somewhere in
> > >
> > > https://pandoc.org/lua-filters.html
> > >
> > > but I'm not sure where to look.
> > >
> > > Kris
> > >
> > > On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote:
> > >
> > >> Thanks, the script to strip the underlines is exactly what I needed. 
> As 
> > >> for the spaces in the bolds/italics, I'll add an issue for that. 
> Obviously, 
> > >> this is not pandoc's fault but if you have a workaround that can be 
> ported 
> > >> to the RTF reader at some point, that would be super.
> > >>
> > >> In the meantime, I guess I'll investigate doing the job myself in 
> Lua. 
> > >> Even if it takes me a couple of days to figure out it'll be faster 
> than 
> > >> processing every file manually!
> > >>
> > >> Kris
> > >>
> > >> On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane 
> wrote:
> > >>
> > >>>
> > >>> The issue with bold is probably because the RTF file includes
> > >>> some spaces inside the boldface emphasis. That is depressingly
> > >>> common in word processing documents, and we have code in the docx
> > >>> reader, if I recall, that handles it by converting
> > >>>
> > >>> <b>helloSPACE</b>
> > >>> to
> > >>> <b>hello</b>SPACE
> > >>>
> > >>> We could port this over to the RTF reader, I think -- can you
> > >>> put up an issue on the tracker so we don't forget?
> > >>>
> > >>> The other issue can be handled using a simple Lua filter.
> > >>> Save it as ununderline.lua and use -L ununderline.lua on
> > >>> the command line:
> > >>>
> > >>> function Underline(el)
> > >>> return el.content
> > >>> end
> > >>>
> > >>> You could probably handle the spacing issue with a more complex
> > >>> Lua filter, as well.
> > >>>
> > >>> Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:
> > >>>
> > >>> > Sorry if anyone gets this twice, had to correct my formatting...
> > >>> >
> > >>> > I'm trying to use pandoc (for the first time) to convert some RTF 
> files 
> > >>> to 
> > >>> > markdown. My goal is to extract the text with ***bold*** and 
> > >>> **italics** 
> > >>> > preserved and no other formatting.
> > >>> >
> > >>> > Simply converting with "pandoc in.rtf -o out.md" produces a 
> markdown 
> > >>> file 
> > >>> > that's not quite what I need. For instance, here's a line from the 
> > >>> output:
> > >>> >
> > >>> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863
> > >>> >
> > >>> > FIRST and foremost, pandoc tries to preserve the underlined text, 
> which 
> > >>> I 
> > >>> > don't want. Can this be disabled? I've tried the "bracketed_spans" 
> and "
> > >>> > native_spans" extensions but this still processes the underlines 
> as:
> > >>> >
> > >>> > **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863
> > >>> >
> > >>> > SECOND, at least when I view this in VSCode's markdown preview, 
> the 
> > >>> bold 
> > >>> > and emphasis are not presented correctly, I guess because they 
> touch 
> > >>> each 
> > >>> > other or have spaces (or both?)? It displays correctly if it's:
> > >>> >
> > >>> > **Scientific Name:** *Aplysia parvula* Morch, 1863
> > >>> >
> > >>> > I realize that the text in the RTF might have the bold/italic 
> tagged 
> > >>> > weirdly but is there a way to deal with this or am I just stuck? I 
> have 
> > >>> > about 500 such files to process, so I'm looking for automated 
> methods.
> > >>> >
> > >>> > Thanks in advance for any help you can provide!
> > >>> >
> > >>> > -- 
> > >>> > You received this message because you are subscribed to the Google 
> > >>> Groups "pandoc-discuss" group.
> > >>> > To unsubscribe from this group and stop receiving emails from it, 
> send 
> > >>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > >>> > To view this discussion on the web visit 
> > >>> 
> https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com
> > >>> .
> > >>>
> > >>
> > >
> > > -- 
> > > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com
> .
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/yh480kbkwmtgvg.fsf%40johnmacfarlane.net
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/f35cb187-3a19-444b-a9ce-9799e816c305n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 10258 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RTF to Markdown questions
       [not found]                 ` <yh480kbkwmtgvg.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
  2022-04-27 20:56                   ` Bastien DUMONT
@ 2022-04-27 23:51                   ` Kris Wilk
  1 sibling, 0 replies; 9+ messages in thread
From: Kris Wilk @ 2022-04-27 23:51 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 5373 bytes --]

Thanks, that did the trick!

Kris

On Wednesday, April 27, 2022 at 4:23:55 PM UTC-4 John MacFarlane wrote:

>
> In pandoc's AST, that is a Span with an id.
> So your filter will need to match on Span instead of underline,
> but otherwise same as the last one.
>
> Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:
>
> > Follow-up question: I just spotted another markup that has been 
> preserved 
> > that I don't recognize (and don't want...only bold/italics). Here's a 
> > snippet of converted RTF to markdown:
> >
> > ...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale...
> >
> > I'm guessing that's some kind of RTF attribute/tag? I presume it can be 
> > stripped as easily as the underlines but I have no idea what it is in 
> the 
> > first place. I suspect the answer is found somewhere in
> >
> > https://pandoc.org/lua-filters.html
> >
> > but I'm not sure where to look.
> >
> > Kris
> >
> > On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote:
> >
> >> Thanks, the script to strip the underlines is exactly what I needed. As 
> >> for the spaces in the bolds/italics, I'll add an issue for that. 
> Obviously, 
> >> this is not pandoc's fault but if you have a workaround that can be 
> ported 
> >> to the RTF reader at some point, that would be super.
> >>
> >> In the meantime, I guess I'll investigate doing the job myself in Lua. 
> >> Even if it takes me a couple of days to figure out it'll be faster than 
> >> processing every file manually!
> >>
> >> Kris
> >>
> >> On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote:
> >>
> >>>
> >>> The issue with bold is probably because the RTF file includes
> >>> some spaces inside the boldface emphasis. That is depressingly
> >>> common in word processing documents, and we have code in the docx
> >>> reader, if I recall, that handles it by converting
> >>>
> >>> <b>helloSPACE</b>
> >>> to
> >>> <b>hello</b>SPACE
> >>>
> >>> We could port this over to the RTF reader, I think -- can you
> >>> put up an issue on the tracker so we don't forget?
> >>>
> >>> The other issue can be handled using a simple Lua filter.
> >>> Save it as ununderline.lua and use -L ununderline.lua on
> >>> the command line:
> >>>
> >>> function Underline(el)
> >>> return el.content
> >>> end
> >>>
> >>> You could probably handle the spacing issue with a more complex
> >>> Lua filter, as well.
> >>>
> >>> Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:
> >>>
> >>> > Sorry if anyone gets this twice, had to correct my formatting...
> >>> >
> >>> > I'm trying to use pandoc (for the first time) to convert some RTF 
> files 
> >>> to 
> >>> > markdown. My goal is to extract the text with ***bold*** and 
> >>> **italics** 
> >>> > preserved and no other formatting.
> >>> >
> >>> > Simply converting with "pandoc in.rtf -o out.md" produces a markdown 
> >>> file 
> >>> > that's not quite what I need. For instance, here's a line from the 
> >>> output:
> >>> >
> >>> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863
> >>> >
> >>> > FIRST and foremost, pandoc tries to preserve the underlined text, 
> which 
> >>> I 
> >>> > don't want. Can this be disabled? I've tried the "bracketed_spans" 
> and "
> >>> > native_spans" extensions but this still processes the underlines as:
> >>> >
> >>> > **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863
> >>> >
> >>> > SECOND, at least when I view this in VSCode's markdown preview, the 
> >>> bold 
> >>> > and emphasis are not presented correctly, I guess because they touch 
> >>> each 
> >>> > other or have spaces (or both?)? It displays correctly if it's:
> >>> >
> >>> > **Scientific Name:** *Aplysia parvula* Morch, 1863
> >>> >
> >>> > I realize that the text in the RTF might have the bold/italic tagged 
> >>> > weirdly but is there a way to deal with this or am I just stuck? I 
> have 
> >>> > about 500 such files to process, so I'm looking for automated 
> methods.
> >>> >
> >>> > Thanks in advance for any help you can provide!
> >>> >
> >>> > -- 
> >>> > You received this message because you are subscribed to the Google 
> >>> Groups "pandoc-discuss" group.
> >>> > To unsubscribe from this group and stop receiving emails from it, 
> send 
> >>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> >>> > To view this discussion on the web visit 
> >>> 
> https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com
> >>> .
> >>>
> >>
> >
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/d529f882-3339-4536-8021-eacc3fafc457n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 8438 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RTF to Markdown questions
       [not found]     ` <m27d7aqt35.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
  2022-04-27 18:38       ` Kris Wilk
@ 2022-04-28  0:53       ` Kris Wilk
  1 sibling, 0 replies; 9+ messages in thread
From: Kris Wilk @ 2022-04-28  0:53 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 982 bytes --]

On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote:

> ...we have code in the docx reader, if I recall, that handles [spaces 
> inside boldface markup]...
>

John, I just wanted to thank you for mentioning the above tidbit, because 
it gave me a workaround...I first converted all my RTFs to DOCX, then to MD 
(using your other tips to strip the remaining cruft). Fixed the lousy 
markup/space issues perfectly.

I'll still put in an issue to suggest porting this to the RTF reader, but I 
was able to get the job done. 👍

Kris

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/148ceed0-4f32-462a-9290-083e199ead2cn%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 1522 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-04-28  0:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-27 18:02 RTF to Markdown questions Kris Wilk
     [not found] ` <aecd40a2-09db-4e1b-96ad-752973375e0cn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-04-27 18:28   ` John MacFarlane
     [not found]     ` <m27d7aqt35.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2022-04-27 18:38       ` Kris Wilk
     [not found]         ` <bf4044b0-6746-4720-942f-53303a5cb296n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-04-27 19:06           ` Kris Wilk
     [not found]             ` <dec1524b-c96f-4090-be4c-6a2509879d59n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-04-27 20:23               ` John MacFarlane
     [not found]                 ` <yh480kbkwmtgvg.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2022-04-27 20:56                   ` Bastien DUMONT
2022-04-27 23:51                     ` Kris Wilk
2022-04-27 23:51                   ` Kris Wilk
2022-04-28  0:53       ` Kris Wilk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).