Follow-up question: I just spotted another markup that has been preserved that I don't recognize (and don't want...only bold/italics). Here's a snippet of converted RTF to markdown:

...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale...

I'm guessing that's some kind of RTF attribute/tag? I presume it can be stripped as easily as the underlines but I have no idea what it is in the first place. I suspect the answer is found somewhere in

https://pandoc.org/lua-filters.html

but I'm not sure where to look.

Kris

On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote:
Thanks, the script to strip the underlines is exactly what I needed. As for the spaces in the bolds/italics, I'll add an issue for that. Obviously, this is not pandoc's fault but if you have a workaround that can be ported to the RTF reader at some point, that would be super.

In the meantime, I guess I'll investigate doing the job myself in Lua. Even if it takes me a couple of days to figure out it'll be faster than processing every file manually!

Kris

On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote:

The issue with bold is probably because the RTF file includes
some spaces inside the boldface emphasis. That is depressingly
common in word processing documents, and we have code in the docx
reader, if I recall, that handles it by converting

<b>helloSPACE</b>
to
<b>hello</b>SPACE

We could port this over to the RTF reader, I think -- can you
put up an issue on the tracker so we don't forget?

The other issue can be handled using a simple Lua filter.
Save it as ununderline.lua and use -L ununderline.lua on
the command line:

function Underline(el)
return el.content
end

You could probably handle the spacing issue with a more complex
Lua filter, as well.

Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:

> Sorry if anyone gets this twice, had to correct my formatting...
>
> I'm trying to use pandoc (for the first time) to convert some RTF files to
> markdown. My goal is to extract the text with ***bold*** and **italics**
> preserved and no other formatting.
>
> Simply converting with "pandoc in.rtf -o out.md" produces a markdown file
> that's not quite what I need. For instance, here's a line from the output:
>
> **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863
>
> FIRST and foremost, pandoc tries to preserve the underlined text, which I
> don't want. Can this be disabled? I've tried the "bracketed_spans" and "
> native_spans" extensions but this still processes the underlines as:
>
> **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863
>
> SECOND, at least when I view this in VSCode's markdown preview, the bold
> and emphasis are not presented correctly, I guess because they touch each
> other or have spaces (or both?)? It displays correctly if it's:
>
> **Scientific Name:** *Aplysia parvula* Morch, 1863
>
> I realize that the text in the RTF might have the bold/italic tagged
> weirdly but is there a way to deal with this or am I just stuck? I have
> about 500 such files to process, so I'm looking for automated methods.
>
> Thanks in advance for any help you can provide!
>
> --
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com.