public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: John MacFarlane <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org>
To: Kris Wilk <kris-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org>,
	pandoc-discuss
	<pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: RTF to Markdown questions
Date: Wed, 27 Apr 2022 11:28:14 -0700	[thread overview]
Message-ID: <m27d7aqt35.fsf@MacBook-Pro-2.hsd1.ca.comcast.net> (raw)
In-Reply-To: <aecd40a2-09db-4e1b-96ad-752973375e0cn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>


The issue with bold is probably because the RTF file includes
some spaces inside the boldface emphasis.  That is depressingly
common in word processing documents, and we have code in the docx
reader, if I recall, that handles it by converting

<b>helloSPACE</b>
to
<b>hello</b>SPACE

We could port this over to the RTF reader, I think -- can you
put up an issue on the tracker so we don't forget?

The other issue can be handled using a simple Lua filter.
Save it as ununderline.lua and use -L ununderline.lua on
the command line:

function Underline(el)
  return el.content
end

You could probably handle the spacing issue with a more complex
Lua filter, as well.

Kris Wilk <kris-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:

> Sorry if anyone gets this twice, had to correct my formatting...
>
> I'm trying to use pandoc (for the first time) to convert some RTF files to 
> markdown. My goal is to extract the text with ***bold*** and **italics** 
> preserved and no other formatting.
>
> Simply converting with "pandoc in.rtf -o out.md" produces a markdown file 
> that's not quite what I need. For instance, here's a line from the output:
>
> **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863
>
> FIRST and foremost, pandoc tries to preserve the underlined text, which I 
> don't want. Can this be disabled? I've tried the "bracketed_spans" and "
> native_spans" extensions but this still processes the underlines as:
>
> **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863
>
> SECOND, at least when I view this in VSCode's markdown preview, the bold 
> and emphasis are not presented correctly, I guess because they touch each 
> other or have spaces (or both?)? It displays correctly if it's:
>
> **Scientific Name:** *Aplysia parvula* Morch, 1863
>
> I realize that the text in the RTF might have the bold/italic tagged 
> weirdly but is there a way to deal with this or am I just stuck? I have 
> about 500 such files to process, so I'm looking for automated methods.
>
> Thanks in advance for any help you can provide!
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com.


  parent reply	other threads:[~2022-04-27 18:28 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-27 18:02 Kris Wilk
     [not found] ` <aecd40a2-09db-4e1b-96ad-752973375e0cn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-04-27 18:28   ` John MacFarlane [this message]
     [not found]     ` <m27d7aqt35.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
2022-04-27 18:38       ` Kris Wilk
     [not found]         ` <bf4044b0-6746-4720-942f-53303a5cb296n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-04-27 19:06           ` Kris Wilk
     [not found]             ` <dec1524b-c96f-4090-be4c-6a2509879d59n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2022-04-27 20:23               ` John MacFarlane
     [not found]                 ` <yh480kbkwmtgvg.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>
2022-04-27 20:56                   ` Bastien DUMONT
2022-04-27 23:51                     ` Kris Wilk
2022-04-27 23:51                   ` Kris Wilk
2022-04-28  0:53       ` Kris Wilk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m27d7aqt35.fsf@MacBook-Pro-2.hsd1.ca.comcast.net \
    --to=jgm-tvlzxgkolnx2fbvcvol8/a@public.gmane.org \
    --cc=kris-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).