From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/30492
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Kris Wilk <kris-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org>
Newsgroups: gmane.text.pandoc
Subject: Re: RTF to Markdown questions
Date: Wed, 27 Apr 2022 11:38:07 -0700 (PDT)
Message-ID: <bf4044b0-6746-4720-942f-53303a5cb296n@googlegroups.com>
References: <aecd40a2-09db-4e1b-96ad-752973375e0cn@googlegroups.com>
 <m27d7aqt35.fsf@MacBook-Pro-2.hsd1.ca.comcast.net>
Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
Mime-Version: 1.0
Content-Type: multipart/mixed; 
	boundary="----=_Part_772_419294079.1651084687424"
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="40394"; mail-complaints-to="usenet@ciao.gmane.io"
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Original-X-From: pandoc-discuss+bncBCC5D7WV5UIRBEE3U2JQMGQEUHNYANQ-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Wed Apr 27 20:38:12 2022
Return-path: <pandoc-discuss+bncBCC5D7WV5UIRBEE3U2JQMGQEUHNYANQ-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org
Original-Received: from mail-oo1-f56.google.com ([209.85.161.56])
	by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128)
	(Exim 4.92)
	(envelope-from <pandoc-discuss+bncBCC5D7WV5UIRBEE3U2JQMGQEUHNYANQ-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>)
	id 1njmYF-000ALk-Sx
	for gtp-pandoc-discuss@m.gmane-mx.org; Wed, 27 Apr 2022 20:38:11 +0200
Original-Received: by mail-oo1-f56.google.com with SMTP id m9-20020a4abc89000000b0035e964b0813sf1387527oop.16
        for <gtp-pandoc-discuss@m.gmane-mx.org>; Wed, 27 Apr 2022 11:38:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=googlegroups.com; s=20210112;
        h=sender:date:from:to:message-id:in-reply-to:references:subject
         :mime-version:x-original-sender:reply-to:precedence:mailing-list
         :list-id:list-post:list-help:list-archive:list-subscribe
         :list-unsubscribe;
        bh=F9FXhz8QpQ0JCPgNciq25aVKyayba/GkuM5X0ch6r6M=;
        b=Ks+VegI4r3zIyF1eeX0txwzvoSn1MSgwGSWgNxrBmFStdoxIlxCLhzIzQ1ptzVXMMP
         t78wM23KDtGiUiWI1Khv6Z+kWIyRf3CqoznruZeOUXt5dDkxskqz7dzTbTQXdIW6OBuy
         84fd4LHt4y5mG+maS4uhob2XIjC7XBzSOSBZo054DqZ4TssUphtus/MM6bCnV68zSrIS
         vg4KguwGznL50lCjRo7vuXgGrBAYp7B8wNMOdbRI5+GvHXVmWtD5mMKXUIBcVv+TuSKH
         nhu1vR3ZpZ3TcSqUF2oyAVN01b6D9+JS2Yg0sq+s5qL7N6x/jVf2eC0Y5av6tpzr8ARr
         z5MA==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=reefnet.ca; s=google;
        h=date:from:to:message-id:in-reply-to:references:subject:mime-version
         :x-original-sender:reply-to:precedence:mailing-list:list-id
         :list-post:list-help:list-archive:list-subscribe:list-unsubscribe;
        bh=F9FXhz8QpQ0JCPgNciq25aVKyayba/GkuM5X0ch6r6M=;
        b=UDINnGALEEsOUpVQXEZkMSABmOJEigCZErugLcu7SLKt8iTsbdusnpv3wbF1b2d/Xa
         K0266+rgmgGNhAd0gwli7ocESx97CpdDB5OokrUeg5cHvjKs9TieMW1PVaxT2wQ7BVIF
         rcF3MQfJUO8rQtDM0kRdBS7E5FNaDGrb6fwic=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to
         :references:subject:mime-version:x-original-sender:reply-to
         :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post
         :list-help:list-archive:list-subscribe:list-unsubscribe;
        bh=F9FXhz8QpQ0JCPgNciq25aVKyayba/GkuM5X0ch6r6M=;
        b=vaOU6ELZV5phTA6WsL9WhCzUtNbRwbm7mpAWFS7RBnc07n+Kh6aTXHqGbxCA9RIwtb
         d4S8ICtPTZxb5BGjowZFl87WJ8Nr8vAgQfZbrRN2Fz2QK+h8PWToHyrhOJP+z4fpBrNe
         jByj/7YWOgOHk93iO1MQGh4Ukg2TLzijKFUOluI/lSBOnRHSnhal/0EC5fbcsSc6J1tB
         3Gsg3IQTDsmZY+gff98adHB9ozJVnGno4wfPhSmnRzaZuyB2O14miW+M8TihmSseYsV2
         KzszgVLiAXf/8Gt5euJe+zzeyyoFYYGo5XEzFLBi/pnldsf4tDo0lL4jzy9nSUrIC7eV
         VWjg==
Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
X-Gm-Message-State: AOAM531sM141CrC8Q2IpP6+zQLw6zULMkDnyZk3KNQT+/sVfe/jJAHAN
	I2nPxUHC5JCODa2BBxnbdNc=
X-Google-Smtp-Source: ABdhPJwU5reMmeh7CS3SPp49no5NhnD3E7eLhxOaO0HAc1cSQMKCdyrkBdpjoJKWN4D8bLCrwbD+Mw==
X-Received: by 2002:a05:6870:b28d:b0:e5:c7d7:4184 with SMTP id c13-20020a056870b28d00b000e5c7d74184mr16373692oao.30.1651084690710;
        Wed, 27 Apr 2022 11:38:10 -0700 (PDT)
X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
Original-Received: by 2002:a05:6830:4a6:b0:603:2255:3d83 with SMTP id
 l6-20020a05683004a600b0060322553d83ls3842774otd.7.gmail; Wed, 27 Apr 2022
 11:38:08 -0700 (PDT)
X-Received: by 2002:a9d:20e2:0:b0:5c9:2edb:af8e with SMTP id x89-20020a9d20e2000000b005c92edbaf8emr10854424ota.325.1651084688080;
        Wed, 27 Apr 2022 11:38:08 -0700 (PDT)
In-Reply-To: <m27d7aqt35.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>
X-Original-Sender: kris-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org
Precedence: list
Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
List-ID: <pandoc-discuss.googlegroups.com>
X-Google-Group-Id: 1007024079513
List-Post: <https://groups.google.com/group/pandoc-discuss/post>, <mailto:pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
List-Help: <https://groups.google.com/support/>, <mailto:pandoc-discuss+help-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
List-Archive: <https://groups.google.com/group/pandoc-discuss
List-Subscribe: <https://groups.google.com/group/pandoc-discuss/subscribe>, <mailto:pandoc-discuss+subscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
List-Unsubscribe: <mailto:googlegroups-manage+1007024079513+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>,
 <https://groups.google.com/group/pandoc-discuss/subscribe>
Xref: news.gmane.io gmane.text.pandoc:30492
Archived-At: <http://permalink.gmane.org/gmane.text.pandoc/30492>

------=_Part_772_419294079.1651084687424
Content-Type: multipart/alternative; 
	boundary="----=_Part_773_2011062584.1651084687424"

------=_Part_773_2011062584.1651084687424
Content-Type: text/plain; charset="UTF-8"

Thanks, the script to strip the underlines is exactly what I needed. As for 
the spaces in the bolds/italics, I'll add an issue for that. Obviously, 
this is not pandoc's fault but if you have a workaround that can be ported 
to the RTF reader at some point, that would be super.

In the meantime, I guess I'll investigate doing the job myself in Lua. Even 
if it takes me a couple of days to figure out it'll be faster than 
processing every file manually!

Kris

On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote:

>
> The issue with bold is probably because the RTF file includes
> some spaces inside the boldface emphasis. That is depressingly
> common in word processing documents, and we have code in the docx
> reader, if I recall, that handles it by converting
>
> <b>helloSPACE</b>
> to
> <b>hello</b>SPACE
>
> We could port this over to the RTF reader, I think -- can you
> put up an issue on the tracker so we don't forget?
>
> The other issue can be handled using a simple Lua filter.
> Save it as ununderline.lua and use -L ununderline.lua on
> the command line:
>
> function Underline(el)
> return el.content
> end
>
> You could probably handle the spacing issue with a more complex
> Lua filter, as well.
>
> Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:
>
> > Sorry if anyone gets this twice, had to correct my formatting...
> >
> > I'm trying to use pandoc (for the first time) to convert some RTF files 
> to 
> > markdown. My goal is to extract the text with ***bold*** and **italics** 
> > preserved and no other formatting.
> >
> > Simply converting with "pandoc in.rtf -o out.md" produces a markdown 
> file 
> > that's not quite what I need. For instance, here's a line from the 
> output:
> >
> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863
> >
> > FIRST and foremost, pandoc tries to preserve the underlined text, which 
> I 
> > don't want. Can this be disabled? I've tried the "bracketed_spans" and "
> > native_spans" extensions but this still processes the underlines as:
> >
> > **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863
> >
> > SECOND, at least when I view this in VSCode's markdown preview, the bold 
> > and emphasis are not presented correctly, I guess because they touch 
> each 
> > other or have spaces (or both?)? It displays correctly if it's:
> >
> > **Scientific Name:** *Aplysia parvula* Morch, 1863
> >
> > I realize that the text in the RTF might have the bold/italic tagged 
> > weirdly but is there a way to deal with this or am I just stuck? I have 
> > about 500 such files to process, so I'm looking for automated methods.
> >
> > Thanks in advance for any help you can provide!
> >
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/bf4044b0-6746-4720-942f-53303a5cb296n%40googlegroups.com.

------=_Part_773_2011062584.1651084687424
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Thanks, the script to strip the underlines is exactly what I needed. As for=
 the spaces in the bolds/italics, I'll add an issue for that. Obviously, th=
is is not pandoc's fault but if you have a workaround that can be ported to=
 the RTF reader at some point, that would be super.<div><br></div><div>In t=
he meantime, I guess I'll investigate doing the job myself in Lua. Even if =
it takes me a couple of days to figure out it'll be faster than processing =
every file manually!</div><div><br></div><div>Kris<br><br></div><div class=
=3D"gmail_quote"><div dir=3D"auto" class=3D"gmail_attr">On Wednesday, April=
 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote:<br/></div><blockquote =
class=3D"gmail_quote" style=3D"margin: 0 0 0 0.8ex; border-left: 1px solid =
rgb(204, 204, 204); padding-left: 1ex;">
<br>The issue with bold is probably because the RTF file includes
<br>some spaces inside the boldface emphasis.  That is depressingly
<br>common in word processing documents, and we have code in the docx
<br>reader, if I recall, that handles it by converting
<br>
<br>&lt;b&gt;helloSPACE&lt;/b&gt;
<br>to
<br>&lt;b&gt;hello&lt;/b&gt;SPACE
<br>
<br>We could port this over to the RTF reader, I think -- can you
<br>put up an issue on the tracker so we don&#39;t forget?
<br>
<br>The other issue can be handled using a simple Lua filter.
<br>Save it as ununderline.lua and use -L ununderline.lua on
<br>the command line:
<br>
<br>function Underline(el)
<br>  return el.content
<br>end
<br>
<br>You could probably handle the spacing issue with a more complex
<br>Lua filter, as well.
<br>
<br>Kris Wilk &lt;<a href data-email-masked rel=3D"nofollow">kr...@reefnet.=
ca</a>&gt; writes:
<br>
<br>&gt; Sorry if anyone gets this twice, had to correct my formatting...
<br>&gt;
<br>&gt; I&#39;m trying to use pandoc (for the first time) to convert some =
RTF files to=20
<br>&gt; markdown. My goal is to extract the text with ***bold*** and **ita=
lics**=20
<br>&gt; preserved and no other formatting.
<br>&gt;
<br>&gt; Simply converting with &quot;pandoc in.rtf -o out.md&quot; produce=
s a markdown file=20
<br>&gt; that&#39;s not quite what I need. For instance, here&#39;s a line =
from the output:
<br>&gt;
<br>&gt; **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863
<br>&gt;
<br>&gt; FIRST and foremost, pandoc tries to preserve the underlined text, =
which I=20
<br>&gt; don&#39;t want. Can this be disabled? I&#39;ve tried the &quot;bra=
cketed_spans&quot; and &quot;
<br>&gt; native_spans&quot; extensions but this still processes the underli=
nes as:
<br>&gt;
<br>&gt; **&lt;u&gt;Scientific Name&lt;/u&gt;: ***Aplysia parvula *Morch, 1=
863
<br>&gt;
<br>&gt; SECOND, at least when I view this in VSCode&#39;s markdown preview=
, the bold=20
<br>&gt; and emphasis are not presented correctly, I guess because they tou=
ch each=20
<br>&gt; other or have spaces (or both?)? It displays correctly if it&#39;s=
:
<br>&gt;
<br>&gt; **Scientific Name:** *Aplysia parvula* Morch, 1863
<br>&gt;
<br>&gt; I realize that the text in the RTF might have the bold/italic tagg=
ed=20
<br>&gt; weirdly but is there a way to deal with this or am I just stuck? I=
 have=20
<br>&gt; about 500 such files to process, so I&#39;m looking for automated =
methods.
<br>&gt;
<br>&gt; Thanks in advance for any help you can provide!
<br>&gt;
<br>&gt; --=20
<br>&gt; You received this message because you are subscribed to the Google=
 Groups &quot;pandoc-discuss&quot; group.
<br>&gt; To unsubscribe from this group and stop receiving emails from it, =
send an email to <a href data-email-masked rel=3D"nofollow">pandoc-discus..=
.@googlegroups.com</a>.
<br>&gt; To view this discussion on the web visit <a href=3D"https://groups=
.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40=
googlegroups.com" target=3D"_blank" rel=3D"nofollow" data-saferedirecturl=
=3D"https://www.google.com/url?hl=3Den&amp;q=3Dhttps://groups.google.com/d/=
msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%2540googlegroups=
.com&amp;source=3Dgmail&amp;ust=3D1651170675245000&amp;usg=3DAOvVaw2uNB_Ny9=
SeE1xre_UFyIla">https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-0=
9db-4e1b-96ad-752973375e0cn%40googlegroups.com</a>.
<br></blockquote></div>

<p></p>

-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;pandoc-discuss&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org">pand=
oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org</a>.<br />
To view this discussion on the web visit <a href=3D"https://groups.google.c=
om/d/msgid/pandoc-discuss/bf4044b0-6746-4720-942f-53303a5cb296n%40googlegro=
ups.com?utm_medium=3Demail&utm_source=3Dfooter">https://groups.google.com/d=
/msgid/pandoc-discuss/bf4044b0-6746-4720-942f-53303a5cb296n%40googlegroups.=
com</a>.<br />

------=_Part_773_2011062584.1651084687424--

------=_Part_772_419294079.1651084687424--