From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/30492 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Kris Wilk Newsgroups: gmane.text.pandoc Subject: Re: RTF to Markdown questions Date: Wed, 27 Apr 2022 11:38:07 -0700 (PDT) Message-ID: References: Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_772_419294079.1651084687424" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="40394"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBCC5D7WV5UIRBEE3U2JQMGQEUHNYANQ-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Wed Apr 27 20:38:12 2022 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-oo1-f56.google.com ([209.85.161.56]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1njmYF-000ALk-Sx for gtp-pandoc-discuss@m.gmane-mx.org; Wed, 27 Apr 2022 20:38:11 +0200 Original-Received: by mail-oo1-f56.google.com with SMTP id m9-20020a4abc89000000b0035e964b0813sf1387527oop.16 for ; Wed, 27 Apr 2022 11:38:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=F9FXhz8QpQ0JCPgNciq25aVKyayba/GkuM5X0ch6r6M=; b=Ks+VegI4r3zIyF1eeX0txwzvoSn1MSgwGSWgNxrBmFStdoxIlxCLhzIzQ1ptzVXMMP t78wM23KDtGiUiWI1Khv6Z+kWIyRf3CqoznruZeOUXt5dDkxskqz7dzTbTQXdIW6OBuy 84fd4LHt4y5mG+maS4uhob2XIjC7XBzSOSBZo054DqZ4TssUphtus/MM6bCnV68zSrIS vg4KguwGznL50lCjRo7vuXgGrBAYp7B8wNMOdbRI5+GvHXVmWtD5mMKXUIBcVv+TuSKH nhu1vR3ZpZ3TcSqUF2oyAVN01b6D9+JS2Yg0sq+s5qL7N6x/jVf2eC0Y5av6tpzr8ARr z5MA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=reefnet.ca; s=google; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=F9FXhz8QpQ0JCPgNciq25aVKyayba/GkuM5X0ch6r6M=; b=UDINnGALEEsOUpVQXEZkMSABmOJEigCZErugLcu7SLKt8iTsbdusnpv3wbF1b2d/Xa K0266+rgmgGNhAd0gwli7ocESx97CpdDB5OokrUeg5cHvjKs9TieMW1PVaxT2wQ7BVIF rcF3MQfJUO8rQtDM0kRdBS7E5FNaDGrb6fwic= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=F9FXhz8QpQ0JCPgNciq25aVKyayba/GkuM5X0ch6r6M=; b=vaOU6ELZV5phTA6WsL9WhCzUtNbRwbm7mpAWFS7RBnc07n+Kh6aTXHqGbxCA9RIwtb d4S8ICtPTZxb5BGjowZFl87WJ8Nr8vAgQfZbrRN2Fz2QK+h8PWToHyrhOJP+z4fpBrNe jByj/7YWOgOHk93iO1MQGh4Ukg2TLzijKFUOluI/lSBOnRHSnhal/0EC5fbcsSc6J1tB 3Gsg3IQTDsmZY+gff98adHB9ozJVnGno4wfPhSmnRzaZuyB2O14miW+M8TihmSseYsV2 KzszgVLiAXf/8Gt5euJe+zzeyyoFYYGo5XEzFLBi/pnldsf4tDo0lL4jzy9nSUrIC7eV VWjg== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM531sM141CrC8Q2IpP6+zQLw6zULMkDnyZk3KNQT+/sVfe/jJAHAN I2nPxUHC5JCODa2BBxnbdNc= X-Google-Smtp-Source: ABdhPJwU5reMmeh7CS3SPp49no5NhnD3E7eLhxOaO0HAc1cSQMKCdyrkBdpjoJKWN4D8bLCrwbD+Mw== X-Received: by 2002:a05:6870:b28d:b0:e5:c7d7:4184 with SMTP id c13-20020a056870b28d00b000e5c7d74184mr16373692oao.30.1651084690710; Wed, 27 Apr 2022 11:38:10 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a05:6830:4a6:b0:603:2255:3d83 with SMTP id l6-20020a05683004a600b0060322553d83ls3842774otd.7.gmail; Wed, 27 Apr 2022 11:38:08 -0700 (PDT) X-Received: by 2002:a9d:20e2:0:b0:5c9:2edb:af8e with SMTP id x89-20020a9d20e2000000b005c92edbaf8emr10854424ota.325.1651084688080; Wed, 27 Apr 2022 11:38:08 -0700 (PDT) In-Reply-To: X-Original-Sender: kris-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:30492 Archived-At: ------=_Part_772_419294079.1651084687424 Content-Type: multipart/alternative; boundary="----=_Part_773_2011062584.1651084687424" ------=_Part_773_2011062584.1651084687424 Content-Type: text/plain; charset="UTF-8" Thanks, the script to strip the underlines is exactly what I needed. As for the spaces in the bolds/italics, I'll add an issue for that. Obviously, this is not pandoc's fault but if you have a workaround that can be ported to the RTF reader at some point, that would be super. In the meantime, I guess I'll investigate doing the job myself in Lua. Even if it takes me a couple of days to figure out it'll be faster than processing every file manually! Kris On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote: > > The issue with bold is probably because the RTF file includes > some spaces inside the boldface emphasis. That is depressingly > common in word processing documents, and we have code in the docx > reader, if I recall, that handles it by converting > > helloSPACE > to > helloSPACE > > We could port this over to the RTF reader, I think -- can you > put up an issue on the tracker so we don't forget? > > The other issue can be handled using a simple Lua filter. > Save it as ununderline.lua and use -L ununderline.lua on > the command line: > > function Underline(el) > return el.content > end > > You could probably handle the spacing issue with a more complex > Lua filter, as well. > > Kris Wilk writes: > > > Sorry if anyone gets this twice, had to correct my formatting... > > > > I'm trying to use pandoc (for the first time) to convert some RTF files > to > > markdown. My goal is to extract the text with ***bold*** and **italics** > > preserved and no other formatting. > > > > Simply converting with "pandoc in.rtf -o out.md" produces a markdown > file > > that's not quite what I need. For instance, here's a line from the > output: > > > > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863 > > > > FIRST and foremost, pandoc tries to preserve the underlined text, which > I > > don't want. Can this be disabled? I've tried the "bracketed_spans" and " > > native_spans" extensions but this still processes the underlines as: > > > > **Scientific Name: ***Aplysia parvula *Morch, 1863 > > > > SECOND, at least when I view this in VSCode's markdown preview, the bold > > and emphasis are not presented correctly, I guess because they touch > each > > other or have spaces (or both?)? It displays correctly if it's: > > > > **Scientific Name:** *Aplysia parvula* Morch, 1863 > > > > I realize that the text in the RTF might have the bold/italic tagged > > weirdly but is there a way to deal with this or am I just stuck? I have > > about 500 such files to process, so I'm looking for automated methods. > > > > Thanks in advance for any help you can provide! > > > > -- > > You received this message because you are subscribed to the Google > Groups "pandoc-discuss" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com > . > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/bf4044b0-6746-4720-942f-53303a5cb296n%40googlegroups.com. ------=_Part_773_2011062584.1651084687424 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks, the script to strip the underlines is exactly what I needed. As for= the spaces in the bolds/italics, I'll add an issue for that. Obviously, th= is is not pandoc's fault but if you have a workaround that can be ported to= the RTF reader at some point, that would be super.

In t= he meantime, I guess I'll investigate doing the job myself in Lua. Even if = it takes me a couple of days to figure out it'll be faster than processing = every file manually!

Kris

On Wednesday, April= 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote:

The issue with bold is probably because the RTF file includes
some spaces inside the boldface emphasis. That is depressingly
common in word processing documents, and we have code in the docx
reader, if I recall, that handles it by converting

<b>helloSPACE</b>
to
<b>hello</b>SPACE

We could port this over to the RTF reader, I think -- can you
put up an issue on the tracker so we don't forget?

The other issue can be handled using a simple Lua filter.
Save it as ununderline.lua and use -L ununderline.lua on
the command line:

function Underline(el)
return el.content
end

You could probably handle the spacing issue with a more complex
Lua filter, as well.

Kris Wilk <kr...@reefnet.= ca> writes:

> Sorry if anyone gets this twice, had to correct my formatting...
>
> I'm trying to use pandoc (for the first time) to convert some = RTF files to=20
> markdown. My goal is to extract the text with ***bold*** and **ita= lics**=20
> preserved and no other formatting.
>
> Simply converting with "pandoc in.rtf -o out.md" produce= s a markdown file=20
> that's not quite what I need. For instance, here's a line = from the output:
>
> **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863
>
> FIRST and foremost, pandoc tries to preserve the underlined text, = which I=20
> don't want. Can this be disabled? I've tried the "bra= cketed_spans" and "
> native_spans" extensions but this still processes the underli= nes as:
>
> **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1= 863
>
> SECOND, at least when I view this in VSCode's markdown preview= , the bold=20
> and emphasis are not presented correctly, I guess because they tou= ch each=20
> other or have spaces (or both?)? It displays correctly if it's= :
>
> **Scientific Name:** *Aplysia parvula* Morch, 1863
>
> I realize that the text in the RTF might have the bold/italic tagg= ed=20
> weirdly but is there a way to deal with this or am I just stuck? I= have=20
> about 500 such files to process, so I'm looking for automated = methods.
>
> Thanks in advance for any help you can provide!
>
> --=20
> You received this message because you are subscribed to the Google= Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, = send an email to pandoc-discus..= .@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-0= 9db-4e1b-96ad-752973375e0cn%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/bf4044b0-6746-4720-942f-53303a5cb296n%40googlegroups.= com.
------=_Part_773_2011062584.1651084687424-- ------=_Part_772_419294079.1651084687424--