From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/30493 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Kris Wilk Newsgroups: gmane.text.pandoc Subject: Re: RTF to Markdown questions Date: Wed, 27 Apr 2022 12:06:07 -0700 (PDT) Message-ID: References: Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_38_1841329272.1651086367444" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="27888"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBCC5D7WV5UIRBIFIU2JQMGQEMPVHCJY-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Wed Apr 27 21:06:11 2022 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-oo1-f63.google.com ([209.85.161.63]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1njmzL-00071I-Ay for gtp-pandoc-discuss@m.gmane-mx.org; Wed, 27 Apr 2022 21:06:11 +0200 Original-Received: by mail-oo1-f63.google.com with SMTP id c10-20020a4ac30a000000b0035e9b79e259sf1044550ooq.15 for ; Wed, 27 Apr 2022 12:06:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=U5gL5sJ+GdWwpJ8N8wVXhV3uslx3Xl7znetR3T/8JVE=; b=GuxSOL+JrgB/b3LDhLV0hA5KfEWTPYx3L2eUlauzKJPmTtwxzyhXN3GI8exIT0E8/p TNNmihTXnr3ITk+fuIYkFczJ1kMt4Onzsz+VOVUGe7f7XFmYozlgmlWjc8dslqowAwq3 WUX/zztZdA8WilxLmyR8bRHhwoHX9ACKhxb74yVFu7XFEd7ycM8q5dwigw/o1N5r0G+v 0H2NxDzmA3WNJDXSgTgHtzLIuvYZkU5PsmbrYR48J/QUSgQ+T6DApK43VhY1el3FIRS/ CyLPtqStd4Xd9p3wObjUngX75o6Ul5XKKDVf9erL3eJz4gZM35d0+bt4Zj1agG7aySva dwPA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=reefnet.ca; s=google; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=U5gL5sJ+GdWwpJ8N8wVXhV3uslx3Xl7znetR3T/8JVE=; b=U34dXimrrKjsXybNkiH4hFhd3cDPvEh9SNNU63RDaikiT7UVXfP7n29mvtLTTaCiim QZxz8Imy14dExtgT4BNR55Gfyzyzdv06vc4ArYPr04xTJZyLRbhRx2u/TJdBsnobU6wo 7lbAswU6IdEnwiyZqZU3cJDU+2hpphe13Murw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=U5gL5sJ+GdWwpJ8N8wVXhV3uslx3Xl7znetR3T/8JVE=; b=TPGx2hb8amg2U7xlxSQRykAAeQWeFv/DYyuVFTRQXLCPeon7VTclSeoIavBtBtO2Qv 112X15tqtBnHjooMIjzNln7w9Om9sb46wxfVfeIbk8stVG8G6pD+YLwZeJgqFXEFQl7M /DT1OY0nXWLLUWtFbDsmQXkrJaQAcyOPXTaM9f4eMiUDMXPZ8iE0BnbsxdR3We/RfcHt bNpfjlBmItckt02KxGOpCX69jl0GtdmjMmUVFP2uXeNBVe5tLq/+BF6h4hOL6nmZ10dS yrAUqAYn4t1KFEV6HeqZpS3dvRsRD/TuHdyPDH4XpU0xbMdtBNS9WZryOusW4Wlidq0K dU6g== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM533ZeczlHjHapij+Jvm3uncVliv5rT1TZtvm0jx+QxPRdTfElGVy 4bJuXGOxU+shdXhdDOtQJ1U= X-Google-Smtp-Source: ABdhPJztNGhIBLPqET9k8/bvMr+EUaxsN7M9H+TANQQP36piZRktVc1hJqEjzBZjVs9qi838TlFQNg== X-Received: by 2002:a05:6870:c393:b0:e2:ae03:70ff with SMTP id g19-20020a056870c39300b000e2ae0370ffmr11423391oao.231.1651086370192; Wed, 27 Apr 2022 12:06:10 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a05:6870:c145:b0:e6:6ee9:6279 with SMTP id g5-20020a056870c14500b000e66ee96279ls6559932oad.1.gmail; Wed, 27 Apr 2022 12:06:08 -0700 (PDT) X-Received: by 2002:a05:6870:41ca:b0:e9:c84:987a with SMTP id z10-20020a05687041ca00b000e90c84987amr10348412oac.149.1651086368096; Wed, 27 Apr 2022 12:06:08 -0700 (PDT) In-Reply-To: X-Original-Sender: kris-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:30493 Archived-At: ------=_Part_38_1841329272.1651086367444 Content-Type: multipart/alternative; boundary="----=_Part_39_1886742796.1651086367444" ------=_Part_39_1886742796.1651086367444 Content-Type: text/plain; charset="UTF-8" Follow-up question: I just spotted another markup that has been preserved that I don't recognize (and don't want...only bold/italics). Here's a snippet of converted RTF to markdown: ...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale... I'm guessing that's some kind of RTF attribute/tag? I presume it can be stripped as easily as the underlines but I have no idea what it is in the first place. I suspect the answer is found somewhere in https://pandoc.org/lua-filters.html but I'm not sure where to look. Kris On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote: > Thanks, the script to strip the underlines is exactly what I needed. As > for the spaces in the bolds/italics, I'll add an issue for that. Obviously, > this is not pandoc's fault but if you have a workaround that can be ported > to the RTF reader at some point, that would be super. > > In the meantime, I guess I'll investigate doing the job myself in Lua. > Even if it takes me a couple of days to figure out it'll be faster than > processing every file manually! > > Kris > > On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote: > >> >> The issue with bold is probably because the RTF file includes >> some spaces inside the boldface emphasis. That is depressingly >> common in word processing documents, and we have code in the docx >> reader, if I recall, that handles it by converting >> >> helloSPACE >> to >> helloSPACE >> >> We could port this over to the RTF reader, I think -- can you >> put up an issue on the tracker so we don't forget? >> >> The other issue can be handled using a simple Lua filter. >> Save it as ununderline.lua and use -L ununderline.lua on >> the command line: >> >> function Underline(el) >> return el.content >> end >> >> You could probably handle the spacing issue with a more complex >> Lua filter, as well. >> >> Kris Wilk writes: >> >> > Sorry if anyone gets this twice, had to correct my formatting... >> > >> > I'm trying to use pandoc (for the first time) to convert some RTF files >> to >> > markdown. My goal is to extract the text with ***bold*** and >> **italics** >> > preserved and no other formatting. >> > >> > Simply converting with "pandoc in.rtf -o out.md" produces a markdown >> file >> > that's not quite what I need. For instance, here's a line from the >> output: >> > >> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863 >> > >> > FIRST and foremost, pandoc tries to preserve the underlined text, which >> I >> > don't want. Can this be disabled? I've tried the "bracketed_spans" and " >> > native_spans" extensions but this still processes the underlines as: >> > >> > **Scientific Name: ***Aplysia parvula *Morch, 1863 >> > >> > SECOND, at least when I view this in VSCode's markdown preview, the >> bold >> > and emphasis are not presented correctly, I guess because they touch >> each >> > other or have spaces (or both?)? It displays correctly if it's: >> > >> > **Scientific Name:** *Aplysia parvula* Morch, 1863 >> > >> > I realize that the text in the RTF might have the bold/italic tagged >> > weirdly but is there a way to deal with this or am I just stuck? I have >> > about 500 such files to process, so I'm looking for automated methods. >> > >> > Thanks in advance for any help you can provide! >> > >> > -- >> > You received this message because you are subscribed to the Google >> Groups "pandoc-discuss" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com >> . >> > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com. ------=_Part_39_1886742796.1651086367444 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Follow-up question: I just spotted another markup that has been preserved t= hat I don't recognize (and don't want...only bold/italics). Here's a snippe= t of converted RTF to markdown:

...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale...

I'm guessing that's some kind of RTF attribute/tag? I = presume it can be stripped as easily as the underlines but I have no idea w= hat it is in the first place. I suspect the answer is found somewhere in

https://pandoc.org/lua-fi= lters.html

but I'm not sure where to lo= ok.

Kris

On Wednesday, April 27, 2022 a= t 2:38:07 PM UTC-4 Kris Wilk wrote:
Thanks, the script to strip the underlines is exactl= y what I needed. As for the spaces in the bolds/italics, I'll add an is= sue for that. Obviously, this is not pandoc's fault but if you have a w= orkaround that can be ported to the RTF reader at some point, that would be= super.

In the meantime, I guess I'll investigate do= ing the job myself in Lua. Even if it takes me a couple of days to figure o= ut it'll be faster than processing every file manually!

<= /div>
Kris

On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John Ma= cFarlane wrote:

The issue with bold is probably because the RTF file includes
some spaces inside the boldface emphasis. That is depressingly
common in word processing documents, and we have code in the docx
reader, if I recall, that handles it by converting

<b>helloSPACE</b>
to
<b>hello</b>SPACE

We could port this over to the RTF reader, I think -- can you
put up an issue on the tracker so we don't forget?

The other issue can be handled using a simple Lua filter.
Save it as ununderline.lua and use -L ununderline.lua on
the command line:

function Underline(el)
return el.content
end

You could probably handle the spacing issue with a more complex
Lua filter, as well.

Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:

> Sorry if anyone gets this twice, had to correct my formatting...
>
> I'm trying to use pandoc (for the first time) to convert some = RTF files to=20
> markdown. My goal is to extract the text with ***bold*** and **ita= lics**=20
> preserved and no other formatting.
>
> Simply converting with "pandoc in.rtf -o out.md" produce= s a markdown file=20
> that's not quite what I need. For instance, here's a line = from the output:
>
> **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863
>
> FIRST and foremost, pandoc tries to preserve the underlined text, = which I=20
> don't want. Can this be disabled? I've tried the "bra= cketed_spans" and "
> native_spans" extensions but this still processes the underli= nes as:
>
> **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1= 863
>
> SECOND, at least when I view this in VSCode's markdown preview= , the bold=20
> and emphasis are not presented correctly, I guess because they tou= ch each=20
> other or have spaces (or both?)? It displays correctly if it's= :
>
> **Scientific Name:** *Aplysia parvula* Morch, 1863
>
> I realize that the text in the RTF might have the bold/italic tagg= ed=20
> weirdly but is there a way to deal with this or am I just stuck? I= have=20
> about 500 such files to process, so I'm looking for automated = methods.
>
> Thanks in advance for any help you can provide!
>
> --=20
> You received this message because you are subscribed to the Google= Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, = send an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-0= 9db-4e1b-96ad-752973375e0cn%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.= com.
------=_Part_39_1886742796.1651086367444-- ------=_Part_38_1841329272.1651086367444--