From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/30497 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Kris Wilk Newsgroups: gmane.text.pandoc Subject: Re: RTF to Markdown questions Date: Wed, 27 Apr 2022 16:51:56 -0700 (PDT) Message-ID: References: Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_140_292931822.1651103516969" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="37839"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBCC5D7WV5UIRBHVOU6JQMGQETIZLY2I-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Thu Apr 28 01:52:02 2022 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-ot1-f61.google.com ([209.85.210.61]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1njrRx-0009Z6-H8 for gtp-pandoc-discuss@m.gmane-mx.org; Thu, 28 Apr 2022 01:52:01 +0200 Original-Received: by mail-ot1-f61.google.com with SMTP id 64-20020a9d0346000000b00605ddf8273asf949876otv.11 for ; Wed, 27 Apr 2022 16:52:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=RbaRhVHUGxTVp8aQKktAZLILrU16cD08reAtiWoH4H4=; b=Hv1j97kJWszeDfE7DY+wuQaCYUJ6L/cvIMv194xUqhnzuZFrnJaZedcK/kKXbeFBHb zHt4PVnl1whIdEY05VdwQJcpePSxJSGdLmueUgaZY5dRqfKk5G34CBU/QtXkuOAe8Z19 vOYSmOSCT1VsXieo0TE6ft8wlz0ua16bOOypGujy4QP39nRUwqfWKRtGsihziS8TzyUb 2hl7E6on3lL5ZnZFiin6926oPtXLV/m9H+uNGrP54vd/4SyGimadpVRZqfbNOa5XY5fu tHI+LHw1eragscHdWHu3W04A7de6+0wiRut/2BhS379ezEmgqk8Dfd7h+I8e/lLo2BXn LLqA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=reefnet.ca; s=google; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=RbaRhVHUGxTVp8aQKktAZLILrU16cD08reAtiWoH4H4=; b=b1qOQV883FGOL5Xy7zUILtpY4xpgfNroO11KXHSm4kqSUf5sh2EPEkm6QQNY0MCwCz B6eyWAFCC3FxctJQuWwyxKJPLlegO6NmBP5eSbSp/yLVLZZKm68lKw+vexzxR1SiiB5V 3KflEodLPj8bREKeFVrlQLarJpDw1b+Y+v1Ys= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=RbaRhVHUGxTVp8aQKktAZLILrU16cD08reAtiWoH4H4=; b=BWt1y8cJRKJlVKc/T+p5O996d7vhalv1VcnxKv9dLxN47ZCZGYQI34Yfx8Pp0V+350 h9Jw+Ul6eDDK75rM4L+NYg3E/Q10J+GNa6z/rZ7YDf/CZErBD7m3fjFdx3dqmgru5byl 72PQ5WoXAKtiT7nXH7MtwKjreitq0pu0vM1HUHkrkOs1o9KYpMygffTmcFyWDpIQ91HQ +Lauly4sl6tiKS5QT2bo1HBtREe5HfeTZzx3RWI/nO9OehCzxofWKjh2bZrLeJ3pvLOC +WXiopiJqrsdVvNWMxNaVsNrAqM+bPd+SxggrIRAOgoojsdeRLESFSfOvRjRdb17vdR/ vTig== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM5302/IBhOhZhkCk1GW9Fe44ZoQA9Yp59CMw4GJNO+YCigCQ+bHzK 4lZqIspAhu04+6xMdpDlFnQ= X-Google-Smtp-Source: ABdhPJxI/9UXBwzchtZqH4Eon4QRNpvtzjdehJAzCyGnsTi2jog9wUOkB6hS+SPq783MIjHskwsjRg== X-Received: by 2002:a05:6808:1402:b0:325:c27:9a6b with SMTP id w2-20020a056808140200b003250c279a6bmr10200590oiv.143.1651103520357; Wed, 27 Apr 2022 16:52:00 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a9d:6205:0:b0:605:48da:b100 with SMTP id g5-20020a9d6205000000b0060548dab100ls1985882otj.5.gmail; Wed, 27 Apr 2022 16:51:57 -0700 (PDT) X-Received: by 2002:a05:6830:1303:b0:605:5033:1150 with SMTP id p3-20020a056830130300b0060550331150mr11177034otq.231.1651103517692; Wed, 27 Apr 2022 16:51:57 -0700 (PDT) In-Reply-To: X-Original-Sender: kris-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:30497 Archived-At: ------=_Part_140_292931822.1651103516969 Content-Type: multipart/alternative; boundary="----=_Part_141_1010690326.1651103516969" ------=_Part_141_1010690326.1651103516969 Content-Type: text/plain; charset="UTF-8" Thanks, that did the trick! Kris On Wednesday, April 27, 2022 at 4:23:55 PM UTC-4 John MacFarlane wrote: > > In pandoc's AST, that is a Span with an id. > So your filter will need to match on Span instead of underline, > but otherwise same as the last one. > > Kris Wilk writes: > > > Follow-up question: I just spotted another markup that has been > preserved > > that I don't recognize (and don't want...only bold/italics). Here's a > > snippet of converted RTF to markdown: > > > > ...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale... > > > > I'm guessing that's some kind of RTF attribute/tag? I presume it can be > > stripped as easily as the underlines but I have no idea what it is in > the > > first place. I suspect the answer is found somewhere in > > > > https://pandoc.org/lua-filters.html > > > > but I'm not sure where to look. > > > > Kris > > > > On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote: > > > >> Thanks, the script to strip the underlines is exactly what I needed. As > >> for the spaces in the bolds/italics, I'll add an issue for that. > Obviously, > >> this is not pandoc's fault but if you have a workaround that can be > ported > >> to the RTF reader at some point, that would be super. > >> > >> In the meantime, I guess I'll investigate doing the job myself in Lua. > >> Even if it takes me a couple of days to figure out it'll be faster than > >> processing every file manually! > >> > >> Kris > >> > >> On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote: > >> > >>> > >>> The issue with bold is probably because the RTF file includes > >>> some spaces inside the boldface emphasis. That is depressingly > >>> common in word processing documents, and we have code in the docx > >>> reader, if I recall, that handles it by converting > >>> > >>> helloSPACE > >>> to > >>> helloSPACE > >>> > >>> We could port this over to the RTF reader, I think -- can you > >>> put up an issue on the tracker so we don't forget? > >>> > >>> The other issue can be handled using a simple Lua filter. > >>> Save it as ununderline.lua and use -L ununderline.lua on > >>> the command line: > >>> > >>> function Underline(el) > >>> return el.content > >>> end > >>> > >>> You could probably handle the spacing issue with a more complex > >>> Lua filter, as well. > >>> > >>> Kris Wilk writes: > >>> > >>> > Sorry if anyone gets this twice, had to correct my formatting... > >>> > > >>> > I'm trying to use pandoc (for the first time) to convert some RTF > files > >>> to > >>> > markdown. My goal is to extract the text with ***bold*** and > >>> **italics** > >>> > preserved and no other formatting. > >>> > > >>> > Simply converting with "pandoc in.rtf -o out.md" produces a markdown > >>> file > >>> > that's not quite what I need. For instance, here's a line from the > >>> output: > >>> > > >>> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863 > >>> > > >>> > FIRST and foremost, pandoc tries to preserve the underlined text, > which > >>> I > >>> > don't want. Can this be disabled? I've tried the "bracketed_spans" > and " > >>> > native_spans" extensions but this still processes the underlines as: > >>> > > >>> > **Scientific Name: ***Aplysia parvula *Morch, 1863 > >>> > > >>> > SECOND, at least when I view this in VSCode's markdown preview, the > >>> bold > >>> > and emphasis are not presented correctly, I guess because they touch > >>> each > >>> > other or have spaces (or both?)? It displays correctly if it's: > >>> > > >>> > **Scientific Name:** *Aplysia parvula* Morch, 1863 > >>> > > >>> > I realize that the text in the RTF might have the bold/italic tagged > >>> > weirdly but is there a way to deal with this or am I just stuck? I > have > >>> > about 500 such files to process, so I'm looking for automated > methods. > >>> > > >>> > Thanks in advance for any help you can provide! > >>> > > >>> > -- > >>> > You received this message because you are subscribed to the Google > >>> Groups "pandoc-discuss" group. > >>> > To unsubscribe from this group and stop receiving emails from it, > send > >>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > >>> > To view this discussion on the web visit > >>> > https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com > >>> . > >>> > >> > > > > -- > > You received this message because you are subscribed to the Google > Groups "pandoc-discuss" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com > . > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/d529f882-3339-4536-8021-eacc3fafc457n%40googlegroups.com. ------=_Part_141_1010690326.1651103516969 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks, that did the trick!

Kris

On Wednesday, April= 27, 2022 at 4:23:55 PM UTC-4 John MacFarlane wrote:

In pandoc's AST, that is a Span with an id.
So your filter will need to match on Span instead of underline,
but otherwise same as the last one.

Kris Wilk <kr...@reefnet.= ca> writes:

> Follow-up question: I just spotted another markup that has been pr= eserved=20
> that I don't recognize (and don't want...only bold/italics= ). Here's a=20
> snippet of converted RTF to markdown:
>
> ...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale..= .
>
> I'm guessing that's some kind of RTF attribute/tag? I pres= ume it can be=20
> stripped as easily as the underlines but I have no idea what it is= in the=20
> first place. I suspect the answer is found somewhere in
>
> https://pandoc.org/= lua-filters.html
>
> but I'm not sure where to look.
>
> Kris
>
> On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote:
>
>> Thanks, the script to strip the underlines is exactly what I n= eeded. As=20
>> for the spaces in the bolds/italics, I'll add an issue for= that. Obviously,=20
>> this is not pandoc's fault but if you have a workaround th= at can be ported=20
>> to the RTF reader at some point, that would be super.
>>
>> In the meantime, I guess I'll investigate doing the job my= self in Lua.=20
>> Even if it takes me a couple of days to figure out it'll b= e faster than=20
>> processing every file manually!
>>
>> Kris
>>
>> On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarla= ne wrote:
>>
>>>
>>> The issue with bold is probably because the RTF file inclu= des
>>> some spaces inside the boldface emphasis. That is depressi= ngly
>>> common in word processing documents, and we have code in t= he docx
>>> reader, if I recall, that handles it by converting
>>>
>>> <b>helloSPACE</b>
>>> to
>>> <b>hello</b>SPACE
>>>
>>> We could port this over to the RTF reader, I think -- can = you
>>> put up an issue on the tracker so we don't forget?
>>>
>>> The other issue can be handled using a simple Lua filter.
>>> Save it as ununderline.lua and use -L ununderline.lua on
>>> the command line:
>>>
>>> function Underline(el)
>>> return el.content
>>> end
>>>
>>> You could probably handle the spacing issue with a more co= mplex
>>> Lua filter, as well.
>>>
>>> Kris Wilk <k= r...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes:
>>>
>>> > Sorry if anyone gets this twice, had to correct my fo= rmatting...
>>> >
>>> > I'm trying to use pandoc (for the first time) to = convert some RTF files=20
>>> to=20
>>> > markdown. My goal is to extract the text with ***bold= *** and=20
>>> **italics**=20
>>> > preserved and no other formatting.
>>> >
>>> > Simply converting with "pandoc in.rtf -o out.md&= quot; produces a markdown=20
>>> file=20
>>> > that's not quite what I need. For instance, here&= #39;s a line from the=20
>>> output:
>>> >
>>> > **[Scientific Name]{.underline}: ***Aplysia parvula *= Morch, 1863
>>> >
>>> > FIRST and foremost, pandoc tries to preserve the unde= rlined text, which=20
>>> I=20
>>> > don't want. Can this be disabled? I've tried = the "bracketed_spans" and "
>>> > native_spans" extensions but this still processe= s the underlines as:
>>> >
>>> > **<u>Scientific Name</u>: ***Aplysia parv= ula *Morch, 1863
>>> >
>>> > SECOND, at least when I view this in VSCode's mar= kdown preview, the=20
>>> bold=20
>>> > and emphasis are not presented correctly, I guess bec= ause they touch=20
>>> each=20
>>> > other or have spaces (or both?)? It displays correctl= y if it's:
>>> >
>>> > **Scientific Name:** *Aplysia parvula* Morch, 1863
>>> >
>>> > I realize that the text in the RTF might have the bol= d/italic tagged=20
>>> > weirdly but is there a way to deal with this or am I = just stuck? I have=20
>>> > about 500 such files to process, so I'm looking f= or automated methods.
>>> >
>>> > Thanks in advance for any help you can provide!
>>> >
>>> > --=20
>>> > You received this message because you are subscribed = to the Google=20
>>> Groups "pandoc-discuss" group.
>>> > To unsubscribe from this group and stop receiving ema= ils from it, send=20
>>> an email to pan= doc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
>>> > To view this discussion on the web visit=20
>>> https://groups.= google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40g= ooglegroups.com
>>> .
>>>
>>
>
> --=20
> You received this message because you are subscribed to the Google= Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, = send an email to pandoc-discus..= .@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c= 96f-4090-be4c-6a2509879d59n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/d529f882-3339-4536-8021-eacc3fafc457n%40googlegroups.= com.
------=_Part_141_1010690326.1651103516969-- ------=_Part_140_292931822.1651103516969--