From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/30494 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: John MacFarlane Newsgroups: gmane.text.pandoc Subject: Re: RTF to Markdown questions Date: Wed, 27 Apr 2022 13:23:47 -0700 Message-ID: References: Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="12602"; mail-complaints-to="usenet@ciao.gmane.io" To: Kris Wilk , pandoc-discuss Original-X-From: pandoc-discuss+bncBCJZJHG45QDBBWOMU2JQMGQEVBCQ3EQ-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Wed Apr 27 22:23:56 2022 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-yb1-f186.google.com ([209.85.219.186]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1njoCa-0002t3-7e for gtp-pandoc-discuss@m.gmane-mx.org; Wed, 27 Apr 2022 22:23:56 +0200 Original-Received: by mail-yb1-f186.google.com with SMTP id a17-20020a258051000000b00648703d0c56sf2630990ybn.22 for ; Wed, 27 Apr 2022 13:23:56 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1651091035; cv=pass; d=google.com; s=arc-20160816; b=OjS8qNjG/xS7xnf1zHHPj/iFDKVWHKzXdNzvcpOa4JXdMWgdvTVeDOJ1GNvTK1ntI9 QXV42Upw8DUgO+NyPXHHHtXTgz5QJ2Ork8jrQ0hh9N6917v+mGiqUGnZNQEhGQGEQ2U6 5sbM/lkOc9CimgVRrYem9oXqoO/IhhhyFogOHfY+fSuvJrg8uK6aUs0U3F5K6ciJvgNY D4896LGrOgXCJw1uDnC+O70sJN6Beev7zgXiBStmzDk5L+BnWfVKDmlnTMjiIc9GBSit thvp7Vys2OEaji8aexbUnBv3FVGm9I/1MR3FW2SjW99ONHG4zGyjr+QBNOWwLjTc+J9I 18+Q== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:reply-to:mime-version:message-id :date:references:in-reply-to:subject:to:from:sender:dkim-signature; bh=kOnLuYaH2YbHm9RpZjtXzHw3CCrefJ28Wv1aiugRW5M=; b=LevVygLqoMDYMXMpb0c6loJBjKbR50b3aVkXFW/fDFRRTibjlg6mC6bOfh2G2hhwib yeCwjnzdObLllbnR85c7VVCcB9Xa/s8ejO/NaraHBN/mjZoV5K4MiYhUHTaht72195GJ wm5wT4VLjEpWk1Zjz0sjjDuNzr0GAsLF/isREKGigaZ8qeARJGJo2AU2qerTDxkuQY36 Ss1dyXucrCdn2dFt0rSrzCWuTy9I4/ZBsW91Zp824ayYrCpSxVXnBToS4gW3vVxTks/q v/XWKq2kvtWUbhRxNcQvst6+9c2ueJX5LYHvkwvP01s5I+Y4Engf1TZVgY/ZMThGzUcC Az+g== ARC-Authentication-Results: i=2; gmr-mx.google.com; dkim=pass header.i=@berkeley-edu.20210112.gappssmtp.com header.s=20210112 header.b=iztODYor; spf=pass (google.com: domain of jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org designates 2607:f8b0:4864:20::1033 as permitted sender) smtp.mailfrom=jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=sender:from:to:subject:in-reply-to:references:date:message-id :mime-version:x-original-sender:x-original-authentication-results :reply-to:precedence:mailing-list:list-id:list-post:list-help :list-archive:list-subscribe:list-unsubscribe; bh=kOnLuYaH2YbHm9RpZjtXzHw3CCrefJ28Wv1aiugRW5M=; b=GyM6xylDNzJaLQ1tT82UyXETEYQ5sLlxQAFKIoazWB49OxUlnXUCAG9OZatVRRYPs4 c7VVyk/66cGnCYTNXiSP5DSDojMRkTVCttYpeJXVqW1pHSPJNt0FTHkGa95M4KdKkMa7 PhYy6c3jIv8wUMUK7jwInzxgYTVTUQnAwThnIRnIoHpZ0L/NAFZWTamhd5mHydvNNYPG bzBG9Tb9bslltIDrey8ezkTQmY+IR1hwzDCPO8ddef2UdMWDfW/VHwa65RpZPvIv2KN8 722oJlr3ZdeHzQHvTkqnyfV++Jsb76bTZ5qRijXB/jOy90r+NCtLwPocW0AEtwXzM85W OJKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=sender:x-gm-message-state:from:to:subject:in-reply-to:references :date:message-id:mime-version:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:x-spam-checked-in-group:list-post:list-help:list-archive :list-subscribe:list-unsubscribe; bh=kOnLuYaH2YbHm9RpZjtXzHw3CCrefJ28Wv1aiugRW5M=; b=LQkoFF+Za8k4XeY6tZnNh48IaLGaiBvcL4vO+CCZifJ1JwKZ86KYR99XPjZNnvtJG4 MTw2accIrZfJuMmFgPVkWRcD5abrbHty7lXY++zfjXTRuhTPZQp0XXblF+XSSyuxdjNL F1zBkA1r8NWpDaAFgNFY2gtdTb2CnsbGa8CKVDE51ZUpkBSProzOsaRETc09gT1QeAsV 2MFnVKMGizNz7rFmYtk4PF0gp9BwTjKBZzVRnzO0heWbrVIQrCuN44q+DYSqRV5Wcyxs M+7YdA+bHfyOSSVtD1NkQeJEf41kJ7YrETrUqKKBcRD0MTNz3RKuwnSZ4YCtxM9SEXmr hZcQ== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM532/n6gJEQtmkwCT/sW7PuDcjWbURdMUq0ehAi+h/6SG/3sk8oph KDYEMWHLxJNgHr0AXA2itM4= X-Google-Smtp-Source: ABdhPJwVaCvkVqEHuhqrMGsnEOxN+Q8iK8YiAW7s1odveF7VxOVFeM2m5oOVHLSZep67VLXjQSV6kw== X-Received: by 2002:a81:9aca:0:b0:2eb:2467:9147 with SMTP id r193-20020a819aca000000b002eb24679147mr29423685ywg.98.1651091035302; Wed, 27 Apr 2022 13:23:55 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a81:b810:0:b0:2d0:3645:8a2c with SMTP id v16-20020a81b810000000b002d036458a2cls2129317ywe.11.gmail; Wed, 27 Apr 2022 13:23:52 -0700 (PDT) X-Received: by 2002:a05:690c:298:b0:2eb:564a:34b with SMTP id bf24-20020a05690c029800b002eb564a034bmr30738488ywb.258.1651091032443; Wed, 27 Apr 2022 13:23:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651091032; cv=none; d=google.com; s=arc-20160816; b=fqH3oyXYgIrPft3yFEGPE/SMfdKTE8eC9Y1qfDv+9zwigr+lyopBzt5fowCs5685Pq hU9Wxa4HsqCDGS2HZFJhJcpLaipcUuInIFhEZ2nAL5uO1OJDaMneFZWddCi//QZjzfwu r45YNEV+dylzAIRo6sOieQA6B1WOFDOmc3Hc+Ftp07DLb4jN0qZf+wNHrKkk8oadwuyY WJRN3SP3n6ijn9LjX68actMD6n5mmtNmMNTw+SyMBY3XrbmNGTdL6tlF4nQccMG0t7K1 armNLyfvvBKeIg27xgWCH9uWv5W8qnnH4EneRh2Z2uX+104zUbuwHg+5+w0pBS04ms+X /jEA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:message-id:date:references:in-reply-to:subject:to:from :dkim-signature; bh=bJHmK104/Vs53c2O4aVa7VWmWsusImW6fd+WwNBVwtU=; b=F6Qymmkp0WBh6ZUrRhhZ9LDcojL8Equgw4jNiom6TCJuz2+xR1JBbvzfFgumVu5xkh xXr+pf0aUEgwXkjD96f6qxDB/KkQcxwIULU8Cymb+Y3YkSejZ4ud744Pm9UK569gbAH7 4bxf9aabuo9f4Uy37cdMZHIcxomc8TXZyAvCl1yYI8NmcE3UQyE/aflFhTmr8zDQxQC4 WkLpbYwtjwqxjcxIW6ssivVUsiYAFIOagyIUyU6hSUCYFiDystpaxeUiXlz2VSPUFnX2 bY2aUSwriUY8GlATaIxLnVnL3b7Gvcz4HhCzvlarkxzOM9yKvfaG3ob2GkaYjlRcz8Xj h3uQ== ARC-Authentication-Results: i=1; gmr-mx.google.com; dkim=pass header.i=@berkeley-edu.20210112.gappssmtp.com header.s=20210112 header.b=iztODYor; spf=pass (google.com: domain of jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org designates 2607:f8b0:4864:20::1033 as permitted sender) smtp.mailfrom=jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org Original-Received: from mail-pj1-x1033.google.com (mail-pj1-x1033.google.com. [2607:f8b0:4864:20::1033]) by gmr-mx.google.com with ESMTPS id h82-20020a256c55000000b00634581eb904si375724ybc.2.2022.04.27.13.23.52 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 27 Apr 2022 13:23:52 -0700 (PDT) Received-SPF: pass (google.com: domain of jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org designates 2607:f8b0:4864:20::1033 as permitted sender) client-ip=2607:f8b0:4864:20::1033; Original-Received: by mail-pj1-x1033.google.com with SMTP id z5-20020a17090a468500b001d2bc2743c4so2640495pjf.0 for ; Wed, 27 Apr 2022 13:23:52 -0700 (PDT) X-Received: by 2002:a17:902:6acb:b0:158:8923:86df with SMTP id i11-20020a1709026acb00b00158892386dfmr30472570plt.144.1651091031378; Wed, 27 Apr 2022 13:23:51 -0700 (PDT) Original-Received: from hermes.johnmacfarlane.net ([45.32.92.108]) by smtp.gmail.com with ESMTPSA id t4-20020a17090a3b4400b001cd4989ff61sm3848479pjf.40.2022.04.27.13.23.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Apr 2022 13:23:50 -0700 (PDT) Original-Received: by hermes.johnmacfarlane.net (sSMTP sendmail emulation); Wed, 27 Apr 2022 13:23:48 -0700 In-Reply-To: X-Original-Sender: jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org X-Original-Authentication-Results: gmr-mx.google.com; dkim=pass header.i=@berkeley-edu.20210112.gappssmtp.com header.s=20210112 header.b=iztODYor; spf=pass (google.com: domain of jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org designates 2607:f8b0:4864:20::1033 as permitted sender) smtp.mailfrom=jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:30494 Archived-At: In pandoc's AST, that is a Span with an id. So your filter will need to match on Span instead of underline, but otherwise same as the last one. Kris Wilk writes: > Follow-up question: I just spotted another markup that has been preserved > that I don't recognize (and don't want...only bold/italics). Here's a > snippet of converted RTF to markdown: > > ...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale... > > I'm guessing that's some kind of RTF attribute/tag? I presume it can be > stripped as easily as the underlines but I have no idea what it is in the > first place. I suspect the answer is found somewhere in > > https://pandoc.org/lua-filters.html > > but I'm not sure where to look. > > Kris > > On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote: > >> Thanks, the script to strip the underlines is exactly what I needed. As >> for the spaces in the bolds/italics, I'll add an issue for that. Obviously, >> this is not pandoc's fault but if you have a workaround that can be ported >> to the RTF reader at some point, that would be super. >> >> In the meantime, I guess I'll investigate doing the job myself in Lua. >> Even if it takes me a couple of days to figure out it'll be faster than >> processing every file manually! >> >> Kris >> >> On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote: >> >>> >>> The issue with bold is probably because the RTF file includes >>> some spaces inside the boldface emphasis. That is depressingly >>> common in word processing documents, and we have code in the docx >>> reader, if I recall, that handles it by converting >>> >>> helloSPACE >>> to >>> helloSPACE >>> >>> We could port this over to the RTF reader, I think -- can you >>> put up an issue on the tracker so we don't forget? >>> >>> The other issue can be handled using a simple Lua filter. >>> Save it as ununderline.lua and use -L ununderline.lua on >>> the command line: >>> >>> function Underline(el) >>> return el.content >>> end >>> >>> You could probably handle the spacing issue with a more complex >>> Lua filter, as well. >>> >>> Kris Wilk writes: >>> >>> > Sorry if anyone gets this twice, had to correct my formatting... >>> > >>> > I'm trying to use pandoc (for the first time) to convert some RTF files >>> to >>> > markdown. My goal is to extract the text with ***bold*** and >>> **italics** >>> > preserved and no other formatting. >>> > >>> > Simply converting with "pandoc in.rtf -o out.md" produces a markdown >>> file >>> > that's not quite what I need. For instance, here's a line from the >>> output: >>> > >>> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863 >>> > >>> > FIRST and foremost, pandoc tries to preserve the underlined text, which >>> I >>> > don't want. Can this be disabled? I've tried the "bracketed_spans" and " >>> > native_spans" extensions but this still processes the underlines as: >>> > >>> > **Scientific Name: ***Aplysia parvula *Morch, 1863 >>> > >>> > SECOND, at least when I view this in VSCode's markdown preview, the >>> bold >>> > and emphasis are not presented correctly, I guess because they touch >>> each >>> > other or have spaces (or both?)? It displays correctly if it's: >>> > >>> > **Scientific Name:** *Aplysia parvula* Morch, 1863 >>> > >>> > I realize that the text in the RTF might have the bold/italic tagged >>> > weirdly but is there a way to deal with this or am I just stuck? I have >>> > about 500 such files to process, so I'm looking for automated methods. >>> > >>> > Thanks in advance for any help you can provide! >>> > >>> > -- >>> > You received this message because you are subscribed to the Google >>> Groups "pandoc-discuss" group. >>> > To unsubscribe from this group and stop receiving emails from it, send >>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >>> > To view this discussion on the web visit >>> https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com >>> . >>> >> > > -- > You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com.