From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/30495 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Bastien DUMONT Newsgroups: gmane.text.pandoc Subject: Re: RTF to Markdown questions Date: Wed, 27 Apr 2022 20:56:15 +0000 Message-ID: References: Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="30655"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-X-From: pandoc-discuss+bncBDCINCES2QJRB463U2JQMGQE5GUNQRY-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Wed Apr 27 22:56:24 2022 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-lj1-f186.google.com ([209.85.208.186]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1njohz-0007n1-Fy for gtp-pandoc-discuss@m.gmane-mx.org; Wed, 27 Apr 2022 22:56:23 +0200 Original-Received: by mail-lj1-f186.google.com with SMTP id n4-20020a2ebd04000000b0024b618dec69sf1184366ljq.18 for ; Wed, 27 Apr 2022 13:56:23 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1651092983; cv=pass; d=google.com; s=arc-20160816; b=bv/u/rXxnPC5X9vYrs4adVFgXLirzDLSd5yizGW6Ke3lPbUzFdpMykkPVBrQpWwe4K ELQGwaXswxUXfuOJxDC4NNznskhugpJVUsuiX29nT+g/Lz5+9obU8wHMjxk1cZ4u9eV0 CFIktDWl+K8IAHbpRRKCai0+RDZUcpNNGjimaBCn6lMkVgBolPCOiBYh/7ZvB3jeget4 uCJQUhxkO4oZs15jHAdPY4cQBkMIDc67GlWTmYu38Jqdd0MlgrC9Qm3v45iA2k1h0BEs VnQl8wcvrxtrE0A9owamt8vH01ltIFJ4zX9UewGYwZZN+KmLAigEVa1ySae2uiL1Czyb +o7Q== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:reply-to:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:to:from:date:sender:dkim-signature; bh=dCkh4nJqYY8DznShDhhemnfH8SeYCy7Ts43J9m2nrvQ=; b=fK8jETukoQTnfN6ck3mJH9jL0/SWf+/yuYPluJP3DlcnFV9ZYJgRz1KTqVwdbmlhQJ oQ4PfLpdNBvhIF033tt/gXEl9YAVMsSu/xWEg+vgmknUqhzXK+ZqMdqwLrhDlMh/nGLU n+yZlIxnMYxfkIg3sh4/nOzJkU4h/zW98nsF73FtVbjWjBrleselV8T5zuHTnFPas90a O86EtceBFQ637HHmvQWFfoDqLciHo4oLp4BxNXSxEfJjm6dKPRfZKa2DMqKaIgxEo7jK vNFfCYEja4Y3GLQUw6BrwrbHV+Rs3tFqTZj1akseRVbcuw3zVH68M89vwXpeEfdZy2v2 KyDQ== ARC-Authentication-Results: i=2; gmr-mx.google.com; dkim=pass header.i=@posteo.net header.s=2017 header.b=HBnOyXBN; spf=pass (google.com: domain of bastien.dumont-VwIFZPTo/vqsTnJN9+BGXg@public.gmane.org designates 185.67.36.65 as permitted sender) smtp.mailfrom=bastien.dumont-VwIFZPTo/vqsTnJN9+BGXg@public.gmane.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=posteo.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=sender:date:from:to:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :x-original-sender:x-original-authentication-results:reply-to :precedence:mailing-list:list-id:list-post:list-help:list-archive :list-subscribe:list-unsubscribe; bh=dCkh4nJqYY8DznShDhhemnfH8SeYCy7Ts43J9m2nrvQ=; b=k8QnY4IHgVpNFuKrNbOiTVlFcOoAbTv9FomSTtXzfFJV+OfcSajwz0F3nDuyTLbOGX sWaiaf5aHb5AlbbYz4kwpao1fA63DmHu6g+aeStAzPfHI/9EUjzE7pr5CM1m+gvm1ld0 wWIj3+afsHIntCT1DngPeAVU7169JfudwrdNqebn8zM+VQYzEiZbrFmb1FFSRFszWFAZ ljSsSmr5S1gJJzWcUZfLPGHU3KgZjX+odJ9NIkiIhu/qTO9vbY6ju85G6p/PpoC5o2eZ k32yUl/ucaE5Fdgcyf+JefT8LzT+Meg/yfS8Z0fnDkvL8yiwU+BXIfUox7BjzHw65AMX a4BQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=sender:x-gm-message-state:date:from:to:subject:message-id :references:mime-version:content-disposition :content-transfer-encoding:in-reply-to:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:x-spam-checked-in-group:list-post:list-help:list-archive :list-subscribe:list-unsubscribe; bh=dCkh4nJqYY8DznShDhhemnfH8SeYCy7Ts43J9m2nrvQ=; b=mRBwS3UJHzUsNO9Zo38HftCm8uLJKx5sRxc1pwlitAFeacDUzgqjRCf+hhh235oOkz uLPxuSDVxwtY84bxUNtSng4X9lerSAWp8FyjVoXqU2u3avtz8e2Z2tRypApkTtxEE599 YIniq++KrBj39ZVR9AHbBMD3Kb9hfD2UCDV1crRJ6MjoTeUy7pIF9JhdOTJzBHg/5BVA CAjhNFBeA6nQi9ZeLPGXAgX6/v7jgC+9GWOi35FBfNAYg7DEXo3MMlCY48A9xdZ39zeI J5w9g8n599geelT1feE/OzXmkmYF7nNWFAua4KHIkMBY Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM533LQLxTkdId/AeWeN18LnXwBYA4vdVpjaQZGpzH2OmF4e71KlHq 1EUssOi0mGMLrudgNHIbBPA= X-Google-Smtp-Source: ABdhPJyVv0ZfqBiozK9NzsDSX8qMCe7sh2FtsYkX+pE80FGHEnssOD/kAZwvnJ+pu4QwVd/43Bb3WQ== X-Received: by 2002:a19:4f53:0:b0:472:1714:61f1 with SMTP id a19-20020a194f53000000b00472171461f1mr7538207lfk.473.1651092982781; Wed, 27 Apr 2022 13:56:22 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a05:6512:3e2a:b0:471:af61:f198 with SMTP id i42-20020a0565123e2a00b00471af61f198ls2874339lfv.0.gmail; Wed, 27 Apr 2022 13:56:18 -0700 (PDT) X-Received: by 2002:a05:6512:1504:b0:44b:36e:b50d with SMTP id bq4-20020a056512150400b0044b036eb50dmr21584288lfb.558.1651092978711; Wed, 27 Apr 2022 13:56:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651092978; cv=none; d=google.com; s=arc-20160816; b=X6I344SLMSHFid23NQ8Sv55A8ino7ycXURU08JQINzYSFeArykJfaWaDsk339iA41T xbUX5YHVv+kqDwYrqH3WBnjvXJR85WDpQsDAz8+eYeWjKuxMDuMpAqoLWp4p8cinjY3B Jyx9JTu4XTNhKcYxgoSsgt2Nugc6YowEsgJazOb2xrtXM4MflCPsk4XCrp/7El/gRtjz S44fMZY0knV5D+Qs/Tid9LCH73HLQPVxNeiQD7oa+0ekxKxsGekbgPP7PFjysHMLldu9 HiMyIOjJ6y9oUW46kA4Q9BiCIaYe6ZTwkB0kK7o2QswO/17wdpJhxfCqOdwygHCm6BnT QQmQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:to:from:date :dkim-signature; bh=rDHE3cvvFH3ozrrzsrMizaNKFGHtM0BoyMfpVPUpxIM=; b=hvmf5EcZh1/KO01oU1SntUqxg40gdAz/dAA+yvogx6Rc/g+9ptRWXtjznzGSJVIUL4 Q4jBOlD6BU4KFBYiqkK/Kt8NX+k7rh7g1hmyYQwOTqHkE/+qft77ZVw+vQLgteHb2GT4 uwvUR3kjpNCL8hrYlM/OIvcdy8wsnI0h4mQE5f9qXOqCNLseateI+5RUe2cKfTTWNysf sD6uDKuKIEsI9WNp2+wG9YjX/4HNCoHhNdPFVTgYmP76xf2CeRDRoCknfFfRwRYhU9K0 76hI0sRNcoL4dObZvsKVb1ZHfcNMNsrz+CH6wgmrrIYr5fNSlbjWDNA4GUOV3lmFNen6 NLbw== ARC-Authentication-Results: i=1; gmr-mx.google.com; dkim=pass header.i=@posteo.net header.s=2017 header.b=HBnOyXBN; spf=pass (google.com: domain of bastien.dumont-VwIFZPTo/vqsTnJN9+BGXg@public.gmane.org designates 185.67.36.65 as permitted sender) smtp.mailfrom=bastien.dumont-VwIFZPTo/vqsTnJN9+BGXg@public.gmane.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=posteo.net Original-Received: from mout01.posteo.de (mout01.posteo.de. [185.67.36.65]) by gmr-mx.google.com with ESMTPS id i18-20020a2e8652000000b0024f1cf9b1b0si114671ljj.4.2022.04.27.13.56.18 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Apr 2022 13:56:18 -0700 (PDT) Received-SPF: pass (google.com: domain of bastien.dumont-VwIFZPTo/vqsTnJN9+BGXg@public.gmane.org designates 185.67.36.65 as permitted sender) client-ip=185.67.36.65; Original-Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id E3167240029 for ; Wed, 27 Apr 2022 22:56:16 +0200 (CEST) Original-Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4KpWJ03XK6z6tq0 for ; Wed, 27 Apr 2022 22:56:16 +0200 (CEST) Content-Disposition: inline In-Reply-To: X-Original-Sender: bastien.dumont-VwIFZPTo/vqsTnJN9+BGXg@public.gmane.org X-Original-Authentication-Results: gmr-mx.google.com; dkim=pass header.i=@posteo.net header.s=2017 header.b=HBnOyXBN; spf=pass (google.com: domain of bastien.dumont-VwIFZPTo/vqsTnJN9+BGXg@public.gmane.org designates 185.67.36.65 as permitted sender) smtp.mailfrom=bastien.dumont-VwIFZPTo/vqsTnJN9+BGXg@public.gmane.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=posteo.net Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:30495 Archived-At: Don't you have any cross-reference to it in your RTF file? If you have, you= may want to keep it. Le Wednesday 27 April 2022 =C3=A0 01:23:47PM, John MacFarlane a =C3=A9crit = : >=20 > In pandoc's AST, that is a Span with an id. > So your filter will need to match on Span instead of underline, > but otherwise same as the last one. >=20 > Kris Wilk writes: >=20 > > Follow-up question: I just spotted another markup that has been preserv= ed=20 > > that I don't recognize (and don't want...only bold/italics). Here's a= =20 > > snippet of converted RTF to markdown: > > > > ...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale... > > > > I'm guessing that's some kind of RTF attribute/tag? I presume it can be= =20 > > stripped as easily as the underlines but I have no idea what it is in t= he=20 > > first place. I suspect the answer is found somewhere in > > > > https://pandoc.org/lua-filters.html > > > > but I'm not sure where to look. > > > > Kris > > > > On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote: > > > >> Thanks, the script to strip the underlines is exactly what I needed. A= s=20 > >> for the spaces in the bolds/italics, I'll add an issue for that. Obvio= usly,=20 > >> this is not pandoc's fault but if you have a workaround that can be po= rted=20 > >> to the RTF reader at some point, that would be super. > >> > >> In the meantime, I guess I'll investigate doing the job myself in Lua.= =20 > >> Even if it takes me a couple of days to figure out it'll be faster tha= n=20 > >> processing every file manually! > >> > >> Kris > >> > >> On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote= : > >> > >>> > >>> The issue with bold is probably because the RTF file includes > >>> some spaces inside the boldface emphasis. That is depressingly > >>> common in word processing documents, and we have code in the docx > >>> reader, if I recall, that handles it by converting > >>> > >>> helloSPACE > >>> to > >>> helloSPACE > >>> > >>> We could port this over to the RTF reader, I think -- can you > >>> put up an issue on the tracker so we don't forget? > >>> > >>> The other issue can be handled using a simple Lua filter. > >>> Save it as ununderline.lua and use -L ununderline.lua on > >>> the command line: > >>> > >>> function Underline(el) > >>> return el.content > >>> end > >>> > >>> You could probably handle the spacing issue with a more complex > >>> Lua filter, as well. > >>> > >>> Kris Wilk writes: > >>> > >>> > Sorry if anyone gets this twice, had to correct my formatting... > >>> > > >>> > I'm trying to use pandoc (for the first time) to convert some RTF f= iles=20 > >>> to=20 > >>> > markdown. My goal is to extract the text with ***bold*** and=20 > >>> **italics**=20 > >>> > preserved and no other formatting. > >>> > > >>> > Simply converting with "pandoc in.rtf -o out.md" produces a markdow= n=20 > >>> file=20 > >>> > that's not quite what I need. For instance, here's a line from the= =20 > >>> output: > >>> > > >>> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863 > >>> > > >>> > FIRST and foremost, pandoc tries to preserve the underlined text, w= hich=20 > >>> I=20 > >>> > don't want. Can this be disabled? I've tried the "bracketed_spans" = and " > >>> > native_spans" extensions but this still processes the underlines as= : > >>> > > >>> > **Scientific Name: ***Aplysia parvula *Morch, 1863 > >>> > > >>> > SECOND, at least when I view this in VSCode's markdown preview, the= =20 > >>> bold=20 > >>> > and emphasis are not presented correctly, I guess because they touc= h=20 > >>> each=20 > >>> > other or have spaces (or both?)? It displays correctly if it's: > >>> > > >>> > **Scientific Name:** *Aplysia parvula* Morch, 1863 > >>> > > >>> > I realize that the text in the RTF might have the bold/italic tagge= d=20 > >>> > weirdly but is there a way to deal with this or am I just stuck? I = have=20 > >>> > about 500 such files to process, so I'm looking for automated metho= ds. > >>> > > >>> > Thanks in advance for any help you can provide! > >>> > > >>> > --=20 > >>> > You received this message because you are subscribed to the Google= =20 > >>> Groups "pandoc-discuss" group. > >>> > To unsubscribe from this group and stop receiving emails from it, s= end=20 > >>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > >>> > To view this discussion on the web visit=20 > >>> https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-9= 6ad-752973375e0cn%40googlegroups.com > >>> . > >>> > >> > > > > --=20 > > You received this message because you are subscribed to the Google Grou= ps "pandoc-discuss" group. > > To unsubscribe from this group and stop receiving emails from it, send = an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To view this discussion on the web visit https://groups.google.com/d/ms= gid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com= . >=20 > --=20 > You received this message because you are subscribed to the Google Groups= "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an= email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgi= d/pandoc-discuss/yh480kbkwmtgvg.fsf%40johnmacfarlane.net. --=20 You received this message because you are subscribed to the Google Groups "= pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/= pandoc-discuss/Ymmt7z2UIuKb98ky%40localhost.