* RTF to Markdown questions @ 2022-04-27 18:02 Kris Wilk [not found] ` <aecd40a2-09db-4e1b-96ad-752973375e0cn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Kris Wilk @ 2022-04-27 18:02 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 1686 bytes --] Sorry if anyone gets this twice, had to correct my formatting... I'm trying to use pandoc (for the first time) to convert some RTF files to markdown. My goal is to extract the text with ***bold*** and **italics** preserved and no other formatting. Simply converting with "pandoc in.rtf -o out.md" produces a markdown file that's not quite what I need. For instance, here's a line from the output: **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863 FIRST and foremost, pandoc tries to preserve the underlined text, which I don't want. Can this be disabled? I've tried the "bracketed_spans" and " native_spans" extensions but this still processes the underlines as: **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863 SECOND, at least when I view this in VSCode's markdown preview, the bold and emphasis are not presented correctly, I guess because they touch each other or have spaces (or both?)? It displays correctly if it's: **Scientific Name:** *Aplysia parvula* Morch, 1863 I realize that the text in the RTF might have the bold/italic tagged weirdly but is there a way to deal with this or am I just stuck? I have about 500 such files to process, so I'm looking for automated methods. Thanks in advance for any help you can provide! -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 2415 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <aecd40a2-09db-4e1b-96ad-752973375e0cn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: RTF to Markdown questions [not found] ` <aecd40a2-09db-4e1b-96ad-752973375e0cn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2022-04-27 18:28 ` John MacFarlane [not found] ` <m27d7aqt35.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: John MacFarlane @ 2022-04-27 18:28 UTC (permalink / raw) To: Kris Wilk, pandoc-discuss The issue with bold is probably because the RTF file includes some spaces inside the boldface emphasis. That is depressingly common in word processing documents, and we have code in the docx reader, if I recall, that handles it by converting <b>helloSPACE</b> to <b>hello</b>SPACE We could port this over to the RTF reader, I think -- can you put up an issue on the tracker so we don't forget? The other issue can be handled using a simple Lua filter. Save it as ununderline.lua and use -L ununderline.lua on the command line: function Underline(el) return el.content end You could probably handle the spacing issue with a more complex Lua filter, as well. Kris Wilk <kris-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes: > Sorry if anyone gets this twice, had to correct my formatting... > > I'm trying to use pandoc (for the first time) to convert some RTF files to > markdown. My goal is to extract the text with ***bold*** and **italics** > preserved and no other formatting. > > Simply converting with "pandoc in.rtf -o out.md" produces a markdown file > that's not quite what I need. For instance, here's a line from the output: > > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863 > > FIRST and foremost, pandoc tries to preserve the underlined text, which I > don't want. Can this be disabled? I've tried the "bracketed_spans" and " > native_spans" extensions but this still processes the underlines as: > > **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863 > > SECOND, at least when I view this in VSCode's markdown preview, the bold > and emphasis are not presented correctly, I guess because they touch each > other or have spaces (or both?)? It displays correctly if it's: > > **Scientific Name:** *Aplysia parvula* Morch, 1863 > > I realize that the text in the RTF might have the bold/italic tagged > weirdly but is there a way to deal with this or am I just stuck? I have > about 500 such files to process, so I'm looking for automated methods. > > Thanks in advance for any help you can provide! > > -- > You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com. ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <m27d7aqt35.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org>]
* Re: RTF to Markdown questions [not found] ` <m27d7aqt35.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> @ 2022-04-27 18:38 ` Kris Wilk [not found] ` <bf4044b0-6746-4720-942f-53303a5cb296n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2022-04-28 0:53 ` Kris Wilk 1 sibling, 1 reply; 9+ messages in thread From: Kris Wilk @ 2022-04-27 18:38 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 3551 bytes --] Thanks, the script to strip the underlines is exactly what I needed. As for the spaces in the bolds/italics, I'll add an issue for that. Obviously, this is not pandoc's fault but if you have a workaround that can be ported to the RTF reader at some point, that would be super. In the meantime, I guess I'll investigate doing the job myself in Lua. Even if it takes me a couple of days to figure out it'll be faster than processing every file manually! Kris On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote: > > The issue with bold is probably because the RTF file includes > some spaces inside the boldface emphasis. That is depressingly > common in word processing documents, and we have code in the docx > reader, if I recall, that handles it by converting > > <b>helloSPACE</b> > to > <b>hello</b>SPACE > > We could port this over to the RTF reader, I think -- can you > put up an issue on the tracker so we don't forget? > > The other issue can be handled using a simple Lua filter. > Save it as ununderline.lua and use -L ununderline.lua on > the command line: > > function Underline(el) > return el.content > end > > You could probably handle the spacing issue with a more complex > Lua filter, as well. > > Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes: > > > Sorry if anyone gets this twice, had to correct my formatting... > > > > I'm trying to use pandoc (for the first time) to convert some RTF files > to > > markdown. My goal is to extract the text with ***bold*** and **italics** > > preserved and no other formatting. > > > > Simply converting with "pandoc in.rtf -o out.md" produces a markdown > file > > that's not quite what I need. For instance, here's a line from the > output: > > > > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863 > > > > FIRST and foremost, pandoc tries to preserve the underlined text, which > I > > don't want. Can this be disabled? I've tried the "bracketed_spans" and " > > native_spans" extensions but this still processes the underlines as: > > > > **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863 > > > > SECOND, at least when I view this in VSCode's markdown preview, the bold > > and emphasis are not presented correctly, I guess because they touch > each > > other or have spaces (or both?)? It displays correctly if it's: > > > > **Scientific Name:** *Aplysia parvula* Morch, 1863 > > > > I realize that the text in the RTF might have the bold/italic tagged > > weirdly but is there a way to deal with this or am I just stuck? I have > > about 500 such files to process, so I'm looking for automated methods. > > > > Thanks in advance for any help you can provide! > > > > -- > > You received this message because you are subscribed to the Google > Groups "pandoc-discuss" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com > . > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/bf4044b0-6746-4720-942f-53303a5cb296n%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 4902 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <bf4044b0-6746-4720-942f-53303a5cb296n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: RTF to Markdown questions [not found] ` <bf4044b0-6746-4720-942f-53303a5cb296n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2022-04-27 19:06 ` Kris Wilk [not found] ` <dec1524b-c96f-4090-be4c-6a2509879d59n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Kris Wilk @ 2022-04-27 19:06 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 4254 bytes --] Follow-up question: I just spotted another markup that has been preserved that I don't recognize (and don't want...only bold/italics). Here's a snippet of converted RTF to markdown: ...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale... I'm guessing that's some kind of RTF attribute/tag? I presume it can be stripped as easily as the underlines but I have no idea what it is in the first place. I suspect the answer is found somewhere in https://pandoc.org/lua-filters.html but I'm not sure where to look. Kris On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote: > Thanks, the script to strip the underlines is exactly what I needed. As > for the spaces in the bolds/italics, I'll add an issue for that. Obviously, > this is not pandoc's fault but if you have a workaround that can be ported > to the RTF reader at some point, that would be super. > > In the meantime, I guess I'll investigate doing the job myself in Lua. > Even if it takes me a couple of days to figure out it'll be faster than > processing every file manually! > > Kris > > On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote: > >> >> The issue with bold is probably because the RTF file includes >> some spaces inside the boldface emphasis. That is depressingly >> common in word processing documents, and we have code in the docx >> reader, if I recall, that handles it by converting >> >> <b>helloSPACE</b> >> to >> <b>hello</b>SPACE >> >> We could port this over to the RTF reader, I think -- can you >> put up an issue on the tracker so we don't forget? >> >> The other issue can be handled using a simple Lua filter. >> Save it as ununderline.lua and use -L ununderline.lua on >> the command line: >> >> function Underline(el) >> return el.content >> end >> >> You could probably handle the spacing issue with a more complex >> Lua filter, as well. >> >> Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes: >> >> > Sorry if anyone gets this twice, had to correct my formatting... >> > >> > I'm trying to use pandoc (for the first time) to convert some RTF files >> to >> > markdown. My goal is to extract the text with ***bold*** and >> **italics** >> > preserved and no other formatting. >> > >> > Simply converting with "pandoc in.rtf -o out.md" produces a markdown >> file >> > that's not quite what I need. For instance, here's a line from the >> output: >> > >> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863 >> > >> > FIRST and foremost, pandoc tries to preserve the underlined text, which >> I >> > don't want. Can this be disabled? I've tried the "bracketed_spans" and " >> > native_spans" extensions but this still processes the underlines as: >> > >> > **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863 >> > >> > SECOND, at least when I view this in VSCode's markdown preview, the >> bold >> > and emphasis are not presented correctly, I guess because they touch >> each >> > other or have spaces (or both?)? It displays correctly if it's: >> > >> > **Scientific Name:** *Aplysia parvula* Morch, 1863 >> > >> > I realize that the text in the RTF might have the bold/italic tagged >> > weirdly but is there a way to deal with this or am I just stuck? I have >> > about 500 such files to process, so I'm looking for automated methods. >> > >> > Thanks in advance for any help you can provide! >> > >> > -- >> > You received this message because you are subscribed to the Google >> Groups "pandoc-discuss" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com >> . >> > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 5918 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <dec1524b-c96f-4090-be4c-6a2509879d59n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: RTF to Markdown questions [not found] ` <dec1524b-c96f-4090-be4c-6a2509879d59n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2022-04-27 20:23 ` John MacFarlane [not found] ` <yh480kbkwmtgvg.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: John MacFarlane @ 2022-04-27 20:23 UTC (permalink / raw) To: Kris Wilk, pandoc-discuss In pandoc's AST, that is a Span with an id. So your filter will need to match on Span instead of underline, but otherwise same as the last one. Kris Wilk <kris-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes: > Follow-up question: I just spotted another markup that has been preserved > that I don't recognize (and don't want...only bold/italics). Here's a > snippet of converted RTF to markdown: > > ...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale... > > I'm guessing that's some kind of RTF attribute/tag? I presume it can be > stripped as easily as the underlines but I have no idea what it is in the > first place. I suspect the answer is found somewhere in > > https://pandoc.org/lua-filters.html > > but I'm not sure where to look. > > Kris > > On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote: > >> Thanks, the script to strip the underlines is exactly what I needed. As >> for the spaces in the bolds/italics, I'll add an issue for that. Obviously, >> this is not pandoc's fault but if you have a workaround that can be ported >> to the RTF reader at some point, that would be super. >> >> In the meantime, I guess I'll investigate doing the job myself in Lua. >> Even if it takes me a couple of days to figure out it'll be faster than >> processing every file manually! >> >> Kris >> >> On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote: >> >>> >>> The issue with bold is probably because the RTF file includes >>> some spaces inside the boldface emphasis. That is depressingly >>> common in word processing documents, and we have code in the docx >>> reader, if I recall, that handles it by converting >>> >>> <b>helloSPACE</b> >>> to >>> <b>hello</b>SPACE >>> >>> We could port this over to the RTF reader, I think -- can you >>> put up an issue on the tracker so we don't forget? >>> >>> The other issue can be handled using a simple Lua filter. >>> Save it as ununderline.lua and use -L ununderline.lua on >>> the command line: >>> >>> function Underline(el) >>> return el.content >>> end >>> >>> You could probably handle the spacing issue with a more complex >>> Lua filter, as well. >>> >>> Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes: >>> >>> > Sorry if anyone gets this twice, had to correct my formatting... >>> > >>> > I'm trying to use pandoc (for the first time) to convert some RTF files >>> to >>> > markdown. My goal is to extract the text with ***bold*** and >>> **italics** >>> > preserved and no other formatting. >>> > >>> > Simply converting with "pandoc in.rtf -o out.md" produces a markdown >>> file >>> > that's not quite what I need. For instance, here's a line from the >>> output: >>> > >>> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863 >>> > >>> > FIRST and foremost, pandoc tries to preserve the underlined text, which >>> I >>> > don't want. Can this be disabled? I've tried the "bracketed_spans" and " >>> > native_spans" extensions but this still processes the underlines as: >>> > >>> > **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863 >>> > >>> > SECOND, at least when I view this in VSCode's markdown preview, the >>> bold >>> > and emphasis are not presented correctly, I guess because they touch >>> each >>> > other or have spaces (or both?)? It displays correctly if it's: >>> > >>> > **Scientific Name:** *Aplysia parvula* Morch, 1863 >>> > >>> > I realize that the text in the RTF might have the bold/italic tagged >>> > weirdly but is there a way to deal with this or am I just stuck? I have >>> > about 500 such files to process, so I'm looking for automated methods. >>> > >>> > Thanks in advance for any help you can provide! >>> > >>> > -- >>> > You received this message because you are subscribed to the Google >>> Groups "pandoc-discuss" group. >>> > To unsubscribe from this group and stop receiving emails from it, send >>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >>> > To view this discussion on the web visit >>> https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com >>> . >>> >> > > -- > You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com. ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <yh480kbkwmtgvg.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org>]
* Re: RTF to Markdown questions [not found] ` <yh480kbkwmtgvg.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org> @ 2022-04-27 20:56 ` Bastien DUMONT 2022-04-27 23:51 ` Kris Wilk 2022-04-27 23:51 ` Kris Wilk 1 sibling, 1 reply; 9+ messages in thread From: Bastien DUMONT @ 2022-04-27 20:56 UTC (permalink / raw) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw Don't you have any cross-reference to it in your RTF file? If you have, you may want to keep it. Le Wednesday 27 April 2022 à 01:23:47PM, John MacFarlane a écrit : > > In pandoc's AST, that is a Span with an id. > So your filter will need to match on Span instead of underline, > but otherwise same as the last one. > > Kris Wilk <kris-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes: > > > Follow-up question: I just spotted another markup that has been preserved > > that I don't recognize (and don't want...only bold/italics). Here's a > > snippet of converted RTF to markdown: > > > > ...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale... > > > > I'm guessing that's some kind of RTF attribute/tag? I presume it can be > > stripped as easily as the underlines but I have no idea what it is in the > > first place. I suspect the answer is found somewhere in > > > > https://pandoc.org/lua-filters.html > > > > but I'm not sure where to look. > > > > Kris > > > > On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote: > > > >> Thanks, the script to strip the underlines is exactly what I needed. As > >> for the spaces in the bolds/italics, I'll add an issue for that. Obviously, > >> this is not pandoc's fault but if you have a workaround that can be ported > >> to the RTF reader at some point, that would be super. > >> > >> In the meantime, I guess I'll investigate doing the job myself in Lua. > >> Even if it takes me a couple of days to figure out it'll be faster than > >> processing every file manually! > >> > >> Kris > >> > >> On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote: > >> > >>> > >>> The issue with bold is probably because the RTF file includes > >>> some spaces inside the boldface emphasis. That is depressingly > >>> common in word processing documents, and we have code in the docx > >>> reader, if I recall, that handles it by converting > >>> > >>> <b>helloSPACE</b> > >>> to > >>> <b>hello</b>SPACE > >>> > >>> We could port this over to the RTF reader, I think -- can you > >>> put up an issue on the tracker so we don't forget? > >>> > >>> The other issue can be handled using a simple Lua filter. > >>> Save it as ununderline.lua and use -L ununderline.lua on > >>> the command line: > >>> > >>> function Underline(el) > >>> return el.content > >>> end > >>> > >>> You could probably handle the spacing issue with a more complex > >>> Lua filter, as well. > >>> > >>> Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes: > >>> > >>> > Sorry if anyone gets this twice, had to correct my formatting... > >>> > > >>> > I'm trying to use pandoc (for the first time) to convert some RTF files > >>> to > >>> > markdown. My goal is to extract the text with ***bold*** and > >>> **italics** > >>> > preserved and no other formatting. > >>> > > >>> > Simply converting with "pandoc in.rtf -o out.md" produces a markdown > >>> file > >>> > that's not quite what I need. For instance, here's a line from the > >>> output: > >>> > > >>> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863 > >>> > > >>> > FIRST and foremost, pandoc tries to preserve the underlined text, which > >>> I > >>> > don't want. Can this be disabled? I've tried the "bracketed_spans" and " > >>> > native_spans" extensions but this still processes the underlines as: > >>> > > >>> > **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863 > >>> > > >>> > SECOND, at least when I view this in VSCode's markdown preview, the > >>> bold > >>> > and emphasis are not presented correctly, I guess because they touch > >>> each > >>> > other or have spaces (or both?)? It displays correctly if it's: > >>> > > >>> > **Scientific Name:** *Aplysia parvula* Morch, 1863 > >>> > > >>> > I realize that the text in the RTF might have the bold/italic tagged > >>> > weirdly but is there a way to deal with this or am I just stuck? I have > >>> > about 500 such files to process, so I'm looking for automated methods. > >>> > > >>> > Thanks in advance for any help you can provide! > >>> > > >>> > -- > >>> > You received this message because you are subscribed to the Google > >>> Groups "pandoc-discuss" group. > >>> > To unsubscribe from this group and stop receiving emails from it, send > >>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > >>> > To view this discussion on the web visit > >>> https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com > >>> . > >>> > >> > > > > -- > > You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. > > To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com. > > -- > You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/yh480kbkwmtgvg.fsf%40johnmacfarlane.net. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/Ymmt7z2UIuKb98ky%40localhost. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: RTF to Markdown questions 2022-04-27 20:56 ` Bastien DUMONT @ 2022-04-27 23:51 ` Kris Wilk 0 siblings, 0 replies; 9+ messages in thread From: Kris Wilk @ 2022-04-27 23:51 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 6611 bytes --] In this case, no, any such references/markup other than bold/italic are spurious and must be removed. The aim is to reduce the content to plain text with the simplest markup only, for ingestion into a database that will be used by a web app. Kris On Wednesday, April 27, 2022 at 4:56:22 PM UTC-4 Bastien Dumont wrote: > Don't you have any cross-reference to it in your RTF file? If you have, > you may want to keep it. > > Le Wednesday 27 April 2022 à 01:23:47PM, John MacFarlane a écrit : > > > > In pandoc's AST, that is a Span with an id. > > So your filter will need to match on Span instead of underline, > > but otherwise same as the last one. > > > > Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes: > > > > > Follow-up question: I just spotted another markup that has been > preserved > > > that I don't recognize (and don't want...only bold/italics). Here's a > > > snippet of converted RTF to markdown: > > > > > > ...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale... > > > > > > I'm guessing that's some kind of RTF attribute/tag? I presume it can > be > > > stripped as easily as the underlines but I have no idea what it is in > the > > > first place. I suspect the answer is found somewhere in > > > > > > https://pandoc.org/lua-filters.html > > > > > > but I'm not sure where to look. > > > > > > Kris > > > > > > On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote: > > > > > >> Thanks, the script to strip the underlines is exactly what I needed. > As > > >> for the spaces in the bolds/italics, I'll add an issue for that. > Obviously, > > >> this is not pandoc's fault but if you have a workaround that can be > ported > > >> to the RTF reader at some point, that would be super. > > >> > > >> In the meantime, I guess I'll investigate doing the job myself in > Lua. > > >> Even if it takes me a couple of days to figure out it'll be faster > than > > >> processing every file manually! > > >> > > >> Kris > > >> > > >> On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane > wrote: > > >> > > >>> > > >>> The issue with bold is probably because the RTF file includes > > >>> some spaces inside the boldface emphasis. That is depressingly > > >>> common in word processing documents, and we have code in the docx > > >>> reader, if I recall, that handles it by converting > > >>> > > >>> <b>helloSPACE</b> > > >>> to > > >>> <b>hello</b>SPACE > > >>> > > >>> We could port this over to the RTF reader, I think -- can you > > >>> put up an issue on the tracker so we don't forget? > > >>> > > >>> The other issue can be handled using a simple Lua filter. > > >>> Save it as ununderline.lua and use -L ununderline.lua on > > >>> the command line: > > >>> > > >>> function Underline(el) > > >>> return el.content > > >>> end > > >>> > > >>> You could probably handle the spacing issue with a more complex > > >>> Lua filter, as well. > > >>> > > >>> Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes: > > >>> > > >>> > Sorry if anyone gets this twice, had to correct my formatting... > > >>> > > > >>> > I'm trying to use pandoc (for the first time) to convert some RTF > files > > >>> to > > >>> > markdown. My goal is to extract the text with ***bold*** and > > >>> **italics** > > >>> > preserved and no other formatting. > > >>> > > > >>> > Simply converting with "pandoc in.rtf -o out.md" produces a > markdown > > >>> file > > >>> > that's not quite what I need. For instance, here's a line from the > > >>> output: > > >>> > > > >>> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863 > > >>> > > > >>> > FIRST and foremost, pandoc tries to preserve the underlined text, > which > > >>> I > > >>> > don't want. Can this be disabled? I've tried the "bracketed_spans" > and " > > >>> > native_spans" extensions but this still processes the underlines > as: > > >>> > > > >>> > **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863 > > >>> > > > >>> > SECOND, at least when I view this in VSCode's markdown preview, > the > > >>> bold > > >>> > and emphasis are not presented correctly, I guess because they > touch > > >>> each > > >>> > other or have spaces (or both?)? It displays correctly if it's: > > >>> > > > >>> > **Scientific Name:** *Aplysia parvula* Morch, 1863 > > >>> > > > >>> > I realize that the text in the RTF might have the bold/italic > tagged > > >>> > weirdly but is there a way to deal with this or am I just stuck? I > have > > >>> > about 500 such files to process, so I'm looking for automated > methods. > > >>> > > > >>> > Thanks in advance for any help you can provide! > > >>> > > > >>> > -- > > >>> > You received this message because you are subscribed to the Google > > >>> Groups "pandoc-discuss" group. > > >>> > To unsubscribe from this group and stop receiving emails from it, > send > > >>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > >>> > To view this discussion on the web visit > > >>> > https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com > > >>> . > > >>> > > >> > > > > > > -- > > > You received this message because you are subscribed to the Google > Groups "pandoc-discuss" group. > > > To unsubscribe from this group and stop receiving emails from it, send > an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com > . > > > > -- > > You received this message because you are subscribed to the Google > Groups "pandoc-discuss" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/yh480kbkwmtgvg.fsf%40johnmacfarlane.net > . > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/f35cb187-3a19-444b-a9ce-9799e816c305n%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 10258 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: RTF to Markdown questions [not found] ` <yh480kbkwmtgvg.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org> 2022-04-27 20:56 ` Bastien DUMONT @ 2022-04-27 23:51 ` Kris Wilk 1 sibling, 0 replies; 9+ messages in thread From: Kris Wilk @ 2022-04-27 23:51 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 5373 bytes --] Thanks, that did the trick! Kris On Wednesday, April 27, 2022 at 4:23:55 PM UTC-4 John MacFarlane wrote: > > In pandoc's AST, that is a Span with an id. > So your filter will need to match on Span instead of underline, > but otherwise same as the last one. > > Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes: > > > Follow-up question: I just spotted another markup that has been > preserved > > that I don't recognize (and don't want...only bold/italics). Here's a > > snippet of converted RTF to markdown: > > > > ...radially outwards. [Eyes]{#_Hlk64921583} black, set in a pale... > > > > I'm guessing that's some kind of RTF attribute/tag? I presume it can be > > stripped as easily as the underlines but I have no idea what it is in > the > > first place. I suspect the answer is found somewhere in > > > > https://pandoc.org/lua-filters.html > > > > but I'm not sure where to look. > > > > Kris > > > > On Wednesday, April 27, 2022 at 2:38:07 PM UTC-4 Kris Wilk wrote: > > > >> Thanks, the script to strip the underlines is exactly what I needed. As > >> for the spaces in the bolds/italics, I'll add an issue for that. > Obviously, > >> this is not pandoc's fault but if you have a workaround that can be > ported > >> to the RTF reader at some point, that would be super. > >> > >> In the meantime, I guess I'll investigate doing the job myself in Lua. > >> Even if it takes me a couple of days to figure out it'll be faster than > >> processing every file manually! > >> > >> Kris > >> > >> On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote: > >> > >>> > >>> The issue with bold is probably because the RTF file includes > >>> some spaces inside the boldface emphasis. That is depressingly > >>> common in word processing documents, and we have code in the docx > >>> reader, if I recall, that handles it by converting > >>> > >>> <b>helloSPACE</b> > >>> to > >>> <b>hello</b>SPACE > >>> > >>> We could port this over to the RTF reader, I think -- can you > >>> put up an issue on the tracker so we don't forget? > >>> > >>> The other issue can be handled using a simple Lua filter. > >>> Save it as ununderline.lua and use -L ununderline.lua on > >>> the command line: > >>> > >>> function Underline(el) > >>> return el.content > >>> end > >>> > >>> You could probably handle the spacing issue with a more complex > >>> Lua filter, as well. > >>> > >>> Kris Wilk <kr...-AwXHIjbJCMCw5LPnMra/2Q@public.gmane.org> writes: > >>> > >>> > Sorry if anyone gets this twice, had to correct my formatting... > >>> > > >>> > I'm trying to use pandoc (for the first time) to convert some RTF > files > >>> to > >>> > markdown. My goal is to extract the text with ***bold*** and > >>> **italics** > >>> > preserved and no other formatting. > >>> > > >>> > Simply converting with "pandoc in.rtf -o out.md" produces a markdown > >>> file > >>> > that's not quite what I need. For instance, here's a line from the > >>> output: > >>> > > >>> > **[Scientific Name]{.underline}: ***Aplysia parvula *Morch, 1863 > >>> > > >>> > FIRST and foremost, pandoc tries to preserve the underlined text, > which > >>> I > >>> > don't want. Can this be disabled? I've tried the "bracketed_spans" > and " > >>> > native_spans" extensions but this still processes the underlines as: > >>> > > >>> > **<u>Scientific Name</u>: ***Aplysia parvula *Morch, 1863 > >>> > > >>> > SECOND, at least when I view this in VSCode's markdown preview, the > >>> bold > >>> > and emphasis are not presented correctly, I guess because they touch > >>> each > >>> > other or have spaces (or both?)? It displays correctly if it's: > >>> > > >>> > **Scientific Name:** *Aplysia parvula* Morch, 1863 > >>> > > >>> > I realize that the text in the RTF might have the bold/italic tagged > >>> > weirdly but is there a way to deal with this or am I just stuck? I > have > >>> > about 500 such files to process, so I'm looking for automated > methods. > >>> > > >>> > Thanks in advance for any help you can provide! > >>> > > >>> > -- > >>> > You received this message because you are subscribed to the Google > >>> Groups "pandoc-discuss" group. > >>> > To unsubscribe from this group and stop receiving emails from it, > send > >>> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > >>> > To view this discussion on the web visit > >>> > https://groups.google.com/d/msgid/pandoc-discuss/aecd40a2-09db-4e1b-96ad-752973375e0cn%40googlegroups.com > >>> . > >>> > >> > > > > -- > > You received this message because you are subscribed to the Google > Groups "pandoc-discuss" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/dec1524b-c96f-4090-be4c-6a2509879d59n%40googlegroups.com > . > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/d529f882-3339-4536-8021-eacc3fafc457n%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 8438 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: RTF to Markdown questions [not found] ` <m27d7aqt35.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 2022-04-27 18:38 ` Kris Wilk @ 2022-04-28 0:53 ` Kris Wilk 1 sibling, 0 replies; 9+ messages in thread From: Kris Wilk @ 2022-04-28 0:53 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 982 bytes --] On Wednesday, April 27, 2022 at 2:28:20 PM UTC-4 John MacFarlane wrote: > ...we have code in the docx reader, if I recall, that handles [spaces > inside boldface markup]... > John, I just wanted to thank you for mentioning the above tidbit, because it gave me a workaround...I first converted all my RTFs to DOCX, then to MD (using your other tips to strip the remaining cruft). Fixed the lousy markup/space issues perfectly. I'll still put in an issue to suggest porting this to the RTF reader, but I was able to get the job done. 👍 Kris -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/148ceed0-4f32-462a-9290-083e199ead2cn%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 1522 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2022-04-28 0:53 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-04-27 18:02 RTF to Markdown questions Kris Wilk [not found] ` <aecd40a2-09db-4e1b-96ad-752973375e0cn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2022-04-27 18:28 ` John MacFarlane [not found] ` <m27d7aqt35.fsf-jF64zX8BO0+FqBokazbCQ6OPv3vYUT2dxr7GGTnW70NeoWH0uzbU5w@public.gmane.org> 2022-04-27 18:38 ` Kris Wilk [not found] ` <bf4044b0-6746-4720-942f-53303a5cb296n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2022-04-27 19:06 ` Kris Wilk [not found] ` <dec1524b-c96f-4090-be4c-6a2509879d59n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2022-04-27 20:23 ` John MacFarlane [not found] ` <yh480kbkwmtgvg.fsf-pgq/RBwaQ+zq8tPRBa0AtqxOck334EZe@public.gmane.org> 2022-04-27 20:56 ` Bastien DUMONT 2022-04-27 23:51 ` Kris Wilk 2022-04-27 23:51 ` Kris Wilk 2022-04-28 0:53 ` Kris Wilk
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).