On 2019-08-31 14:31, Joseph Reagle wrote:> I'm converting Word docxs from a bunch of people to markdown. > > I want fairly simple markdown: headings, links, footnotes, italics, and bold. Which option could I use to avoid custom Word styles and every paragraph being in [...]{dir="ltr"}? (I think this has to do with lettering direction? All docs are in English, but people might have Word configured differently...) > > Actually those `dir="ltr"` attributes shouldn't be there unless the document's and/or the paragraph's default text direction is rtl/Right-to-Left, because if the text direction of a text element agrees with the document-wide/paragraph-wide one it should just be ignored. I generated a docx document with some paragraphs and test spans decorated with an explicit `dir=ltr` and then converted back to Markdown, and the `dir=ltr` didn't appear in the roundtripped Markdown, apparently because ltr was the default anyway. When I changed all instances of "ltr" to "rtl" in the original Markdown the attributes came through in the roundtripped Markdown and in the docx when opened in LibreOffice. When I changed the default paragraph style to rtl and converted that to Markdown the rtl attributes on whole paragraphs and character spans disappeared, which was what I expected. OTOH the last character of every paragraph got wrapped in a span with `dir="rtl"` which would seem to be a bug! Next I used the docx with the default paragraph style set to rtl as --reference-doc when creating another docx from markdown where some text and one paragraph was explicitly marked as ltr. The result was a docx where everything was rtl, my marking one paragraph as ltr having no effect. Now I marked that same paragraph as ltr in LibreOffice, making sure that the default style was rtl. Converting that to Markdown the paragraph explicitly marked as ltr looked normal, but in the paragraph with the default rtl the last character of the paragraph was again wrapped in an rtl span. So it seems that your source docx has some paragraph styles set to rtl, with paragraphs manually marked as ltr, which for some reason results in some text being wrapped in ltr spans, although I can't reproduce this using LibreOffice, which only supports setting writing direction at the paragraph level. It would also seem that handling of writing direction in Pandoc's docx reader and writer is somewhat buggy. However here is a Lua filter which I have tested on converting those docxs with some rtl spans to Markdown. It simply strips any `dir` or `custom-style` attributes from any elements and then also replaces any div or span elements which (no longer) have any attributes with their content, and it seems to work. If this doesn't work for you or isn't enough --- e.g. you need to strip other attributes --- please let me know. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/cd34ed71-f1c2-8927-ce16-e099dad1ae0a%40gmail.com.