public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: Benct Philip Jonsson <melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org,
	joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org
Subject: Re: docs->md: avoiding custom styles and []{dir="ltr"}
Date: Sun, 1 Sep 2019 18:19:04 +0200	[thread overview]
Message-ID: <cd34ed71-f1c2-8927-ce16-e099dad1ae0a@gmail.com> (raw)
In-Reply-To: <4b18be06-800b-88b4-7883-208fe7a6bcb8-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 3276 bytes --]


On 2019-08-31 14:31, Joseph Reagle wrote:> I'm converting Word docxs 
from a bunch of people to markdown.
 >
 > I want fairly simple markdown: headings, links, footnotes, italics, 
and bold. Which option could I use to avoid custom Word styles and every 
paragraph being in [...]{dir="ltr"}? (I think this has to do with 
lettering direction? All docs are in English, but people might have Word 
configured differently...)
 >
 >

Actually those `dir="ltr"` attributes shouldn't be there unless the 
document's and/or the paragraph's default text direction is 
rtl/Right-to-Left, because if the text direction of a text element 
agrees with the document-wide/paragraph-wide one it should just be 
ignored.  I generated a docx document with some paragraphs and test 
spans decorated with an explicit `dir=ltr` and then converted back to 
Markdown, and the `dir=ltr` didn't appear in the roundtripped Markdown, 
apparently because ltr was the default anyway.  When I changed all 
instances of "ltr" to "rtl" in the original Markdown the attributes came 
through in the roundtripped Markdown and in the docx when opened in 
LibreOffice.

When I changed the default paragraph style to rtl and converted that to 
Markdown the rtl attributes on whole paragraphs and character spans 
disappeared, which was what I expected. OTOH the last character of every 
paragraph got wrapped in a span with `dir="rtl"` which would seem to be 
a bug!

Next I used the docx with the default paragraph style set to rtl as 
--reference-doc when creating another docx from markdown where some text 
and one paragraph was explicitly marked as ltr. The result was a docx 
where everything was rtl, my marking one paragraph as ltr having no 
effect.  Now I marked that same paragraph as ltr in LibreOffice, making 
sure that the default style was rtl.  Converting that to Markdown the 
paragraph explicitly marked as ltr looked normal, but in the paragraph 
with the default rtl the last character of the paragraph was again 
wrapped in an rtl span.

So it seems that your source docx has some paragraph styles set to rtl, 
with paragraphs manually marked as ltr, which for some reason results in 
some text being wrapped in ltr spans, although I can't reproduce this 
using LibreOffice, which only supports setting writing direction at the 
paragraph level.

It would also seem that handling of writing direction in Pandoc's docx 
reader and writer is somewhat buggy.

However here is a Lua filter which I have tested on converting those 
docxs with some rtl spans to Markdown.  It simply strips any `dir` or 
`custom-style` attributes from any elements and then also replaces any 
div or span elements which (no longer) have any attributes with their 
content, and it seems to work.  If this doesn't work for you or isn't 
enough --- e.g. you need to strip other attributes --- please let me know.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/cd34ed71-f1c2-8927-ce16-e099dad1ae0a%40gmail.com.

[-- Attachment #2: no-dir-custom-style.lua --]
[-- Type: text/x-lua, Size: 660 bytes --]

local function no_dir_attribute (elem)
    elem.attributes.dir = nil
    elem.attributes['custom-style'] = nil
    if 'Div' == elem.t or 'Span' == elem.t then
        if "" == elem.identifier then
            if 0 == #elem.classes then
                if 0 == #elem.attributes then
                    return elem.content
                end
            end
        end
    end
    return elem
end

return {
    {
        CodeBlock = no_dir_attribute,
        Div = no_dir_attribute,
        Header = no_dir_attribute,
        Code = no_dir_attribute,
        Image = no_dir_attribute,
        Link = no_dir_attribute,
        Span = no_dir_attribute,
    }
}

  parent reply	other threads:[~2019-09-01 16:19 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-31 12:31 Joseph Reagle
     [not found] ` <4b18be06-800b-88b4-7883-208fe7a6bcb8-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2019-08-31 16:49   ` John MacFarlane
2019-09-01 16:19   ` Benct Philip Jonsson [this message]
     [not found]     ` <cd34ed71-f1c2-8927-ce16-e099dad1ae0a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-09-02 13:41       ` Joseph Reagle
2019-09-02 14:36       ` Joseph Reagle
     [not found]         ` <bf682166-7915-7d4d-cbe6-62e502a9bf11-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2019-09-02 17:26           ` John MacFarlane

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cd34ed71-f1c2-8927-ce16-e099dad1ae0a@gmail.com \
    --to=melroch-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).