public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* docs->md: avoiding custom styles and []{dir="ltr"}
@ 2019-08-31 12:31 Joseph Reagle
       [not found] ` <4b18be06-800b-88b4-7883-208fe7a6bcb8-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Joseph Reagle @ 2019-08-31 12:31 UTC (permalink / raw)
  To: pandoc-discuss

I'm converting Word docxs from a bunch of people to markdown. 

I want fairly simple markdown: headings, links, footnotes, italics, and bold. Which option could I use to avoid custom Word styles and every paragraph being in [...]{dir="ltr"}? (I think this has to do with lettering direction? All docs are in English, but people might have Word configured differently...)


-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/4b18be06-800b-88b4-7883-208fe7a6bcb8%40reagle.org.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: docs->md: avoiding custom styles and []{dir="ltr"}
       [not found] ` <4b18be06-800b-88b4-7883-208fe7a6bcb8-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
@ 2019-08-31 16:49   ` John MacFarlane
  2019-09-01 16:19   ` Benct Philip Jonsson
  1 sibling, 0 replies; 6+ messages in thread
From: John MacFarlane @ 2019-08-31 16:49 UTC (permalink / raw)
  To: Joseph Reagle, pandoc-discuss


You can disable native_spans and bracketed_spans.

If you're getting a lot of dir="ltr", it is because your Word
documents have these 'dir' annotations on every paragraph.
I'm not sure why they would if they're English documents.

I'm not sure under what circumstances Word includes these,
but we might consider being less aggressive about adding
them. They should only be necessary when the document language
is a RTL language.

I'd welcome feedback from others who have been getting these
ltr spans.

Joseph Reagle <joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org> writes:

> I'm converting Word docxs from a bunch of people to markdown. 
>
> I want fairly simple markdown: headings, links, footnotes, italics, and bold. Which option could I use to avoid custom Word styles and every paragraph being in [...]{dir="ltr"}? (I think this has to do with lettering direction? All docs are in English, but people might have Word configured differently...)
>
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/4b18be06-800b-88b4-7883-208fe7a6bcb8%40reagle.org.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/m2k1atcmb0.fsf%40johnmacfarlane.net.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: docs->md: avoiding custom styles and []{dir="ltr"}
       [not found] ` <4b18be06-800b-88b4-7883-208fe7a6bcb8-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
  2019-08-31 16:49   ` John MacFarlane
@ 2019-09-01 16:19   ` Benct Philip Jonsson
       [not found]     ` <cd34ed71-f1c2-8927-ce16-e099dad1ae0a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 6+ messages in thread
From: Benct Philip Jonsson @ 2019-09-01 16:19 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw,
	joseph.2011-T1oY19WcHSwdnm+yROfE0A

[-- Attachment #1: Type: text/plain, Size: 3276 bytes --]


On 2019-08-31 14:31, Joseph Reagle wrote:> I'm converting Word docxs 
from a bunch of people to markdown.
 >
 > I want fairly simple markdown: headings, links, footnotes, italics, 
and bold. Which option could I use to avoid custom Word styles and every 
paragraph being in [...]{dir="ltr"}? (I think this has to do with 
lettering direction? All docs are in English, but people might have Word 
configured differently...)
 >
 >

Actually those `dir="ltr"` attributes shouldn't be there unless the 
document's and/or the paragraph's default text direction is 
rtl/Right-to-Left, because if the text direction of a text element 
agrees with the document-wide/paragraph-wide one it should just be 
ignored.  I generated a docx document with some paragraphs and test 
spans decorated with an explicit `dir=ltr` and then converted back to 
Markdown, and the `dir=ltr` didn't appear in the roundtripped Markdown, 
apparently because ltr was the default anyway.  When I changed all 
instances of "ltr" to "rtl" in the original Markdown the attributes came 
through in the roundtripped Markdown and in the docx when opened in 
LibreOffice.

When I changed the default paragraph style to rtl and converted that to 
Markdown the rtl attributes on whole paragraphs and character spans 
disappeared, which was what I expected. OTOH the last character of every 
paragraph got wrapped in a span with `dir="rtl"` which would seem to be 
a bug!

Next I used the docx with the default paragraph style set to rtl as 
--reference-doc when creating another docx from markdown where some text 
and one paragraph was explicitly marked as ltr. The result was a docx 
where everything was rtl, my marking one paragraph as ltr having no 
effect.  Now I marked that same paragraph as ltr in LibreOffice, making 
sure that the default style was rtl.  Converting that to Markdown the 
paragraph explicitly marked as ltr looked normal, but in the paragraph 
with the default rtl the last character of the paragraph was again 
wrapped in an rtl span.

So it seems that your source docx has some paragraph styles set to rtl, 
with paragraphs manually marked as ltr, which for some reason results in 
some text being wrapped in ltr spans, although I can't reproduce this 
using LibreOffice, which only supports setting writing direction at the 
paragraph level.

It would also seem that handling of writing direction in Pandoc's docx 
reader and writer is somewhat buggy.

However here is a Lua filter which I have tested on converting those 
docxs with some rtl spans to Markdown.  It simply strips any `dir` or 
`custom-style` attributes from any elements and then also replaces any 
div or span elements which (no longer) have any attributes with their 
content, and it seems to work.  If this doesn't work for you or isn't 
enough --- e.g. you need to strip other attributes --- please let me know.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/cd34ed71-f1c2-8927-ce16-e099dad1ae0a%40gmail.com.

[-- Attachment #2: no-dir-custom-style.lua --]
[-- Type: text/x-lua, Size: 660 bytes --]

local function no_dir_attribute (elem)
    elem.attributes.dir = nil
    elem.attributes['custom-style'] = nil
    if 'Div' == elem.t or 'Span' == elem.t then
        if "" == elem.identifier then
            if 0 == #elem.classes then
                if 0 == #elem.attributes then
                    return elem.content
                end
            end
        end
    end
    return elem
end

return {
    {
        CodeBlock = no_dir_attribute,
        Div = no_dir_attribute,
        Header = no_dir_attribute,
        Code = no_dir_attribute,
        Image = no_dir_attribute,
        Link = no_dir_attribute,
        Span = no_dir_attribute,
    }
}

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: docs->md: avoiding custom styles and []{dir="ltr"}
       [not found]     ` <cd34ed71-f1c2-8927-ce16-e099dad1ae0a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2019-09-02 13:41       ` Joseph Reagle
  2019-09-02 14:36       ` Joseph Reagle
  1 sibling, 0 replies; 6+ messages in thread
From: Joseph Reagle @ 2019-09-02 13:41 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 521 bytes --]


Hello Benct, thanks for your response! I didn't want to send the file to the whole list, but I've attached it here.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/699834bd-972b-f871-3896-c8cf682a8ea5%40reagle.org.

[-- Attachment #2: 1-02-benjakob-harrison.docx --]
[-- Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document, Size: 43257 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: docs->md: avoiding custom styles and []{dir="ltr"}
       [not found]     ` <cd34ed71-f1c2-8927-ce16-e099dad1ae0a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2019-09-02 13:41       ` Joseph Reagle
@ 2019-09-02 14:36       ` Joseph Reagle
       [not found]         ` <bf682166-7915-7d4d-cbe6-62e502a9bf11-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
  1 sibling, 1 reply; 6+ messages in thread
From: Joseph Reagle @ 2019-09-02 14:36 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


On 9/1/19 12:19 PM, Benct Philip Jonsson wrote:
> Actually those `dir="ltr"` attributes shouldn't be there unless the
> document's and/or the paragraph's default text direction

I sent the docx file to Benct off list. That one was from someone who typically writes in Hebrew but the prose was English. The other examples came from non-native English speakers as well.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: docs->md: avoiding custom styles and []{dir="ltr"}
       [not found]         ` <bf682166-7915-7d4d-cbe6-62e502a9bf11-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
@ 2019-09-02 17:26           ` John MacFarlane
  0 siblings, 0 replies; 6+ messages in thread
From: John MacFarlane @ 2019-09-02 17:26 UTC (permalink / raw)
  To: Joseph Reagle, pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw


See the new issue

https://github.com/jgm/pandoc/issues/5723

Joseph Reagle <joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org> writes:

> On 9/1/19 12:19 PM, Benct Philip Jonsson wrote:
>> Actually those `dir="ltr"` attributes shouldn't be there unless the
>> document's and/or the paragraph's default text direction
>
> I sent the docx file to Benct off list. That one was from someone who typically writes in Hebrew but the prose was English. The other examples came from non-native English speakers as well.
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/bf682166-7915-7d4d-cbe6-62e502a9bf11%40reagle.org.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-09-02 17:26 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-31 12:31 docs->md: avoiding custom styles and []{dir="ltr"} Joseph Reagle
     [not found] ` <4b18be06-800b-88b4-7883-208fe7a6bcb8-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2019-08-31 16:49   ` John MacFarlane
2019-09-01 16:19   ` Benct Philip Jonsson
     [not found]     ` <cd34ed71-f1c2-8927-ce16-e099dad1ae0a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-09-02 13:41       ` Joseph Reagle
2019-09-02 14:36       ` Joseph Reagle
     [not found]         ` <bf682166-7915-7d4d-cbe6-62e502a9bf11-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2019-09-02 17:26           ` John MacFarlane

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).