* Re: docs->md: avoiding custom styles and []{dir="ltr"}
[not found] ` <4b18be06-800b-88b4-7883-208fe7a6bcb8-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
@ 2019-08-31 16:49 ` John MacFarlane
2019-09-01 16:19 ` Benct Philip Jonsson
1 sibling, 0 replies; 6+ messages in thread
From: John MacFarlane @ 2019-08-31 16:49 UTC (permalink / raw)
To: Joseph Reagle, pandoc-discuss
You can disable native_spans and bracketed_spans.
If you're getting a lot of dir="ltr", it is because your Word
documents have these 'dir' annotations on every paragraph.
I'm not sure why they would if they're English documents.
I'm not sure under what circumstances Word includes these,
but we might consider being less aggressive about adding
them. They should only be necessary when the document language
is a RTL language.
I'd welcome feedback from others who have been getting these
ltr spans.
Joseph Reagle <joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org> writes:
> I'm converting Word docxs from a bunch of people to markdown.
>
> I want fairly simple markdown: headings, links, footnotes, italics, and bold. Which option could I use to avoid custom Word styles and every paragraph being in [...]{dir="ltr"}? (I think this has to do with lettering direction? All docs are in English, but people might have Word configured differently...)
>
>
> --
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/4b18be06-800b-88b4-7883-208fe7a6bcb8%40reagle.org.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/m2k1atcmb0.fsf%40johnmacfarlane.net.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: docs->md: avoiding custom styles and []{dir="ltr"}
[not found] ` <4b18be06-800b-88b4-7883-208fe7a6bcb8-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2019-08-31 16:49 ` John MacFarlane
@ 2019-09-01 16:19 ` Benct Philip Jonsson
[not found] ` <cd34ed71-f1c2-8927-ce16-e099dad1ae0a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
1 sibling, 1 reply; 6+ messages in thread
From: Benct Philip Jonsson @ 2019-09-01 16:19 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw,
joseph.2011-T1oY19WcHSwdnm+yROfE0A
[-- Attachment #1: Type: text/plain, Size: 3276 bytes --]
On 2019-08-31 14:31, Joseph Reagle wrote:> I'm converting Word docxs
from a bunch of people to markdown.
>
> I want fairly simple markdown: headings, links, footnotes, italics,
and bold. Which option could I use to avoid custom Word styles and every
paragraph being in [...]{dir="ltr"}? (I think this has to do with
lettering direction? All docs are in English, but people might have Word
configured differently...)
>
>
Actually those `dir="ltr"` attributes shouldn't be there unless the
document's and/or the paragraph's default text direction is
rtl/Right-to-Left, because if the text direction of a text element
agrees with the document-wide/paragraph-wide one it should just be
ignored. I generated a docx document with some paragraphs and test
spans decorated with an explicit `dir=ltr` and then converted back to
Markdown, and the `dir=ltr` didn't appear in the roundtripped Markdown,
apparently because ltr was the default anyway. When I changed all
instances of "ltr" to "rtl" in the original Markdown the attributes came
through in the roundtripped Markdown and in the docx when opened in
LibreOffice.
When I changed the default paragraph style to rtl and converted that to
Markdown the rtl attributes on whole paragraphs and character spans
disappeared, which was what I expected. OTOH the last character of every
paragraph got wrapped in a span with `dir="rtl"` which would seem to be
a bug!
Next I used the docx with the default paragraph style set to rtl as
--reference-doc when creating another docx from markdown where some text
and one paragraph was explicitly marked as ltr. The result was a docx
where everything was rtl, my marking one paragraph as ltr having no
effect. Now I marked that same paragraph as ltr in LibreOffice, making
sure that the default style was rtl. Converting that to Markdown the
paragraph explicitly marked as ltr looked normal, but in the paragraph
with the default rtl the last character of the paragraph was again
wrapped in an rtl span.
So it seems that your source docx has some paragraph styles set to rtl,
with paragraphs manually marked as ltr, which for some reason results in
some text being wrapped in ltr spans, although I can't reproduce this
using LibreOffice, which only supports setting writing direction at the
paragraph level.
It would also seem that handling of writing direction in Pandoc's docx
reader and writer is somewhat buggy.
However here is a Lua filter which I have tested on converting those
docxs with some rtl spans to Markdown. It simply strips any `dir` or
`custom-style` attributes from any elements and then also replaces any
div or span elements which (no longer) have any attributes with their
content, and it seems to work. If this doesn't work for you or isn't
enough --- e.g. you need to strip other attributes --- please let me know.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/cd34ed71-f1c2-8927-ce16-e099dad1ae0a%40gmail.com.
[-- Attachment #2: no-dir-custom-style.lua --]
[-- Type: text/x-lua, Size: 660 bytes --]
local function no_dir_attribute (elem)
elem.attributes.dir = nil
elem.attributes['custom-style'] = nil
if 'Div' == elem.t or 'Span' == elem.t then
if "" == elem.identifier then
if 0 == #elem.classes then
if 0 == #elem.attributes then
return elem.content
end
end
end
end
return elem
end
return {
{
CodeBlock = no_dir_attribute,
Div = no_dir_attribute,
Header = no_dir_attribute,
Code = no_dir_attribute,
Image = no_dir_attribute,
Link = no_dir_attribute,
Span = no_dir_attribute,
}
}
^ permalink raw reply [flat|nested] 6+ messages in thread