* docs->md: avoiding custom styles and []{dir="ltr"}
@ 2019-08-31 12:31 Joseph Reagle
[not found] ` <4b18be06-800b-88b4-7883-208fe7a6bcb8-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: Joseph Reagle @ 2019-08-31 12:31 UTC (permalink / raw)
To: pandoc-discuss
I'm converting Word docxs from a bunch of people to markdown.
I want fairly simple markdown: headings, links, footnotes, italics, and bold. Which option could I use to avoid custom Word styles and every paragraph being in [...]{dir="ltr"}? (I think this has to do with lettering direction? All docs are in English, but people might have Word configured differently...)
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/4b18be06-800b-88b4-7883-208fe7a6bcb8%40reagle.org.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: docs->md: avoiding custom styles and []{dir="ltr"}
[not found] ` <4b18be06-800b-88b4-7883-208fe7a6bcb8-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
@ 2019-08-31 16:49 ` John MacFarlane
2019-09-01 16:19 ` Benct Philip Jonsson
1 sibling, 0 replies; 6+ messages in thread
From: John MacFarlane @ 2019-08-31 16:49 UTC (permalink / raw)
To: Joseph Reagle, pandoc-discuss
You can disable native_spans and bracketed_spans.
If you're getting a lot of dir="ltr", it is because your Word
documents have these 'dir' annotations on every paragraph.
I'm not sure why they would if they're English documents.
I'm not sure under what circumstances Word includes these,
but we might consider being less aggressive about adding
them. They should only be necessary when the document language
is a RTL language.
I'd welcome feedback from others who have been getting these
ltr spans.
Joseph Reagle <joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org> writes:
> I'm converting Word docxs from a bunch of people to markdown.
>
> I want fairly simple markdown: headings, links, footnotes, italics, and bold. Which option could I use to avoid custom Word styles and every paragraph being in [...]{dir="ltr"}? (I think this has to do with lettering direction? All docs are in English, but people might have Word configured differently...)
>
>
> --
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/4b18be06-800b-88b4-7883-208fe7a6bcb8%40reagle.org.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/m2k1atcmb0.fsf%40johnmacfarlane.net.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: docs->md: avoiding custom styles and []{dir="ltr"}
[not found] ` <4b18be06-800b-88b4-7883-208fe7a6bcb8-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2019-08-31 16:49 ` John MacFarlane
@ 2019-09-01 16:19 ` Benct Philip Jonsson
[not found] ` <cd34ed71-f1c2-8927-ce16-e099dad1ae0a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
1 sibling, 1 reply; 6+ messages in thread
From: Benct Philip Jonsson @ 2019-09-01 16:19 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw,
joseph.2011-T1oY19WcHSwdnm+yROfE0A
[-- Attachment #1: Type: text/plain, Size: 3276 bytes --]
On 2019-08-31 14:31, Joseph Reagle wrote:> I'm converting Word docxs
from a bunch of people to markdown.
>
> I want fairly simple markdown: headings, links, footnotes, italics,
and bold. Which option could I use to avoid custom Word styles and every
paragraph being in [...]{dir="ltr"}? (I think this has to do with
lettering direction? All docs are in English, but people might have Word
configured differently...)
>
>
Actually those `dir="ltr"` attributes shouldn't be there unless the
document's and/or the paragraph's default text direction is
rtl/Right-to-Left, because if the text direction of a text element
agrees with the document-wide/paragraph-wide one it should just be
ignored. I generated a docx document with some paragraphs and test
spans decorated with an explicit `dir=ltr` and then converted back to
Markdown, and the `dir=ltr` didn't appear in the roundtripped Markdown,
apparently because ltr was the default anyway. When I changed all
instances of "ltr" to "rtl" in the original Markdown the attributes came
through in the roundtripped Markdown and in the docx when opened in
LibreOffice.
When I changed the default paragraph style to rtl and converted that to
Markdown the rtl attributes on whole paragraphs and character spans
disappeared, which was what I expected. OTOH the last character of every
paragraph got wrapped in a span with `dir="rtl"` which would seem to be
a bug!
Next I used the docx with the default paragraph style set to rtl as
--reference-doc when creating another docx from markdown where some text
and one paragraph was explicitly marked as ltr. The result was a docx
where everything was rtl, my marking one paragraph as ltr having no
effect. Now I marked that same paragraph as ltr in LibreOffice, making
sure that the default style was rtl. Converting that to Markdown the
paragraph explicitly marked as ltr looked normal, but in the paragraph
with the default rtl the last character of the paragraph was again
wrapped in an rtl span.
So it seems that your source docx has some paragraph styles set to rtl,
with paragraphs manually marked as ltr, which for some reason results in
some text being wrapped in ltr spans, although I can't reproduce this
using LibreOffice, which only supports setting writing direction at the
paragraph level.
It would also seem that handling of writing direction in Pandoc's docx
reader and writer is somewhat buggy.
However here is a Lua filter which I have tested on converting those
docxs with some rtl spans to Markdown. It simply strips any `dir` or
`custom-style` attributes from any elements and then also replaces any
div or span elements which (no longer) have any attributes with their
content, and it seems to work. If this doesn't work for you or isn't
enough --- e.g. you need to strip other attributes --- please let me know.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/cd34ed71-f1c2-8927-ce16-e099dad1ae0a%40gmail.com.
[-- Attachment #2: no-dir-custom-style.lua --]
[-- Type: text/x-lua, Size: 660 bytes --]
local function no_dir_attribute (elem)
elem.attributes.dir = nil
elem.attributes['custom-style'] = nil
if 'Div' == elem.t or 'Span' == elem.t then
if "" == elem.identifier then
if 0 == #elem.classes then
if 0 == #elem.attributes then
return elem.content
end
end
end
end
return elem
end
return {
{
CodeBlock = no_dir_attribute,
Div = no_dir_attribute,
Header = no_dir_attribute,
Code = no_dir_attribute,
Image = no_dir_attribute,
Link = no_dir_attribute,
Span = no_dir_attribute,
}
}
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: docs->md: avoiding custom styles and []{dir="ltr"}
[not found] ` <cd34ed71-f1c2-8927-ce16-e099dad1ae0a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2019-09-02 13:41 ` Joseph Reagle
2019-09-02 14:36 ` Joseph Reagle
1 sibling, 0 replies; 6+ messages in thread
From: Joseph Reagle @ 2019-09-02 13:41 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
[-- Attachment #1: Type: text/plain, Size: 521 bytes --]
Hello Benct, thanks for your response! I didn't want to send the file to the whole list, but I've attached it here.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/699834bd-972b-f871-3896-c8cf682a8ea5%40reagle.org.
[-- Attachment #2: 1-02-benjakob-harrison.docx --]
[-- Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document, Size: 43257 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: docs->md: avoiding custom styles and []{dir="ltr"}
[not found] ` <cd34ed71-f1c2-8927-ce16-e099dad1ae0a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-09-02 13:41 ` Joseph Reagle
@ 2019-09-02 14:36 ` Joseph Reagle
[not found] ` <bf682166-7915-7d4d-cbe6-62e502a9bf11-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
1 sibling, 1 reply; 6+ messages in thread
From: Joseph Reagle @ 2019-09-02 14:36 UTC (permalink / raw)
To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
On 9/1/19 12:19 PM, Benct Philip Jonsson wrote:
> Actually those `dir="ltr"` attributes shouldn't be there unless the
> document's and/or the paragraph's default text direction
I sent the docx file to Benct off list. That one was from someone who typically writes in Hebrew but the prose was English. The other examples came from non-native English speakers as well.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: docs->md: avoiding custom styles and []{dir="ltr"}
[not found] ` <bf682166-7915-7d4d-cbe6-62e502a9bf11-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
@ 2019-09-02 17:26 ` John MacFarlane
0 siblings, 0 replies; 6+ messages in thread
From: John MacFarlane @ 2019-09-02 17:26 UTC (permalink / raw)
To: Joseph Reagle, pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw
See the new issue
https://github.com/jgm/pandoc/issues/5723
Joseph Reagle <joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org> writes:
> On 9/1/19 12:19 PM, Benct Philip Jonsson wrote:
>> Actually those `dir="ltr"` attributes shouldn't be there unless the
>> document's and/or the paragraph's default text direction
>
> I sent the docx file to Benct off list. That one was from someone who typically writes in Hebrew but the prose was English. The other examples came from non-native English speakers as well.
>
> --
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/bf682166-7915-7d4d-cbe6-62e502a9bf11%40reagle.org.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-09-02 17:26 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-31 12:31 docs->md: avoiding custom styles and []{dir="ltr"} Joseph Reagle
[not found] ` <4b18be06-800b-88b4-7883-208fe7a6bcb8-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2019-08-31 16:49 ` John MacFarlane
2019-09-01 16:19 ` Benct Philip Jonsson
[not found] ` <cd34ed71-f1c2-8927-ce16-e099dad1ae0a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-09-02 13:41 ` Joseph Reagle
2019-09-02 14:36 ` Joseph Reagle
[not found] ` <bf682166-7915-7d4d-cbe6-62e502a9bf11-T1oY19WcHSwdnm+yROfE0A@public.gmane.org>
2019-09-02 17:26 ` John MacFarlane
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).