From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/23345 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Benct Philip Jonsson Newsgroups: gmane.text.pandoc Subject: Re: docs->md: avoiding custom styles and []{dir="ltr"} Date: Sun, 1 Sep 2019 18:19:04 +0200 Message-ID: References: <4b18be06-800b-88b4-7883-208fe7a6bcb8@reagle.org> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------0199E7C7C4B69F7981CD8DF8" Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="165478"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org, joseph.2011-T1oY19WcHSwdnm+yROfE0A@public.gmane.org Original-X-From: pandoc-discuss+bncBCWMVYEK54FRB6W6V7VQKGQEQ65HYHQ-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Sun Sep 01 18:19:08 2019 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane.org Original-Received: from mail-wr1-f62.google.com ([209.85.221.62]) by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.89) (envelope-from ) id 1i4SZI-000gvu-1G for gtp-pandoc-discuss@m.gmane.org; Sun, 01 Sep 2019 18:19:08 +0200 Original-Received: by mail-wr1-f62.google.com with SMTP id b1sf7448143wru.4 for ; Sun, 01 Sep 2019 09:19:08 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1567354747; cv=pass; d=google.com; s=arc-20160816; b=bMutHES/3uzTNJyC9L0W4aBAG3EKYS/rv0NrciDq+b/m29KBhK3d3hFTFPr0o1R6pT 6mv3cajEc5qndNaY9eNe56fdIZEncZlxNpTmlZhYuB5dIVJW+EMXtBP2/WZoXo6lV69K OG8DWiDpOHm+Wg7+FgUNgVgt8wjhNZ5ZgFVZsZdvAFryp1D6ty5+4rNgy8Zs3IH99001 CZjkhpmfoYo/hLZDKetxYdHfsEMEnS5DYF7iBfi3yu0hn5/ygm/KHOFzSy6bOSA3wTqf /86k1F1RRty7vYtwY0D7ntQtYjCIJlYVPJoQudk58UaY5LGCUoFocXDxOwtGlmTTbmG8 rfkQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:reply-to:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :to:subject:sender:dkim-signature:dkim-signature; bh=5Tfnijfkamxw0Hk/xaM1oymR/fAI/7Gpqqy1xMB+cEc=; b=wJuGao97oV0HksyOIv1q/3dC4nJHIV2LFeu4HpK3rnN4aTBU3c/0j5lxskR1f+7FJk TIyHnNpf3AJR5lNmRSySfQJeR1XHmXWeHmpM1y5RysOw8tbZo/EoKxqwnhy71sHD0mQL mTYrEbcUzMyLUDeZDBzFyEpXXrtIPpu08VQDh0U7uPRqZng/ZD15KZ9uJ8fER88564Cj ntIKIJbJa3ReU0auRk+R7yeUN4yo4ud/GAV7jPQOt3dPmn78LHcXLGA/KCJ3g98srbNf B1I91UlHi2AizhrZk/TrFm79PYcIjBy4cp2g+f0YQRf4NN6GzXlRdeD2NMxL4z2B9OPW glcQ== ARC-Authentication-Results: i=2; gmr-mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=WW6arOll; spf=pass (google.com: domain of melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2a00:1450:4864:20::52a as permitted sender) smtp.mailfrom=melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:subject:to:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=5Tfnijfkamxw0Hk/xaM1oymR/fAI/7Gpqqy1xMB+cEc=; b=dLQAKcJOfTIjvm8zhnZ45F0ugyJX98oXRCPHhn2lDZnecrZVZOn99UVeAEl/UeRUm0 BbSV6MRbkwziPN+aPMlL0WcvElmsBsuc+Ehy8iVd+OPGxfjTP/6zcxdshog1zKnawm/b aZpnCqSysYDBMaSnhE4AolbiWFOgVF4Y6SLiybDPvVC2reMo3ZFG5O9z2EvC8nibud7r H+lfrH+UGBrxllMfPMEYX11amuH5noZusx/JQaI/S2LGUQ3MqrTDoHM44vtnqqL/2J5N l4REfGdiWnD7eSacCRt/Sz+0ujc8Ak8RzQ//AHHiMGzeSvybjPQ+uG9IUYnBi+pUpYKH Tf9g== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-language:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=5Tfnijfkamxw0Hk/xaM1oymR/fAI/7Gpqqy1xMB+cEc=; b=hrG9FNp/rKBYYy5a2wzzD4IMhYVFXJ7LozvTvheW+8ShNEVAeddl9S+opXlw3xrOtz OjJwPg8g3T68vLV+HX6kEYOjQQTGu53kkqr+FmlvLCvl6J/rhBuwFxuXTFN7O83x6+3W O68ac+gyRF/5KDShEa0nh9lG0f7WDK58BlxxhngJkPFSVbBOQufoBPc+3A0UjhiAkfT2 x1rIu2p2iBLN4jfVnzUL70lSMdyJmhvQi2OuNpQlQXBwK8qjaJjXKqR2GtOX2H/fOMbJ xpUtb0nFiRTpstqS3MpVUqMvJ6LTG6CO1W9wnCJbeaFkBWf749q9fFb/EJ7I17vPsWyp 1Ejw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:subject:to:references:from:message-id :date:user-agent:mime-version:in-reply-to:content-language :x-original-sender:x-original-authentication-results:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=5Tfnijfkamxw0Hk/xaM1oymR/fAI/7Gpqqy1xMB+cEc=; b=m0IPa9u5GC67Fs9WVAS+2seJs6tqB5lwCMQQMru8a4Ha4gYaB2050OzVhUb1TsRQwR Z3PZaJW0uGFxQrb4m3nLdfOZWUnxOBkb1X3Lze9IMVXOv/6fvfBuX3/bSlL94o1wvMWv Pg3LfmEbw81Ncmn+GYNcy89xaqKf8vBTUondCl9IQinuPOXE1jGN728Wor+noy4xl1Wo C/7hAbipgnevS0ySwo6zZ+dcYv22P2V4xCARgVX6pzEvQho9ZmsWymMqVzcowS4pfDlO VYuEJWEpND3bg+J0sD0JtVX6RmTyW7UmVpAmD1eVudmEfDSk78s4vZg5p9t6Zs++Dtik Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: APjAAAUAthXRaJ+FSk56x6xu/+J9FQ11Z0qaWct88X8At82kZinG7MJt m/kpHbnxC5UFpEGvljP43gI= X-Google-Smtp-Source: APXvYqx2i+Z9bPQZiKG/jtPVwysuPJOA0O8tH6vqsmdefnqKB4tqVuCiZzqD9jzMJQF1pvhO3KL6wA== X-Received: by 2002:a1c:1f10:: with SMTP id f16mr31740892wmf.176.1567354747442; Sun, 01 Sep 2019 09:19:07 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a1c:7506:: with SMTP id o6ls1165465wmc.4.gmail; Sun, 01 Sep 2019 09:19:06 -0700 (PDT) X-Received: by 2002:a7b:ce1a:: with SMTP id m26mr31471064wmc.60.1567354746593; Sun, 01 Sep 2019 09:19:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1567354746; cv=none; d=google.com; s=arc-20160816; b=wd2/jTOiCT9Pd5QQHTMGZ3Aim4uvdqdDfzQQny0BBBwAEt7sWqjOJd5kkR5jcyctRX jh9wBlgo8MlsbvEslaowRlAGLXw7DD1CK2K2ZiFBidixXgupj++lC6k8ppcit9TGzS+b 1eGHgnDs0gVjTWmJsXzNLT6NXQi5tnO10zYm+WlYEaGdwkb6sQ13xmWklM2jVWeIiAqN tKg/bh0AFtqWXVhH1Tyait59ktL8wQXmcyOlJLQ1gdwqZV/mAu6GZFS+LrQyfVskeG4t pBR8BJKRyWRSYAxgNL2ALUiLsC7cDo43Qkji47Tq3AJ3X+yV/tbucPIPrfCRDpY0yzoW 4veA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:to:subject:dkim-signature; bh=O8VflU/TexxYvKdKb1nkdHU+M/RSe8DGY2l5IroWcmY=; b=h66IBY8lAF8gi0WrbrHiiG6rX+zqiBmOSXfAAQGA+hvcf5tgnpha5+GAi383EXk3Pw MOwRmWOIoQJ3oxfIic2/Cb68xVWwh4IpP3khokNqmN1ygzi/ZSURGM0iNr8jkIYy0uMH 6kKxQ06qPMHpcrdN6mmrTfXDB/NyVIOqlEJ9MoQUIzIywU/L4i91pL5UdshKia2acWQz U9h4ooEdSWYCQ10VL5XJ2cg2HbMtkIrCHMjsOJM5kK+QFJvXcb6sBN9kYOvjDxsNmjy6 HAunY2I5dO2JXaBfAYseNnlKMp5JMFLqclwrXr0u4PAD6Z9gmEBFnUyI1o/oMUXlrRyR RPsA== ARC-Authentication-Results: i=1; gmr-mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=WW6arOll; spf=pass (google.com: domain of melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2a00:1450:4864:20::52a as permitted sender) smtp.mailfrom=melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Original-Received: from mail-ed1-x52a.google.com (mail-ed1-x52a.google.com. [2a00:1450:4864:20::52a]) by gmr-mx.google.com with ESMTPS id w17si658099wmk.1.2019.09.01.09.19.06 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 01 Sep 2019 09:19:06 -0700 (PDT) Received-SPF: pass (google.com: domain of melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2a00:1450:4864:20::52a as permitted sender) client-ip=2a00:1450:4864:20::52a; Original-Received: by mail-ed1-x52a.google.com with SMTP id o9so2014953edq.0 for ; Sun, 01 Sep 2019 09:19:06 -0700 (PDT) X-Received: by 2002:a05:6402:125a:: with SMTP id l26mr26907049edw.192.1567354746272; Sun, 01 Sep 2019 09:19:06 -0700 (PDT) Original-Received: from [192.168.1.122] (user141.77-105-220.netatonce.net. [77.105.220.141]) by smtp.gmail.com with ESMTPSA id s22sm2378897eds.67.2019.09.01.09.19.05 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 01 Sep 2019 09:19:05 -0700 (PDT) In-Reply-To: <4b18be06-800b-88b4-7883-208fe7a6bcb8-T1oY19WcHSwdnm+yROfE0A@public.gmane.org> Content-Language: en-US X-Original-Sender: melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org X-Original-Authentication-Results: gmr-mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=WW6arOll; spf=pass (google.com: domain of melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2a00:1450:4864:20::52a as permitted sender) smtp.mailfrom=melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.org gmane.text.pandoc:23345 Archived-At: This is a multi-part message in MIME format. --------------0199E7C7C4B69F7981CD8DF8 Content-Type: text/plain; charset="UTF-8"; format=flowed On 2019-08-31 14:31, Joseph Reagle wrote:> I'm converting Word docxs from a bunch of people to markdown. > > I want fairly simple markdown: headings, links, footnotes, italics, and bold. Which option could I use to avoid custom Word styles and every paragraph being in [...]{dir="ltr"}? (I think this has to do with lettering direction? All docs are in English, but people might have Word configured differently...) > > Actually those `dir="ltr"` attributes shouldn't be there unless the document's and/or the paragraph's default text direction is rtl/Right-to-Left, because if the text direction of a text element agrees with the document-wide/paragraph-wide one it should just be ignored. I generated a docx document with some paragraphs and test spans decorated with an explicit `dir=ltr` and then converted back to Markdown, and the `dir=ltr` didn't appear in the roundtripped Markdown, apparently because ltr was the default anyway. When I changed all instances of "ltr" to "rtl" in the original Markdown the attributes came through in the roundtripped Markdown and in the docx when opened in LibreOffice. When I changed the default paragraph style to rtl and converted that to Markdown the rtl attributes on whole paragraphs and character spans disappeared, which was what I expected. OTOH the last character of every paragraph got wrapped in a span with `dir="rtl"` which would seem to be a bug! Next I used the docx with the default paragraph style set to rtl as --reference-doc when creating another docx from markdown where some text and one paragraph was explicitly marked as ltr. The result was a docx where everything was rtl, my marking one paragraph as ltr having no effect. Now I marked that same paragraph as ltr in LibreOffice, making sure that the default style was rtl. Converting that to Markdown the paragraph explicitly marked as ltr looked normal, but in the paragraph with the default rtl the last character of the paragraph was again wrapped in an rtl span. So it seems that your source docx has some paragraph styles set to rtl, with paragraphs manually marked as ltr, which for some reason results in some text being wrapped in ltr spans, although I can't reproduce this using LibreOffice, which only supports setting writing direction at the paragraph level. It would also seem that handling of writing direction in Pandoc's docx reader and writer is somewhat buggy. However here is a Lua filter which I have tested on converting those docxs with some rtl spans to Markdown. It simply strips any `dir` or `custom-style` attributes from any elements and then also replaces any div or span elements which (no longer) have any attributes with their content, and it seems to work. If this doesn't work for you or isn't enough --- e.g. you need to strip other attributes --- please let me know. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/cd34ed71-f1c2-8927-ce16-e099dad1ae0a%40gmail.com. --------------0199E7C7C4B69F7981CD8DF8 Content-Type: text/x-lua; name="no-dir-custom-style.lua" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="no-dir-custom-style.lua" local function no_dir_attribute (elem) elem.attributes.dir = nil elem.attributes['custom-style'] = nil if 'Div' == elem.t or 'Span' == elem.t then if "" == elem.identifier then if 0 == #elem.classes then if 0 == #elem.attributes then return elem.content end end end end return elem end return { { CodeBlock = no_dir_attribute, Div = no_dir_attribute, Header = no_dir_attribute, Code = no_dir_attribute, Image = no_dir_attribute, Link = no_dir_attribute, Span = no_dir_attribute, } } --------------0199E7C7C4B69F7981CD8DF8--