From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/20268 Path: news.gmane.org!.POSTED!not-for-mail From: pandoc-only-Mmb7MZpHnFY@public.gmane.org Newsgroups: gmane.text.pandoc Subject: Re: docx (Word) reader and complex fields Date: Mon, 28 May 2018 17:09:41 -0700 (PDT) Message-ID: <98b3a76c-6c31-412d-bf10-10218975e66a@googlegroups.com> References: <7fcad0d6-d3d4-4239-8976-ac51ce85d475@googlegroups.com> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_23329_293685910.1527552581277" X-Trace: blaine.gmane.org 1527552459 14455 195.159.176.226 (29 May 2018 00:07:39 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 29 May 2018 00:07:39 +0000 (UTC) To: pandoc-discuss Original-X-From: pandoc-discuss+bncBDDNZVXG5MJRBSNUWLMAKGQEIQGX2KY-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Tue May 29 02:07:35 2018 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane.org Original-Received: from mail-ot0-f192.google.com ([74.125.82.192]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fNSAp-0003fd-Fq for gtp-pandoc-discuss@m.gmane.org; Tue, 29 May 2018 02:07:35 +0200 Original-Received: by mail-ot0-f192.google.com with SMTP id l3-v6sf8861984otk.4 for ; Mon, 28 May 2018 17:09:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=kUDohfHhjtrDr76llTJFooFR3aD1Bj0wv3mtE6znPGA=; b=PbpZ9Q9fawyDva8nC7i+stfpnXgd5oDRAgmxO2VBOlXho9D4MXtRenYgQoHwAO5Cqr XsD1ShOH4RssfcU34wQs8CHm4RNXbswENZey+Zj+Nftg3BJ4TvSgNYAIy8qF3F08qCS9 RH6Glfrg6MtTfE+qvZwRODABQ8Czx+UTu6YvtXao23MPFPV00rNAMJSXx6uRS/5Teozl jgtj5m7WgMgyZbSsJM8kmWe19xTQ8gEyL2HPB6G9ChtUVabuRcaNpbBsPxu7lZDVmud3 +rgSalY4tfbwNfKxViEHFntvbGgOdpnLtdOO4PzAY0eDTEukQD1J1qbmbCkM4CDHIXZk g0GA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=kUDohfHhjtrDr76llTJFooFR3aD1Bj0wv3mtE6znPGA=; b=tsI0f3epCPpDu9kopUJ6s3I+4R4+xRLDfP7t/VYICPCnU2KIZjBsqBfhlR5FKwxQx5 8DdV3ViFNjnoIZ91mNbcyXAr1raAFCi7afX3V1hVnABAAea0pKPZYEEj5677jRVuAOik FxdXfEnH6CpgnyxAYkFT+VDE1BAmEJcTMVrwA9oOlA0GavCp6m3RcFFQjdEPCReAJjgp kHdoWRdNRNGEwFqzLFMUElfwYOq5oKB10hI3rtMUcEWe4zSgfkYvvw7I7iLjxYGG+Ra4 uWn1xdZIS6/PcKwJUY2OEI1busoQ86fhsFKUwCfQNn8HL5HuQlGLdMtRcCG3c+bfxX3C Ohxw== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: ALKqPwcXb+hFYm2d1CBOmiDMftWra7RmpJ2JdGJN4dRRPKPupLTr7x8u aAm/kyl9eHhaJrdpzmFm694= X-Google-Smtp-Source: ADUXVKIeIpWQ/hNKQ38gJT97z7ld5kFC1DaB+PAycizo5NYK34NeYGs/kPdO/2GXSY4b4dV4rmLRQA== X-Received: by 2002:a9d:4503:: with SMTP id w3-v6mr1134772ote.7.1527552586194; Mon, 28 May 2018 17:09:46 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a9d:25c7:: with SMTP id q65-v6ls18065476ota.2.gmail; Mon, 28 May 2018 17:09:43 -0700 (PDT) X-Received: by 2002:a9d:5887:: with SMTP id x7-v6mr146917otg.14.1527552581797; Mon, 28 May 2018 17:09:41 -0700 (PDT) In-Reply-To: <7fcad0d6-d3d4-4239-8976-ac51ce85d475-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> X-Original-Sender: pandoc-only-Mmb7MZpHnFY@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Spam-Checked-In-Group: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.org gmane.text.pandoc:20268 Archived-At: ------=_Part_23329_293685910.1527552581277 Content-Type: multipart/alternative; boundary="----=_Part_23330_19366164.1527552581278" ------=_Part_23330_19366164.1527552581278 Content-Type: text/plain; charset="UTF-8" Turning this into a standalone test case was a worthwhile exercise. Turns out that my "other things" triggered this bug, and without them, the output looks fine, so it is arguably not a bug. The other things were the multi-paragraph complex fields (https://groups.google.com/forum/#!topic/pandoc-discuss/ItyWCB9HgKI), which may be a special-interest topic, since my only example for that is a Citavi bibliography. On Monday, May 28, 2018 at 5:22:42 PM UTC+2, Francesco Occhipinti wrote: > > Probably this deserves to be turned into an issue and related pull request > if you didn't already. Ideally the pull request would include its unit test > > On Thursday, May 24, 2018 at 11:06:34 PM UTC+2, pando...-Mmb7MZpHnFY@public.gmane.org wrote: >> >> Text/Pandoc/Reader/Docx/Parse.hs handles complex fields as if the >> "separate" field char and subsequent runs were obligatory, but they are >> optional. >> >> While working on other things, I came across this snippet of docx XML >> (simplified here): >> >> >> SEQ CHAPTER \h \r 1 >> >> >> >> ECMA-376-1:2016, 17.16.2 (p. 1165), however, marks these parts as >> optional. >> >> Parsing this had the FldCharState transition from Closed -begin-> Open >> -instr-> FieldInfo, and then ignoring the end. Only when a later field >> contained a separator, it would continue to -sebarate-> CharContent -end-> >> Closed. >> >> As a fix, I added a case when end is encountered in the FieldInfo state: >> Fixed a bug in complex field handling: separator fields and rendition >> runs are optional. >> >> --- >> src/Text/Pandoc/Readers/Docx/Parse.hs | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/src/Text/Pandoc/Readers/Docx/Parse.hs >> b/src/Text/Pandoc/Readers/Docx/Parse.hs >> index 221260f42..b5226a95a 100644 >> --- a/src/Text/Pandoc/Readers/Docx/Parse.hs >> +++ b/src/Text/Pandoc/Readers/Docx/Parse.hs >> @@ -830,9 +830,12 @@ elemToParPart ns element >> FldCharClosed | fldCharType == "begin" -> do >> modify $ \st -> st {stateFldCharState = FldCharOpen} >> return NullParPart >> - FldCharFieldInfo info | fldCharType == "separate" -> do >> + FldCharFieldInfo info | fldCharType == "separate" -> do -- >> optional separator before rendition >> modify $ \st -> st {stateFldCharState = FldCharContent info []} >> return NullParPart >> + FldCharFieldInfo info | fldCharType == "end" -> do -- direct >> end, without rendition >> + modify $ \st -> st {stateFldCharState = FldCharClosed} >> + return $ Field info [] >> FldCharContent info runs | fldCharType == "end" -> do -- fxg: >> End in same par >> modify $ \st -> st {stateFldCharState = FldCharClosed} >> return $ Field info $ reverse runs >> -- >> 2.11.0 >> >> I tested that with the current pandoc git master HEAD, versioned 2.2.1 >> > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/98b3a76c-6c31-412d-bf10-10218975e66a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. ------=_Part_23330_19366164.1527552581278 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Turning this into a standalone test case was a worthwhile = exercise.=C2=A0 Turns out that my "other things" triggered this b= ug, and without them, the output looks fine, so it is arguably not a bug.The other things were the multi-paragraph complex fields (https://groups= .google.com/forum/#!topic/pandoc-discuss/ItyWCB9HgKI), which may be a speci= al-interest topic, since my only example for that is a Citavi bibliography.=

On Monday, May 28, 2018 at 5:22:42 PM UTC+2, Francesco Occhipinti w= rote:
Probably= this deserves to be turned into an issue and related pull request if you d= idn't already. Ideally the pull request would include its unit test
=
On Thursday, May 24, 2018 at 11:06:34 PM UTC+2, pando...-Mmb7MZpHnFY@public.gmane.org = wrote:
Text/Pandoc/= Reader/Docx/Parse.hs handles complex fields as if the "separate&q= uot; field char and subsequent runs were obligatory, but they are optional.=

While working on other things, I came across this snipp= et of docx XML (simplified here):
&l= t;w:r><w:fldCh= ar w:fldCharType=3D"begin"/></w:r>
<w:r>
=
=C2=A0 <w:instrText xml:space=3D"preserve"> SEQ CHAPTE= R \h \r 1</w:instrText>
</w:r>
<w:r><= span style=3D"font-family:Arial,Helvetica,sans-serif"><w:fldChar w:fldCh= arType=3D"end"/></w:r>

ECMA= -376-1:2016, 17.16.2 (p. 1165), however, marks these parts as optional.

Parsing this had the FldCharState transition from Clo= sed -begin-> Open -instr-> FieldInfo, and then ignoring the end.=C2= =A0 Only when a later field contained a separator, it would continue to -se= barate-> CharContent -end-> Closed.

As a fix= , I added a case when end is encountered in the FieldInfo state:
=
Fixed a bug in complex field handling: separator = fields and rendition runs are optional.

---
=C2=A0src/Text/Pandoc/Readers/Docx/Parse.hs | 5 = ++++-
=C2=A01 file changed, 4 inse= rtions(+), 1 deletion(-)

diff --git a/src/Text/Pandoc/Readers/<= wbr>Docx/Parse.hs b/src/Text/Pandoc/Readers/Docx/Parse.hs
=
index 221260f42..b5226a95a 100644
=
--- a/src/Text/Pandoc/Readers/Docx/Parse.= hs
+++ b/src/Text/Pandoc/Readers/<= wbr>Docx/Parse.hs
@@ -830,9 +830,1= 2 @@ elemToParPart ns element
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0FldCharClosed | fldCharType =3D=3D "beg= in" -> do
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0modify $ \st -> st {stateFldCharState =3D Fld= CharOpen}
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0return NullParPart
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 FldCharFieldInfo info | fldCharType =3D=3D &= quot;separate" -> do
+=C2= =A0 =C2=A0 =C2=A0 =C2=A0 FldCharFieldInfo info | fldCharType =3D=3D "s= eparate" -> do -- optional separator before rendition
<= div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0modify= $ \st -> st {stateFldCharState =3D FldCharContent info []}
=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0retur= n NullParPart
+=C2=A0 =C2=A0 =C2= =A0 =C2=A0 FldCharFieldInfo info | fldCharType =3D=3D "end" ->= do -- direct end, without rendition
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 modify $ \st -> st {stateFldCharS= tate =3D FldCharClosed}
+=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 return $ Field info []
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0FldCharContent info runs= | fldCharType =3D=3D "end" -> do -- fxg: End in same par
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0modify $ \st -> st {stateFldCharState =3D FldCharClosed}
=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0retur= n $ Field info $ reverse runs
--= =C2=A0
2.11.0

I tested that with the current pandoc git master = HEAD, versioned 2.2.1

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d/= msgid/pandoc-discuss/98b3a76c-6c31-412d-bf10-10218975e66a%40googlegroups.co= m.
For more options, visit http= s://groups.google.com/d/optout.
------=_Part_23330_19366164.1527552581278-- ------=_Part_23329_293685910.1527552581277--