From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/20258 Path: news.gmane.org!.POSTED!not-for-mail From: Francesco Occhipinti Newsgroups: gmane.text.pandoc Subject: Re: docx (Word) reader and complex fields Date: Mon, 28 May 2018 08:22:42 -0700 (PDT) Message-ID: <7fcad0d6-d3d4-4239-8976-ac51ce85d475@googlegroups.com> References: Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_10230_2109135885.1527520962359" X-Trace: blaine.gmane.org 1527520838 11855 195.159.176.226 (28 May 2018 15:20:38 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Mon, 28 May 2018 15:20:38 +0000 (UTC) To: pandoc-discuss Original-X-From: pandoc-discuss+bncBCBO7I5D4QHRBQ55WDMAKGQE57NB7RI-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mon May 28 17:20:34 2018 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane.org Original-Received: from mail-oi0-f64.google.com ([209.85.218.64]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fNJwn-0002z2-EQ for gtp-pandoc-discuss@m.gmane.org; Mon, 28 May 2018 17:20:33 +0200 Original-Received: by mail-oi0-f64.google.com with SMTP id u63-v6sf2125562oia.8 for ; Mon, 28 May 2018 08:22:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=/4ojOlx08PfSbpvm2ZKYJVY+9E7PfPjRWJ5GoxfgplA=; b=N4Dae+YPexWKPQNmbHBy9jBNEJEe6Rc52AlwT244xdYudWAnNREqeSE47MdeWaEGL8 jC2vJBxZAX8wGRZFHlvzKpJiAykD25p0u+JWYIR77ke4gzyNq+QRMI9AYzyajDTmGJRd 95uq7IWdqSSjwPvqHylygooClqoRTU3LX+geABcaUNolnuiM2BulZoL5TzOiEPrm2/9m pUnoiamZMHsxBrVCi7FOgcVPFnU64qfcbsHzTYeH1RbM3O6UP/+BRmXfSYkLb7SpORno NibA7OwaQbO0qvt9+/EmlcB7Se3F/B8bvjkwlxb/+jxmOtpbLOtSdI/Kny2GNFaql0uR t0mw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=/4ojOlx08PfSbpvm2ZKYJVY+9E7PfPjRWJ5GoxfgplA=; b=nkFKaE/MazcyJ0v+HGCVR4hUOlaofLuzDNhZeED3e9+6NAevo358VoKbUim5hVg5Ch rmTgj5WojoNbE6GCKyDJdbyi0mVKuk2drrCVjYQDFeaX8QWCVSAsRwFGy9GUIkapFAOy jcD9/aYYDoD72z+DLaQH31wsR9fRzZaGdIadg4Z4s0qMfI+F4fFjD7ueFU+7WPcGiI+n FbQqqaJYxzpTaGcpAXAEvJAGkZ3nZnEwFzJoFotaIHUtt7/hanv7KSjRcd9VEd+Od9/y rTf319tKoD4m/qn4m914fBE4v0x1+lV/sGNCjsVexFQ9lAEhAvlwyKHHVoOzOSH9Rpof ZnJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=/4ojOlx08PfSbpvm2ZKYJVY+9E7PfPjRWJ5GoxfgplA=; b=UeI0MTqYtZ4Hb0ChbO/5EdIawvKVc+llZPoEuTE2x8BFGOqe9BIyEHxKe0a/u1NO62 beXKlWFQbBWsO9CINGNQNABGln0+rM0jHpa4+jNHy9L5JudLv+i2jdOSZkHU3rGP/CkX e4faoR24C/6DCjndejMP8YNOd2bc1gk4e/l+cOfWwgiBtJNexY+fvOXyMmupowDU74sq Ksf0vE6xsLTscEYMGE/Y0OdHFIQeibesSZWrTCaq5s28KdxnSccSApvc1FisqS0UzE+E cbIptLNrRhCZeP3o5Jatau5qtA+e2+gJS3u7GSteA+/Vs5X8Do5YBMzFDUhCyrl/7ipM veWQ== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: ALKqPweGf3bZYH3ZVnp+5X84N1FTEzdd2Spf9oTjdPS88jjxJGe8m9js IyV39GTwz2yHBnlFCPNjIZk= X-Google-Smtp-Source: ADUXVKJOG/UhtcbznQ+kGKnRDpSuLL+UGtKSdlOVj8HA7L1lp9HPAVwsa3wVvrImJaqZ6p0Y/rcrMA== X-Received: by 2002:a9d:5608:: with SMTP id e8-v6mr1068919oti.5.1527520964242; Mon, 28 May 2018 08:22:44 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a9d:25c7:: with SMTP id q65-v6ls17606831ota.2.gmail; Mon, 28 May 2018 08:22:43 -0700 (PDT) X-Received: by 2002:a9d:5c8d:: with SMTP id a13-v6mr1088395oti.0.1527520963050; Mon, 28 May 2018 08:22:43 -0700 (PDT) In-Reply-To: X-Original-Sender: f.occhipinti-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Spam-Checked-In-Group: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.org gmane.text.pandoc:20258 Archived-At: ------=_Part_10230_2109135885.1527520962359 Content-Type: multipart/alternative; boundary="----=_Part_10231_503444359.1527520962359" ------=_Part_10231_503444359.1527520962359 Content-Type: text/plain; charset="UTF-8" Probably this deserves to be turned into an issue and related pull request if you didn't already. Ideally the pull request would include its unit test On Thursday, May 24, 2018 at 11:06:34 PM UTC+2, pando...-Mmb7MZpHnFY@public.gmane.org wrote: > > Text/Pandoc/Reader/Docx/Parse.hs handles complex fields as if the > "separate" field char and subsequent runs were obligatory, but they are > optional. > > While working on other things, I came across this snippet of docx XML > (simplified here): > > > SEQ CHAPTER \h \r 1 > > > > ECMA-376-1:2016, 17.16.2 (p. 1165), however, marks these parts as optional. > > Parsing this had the FldCharState transition from Closed -begin-> Open > -instr-> FieldInfo, and then ignoring the end. Only when a later field > contained a separator, it would continue to -sebarate-> CharContent -end-> > Closed. > > As a fix, I added a case when end is encountered in the FieldInfo state: > Fixed a bug in complex field handling: separator fields and rendition runs > are optional. > > --- > src/Text/Pandoc/Readers/Docx/Parse.hs | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/src/Text/Pandoc/Readers/Docx/Parse.hs > b/src/Text/Pandoc/Readers/Docx/Parse.hs > index 221260f42..b5226a95a 100644 > --- a/src/Text/Pandoc/Readers/Docx/Parse.hs > +++ b/src/Text/Pandoc/Readers/Docx/Parse.hs > @@ -830,9 +830,12 @@ elemToParPart ns element > FldCharClosed | fldCharType == "begin" -> do > modify $ \st -> st {stateFldCharState = FldCharOpen} > return NullParPart > - FldCharFieldInfo info | fldCharType == "separate" -> do > + FldCharFieldInfo info | fldCharType == "separate" -> do -- > optional separator before rendition > modify $ \st -> st {stateFldCharState = FldCharContent info []} > return NullParPart > + FldCharFieldInfo info | fldCharType == "end" -> do -- direct end, > without rendition > + modify $ \st -> st {stateFldCharState = FldCharClosed} > + return $ Field info [] > FldCharContent info runs | fldCharType == "end" -> do -- fxg: End > in same par > modify $ \st -> st {stateFldCharState = FldCharClosed} > return $ Field info $ reverse runs > -- > 2.11.0 > > I tested that with the current pandoc git master HEAD, versioned 2.2.1 > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/7fcad0d6-d3d4-4239-8976-ac51ce85d475%40googlegroups.com. For more options, visit https://groups.google.com/d/optout. ------=_Part_10231_503444359.1527520962359 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Probably this deserves to be turned into an issue and rela= ted pull request if you didn't already. Ideally the pull request would = include its unit test

On Thursday, May 24, 2018 at 11:06:34 PM UTC+2= , pando...-Mmb7MZpHnFY@public.gmane.org wrote:
Text/Pandoc/Reader/Docx/Parse.hs handles complex fields as = if the "separate" field char and subsequent runs were obligatory,= but they are optional.

While working on other things, I= came across this snippet of docx XML (simplified here):
<w:r><w:fldChar w:fldCharType=3D"begin"/></w:r>
<w:r>
=C2=A0 <w:instrText xml:space=3D"preser= ve"> SEQ CHAPTER \h \r 1</w:instrText>
</w:r>=
<w:r><w:fldChar w:fldCharType=3D"end"/></w:r>

ECMA-376-1:2016, 17.16.2 (p. 1165), however, marks these= parts as optional.

Parsing this had the FldCharSt= ate transition from Closed -begin-> Open -instr-> FieldInfo, and then= ignoring the end.=C2=A0 Only when a later field contained a separator, it = would continue to -sebarate-> CharContent -end-> Closed.
As a fix, I added a case when end is encountered in the FieldI= nfo state:
Fixed a bug in complex field= handling: separator fields and rendition runs are optional.

--= -
=C2=A0src/Text/Pandoc/Readers/Do= cx/Parse.hs | 5 ++++-
=C2=A01= file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src= /Text/Pandoc/Readers/Docx/Parse.hs b/src/Text/Pandoc/Readers/Docx= /Parse.hs
index 221260f42..b5226a9= 5a 100644
--- a/src/Text/Pandoc/Re= aders/Docx/Parse.hs
+++ b/src= /Text/Pandoc/Readers/Docx/Parse.hs
@@ -830,9 +830,12 @@ elemToParPart ns element
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0FldCharClosed | fldChar= Type =3D=3D "begin" -> do
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0modify $ \st -> st {stateF= ldCharState =3D FldCharOpen}
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return NullParPart
<= font color=3D"#660066">-=C2=A0 =C2=A0 =C2=A0 =C2=A0 FldCharFieldInfo info |= fldCharType =3D=3D "separate" -> do
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 FldCharFieldInfo info | fldCha= rType =3D=3D "separate" -> do -- optional separator before ren= dition
=C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0modify $ \st -> st {stateFldCharState =3D FldCharContent i= nfo []}
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0return NullParPart
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 FldCharFieldInfo info | fldCharType =3D=3D &= quot;end" -> do -- direct end, without rendition
<= font color=3D"#660066">+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 modify $ \st -&g= t; st {stateFldCharState =3D FldCharClosed}
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return $ Field info []=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0FldCha= rContent info runs | fldCharType =3D=3D "end" -> do -- fxg: En= d in same par
=C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0modify $ \st -> st {stateFldCharState =3D FldCharCl= osed}
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0return $ Field info $ reverse runs
--=C2=A0
2.11.0

I tested that with the current pa= ndoc git master HEAD, versioned 2.2.1

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d/= msgid/pandoc-discuss/7fcad0d6-d3d4-4239-8976-ac51ce85d475%40googlegroups.co= m.
For more options, visit http= s://groups.google.com/d/optout.
------=_Part_10231_503444359.1527520962359-- ------=_Part_10230_2109135885.1527520962359--