public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: Francesco Occhipinti <f.occhipinti-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: docx (Word) reader and complex fields
Date: Mon, 28 May 2018 08:22:42 -0700 (PDT)	[thread overview]
Message-ID: <7fcad0d6-d3d4-4239-8976-ac51ce85d475@googlegroups.com> (raw)
In-Reply-To: <fe8c2cd9-1aca-430a-92de-c20fa268a223-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 3070 bytes --]

Probably this deserves to be turned into an issue and related pull request 
if you didn't already. Ideally the pull request would include its unit test

On Thursday, May 24, 2018 at 11:06:34 PM UTC+2, pando...-Mmb7MZpHnFY@public.gmane.org wrote:
>
> Text/Pandoc/Reader/Docx/Parse.hs handles complex fields as if the 
> "separate" field char and subsequent runs were obligatory, but they are 
> optional.
>
> While working on other things, I came across this snippet of docx XML 
> (simplified here):
> <w:r><w:fldChar w:fldCharType="begin"/></w:r>
> <w:r>
>   <w:instrText xml:space="preserve"> SEQ CHAPTER \h \r 1</w:instrText>
> </w:r>
> <w:r><w:fldChar w:fldCharType="end"/></w:r>
>
> ECMA-376-1:2016, 17.16.2 (p. 1165), however, marks these parts as optional.
>
> Parsing this had the FldCharState transition from Closed -begin-> Open 
> -instr-> FieldInfo, and then ignoring the end.  Only when a later field 
> contained a separator, it would continue to -sebarate-> CharContent -end-> 
> Closed.
>
> As a fix, I added a case when end is encountered in the FieldInfo state:
> Fixed a bug in complex field handling: separator fields and rendition runs 
> are optional.
>
> ---
>  src/Text/Pandoc/Readers/Docx/Parse.hs | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/src/Text/Pandoc/Readers/Docx/Parse.hs 
> b/src/Text/Pandoc/Readers/Docx/Parse.hs
> index 221260f42..b5226a95a 100644
> --- a/src/Text/Pandoc/Readers/Docx/Parse.hs
> +++ b/src/Text/Pandoc/Readers/Docx/Parse.hs
> @@ -830,9 +830,12 @@ elemToParPart ns element
>          FldCharClosed | fldCharType == "begin" -> do
>            modify $ \st -> st {stateFldCharState = FldCharOpen}
>            return NullParPart
> -        FldCharFieldInfo info | fldCharType == "separate" -> do
> +        FldCharFieldInfo info | fldCharType == "separate" -> do -- 
> optional separator before rendition
>            modify $ \st -> st {stateFldCharState = FldCharContent info []}
>            return NullParPart
> +        FldCharFieldInfo info | fldCharType == "end" -> do -- direct end, 
> without rendition
> +          modify $ \st -> st {stateFldCharState = FldCharClosed}
> +          return $ Field info []
>          FldCharContent info runs | fldCharType == "end" -> do -- fxg: End 
> in same par
>            modify $ \st -> st {stateFldCharState = FldCharClosed}
>            return $ Field info $ reverse runs
> -- 
> 2.11.0
>
> I tested that with the current pandoc git master HEAD, versioned 2.2.1
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/7fcad0d6-d3d4-4239-8976-ac51ce85d475%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 5599 bytes --]

  parent reply	other threads:[~2018-05-28 15:22 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-24 21:06 pandoc-only-Mmb7MZpHnFY
     [not found] ` <fe8c2cd9-1aca-430a-92de-c20fa268a223-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-05-28 15:22   ` Francesco Occhipinti [this message]
     [not found]     ` <7fcad0d6-d3d4-4239-8976-ac51ce85d475-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-05-29  0:09       ` pandoc-only-Mmb7MZpHnFY

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7fcad0d6-d3d4-4239-8976-ac51ce85d475@googlegroups.com \
    --to=f.occhipinti-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).