public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* docx (Word) reader and complex fields
@ 2018-05-24 21:06 pandoc-only-Mmb7MZpHnFY
       [not found] ` <fe8c2cd9-1aca-430a-92de-c20fa268a223-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: pandoc-only-Mmb7MZpHnFY @ 2018-05-24 21:06 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2721 bytes --]

Text/Pandoc/Reader/Docx/Parse.hs handles complex fields as if the 
"separate" field char and subsequent runs were obligatory, but they are 
optional.

While working on other things, I came across this snippet of docx XML 
(simplified here):
<w:r><w:fldChar w:fldCharType="begin"/></w:r>
<w:r>
  <w:instrText xml:space="preserve"> SEQ CHAPTER \h \r 1</w:instrText>
</w:r>
<w:r><w:fldChar w:fldCharType="end"/></w:r>

ECMA-376-1:2016, 17.16.2 (p. 1165), however, marks these parts as optional.

Parsing this had the FldCharState transition from Closed -begin-> Open 
-instr-> FieldInfo, and then ignoring the end.  Only when a later field 
contained a separator, it would continue to -sebarate-> CharContent -end-> 
Closed.

As a fix, I added a case when end is encountered in the FieldInfo state:
Fixed a bug in complex field handling: separator fields and rendition runs 
are optional.

---
 src/Text/Pandoc/Readers/Docx/Parse.hs | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/Text/Pandoc/Readers/Docx/Parse.hs 
b/src/Text/Pandoc/Readers/Docx/Parse.hs
index 221260f42..b5226a95a 100644
--- a/src/Text/Pandoc/Readers/Docx/Parse.hs
+++ b/src/Text/Pandoc/Readers/Docx/Parse.hs
@@ -830,9 +830,12 @@ elemToParPart ns element
         FldCharClosed | fldCharType == "begin" -> do
           modify $ \st -> st {stateFldCharState = FldCharOpen}
           return NullParPart
-        FldCharFieldInfo info | fldCharType == "separate" -> do
+        FldCharFieldInfo info | fldCharType == "separate" -> do -- 
optional separator before rendition
           modify $ \st -> st {stateFldCharState = FldCharContent info []}
           return NullParPart
+        FldCharFieldInfo info | fldCharType == "end" -> do -- direct end, 
without rendition
+          modify $ \st -> st {stateFldCharState = FldCharClosed}
+          return $ Field info []
         FldCharContent info runs | fldCharType == "end" -> do -- fxg: End 
in same par
           modify $ \st -> st {stateFldCharState = FldCharClosed}
           return $ Field info $ reverse runs
-- 
2.11.0

I tested that with the current pandoc git master HEAD, versioned 2.2.1

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/fe8c2cd9-1aca-430a-92de-c20fa268a223%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 6077 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: docx (Word) reader and complex fields
       [not found] ` <fe8c2cd9-1aca-430a-92de-c20fa268a223-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-05-28 15:22   ` Francesco Occhipinti
       [not found]     ` <7fcad0d6-d3d4-4239-8976-ac51ce85d475-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Francesco Occhipinti @ 2018-05-28 15:22 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3070 bytes --]

Probably this deserves to be turned into an issue and related pull request 
if you didn't already. Ideally the pull request would include its unit test

On Thursday, May 24, 2018 at 11:06:34 PM UTC+2, pando...-Mmb7MZpHnFY@public.gmane.org wrote:
>
> Text/Pandoc/Reader/Docx/Parse.hs handles complex fields as if the 
> "separate" field char and subsequent runs were obligatory, but they are 
> optional.
>
> While working on other things, I came across this snippet of docx XML 
> (simplified here):
> <w:r><w:fldChar w:fldCharType="begin"/></w:r>
> <w:r>
>   <w:instrText xml:space="preserve"> SEQ CHAPTER \h \r 1</w:instrText>
> </w:r>
> <w:r><w:fldChar w:fldCharType="end"/></w:r>
>
> ECMA-376-1:2016, 17.16.2 (p. 1165), however, marks these parts as optional.
>
> Parsing this had the FldCharState transition from Closed -begin-> Open 
> -instr-> FieldInfo, and then ignoring the end.  Only when a later field 
> contained a separator, it would continue to -sebarate-> CharContent -end-> 
> Closed.
>
> As a fix, I added a case when end is encountered in the FieldInfo state:
> Fixed a bug in complex field handling: separator fields and rendition runs 
> are optional.
>
> ---
>  src/Text/Pandoc/Readers/Docx/Parse.hs | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/src/Text/Pandoc/Readers/Docx/Parse.hs 
> b/src/Text/Pandoc/Readers/Docx/Parse.hs
> index 221260f42..b5226a95a 100644
> --- a/src/Text/Pandoc/Readers/Docx/Parse.hs
> +++ b/src/Text/Pandoc/Readers/Docx/Parse.hs
> @@ -830,9 +830,12 @@ elemToParPart ns element
>          FldCharClosed | fldCharType == "begin" -> do
>            modify $ \st -> st {stateFldCharState = FldCharOpen}
>            return NullParPart
> -        FldCharFieldInfo info | fldCharType == "separate" -> do
> +        FldCharFieldInfo info | fldCharType == "separate" -> do -- 
> optional separator before rendition
>            modify $ \st -> st {stateFldCharState = FldCharContent info []}
>            return NullParPart
> +        FldCharFieldInfo info | fldCharType == "end" -> do -- direct end, 
> without rendition
> +          modify $ \st -> st {stateFldCharState = FldCharClosed}
> +          return $ Field info []
>          FldCharContent info runs | fldCharType == "end" -> do -- fxg: End 
> in same par
>            modify $ \st -> st {stateFldCharState = FldCharClosed}
>            return $ Field info $ reverse runs
> -- 
> 2.11.0
>
> I tested that with the current pandoc git master HEAD, versioned 2.2.1
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/7fcad0d6-d3d4-4239-8976-ac51ce85d475%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 5599 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: docx (Word) reader and complex fields
       [not found]     ` <7fcad0d6-d3d4-4239-8976-ac51ce85d475-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2018-05-29  0:09       ` pandoc-only-Mmb7MZpHnFY
  0 siblings, 0 replies; 3+ messages in thread
From: pandoc-only-Mmb7MZpHnFY @ 2018-05-29  0:09 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3634 bytes --]

Turning this into a standalone test case was a worthwhile exercise.  Turns 
out that my "other things" triggered this bug, and without them, the output 
looks fine, so it is arguably not a bug.
The other things were the multi-paragraph complex fields 
(https://groups.google.com/forum/#!topic/pandoc-discuss/ItyWCB9HgKI), which 
may be a special-interest topic, since my only example for that is a Citavi 
bibliography.

On Monday, May 28, 2018 at 5:22:42 PM UTC+2, Francesco Occhipinti wrote:
>
> Probably this deserves to be turned into an issue and related pull request 
> if you didn't already. Ideally the pull request would include its unit test
>
> On Thursday, May 24, 2018 at 11:06:34 PM UTC+2, pando...-Mmb7MZpHnFY@public.gmane.org wrote:
>>
>> Text/Pandoc/Reader/Docx/Parse.hs handles complex fields as if the 
>> "separate" field char and subsequent runs were obligatory, but they are 
>> optional.
>>
>> While working on other things, I came across this snippet of docx XML 
>> (simplified here):
>> <w:r><w:fldChar w:fldCharType="begin"/></w:r>
>> <w:r>
>>   <w:instrText xml:space="preserve"> SEQ CHAPTER \h \r 1</w:instrText>
>> </w:r>
>> <w:r><w:fldChar w:fldCharType="end"/></w:r>
>>
>> ECMA-376-1:2016, 17.16.2 (p. 1165), however, marks these parts as 
>> optional.
>>
>> Parsing this had the FldCharState transition from Closed -begin-> Open 
>> -instr-> FieldInfo, and then ignoring the end.  Only when a later field 
>> contained a separator, it would continue to -sebarate-> CharContent -end-> 
>> Closed.
>>
>> As a fix, I added a case when end is encountered in the FieldInfo state:
>> Fixed a bug in complex field handling: separator fields and rendition 
>> runs are optional.
>>
>> ---
>>  src/Text/Pandoc/Readers/Docx/Parse.hs | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/src/Text/Pandoc/Readers/Docx/Parse.hs 
>> b/src/Text/Pandoc/Readers/Docx/Parse.hs
>> index 221260f42..b5226a95a 100644
>> --- a/src/Text/Pandoc/Readers/Docx/Parse.hs
>> +++ b/src/Text/Pandoc/Readers/Docx/Parse.hs
>> @@ -830,9 +830,12 @@ elemToParPart ns element
>>          FldCharClosed | fldCharType == "begin" -> do
>>            modify $ \st -> st {stateFldCharState = FldCharOpen}
>>            return NullParPart
>> -        FldCharFieldInfo info | fldCharType == "separate" -> do
>> +        FldCharFieldInfo info | fldCharType == "separate" -> do -- 
>> optional separator before rendition
>>            modify $ \st -> st {stateFldCharState = FldCharContent info []}
>>            return NullParPart
>> +        FldCharFieldInfo info | fldCharType == "end" -> do -- direct 
>> end, without rendition
>> +          modify $ \st -> st {stateFldCharState = FldCharClosed}
>> +          return $ Field info []
>>          FldCharContent info runs | fldCharType == "end" -> do -- fxg: 
>> End in same par
>>            modify $ \st -> st {stateFldCharState = FldCharClosed}
>>            return $ Field info $ reverse runs
>> -- 
>> 2.11.0
>>
>> I tested that with the current pandoc git master HEAD, versioned 2.2.1
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/98b3a76c-6c31-412d-bf10-10218975e66a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 6266 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-05-29  0:09 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-24 21:06 docx (Word) reader and complex fields pandoc-only-Mmb7MZpHnFY
     [not found] ` <fe8c2cd9-1aca-430a-92de-c20fa268a223-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-05-28 15:22   ` Francesco Occhipinti
     [not found]     ` <7fcad0d6-d3d4-4239-8976-ac51ce85d475-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2018-05-29  0:09       ` pandoc-only-Mmb7MZpHnFY

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).