public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: Milan Bracke <milan.bracke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: docx parsing bug: nested fldChar fields are interpreted incorrectly
Date: Mon, 14 Jun 2021 00:17:13 -0700 (PDT)	[thread overview]
Message-ID: <a4a592f3-414e-488f-be2a-0f7fd1e0cd21n@googlegroups.com> (raw)


[-- Attachment #1.1: Type: text/plain, Size: 1752 bytes --]

For those who don't know fldChar fields, this comment from the docx parse 
code (parse.hs, starting on line 825) explains it:




























*fldChar fields work by firsthaving a <w:fldChar fldCharType="begin"> in a 
run, then a run with<w:instrText>, then a <w:fldChar 
fldCharType="separate"> run, then thecontent runs, and finally a <w:fldChar 
fldCharType="end"> run. Forexample (omissions and my comments in brackets): 
<w:r> [...] <w:fldChar w:fldCharType="begin"/> </w:r> <w:r> [...] 
<w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText> 
</w:r> <w:r> [...] <w:fldChar w:fldCharType="separate"/> </w:r> <w:r 
w:rsidRPr=[...]> [...] <w:t>Foundations of Analysis, 2nd Edition</w:t> 
</w:r> <w:r> [...] <w:fldChar w:fldCharType="end"/> </w:r>*The current way 
of parsing fldChar fields doesn't take into account that they can be 
nested. So the end of the nested flcChar field will be interpreted as the 
end of the surrounding one. This could for example lead to a hyperlink that 
ends too soon. See attached example for a docx that demonstrates this.

I propose to fix this by turning the fldChar state into a stack, so that a 
field can be started and ended inside other fields. I will include this in 
my pull request for PAGEREF fields that I announced here a while ago, since 
they are related.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a4a592f3-414e-488f-be2a-0f7fd1e0cd21n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 2256 bytes --]

[-- Attachment #2: instrText_hyperlink.docx --]
[-- Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document, Size: 14112 bytes --]

             reply	other threads:[~2021-06-14  7:17 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <AQHXeYfzXOAwmGiNTUe7VHoqgBZUyqtEHQvVgI6tBACABM9T0w==>
     [not found] ` <AQHXeYfzXOAwmGiNTUe7VHoqgBZUyqtEHQvV>
     [not found]   ` <AQHXYO2J8TKZPxOOWkqee4W1dsjoP6sUoEAAgAHDUYCAAE9Mnw==>
2021-06-14  7:17     ` Milan Bracke [this message]
     [not found]       ` <a4a592f3-414e-488f-be2a-0f7fd1e0cd21n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-15  6:38         ` Milan Bracke
     [not found]           ` <2f5489af-f5a9-4ea4-9155-9f85c4808756n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-16  9:33             ` Milan Bracke
     [not found]               ` <9bdb337d-fa68-4c66-8f5c-d4fa81547953n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-16 14:21                 ` Jesse Rosenthal
     [not found]                   ` <DM6PR01MB4650D9807C8EE5F33B3D5278C90F9-cCDycDV5LeFzw2JhukMBoF5F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-06-17  6:42                     ` Milan Bracke
     [not found]                       ` <a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-07-15 14:43                         ` Milan Bracke
     [not found]                           ` <24273fbf-2ce9-4c26-886b-50d504cb7b05n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-07-15 14:46                             ` 'Jesse Rosenthal' via pandoc-discuss
     [not found]                               ` <BL3PR01MB7100D5D9E5DC4898EC9E1221C9129-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-10-14  9:33                                 ` Milan Bracke
     [not found]                                   ` <50bcbdc6-8d4b-49c1-badb-f35fb968112dn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-10-17 11:02                                     ` 'Jesse Rosenthal' via pandoc-discuss
     [not found]                                       ` <BL3PR01MB7100280CEDC0AA33F788D8C9C9BB9-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-10-18  7:07                                         ` Milan Bracke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a4a592f3-414e-488f-be2a-0f7fd1e0cd21n@googlegroups.com \
    --to=milan.bracke-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).