From: Milan Bracke <milan.bracke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly
Date: Wed, 16 Jun 2021 02:33:49 -0700 (PDT) [thread overview]
Message-ID: <9bdb337d-fa68-4c66-8f5c-d4fa81547953n@googlegroups.com> (raw)
In-Reply-To: <2f5489af-f5a9-4ea4-9155-9f85c4808756n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
[-- Attachment #1.1: Type: text/plain, Size: 2838 bytes --]
I can't fix this without at least some feedback. It's a complex issue and
the fix will take some time, so I need to at least know that my proposed
solution
seems good and would be accepted if implemented correctly.
On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote:
> I've encountered a new problem. A fldChar field can span multiple
> paragraphs, but it doesn't have to start at the beginning of the first one.
> Because of this, a field across multiple paragraphs will merge those
> paragraphs.
> There is no way to represent this exactly in the pandoc model I think. So
> my current solution is to have different fields with the same field
> info in the different paragraphs. This can at least make the hyperlink
> fields work and I think it will work for the other fields we might add in
> the future as well (I've checked the list).
> What do you think about this ?
>
> On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote:
>
>> For those who don't know fldChar fields, this comment from the docx parse
>> code (parse.hs, starting on line 825) explains it:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *fldChar fields work by firsthaving a <w:fldChar fldCharType="begin"> in
>> a run, then a run with<w:instrText>, then a <w:fldChar
>> fldCharType="separate"> run, then thecontent runs, and finally a <w:fldChar
>> fldCharType="end"> run. Forexample (omissions and my comments in brackets):
>> <w:r> [...] <w:fldChar w:fldCharType="begin"/> </w:r> <w:r> [...]
>> <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText>
>> </w:r> <w:r> [...] <w:fldChar w:fldCharType="separate"/> </w:r> <w:r
>> w:rsidRPr=[...]> [...] <w:t>Foundations of Analysis, 2nd Edition</w:t>
>> </w:r> <w:r> [...] <w:fldChar w:fldCharType="end"/> </w:r>*The current
>> way of parsing fldChar fields doesn't take into account that they can be
>> nested. So the end of the nested flcChar field will be interpreted as the
>> end of the surrounding one. This could for example lead to a hyperlink that
>> ends too soon. See attached example for a docx that demonstrates this.
>>
>> I propose to fix this by turning the fldChar state into a stack, so that
>> a field can be started and ended inside other fields. I will include this
>> in my pull request for PAGEREF fields that I announced here a while ago,
>> since they are related.
>>
>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com.
[-- Attachment #1.2: Type: text/html, Size: 3793 bytes --]
next prev parent reply other threads:[~2021-06-16 9:33 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <AQHXeYfzXOAwmGiNTUe7VHoqgBZUyqtEHQvVgI6tBACABM9T0w==>
[not found] ` <AQHXeYfzXOAwmGiNTUe7VHoqgBZUyqtEHQvV>
[not found] ` <AQHXYO2J8TKZPxOOWkqee4W1dsjoP6sUoEAAgAHDUYCAAE9Mnw==>
2021-06-14 7:17 ` Milan Bracke
[not found] ` <a4a592f3-414e-488f-be2a-0f7fd1e0cd21n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-15 6:38 ` Milan Bracke
[not found] ` <2f5489af-f5a9-4ea4-9155-9f85c4808756n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-16 9:33 ` Milan Bracke [this message]
[not found] ` <9bdb337d-fa68-4c66-8f5c-d4fa81547953n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-16 14:21 ` Jesse Rosenthal
[not found] ` <DM6PR01MB4650D9807C8EE5F33B3D5278C90F9-cCDycDV5LeFzw2JhukMBoF5F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-06-17 6:42 ` Milan Bracke
[not found] ` <a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-07-15 14:43 ` Milan Bracke
[not found] ` <24273fbf-2ce9-4c26-886b-50d504cb7b05n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-07-15 14:46 ` 'Jesse Rosenthal' via pandoc-discuss
[not found] ` <BL3PR01MB7100D5D9E5DC4898EC9E1221C9129-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-10-14 9:33 ` Milan Bracke
[not found] ` <50bcbdc6-8d4b-49c1-badb-f35fb968112dn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-10-17 11:02 ` 'Jesse Rosenthal' via pandoc-discuss
[not found] ` <BL3PR01MB7100280CEDC0AA33F788D8C9C9BB9-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-10-18 7:07 ` Milan Bracke
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9bdb337d-fa68-4c66-8f5c-d4fa81547953n@googlegroups.com \
--to=milan.bracke-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
--cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).