public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* docx parsing bug: nested fldChar fields are interpreted incorrectly
@ 2021-06-14  7:17     ` Milan Bracke
       [not found]       ` <a4a592f3-414e-488f-be2a-0f7fd1e0cd21n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Milan Bracke @ 2021-06-14  7:17 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 1752 bytes --]

For those who don't know fldChar fields, this comment from the docx parse 
code (parse.hs, starting on line 825) explains it:




























*fldChar fields work by firsthaving a <w:fldChar fldCharType="begin"> in a 
run, then a run with<w:instrText>, then a <w:fldChar 
fldCharType="separate"> run, then thecontent runs, and finally a <w:fldChar 
fldCharType="end"> run. Forexample (omissions and my comments in brackets): 
<w:r> [...] <w:fldChar w:fldCharType="begin"/> </w:r> <w:r> [...] 
<w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText> 
</w:r> <w:r> [...] <w:fldChar w:fldCharType="separate"/> </w:r> <w:r 
w:rsidRPr=[...]> [...] <w:t>Foundations of Analysis, 2nd Edition</w:t> 
</w:r> <w:r> [...] <w:fldChar w:fldCharType="end"/> </w:r>*The current way 
of parsing fldChar fields doesn't take into account that they can be 
nested. So the end of the nested flcChar field will be interpreted as the 
end of the surrounding one. This could for example lead to a hyperlink that 
ends too soon. See attached example for a docx that demonstrates this.

I propose to fix this by turning the fldChar state into a stack, so that a 
field can be started and ended inside other fields. I will include this in 
my pull request for PAGEREF fields that I announced here a while ago, since 
they are related.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a4a592f3-414e-488f-be2a-0f7fd1e0cd21n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 2256 bytes --]

[-- Attachment #2: instrText_hyperlink.docx --]
[-- Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document, Size: 14112 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly
       [not found]       ` <a4a592f3-414e-488f-be2a-0f7fd1e0cd21n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2021-06-15  6:38         ` Milan Bracke
       [not found]           ` <2f5489af-f5a9-4ea4-9155-9f85c4808756n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Milan Bracke @ 2021-06-15  6:38 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2477 bytes --]

I've encountered a new problem. A fldChar field can span multiple 
paragraphs, but it doesn't have to start at the beginning of the first one.
Because of this, a field across multiple paragraphs will merge those 
paragraphs.
There is no way to represent this exactly in the pandoc model I think. So 
my current solution is to have different fields with the same field
info in the different paragraphs. This can at least make the hyperlink 
fields work and I think it will work for the other fields we might add in
the future as well (I've checked the list).
What do you think about this ?

On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote:

> For those who don't know fldChar fields, this comment from the docx parse 
> code (parse.hs, starting on line 825) explains it:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *fldChar fields work by firsthaving a <w:fldChar fldCharType="begin"> in a 
> run, then a run with<w:instrText>, then a <w:fldChar 
> fldCharType="separate"> run, then thecontent runs, and finally a <w:fldChar 
> fldCharType="end"> run. Forexample (omissions and my comments in brackets): 
> <w:r> [...] <w:fldChar w:fldCharType="begin"/> </w:r> <w:r> [...] 
> <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText> 
> </w:r> <w:r> [...] <w:fldChar w:fldCharType="separate"/> </w:r> <w:r 
> w:rsidRPr=[...]> [...] <w:t>Foundations of Analysis, 2nd Edition</w:t> 
> </w:r> <w:r> [...] <w:fldChar w:fldCharType="end"/> </w:r>*The current 
> way of parsing fldChar fields doesn't take into account that they can be 
> nested. So the end of the nested flcChar field will be interpreted as the 
> end of the surrounding one. This could for example lead to a hyperlink that 
> ends too soon. See attached example for a docx that demonstrates this.
>
> I propose to fix this by turning the fldChar state into a stack, so that a 
> field can be started and ended inside other fields. I will include this in 
> my pull request for PAGEREF fields that I announced here a while ago, since 
> they are related.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/2f5489af-f5a9-4ea4-9155-9f85c4808756n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 3264 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly
       [not found]           ` <2f5489af-f5a9-4ea4-9155-9f85c4808756n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2021-06-16  9:33             ` Milan Bracke
       [not found]               ` <9bdb337d-fa68-4c66-8f5c-d4fa81547953n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Milan Bracke @ 2021-06-16  9:33 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2838 bytes --]

I can't fix this without at least some feedback. It's a complex issue and 
the fix will take some time, so I need to at least know that my proposed 
solution 
seems good and would be accepted if implemented correctly.

On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote:

> I've encountered a new problem. A fldChar field can span multiple 
> paragraphs, but it doesn't have to start at the beginning of the first one.
> Because of this, a field across multiple paragraphs will merge those 
> paragraphs.
> There is no way to represent this exactly in the pandoc model I think. So 
> my current solution is to have different fields with the same field
> info in the different paragraphs. This can at least make the hyperlink 
> fields work and I think it will work for the other fields we might add in
> the future as well (I've checked the list).
> What do you think about this ?
>
> On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote:
>
>> For those who don't know fldChar fields, this comment from the docx parse 
>> code (parse.hs, starting on line 825) explains it:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *fldChar fields work by firsthaving a <w:fldChar fldCharType="begin"> in 
>> a run, then a run with<w:instrText>, then a <w:fldChar 
>> fldCharType="separate"> run, then thecontent runs, and finally a <w:fldChar 
>> fldCharType="end"> run. Forexample (omissions and my comments in brackets): 
>> <w:r> [...] <w:fldChar w:fldCharType="begin"/> </w:r> <w:r> [...] 
>> <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText> 
>> </w:r> <w:r> [...] <w:fldChar w:fldCharType="separate"/> </w:r> <w:r 
>> w:rsidRPr=[...]> [...] <w:t>Foundations of Analysis, 2nd Edition</w:t> 
>> </w:r> <w:r> [...] <w:fldChar w:fldCharType="end"/> </w:r>*The current 
>> way of parsing fldChar fields doesn't take into account that they can be 
>> nested. So the end of the nested flcChar field will be interpreted as the 
>> end of the surrounding one. This could for example lead to a hyperlink that 
>> ends too soon. See attached example for a docx that demonstrates this.
>>
>> I propose to fix this by turning the fldChar state into a stack, so that 
>> a field can be started and ended inside other fields. I will include this 
>> in my pull request for PAGEREF fields that I announced here a while ago, 
>> since they are related.
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 3793 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly
       [not found]               ` <9bdb337d-fa68-4c66-8f5c-d4fa81547953n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2021-06-16 14:21                 ` Jesse Rosenthal
       [not found]                   ` <DM6PR01MB4650D9807C8EE5F33B3D5278C90F9-cCDycDV5LeFzw2JhukMBoF5F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Jesse Rosenthal @ 2021-06-16 14:21 UTC (permalink / raw)
  To: pandoc-discuss

Hi Milan,

I wrote the original fldChar code (and that comment) and I figured it would have to evolve as further requirements became necessary. If nesting is a requirement, a stack instead of a toggle seems appropriate.

As far as crossing paragraphs goes -- your approach seems right (and similar to how we've dealt with similar issues like comments crossing paragraphs in docx parsing).

I'd be happy to take a look and offer comments/feedback on your code. Just make sure to ping me (@jkr) on your PRs.

Best,
Jesse

________________________________________
From: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf of Milan Bracke <milan.bracke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Sent: Wednesday, June 16, 2021 5:33 AM
To: pandoc-discuss
Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly

I can't fix this without at least some feedback. It's a complex issue and the fix will take some time, so I need to at least know that my proposed solution
seems good and would be accepted if implemented correctly.

On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote:
I've encountered a new problem. A fldChar field can span multiple paragraphs, but it doesn't have to start at the beginning of the first one.
Because of this, a field across multiple paragraphs will merge those paragraphs.
There is no way to represent this exactly in the pandoc model I think. So my current solution is to have different fields with the same field
info in the different paragraphs. This can at least make the hyperlink fields work and I think it will work for the other fields we might add in
the future as well (I've checked the list).
What do you think about this ?

On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote:
For those who don't know fldChar fields, this comment from the docx parse code (parse.hs, starting on line 825) explains it:

fldChar fields work by first
having a <w:fldChar fldCharType="begin"> in a run, then a run with
<w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the
content runs, and finally a <w:fldChar fldCharType="end"> run. For
example (omissions and my comments in brackets):

<w:r>
[...]
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
[...]
<w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText>
</w:r>
<w:r>
[...]
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidRPr=[...]>
[...]
<w:t>Foundations of Analysis, 2nd Edition</w:t>
</w:r>
<w:r>
[...]
<w:fldChar w:fldCharType="end"/>
</w:r>

The current way of parsing fldChar fields doesn't take into account that they can be nested. So the end of the nested flcChar field will be interpreted as the end of the surrounding one. This could for example lead to a hyperlink that ends too soon. See attached example for a docx that demonstrates this.

I propose to fix this by turning the fldChar state into a stack, so that a field can be started and ended inside other fields. I will include this in my pull request for PAGEREF fields that I announced here a while ago, since they are related.

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0>.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/DM6PR01MB4650D9807C8EE5F33B3D5278C90F9%40DM6PR01MB4650.prod.exchangelabs.com.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly
       [not found]                   ` <DM6PR01MB4650D9807C8EE5F33B3D5278C90F9-cCDycDV5LeFzw2JhukMBoF5F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
@ 2021-06-17  6:42                     ` Milan Bracke
       [not found]                       ` <a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Milan Bracke @ 2021-06-17  6:42 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 5239 bytes --]

Hi Jesse,

Thanks for the feedback. I'll ping you when making the PR. Most of my code 
seems to work so far, but I still
have some trouble with the fact that the fields now need to contain 
ParParts instead of Runs. It's harder to
match all the cases and treat them correctly. I'll try some more and let 
you know how it goes.

Best,
Milan
On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote:

> Hi Milan,
>
> I wrote the original fldChar code (and that comment) and I figured it 
> would have to evolve as further requirements became necessary. If nesting 
> is a requirement, a stack instead of a toggle seems appropriate.
>
> As far as crossing paragraphs goes -- your approach seems right (and 
> similar to how we've dealt with similar issues like comments crossing 
> paragraphs in docx parsing).
>
> I'd be happy to take a look and offer comments/feedback on your code. Just 
> make sure to ping me (@jkr) on your PRs.
>
> Best,
> Jesse
>
> ________________________________________
> From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf 
> of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Sent: Wednesday, June 16, 2021 5:33 AM
> To: pandoc-discuss
> Subject: Re: docx parsing bug: nested fldChar fields are interpreted 
> incorrectly
>
> I can't fix this without at least some feedback. It's a complex issue and 
> the fix will take some time, so I need to at least know that my proposed 
> solution
> seems good and would be accepted if implemented correctly.
>
> On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote:
> I've encountered a new problem. A fldChar field can span multiple 
> paragraphs, but it doesn't have to start at the beginning of the first one.
> Because of this, a field across multiple paragraphs will merge those 
> paragraphs.
> There is no way to represent this exactly in the pandoc model I think. So 
> my current solution is to have different fields with the same field
> info in the different paragraphs. This can at least make the hyperlink 
> fields work and I think it will work for the other fields we might add in
> the future as well (I've checked the list).
> What do you think about this ?
>
> On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote:
> For those who don't know fldChar fields, this comment from the docx parse 
> code (parse.hs, starting on line 825) explains it:
>
> fldChar fields work by first
> having a <w:fldChar fldCharType="begin"> in a run, then a run with
> <w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the
> content runs, and finally a <w:fldChar fldCharType="end"> run. For
> example (omissions and my comments in brackets):
>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="begin"/>
> </w:r>
> <w:r>
> [...]
> <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText>
> </w:r>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="separate"/>
> </w:r>
> <w:r w:rsidRPr=[...]>
> [...]
> <w:t>Foundations of Analysis, 2nd Edition</w:t>
> </w:r>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="end"/>
> </w:r>
>
> The current way of parsing fldChar fields doesn't take into account that 
> they can be nested. So the end of the nested flcChar field will be 
> interpreted as the end of the surrounding one. This could for example lead 
> to a hyperlink that ends too soon. See attached example for a docx that 
> demonstrates this.
>
> I propose to fix this by turning the fldChar state into a stack, so that a 
> field can be started and ended inside other fields. I will include this in 
> my pull request for PAGEREF fields that I announced here a while ago, since 
> they are related.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:
> pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0
> >.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 7987 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly
       [not found]                       ` <a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2021-07-15 14:43                         ` Milan Bracke
       [not found]                           ` <24273fbf-2ce9-4c26-886b-50d504cb7b05n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Milan Bracke @ 2021-07-15 14:43 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 5775 bytes --]

Hi all,

I've had this pull request open for more than 3 weeks now: 
https://github.com/jgm/pandoc/pull/7401
Is there a reason it's not getting any reaction? I'd be happy to improve or 
explain it. If I've done something
wrong, I'd like to know, so I can fix it.

Best,
Milan

On Thursday, June 17, 2021 at 8:42:48 AM UTC+2 Milan Bracke wrote:

> Hi Jesse,
>
> Thanks for the feedback. I'll ping you when making the PR. Most of my code 
> seems to work so far, but I still
> have some trouble with the fact that the fields now need to contain 
> ParParts instead of Runs. It's harder to
> match all the cases and treat them correctly. I'll try some more and let 
> you know how it goes.
>
> Best,
> Milan
> On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote:
>
>> Hi Milan, 
>>
>> I wrote the original fldChar code (and that comment) and I figured it 
>> would have to evolve as further requirements became necessary. If nesting 
>> is a requirement, a stack instead of a toggle seems appropriate. 
>>
>> As far as crossing paragraphs goes -- your approach seems right (and 
>> similar to how we've dealt with similar issues like comments crossing 
>> paragraphs in docx parsing). 
>>
>> I'd be happy to take a look and offer comments/feedback on your code. 
>> Just make sure to ping me (@jkr) on your PRs. 
>>
>> Best, 
>> Jesse 
>>
>> ________________________________________ 
>> From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on 
>> behalf of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 
>> Sent: Wednesday, June 16, 2021 5:33 AM 
>> To: pandoc-discuss 
>> Subject: Re: docx parsing bug: nested fldChar fields are interpreted 
>> incorrectly 
>>
>> I can't fix this without at least some feedback. It's a complex issue and 
>> the fix will take some time, so I need to at least know that my proposed 
>> solution 
>> seems good and would be accepted if implemented correctly. 
>>
>> On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote: 
>> I've encountered a new problem. A fldChar field can span multiple 
>> paragraphs, but it doesn't have to start at the beginning of the first one. 
>> Because of this, a field across multiple paragraphs will merge those 
>> paragraphs. 
>> There is no way to represent this exactly in the pandoc model I think. So 
>> my current solution is to have different fields with the same field 
>> info in the different paragraphs. This can at least make the hyperlink 
>> fields work and I think it will work for the other fields we might add in 
>> the future as well (I've checked the list). 
>> What do you think about this ? 
>>
>> On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote: 
>> For those who don't know fldChar fields, this comment from the docx parse 
>> code (parse.hs, starting on line 825) explains it: 
>>
>> fldChar fields work by first 
>> having a <w:fldChar fldCharType="begin"> in a run, then a run with 
>> <w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the 
>> content runs, and finally a <w:fldChar fldCharType="end"> run. For 
>> example (omissions and my comments in brackets): 
>>
>> <w:r> 
>> [...] 
>> <w:fldChar w:fldCharType="begin"/> 
>> </w:r> 
>> <w:r> 
>> [...] 
>> <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] 
>> </w:instrText> 
>> </w:r> 
>> <w:r> 
>> [...] 
>> <w:fldChar w:fldCharType="separate"/> 
>> </w:r> 
>> <w:r w:rsidRPr=[...]> 
>> [...] 
>> <w:t>Foundations of Analysis, 2nd Edition</w:t> 
>> </w:r> 
>> <w:r> 
>> [...] 
>> <w:fldChar w:fldCharType="end"/> 
>> </w:r> 
>>
>> The current way of parsing fldChar fields doesn't take into account that 
>> they can be nested. So the end of the nested flcChar field will be 
>> interpreted as the end of the surrounding one. This could for example lead 
>> to a hyperlink that ends too soon. See attached example for a docx that 
>> demonstrates this. 
>>
>> I propose to fix this by turning the fldChar state into a stack, so that 
>> a field can be started and ended inside other fields. I will include this 
>> in my pull request for PAGEREF fields that I announced here a while ago, 
>> since they are related. 
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "pandoc-discuss" group. 
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:
>> pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. 
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com
>> <
>> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0>. 
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/24273fbf-2ce9-4c26-886b-50d504cb7b05n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 8579 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly
       [not found]                           ` <24273fbf-2ce9-4c26-886b-50d504cb7b05n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2021-07-15 14:46                             ` 'Jesse Rosenthal' via pandoc-discuss
       [not found]                               ` <BL3PR01MB7100D5D9E5DC4898EC9E1221C9129-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: 'Jesse Rosenthal' via pandoc-discuss @ 2021-07-15 14:46 UTC (permalink / raw)
  To: pandoc-discuss

Hi Milan,

Thanks for the heads up. Honestly just summer craziness: visiting family for the first time in almost two years, shuttling the kids around. Life stuff. I'll take a look at it ASAP.

Best,
Jesse

________________________________________
From: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf of Milan Bracke <milan.bracke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Sent: Thursday, July 15, 2021 10:43 AM
To: pandoc-discuss
Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly

Hi all,

I've had this pull request open for more than 3 weeks now: https://github.com/jgm/pandoc/pull/7401
Is there a reason it's not getting any reaction? I'd be happy to improve or explain it. If I've done something
wrong, I'd like to know, so I can fix it.

Best,
Milan

On Thursday, June 17, 2021 at 8:42:48 AM UTC+2 Milan Bracke wrote:
Hi Jesse,

Thanks for the feedback. I'll ping you when making the PR. Most of my code seems to work so far, but I still
have some trouble with the fact that the fields now need to contain ParParts instead of Runs. It's harder to
match all the cases and treat them correctly. I'll try some more and let you know how it goes.

Best,
Milan
On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote:
Hi Milan,

I wrote the original fldChar code (and that comment) and I figured it would have to evolve as further requirements became necessary. If nesting is a requirement, a stack instead of a toggle seems appropriate.

As far as crossing paragraphs goes -- your approach seems right (and similar to how we've dealt with similar issues like comments crossing paragraphs in docx parsing).

I'd be happy to take a look and offer comments/feedback on your code. Just make sure to ping me (@jkr) on your PRs.

Best,
Jesse

________________________________________
From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Sent: Wednesday, June 16, 2021 5:33 AM
To: pandoc-discuss
Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly

I can't fix this without at least some feedback. It's a complex issue and the fix will take some time, so I need to at least know that my proposed solution
seems good and would be accepted if implemented correctly.

On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote:
I've encountered a new problem. A fldChar field can span multiple paragraphs, but it doesn't have to start at the beginning of the first one.
Because of this, a field across multiple paragraphs will merge those paragraphs.
There is no way to represent this exactly in the pandoc model I think. So my current solution is to have different fields with the same field
info in the different paragraphs. This can at least make the hyperlink fields work and I think it will work for the other fields we might add in
the future as well (I've checked the list).
What do you think about this ?

On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote:
For those who don't know fldChar fields, this comment from the docx parse code (parse.hs, starting on line 825) explains it:

fldChar fields work by first
having a <w:fldChar fldCharType="begin"> in a run, then a run with
<w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the
content runs, and finally a <w:fldChar fldCharType="end"> run. For
example (omissions and my comments in brackets):

<w:r>
[...]
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
[...]
<w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText>
</w:r>
<w:r>
[...]
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidRPr=[...]>
[...]
<w:t>Foundations of Analysis, 2nd Edition</w:t>
</w:r>
<w:r>
[...]
<w:fldChar w:fldCharType="end"/>
</w:r>

The current way of parsing fldChar fields doesn't take into account that they can be nested. So the end of the nested flcChar field will be interpreted as the end of the surrounding one. This could for example lead to a hyperlink that ends too soon. See attached example for a docx that demonstrates this.

I propose to fix this by turning the fldChar state into a stack, so that a field can be started and ended inside other fields. I will include this in my pull request for PAGEREF fields that I announced here a while ago, since they are related.

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:pandoc-discus...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850832949%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=yihkvC%2B7Le7l00nWwyXnOyOmASIibuFvMDgLIXStUSc%3D&reserved=0><https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850842907%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=IYMIxCLKxoONu7qa9IQViRyjj%2FaOVY8x%2FuHlI2oMfXQ%3D&reserved=0>>.

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/24273fbf-2ce9-4c26-886b-50d504cb7b05n%40googlegroups.com<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850852858%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BFQ5SkpbfzgZ7yLWGIi9uTHBuMaN9nBZzA%2Ffzwt4XnU%3D&reserved=0>.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/BL3PR01MB7100D5D9E5DC4898EC9E1221C9129%40BL3PR01MB7100.prod.exchangelabs.com.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly
       [not found]                               ` <BL3PR01MB7100D5D9E5DC4898EC9E1221C9129-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
@ 2021-10-14  9:33                                 ` Milan Bracke
       [not found]                                   ` <50bcbdc6-8d4b-49c1-badb-f35fb968112dn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Milan Bracke @ 2021-10-14  9:33 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 8665 bytes --]

Hi Jesse,

I hope you had a good summer. Do you have time to look at my pull request 
in the coming weeks?
I'm now using a fork of Pandoc to have this fix and I have to rebase every 
time something useful is done
in the main repo, so I would really like to have this fix merged.

Best,
Milan

On Thursday, July 15, 2021 at 4:47:19 PM UTC+2 Jesse Rosenthal wrote:

> Hi Milan,
>
> Thanks for the heads up. Honestly just summer craziness: visiting family 
> for the first time in almost two years, shuttling the kids around. Life 
> stuff. I'll take a look at it ASAP.
>
> Best,
> Jesse
>
> ________________________________________
> From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf 
> of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Sent: Thursday, July 15, 2021 10:43 AM
> To: pandoc-discuss
> Subject: Re: docx parsing bug: nested fldChar fields are interpreted 
> incorrectly
>
> Hi all,
>
> I've had this pull request open for more than 3 weeks now: 
> https://github.com/jgm/pandoc/pull/7401
> Is there a reason it's not getting any reaction? I'd be happy to improve 
> or explain it. If I've done something
> wrong, I'd like to know, so I can fix it.
>
> Best,
> Milan
>
> On Thursday, June 17, 2021 at 8:42:48 AM UTC+2 Milan Bracke wrote:
> Hi Jesse,
>
> Thanks for the feedback. I'll ping you when making the PR. Most of my code 
> seems to work so far, but I still
> have some trouble with the fact that the fields now need to contain 
> ParParts instead of Runs. It's harder to
> match all the cases and treat them correctly. I'll try some more and let 
> you know how it goes.
>
> Best,
> Milan
> On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote:
> Hi Milan,
>
> I wrote the original fldChar code (and that comment) and I figured it 
> would have to evolve as further requirements became necessary. If nesting 
> is a requirement, a stack instead of a toggle seems appropriate.
>
> As far as crossing paragraphs goes -- your approach seems right (and 
> similar to how we've dealt with similar issues like comments crossing 
> paragraphs in docx parsing).
>
> I'd be happy to take a look and offer comments/feedback on your code. Just 
> make sure to ping me (@jkr) on your PRs.
>
> Best,
> Jesse
>
> ________________________________________
> From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf 
> of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Sent: Wednesday, June 16, 2021 5:33 AM
> To: pandoc-discuss
> Subject: Re: docx parsing bug: nested fldChar fields are interpreted 
> incorrectly
>
> I can't fix this without at least some feedback. It's a complex issue and 
> the fix will take some time, so I need to at least know that my proposed 
> solution
> seems good and would be accepted if implemented correctly.
>
> On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote:
> I've encountered a new problem. A fldChar field can span multiple 
> paragraphs, but it doesn't have to start at the beginning of the first one.
> Because of this, a field across multiple paragraphs will merge those 
> paragraphs.
> There is no way to represent this exactly in the pandoc model I think. So 
> my current solution is to have different fields with the same field
> info in the different paragraphs. This can at least make the hyperlink 
> fields work and I think it will work for the other fields we might add in
> the future as well (I've checked the list).
> What do you think about this ?
>
> On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote:
> For those who don't know fldChar fields, this comment from the docx parse 
> code (parse.hs, starting on line 825) explains it:
>
> fldChar fields work by first
> having a <w:fldChar fldCharType="begin"> in a run, then a run with
> <w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the
> content runs, and finally a <w:fldChar fldCharType="end"> run. For
> example (omissions and my comments in brackets):
>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="begin"/>
> </w:r>
> <w:r>
> [...]
> <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText>
> </w:r>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="separate"/>
> </w:r>
> <w:r w:rsidRPr=[...]>
> [...]
> <w:t>Foundations of Analysis, 2nd Edition</w:t>
> </w:r>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="end"/>
> </w:r>
>
> The current way of parsing fldChar fields doesn't take into account that 
> they can be nested. So the end of the nested flcChar field will be 
> interpreted as the end of the surrounding one. This could for example lead 
> to a hyperlink that ends too soon. See attached example for a docx that 
> demonstrates this.
>
> I propose to fix this by turning the fldChar state into a stack, so that a 
> field can be started and ended inside other fields. I will include this in 
> my pull request for PAGEREF fields that I announced here a while ago, since 
> they are related.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:
> pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850832949%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=yihkvC%2B7Le7l00nWwyXnOyOmASIibuFvMDgLIXStUSc%3D&reserved=0
> ><
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850842907%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=IYMIxCLKxoONu7qa9IQViRyjj%2FaOVY8x%2FuHlI2oMfXQ%3D&reserved=0
> >>.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:
> pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/24273fbf-2ce9-4c26-886b-50d504cb7b05n%40googlegroups.com
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850852858%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BFQ5SkpbfzgZ7yLWGIi9uTHBuMaN9nBZzA%2Ffzwt4XnU%3D&reserved=0
> >.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/50bcbdc6-8d4b-49c1-badb-f35fb968112dn%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 16193 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly
       [not found]                                   ` <50bcbdc6-8d4b-49c1-badb-f35fb968112dn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2021-10-17 11:02                                     ` 'Jesse Rosenthal' via pandoc-discuss
       [not found]                                       ` <BL3PR01MB7100280CEDC0AA33F788D8C9C9BB9-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: 'Jesse Rosenthal' via pandoc-discuss @ 2021-10-17 11:02 UTC (permalink / raw)
  To: pandoc-discuss

Dear Milan,

Just commented on github. This looks good to me. I apologize for the long wait here, and for taking so long to turn my attention to this.

Thanks for making this work, and for sharing it with everyone else. Sorry to stand in the way of that process being a bit smoother.

Best,
Jesse

________________________________________
From: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf of Milan Bracke <milan.bracke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Sent: Thursday, October 14, 2021 5:33 AM
To: pandoc-discuss
Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly

Hi Jesse,

I hope you had a good summer. Do you have time to look at my pull request in the coming weeks?
I'm now using a fork of Pandoc to have this fix and I have to rebase every time something useful is done
in the main repo, so I would really like to have this fix merged.

Best,
Milan

On Thursday, July 15, 2021 at 4:47:19 PM UTC+2 Jesse Rosenthal wrote:
Hi Milan,

Thanks for the heads up. Honestly just summer craziness: visiting family for the first time in almost two years, shuttling the kids around. Life stuff. I'll take a look at it ASAP.

Best,
Jesse

________________________________________
From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Sent: Thursday, July 15, 2021 10:43 AM
To: pandoc-discuss
Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly

Hi all,

I've had this pull request open for more than 3 weeks now: https://github.com/jgm/pandoc/pull/7401<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjgm%2Fpandoc%2Fpull%2F7401&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257392816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=quwNSpn8xQSGtRt%2BHuTTBHLOLUltz%2FZPGhmKHQ2hBL8%3D&reserved=0>
Is there a reason it's not getting any reaction? I'd be happy to improve or explain it. If I've done something
wrong, I'd like to know, so I can fix it.

Best,
Milan

On Thursday, June 17, 2021 at 8:42:48 AM UTC+2 Milan Bracke wrote:
Hi Jesse,

Thanks for the feedback. I'll ping you when making the PR. Most of my code seems to work so far, but I still
have some trouble with the fact that the fields now need to contain ParParts instead of Runs. It's harder to
match all the cases and treat them correctly. I'll try some more and let you know how it goes.

Best,
Milan
On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote:
Hi Milan,

I wrote the original fldChar code (and that comment) and I figured it would have to evolve as further requirements became necessary. If nesting is a requirement, a stack instead of a toggle seems appropriate.

As far as crossing paragraphs goes -- your approach seems right (and similar to how we've dealt with similar issues like comments crossing paragraphs in docx parsing).

I'd be happy to take a look and offer comments/feedback on your code. Just make sure to ping me (@jkr) on your PRs.

Best,
Jesse

________________________________________
From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Sent: Wednesday, June 16, 2021 5:33 AM
To: pandoc-discuss
Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly

I can't fix this without at least some feedback. It's a complex issue and the fix will take some time, so I need to at least know that my proposed solution
seems good and would be accepted if implemented correctly.

On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote:
I've encountered a new problem. A fldChar field can span multiple paragraphs, but it doesn't have to start at the beginning of the first one.
Because of this, a field across multiple paragraphs will merge those paragraphs.
There is no way to represent this exactly in the pandoc model I think. So my current solution is to have different fields with the same field
info in the different paragraphs. This can at least make the hyperlink fields work and I think it will work for the other fields we might add in
the future as well (I've checked the list).
What do you think about this ?

On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote:
For those who don't know fldChar fields, this comment from the docx parse code (parse.hs, starting on line 825) explains it:

fldChar fields work by first
having a <w:fldChar fldCharType="begin"> in a run, then a run with
<w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the
content runs, and finally a <w:fldChar fldCharType="end"> run. For
example (omissions and my comments in brackets):

<w:r>
[...]
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
[...]
<w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText>
</w:r>
<w:r>
[...]
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidRPr=[...]>
[...]
<w:t>Foundations of Analysis, 2nd Edition</w:t>
</w:r>
<w:r>
[...]
<w:fldChar w:fldCharType="end"/>
</w:r>

The current way of parsing fldChar fields doesn't take into account that they can be nested. So the end of the nested flcChar field will be interpreted as the end of the surrounding one. This could for example lead to a hyperlink that ends too soon. See attached example for a docx that demonstrates this.

I propose to fix this by turning the fldChar state into a stack, so that a field can be started and ended inside other fields. I will include this in my pull request for PAGEREF fields that I announced here a while ago, since they are related.

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:pandoc-discus...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257392816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=LUBegYzlL9%2BxTt7flRmGawdKytypKU9cbguRYa4L7GY%3D&reserved=0><https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850832949%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=yihkvC%2B7Le7l00nWwyXnOyOmASIibuFvMDgLIXStUSc%3D&reserved=0<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257402781%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5AwXcUxCA%2BCpIEMk0K%2FJwmSWZlDH8dK4UQebZ7jMQKo%3D&reserved=0>><https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257402781%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fI%2BAchsEj1Ik%2F%2BSfY9obLEhbcmt54V27Ao9lb05k3jk%3D&reserved=0><https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850842907%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=IYMIxCLKxoONu7qa9IQViRyjj%2FaOVY8x%2FuHlI2oMfXQ%3D&reserved=0<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257412739%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yEe8Bn%2BylkLhfB9skaPVxsxI0ngxAQKWjX6C04jlfWw%3D&reserved=0>>>.

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:pandoc-discus...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/24273fbf-2ce9-4c26-886b-50d504cb7b05n%40googlegroups.com<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257412739%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QuoRTqPLxb%2FS0L5JVpkdfc2Qg14lS%2FKS7Yvtzwm4cfg%3D&reserved=0><https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850852858%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BFQ5SkpbfzgZ7yLWGIi9uTHBuMaN9nBZzA%2Ffzwt4XnU%3D&reserved=0<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257422689%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=eY8ikgAGkbiXevRpp4lXXfUT71j0tATVP%2BQuJfd%2BGE8%3D&reserved=0>>.

--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/50bcbdc6-8d4b-49c1-badb-f35fb968112dn%40googlegroups.com<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F50bcbdc6-8d4b-49c1-badb-f35fb968112dn%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257432648%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VihSAzBEbJslRknO%2FoJs32DAyA39iUnoN2alFLB2mfU%3D&reserved=0>.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/BL3PR01MB7100280CEDC0AA33F788D8C9C9BB9%40BL3PR01MB7100.prod.exchangelabs.com.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly
       [not found]                                       ` <BL3PR01MB7100280CEDC0AA33F788D8C9C9BB9-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
@ 2021-10-18  7:07                                         ` Milan Bracke
  0 siblings, 0 replies; 10+ messages in thread
From: Milan Bracke @ 2021-10-18  7:07 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 14038 bytes --]

No worries. Thanks for taking a look. I responded to your question in the 
pull request.

On Sunday, October 17, 2021 at 1:02:22 PM UTC+2 Jesse Rosenthal wrote:

> Dear Milan,
>
> Just commented on github. This looks good to me. I apologize for the long 
> wait here, and for taking so long to turn my attention to this.
>
> Thanks for making this work, and for sharing it with everyone else. Sorry 
> to stand in the way of that process being a bit smoother.
>
> Best,
> Jesse
>
> ________________________________________
> From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf 
> of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Sent: Thursday, October 14, 2021 5:33 AM
> To: pandoc-discuss
> Subject: Re: docx parsing bug: nested fldChar fields are interpreted 
> incorrectly
>
> Hi Jesse,
>
> I hope you had a good summer. Do you have time to look at my pull request 
> in the coming weeks?
> I'm now using a fork of Pandoc to have this fix and I have to rebase every 
> time something useful is done
> in the main repo, so I would really like to have this fix merged.
>
> Best,
> Milan
>
> On Thursday, July 15, 2021 at 4:47:19 PM UTC+2 Jesse Rosenthal wrote:
> Hi Milan,
>
> Thanks for the heads up. Honestly just summer craziness: visiting family 
> for the first time in almost two years, shuttling the kids around. Life 
> stuff. I'll take a look at it ASAP.
>
> Best,
> Jesse
>
> ________________________________________
> From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf 
> of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Sent: Thursday, July 15, 2021 10:43 AM
> To: pandoc-discuss
> Subject: Re: docx parsing bug: nested fldChar fields are interpreted 
> incorrectly
>
> Hi all,
>
> I've had this pull request open for more than 3 weeks now: 
> https://github.com/jgm/pandoc/pull/7401<
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjgm%2Fpandoc%2Fpull%2F7401&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257392816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=quwNSpn8xQSGtRt%2BHuTTBHLOLUltz%2FZPGhmKHQ2hBL8%3D&reserved=0
> >
> Is there a reason it's not getting any reaction? I'd be happy to improve 
> or explain it. If I've done something
> wrong, I'd like to know, so I can fix it.
>
> Best,
> Milan
>
> On Thursday, June 17, 2021 at 8:42:48 AM UTC+2 Milan Bracke wrote:
> Hi Jesse,
>
> Thanks for the feedback. I'll ping you when making the PR. Most of my code 
> seems to work so far, but I still
> have some trouble with the fact that the fields now need to contain 
> ParParts instead of Runs. It's harder to
> match all the cases and treat them correctly. I'll try some more and let 
> you know how it goes.
>
> Best,
> Milan
> On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote:
> Hi Milan,
>
> I wrote the original fldChar code (and that comment) and I figured it 
> would have to evolve as further requirements became necessary. If nesting 
> is a requirement, a stack instead of a toggle seems appropriate.
>
> As far as crossing paragraphs goes -- your approach seems right (and 
> similar to how we've dealt with similar issues like comments crossing 
> paragraphs in docx parsing).
>
> I'd be happy to take a look and offer comments/feedback on your code. Just 
> make sure to ping me (@jkr) on your PRs.
>
> Best,
> Jesse
>
> ________________________________________
> From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf 
> of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Sent: Wednesday, June 16, 2021 5:33 AM
> To: pandoc-discuss
> Subject: Re: docx parsing bug: nested fldChar fields are interpreted 
> incorrectly
>
> I can't fix this without at least some feedback. It's a complex issue and 
> the fix will take some time, so I need to at least know that my proposed 
> solution
> seems good and would be accepted if implemented correctly.
>
> On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote:
> I've encountered a new problem. A fldChar field can span multiple 
> paragraphs, but it doesn't have to start at the beginning of the first one.
> Because of this, a field across multiple paragraphs will merge those 
> paragraphs.
> There is no way to represent this exactly in the pandoc model I think. So 
> my current solution is to have different fields with the same field
> info in the different paragraphs. This can at least make the hyperlink 
> fields work and I think it will work for the other fields we might add in
> the future as well (I've checked the list).
> What do you think about this ?
>
> On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote:
> For those who don't know fldChar fields, this comment from the docx parse 
> code (parse.hs, starting on line 825) explains it:
>
> fldChar fields work by first
> having a <w:fldChar fldCharType="begin"> in a run, then a run with
> <w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the
> content runs, and finally a <w:fldChar fldCharType="end"> run. For
> example (omissions and my comments in brackets):
>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="begin"/>
> </w:r>
> <w:r>
> [...]
> <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText>
> </w:r>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="separate"/>
> </w:r>
> <w:r w:rsidRPr=[...]>
> [...]
> <w:t>Foundations of Analysis, 2nd Edition</w:t>
> </w:r>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="end"/>
> </w:r>
>
> The current way of parsing fldChar fields doesn't take into account that 
> they can be nested. So the end of the nested flcChar field will be 
> interpreted as the end of the surrounding one. This could for example lead 
> to a hyperlink that ends too soon. See attached example for a docx that 
> demonstrates this.
>
> I propose to fix this by turning the fldChar state into a stack, so that a 
> field can be started and ended inside other fields. I will include this in 
> my pull request for PAGEREF fields that I announced here a while ago, since 
> they are related.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:
> pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257392816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=LUBegYzlL9%2BxTt7flRmGawdKytypKU9cbguRYa4L7GY%3D&reserved=0
> ><
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850832949%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=yihkvC%2B7Le7l00nWwyXnOyOmASIibuFvMDgLIXStUSc%3D&reserved=0
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257402781%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5AwXcUxCA%2BCpIEMk0K%2FJwmSWZlDH8dK4UQebZ7jMQKo%3D&reserved=0
> >><
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257402781%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fI%2BAchsEj1Ik%2F%2BSfY9obLEhbcmt54V27Ao9lb05k3jk%3D&reserved=0
> ><
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850842907%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=IYMIxCLKxoONu7qa9IQViRyjj%2FaOVY8x%2FuHlI2oMfXQ%3D&reserved=0
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257412739%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yEe8Bn%2BylkLhfB9skaPVxsxI0ngxAQKWjX6C04jlfWw%3D&reserved=0
> >>>.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:
> pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/24273fbf-2ce9-4c26-886b-50d504cb7b05n%40googlegroups.com
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257412739%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QuoRTqPLxb%2FS0L5JVpkdfc2Qg14lS%2FKS7Yvtzwm4cfg%3D&reserved=0
> ><
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850852858%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BFQ5SkpbfzgZ7yLWGIi9uTHBuMaN9nBZzA%2Ffzwt4XnU%3D&reserved=0
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257422689%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=eY8ikgAGkbiXevRpp4lXXfUT71j0tATVP%2BQuJfd%2BGE8%3D&reserved=0
> >>.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:
> pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/50bcbdc6-8d4b-49c1-badb-f35fb968112dn%40googlegroups.com
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F50bcbdc6-8d4b-49c1-badb-f35fb968112dn%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257432648%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VihSAzBEbJslRknO%2FoJs32DAyA39iUnoN2alFLB2mfU%3D&reserved=0
> >.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/d537ca16-e31c-4b90-ab6d-6a5e80200618n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 32119 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-10-18  7:07 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <AQHXeYfzXOAwmGiNTUe7VHoqgBZUyqtEHQvVgI6tBACABM9T0w==>
     [not found] ` <AQHXeYfzXOAwmGiNTUe7VHoqgBZUyqtEHQvV>
     [not found]   ` <AQHXYO2J8TKZPxOOWkqee4W1dsjoP6sUoEAAgAHDUYCAAE9Mnw==>
2021-06-14  7:17     ` docx parsing bug: nested fldChar fields are interpreted incorrectly Milan Bracke
     [not found]       ` <a4a592f3-414e-488f-be2a-0f7fd1e0cd21n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-15  6:38         ` Milan Bracke
     [not found]           ` <2f5489af-f5a9-4ea4-9155-9f85c4808756n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-16  9:33             ` Milan Bracke
     [not found]               ` <9bdb337d-fa68-4c66-8f5c-d4fa81547953n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-16 14:21                 ` Jesse Rosenthal
     [not found]                   ` <DM6PR01MB4650D9807C8EE5F33B3D5278C90F9-cCDycDV5LeFzw2JhukMBoF5F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-06-17  6:42                     ` Milan Bracke
     [not found]                       ` <a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-07-15 14:43                         ` Milan Bracke
     [not found]                           ` <24273fbf-2ce9-4c26-886b-50d504cb7b05n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-07-15 14:46                             ` 'Jesse Rosenthal' via pandoc-discuss
     [not found]                               ` <BL3PR01MB7100D5D9E5DC4898EC9E1221C9129-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-10-14  9:33                                 ` Milan Bracke
     [not found]                                   ` <50bcbdc6-8d4b-49c1-badb-f35fb968112dn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-10-17 11:02                                     ` 'Jesse Rosenthal' via pandoc-discuss
     [not found]                                       ` <BL3PR01MB7100280CEDC0AA33F788D8C9C9BB9-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-10-18  7:07                                         ` Milan Bracke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).