public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: Milan Bracke <milan.bracke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly
Date: Mon, 18 Oct 2021 00:07:44 -0700 (PDT)	[thread overview]
Message-ID: <d537ca16-e31c-4b90-ab6d-6a5e80200618n@googlegroups.com> (raw)
In-Reply-To: <BL3PR01MB7100280CEDC0AA33F788D8C9C9BB9-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 14038 bytes --]

No worries. Thanks for taking a look. I responded to your question in the 
pull request.

On Sunday, October 17, 2021 at 1:02:22 PM UTC+2 Jesse Rosenthal wrote:

> Dear Milan,
>
> Just commented on github. This looks good to me. I apologize for the long 
> wait here, and for taking so long to turn my attention to this.
>
> Thanks for making this work, and for sharing it with everyone else. Sorry 
> to stand in the way of that process being a bit smoother.
>
> Best,
> Jesse
>
> ________________________________________
> From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf 
> of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Sent: Thursday, October 14, 2021 5:33 AM
> To: pandoc-discuss
> Subject: Re: docx parsing bug: nested fldChar fields are interpreted 
> incorrectly
>
> Hi Jesse,
>
> I hope you had a good summer. Do you have time to look at my pull request 
> in the coming weeks?
> I'm now using a fork of Pandoc to have this fix and I have to rebase every 
> time something useful is done
> in the main repo, so I would really like to have this fix merged.
>
> Best,
> Milan
>
> On Thursday, July 15, 2021 at 4:47:19 PM UTC+2 Jesse Rosenthal wrote:
> Hi Milan,
>
> Thanks for the heads up. Honestly just summer craziness: visiting family 
> for the first time in almost two years, shuttling the kids around. Life 
> stuff. I'll take a look at it ASAP.
>
> Best,
> Jesse
>
> ________________________________________
> From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf 
> of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Sent: Thursday, July 15, 2021 10:43 AM
> To: pandoc-discuss
> Subject: Re: docx parsing bug: nested fldChar fields are interpreted 
> incorrectly
>
> Hi all,
>
> I've had this pull request open for more than 3 weeks now: 
> https://github.com/jgm/pandoc/pull/7401<
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjgm%2Fpandoc%2Fpull%2F7401&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257392816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=quwNSpn8xQSGtRt%2BHuTTBHLOLUltz%2FZPGhmKHQ2hBL8%3D&reserved=0
> >
> Is there a reason it's not getting any reaction? I'd be happy to improve 
> or explain it. If I've done something
> wrong, I'd like to know, so I can fix it.
>
> Best,
> Milan
>
> On Thursday, June 17, 2021 at 8:42:48 AM UTC+2 Milan Bracke wrote:
> Hi Jesse,
>
> Thanks for the feedback. I'll ping you when making the PR. Most of my code 
> seems to work so far, but I still
> have some trouble with the fact that the fields now need to contain 
> ParParts instead of Runs. It's harder to
> match all the cases and treat them correctly. I'll try some more and let 
> you know how it goes.
>
> Best,
> Milan
> On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote:
> Hi Milan,
>
> I wrote the original fldChar code (and that comment) and I figured it 
> would have to evolve as further requirements became necessary. If nesting 
> is a requirement, a stack instead of a toggle seems appropriate.
>
> As far as crossing paragraphs goes -- your approach seems right (and 
> similar to how we've dealt with similar issues like comments crossing 
> paragraphs in docx parsing).
>
> I'd be happy to take a look and offer comments/feedback on your code. Just 
> make sure to ping me (@jkr) on your PRs.
>
> Best,
> Jesse
>
> ________________________________________
> From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf 
> of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Sent: Wednesday, June 16, 2021 5:33 AM
> To: pandoc-discuss
> Subject: Re: docx parsing bug: nested fldChar fields are interpreted 
> incorrectly
>
> I can't fix this without at least some feedback. It's a complex issue and 
> the fix will take some time, so I need to at least know that my proposed 
> solution
> seems good and would be accepted if implemented correctly.
>
> On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote:
> I've encountered a new problem. A fldChar field can span multiple 
> paragraphs, but it doesn't have to start at the beginning of the first one.
> Because of this, a field across multiple paragraphs will merge those 
> paragraphs.
> There is no way to represent this exactly in the pandoc model I think. So 
> my current solution is to have different fields with the same field
> info in the different paragraphs. This can at least make the hyperlink 
> fields work and I think it will work for the other fields we might add in
> the future as well (I've checked the list).
> What do you think about this ?
>
> On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote:
> For those who don't know fldChar fields, this comment from the docx parse 
> code (parse.hs, starting on line 825) explains it:
>
> fldChar fields work by first
> having a <w:fldChar fldCharType="begin"> in a run, then a run with
> <w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the
> content runs, and finally a <w:fldChar fldCharType="end"> run. For
> example (omissions and my comments in brackets):
>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="begin"/>
> </w:r>
> <w:r>
> [...]
> <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText>
> </w:r>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="separate"/>
> </w:r>
> <w:r w:rsidRPr=[...]>
> [...]
> <w:t>Foundations of Analysis, 2nd Edition</w:t>
> </w:r>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="end"/>
> </w:r>
>
> The current way of parsing fldChar fields doesn't take into account that 
> they can be nested. So the end of the nested flcChar field will be 
> interpreted as the end of the surrounding one. This could for example lead 
> to a hyperlink that ends too soon. See attached example for a docx that 
> demonstrates this.
>
> I propose to fix this by turning the fldChar state into a stack, so that a 
> field can be started and ended inside other fields. I will include this in 
> my pull request for PAGEREF fields that I announced here a while ago, since 
> they are related.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:
> pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257392816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=LUBegYzlL9%2BxTt7flRmGawdKytypKU9cbguRYa4L7GY%3D&reserved=0
> ><
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850832949%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=yihkvC%2B7Le7l00nWwyXnOyOmASIibuFvMDgLIXStUSc%3D&reserved=0
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257402781%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5AwXcUxCA%2BCpIEMk0K%2FJwmSWZlDH8dK4UQebZ7jMQKo%3D&reserved=0
> >><
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257402781%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fI%2BAchsEj1Ik%2F%2BSfY9obLEhbcmt54V27Ao9lb05k3jk%3D&reserved=0
> ><
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850842907%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=IYMIxCLKxoONu7qa9IQViRyjj%2FaOVY8x%2FuHlI2oMfXQ%3D&reserved=0
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257412739%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yEe8Bn%2BylkLhfB9skaPVxsxI0ngxAQKWjX6C04jlfWw%3D&reserved=0
> >>>.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:
> pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/24273fbf-2ce9-4c26-886b-50d504cb7b05n%40googlegroups.com
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257412739%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QuoRTqPLxb%2FS0L5JVpkdfc2Qg14lS%2FKS7Yvtzwm4cfg%3D&reserved=0
> ><
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850852858%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BFQ5SkpbfzgZ7yLWGIi9uTHBuMaN9nBZzA%2Ffzwt4XnU%3D&reserved=0
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257422689%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=eY8ikgAGkbiXevRpp4lXXfUT71j0tATVP%2BQuJfd%2BGE8%3D&reserved=0
> >>.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:
> pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/50bcbdc6-8d4b-49c1-badb-f35fb968112dn%40googlegroups.com
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F50bcbdc6-8d4b-49c1-badb-f35fb968112dn%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257432648%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VihSAzBEbJslRknO%2FoJs32DAyA39iUnoN2alFLB2mfU%3D&reserved=0
> >.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/d537ca16-e31c-4b90-ab6d-6a5e80200618n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 32119 bytes --]

      parent reply	other threads:[~2021-10-18  7:07 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <AQHXeYfzXOAwmGiNTUe7VHoqgBZUyqtEHQvVgI6tBACABM9T0w==>
     [not found] ` <AQHXeYfzXOAwmGiNTUe7VHoqgBZUyqtEHQvV>
     [not found]   ` <AQHXYO2J8TKZPxOOWkqee4W1dsjoP6sUoEAAgAHDUYCAAE9Mnw==>
2021-06-14  7:17     ` Milan Bracke
     [not found]       ` <a4a592f3-414e-488f-be2a-0f7fd1e0cd21n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-15  6:38         ` Milan Bracke
     [not found]           ` <2f5489af-f5a9-4ea4-9155-9f85c4808756n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-16  9:33             ` Milan Bracke
     [not found]               ` <9bdb337d-fa68-4c66-8f5c-d4fa81547953n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-16 14:21                 ` Jesse Rosenthal
     [not found]                   ` <DM6PR01MB4650D9807C8EE5F33B3D5278C90F9-cCDycDV5LeFzw2JhukMBoF5F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-06-17  6:42                     ` Milan Bracke
     [not found]                       ` <a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-07-15 14:43                         ` Milan Bracke
     [not found]                           ` <24273fbf-2ce9-4c26-886b-50d504cb7b05n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-07-15 14:46                             ` 'Jesse Rosenthal' via pandoc-discuss
     [not found]                               ` <BL3PR01MB7100D5D9E5DC4898EC9E1221C9129-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-10-14  9:33                                 ` Milan Bracke
     [not found]                                   ` <50bcbdc6-8d4b-49c1-badb-f35fb968112dn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-10-17 11:02                                     ` 'Jesse Rosenthal' via pandoc-discuss
     [not found]                                       ` <BL3PR01MB7100280CEDC0AA33F788D8C9C9BB9-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-10-18  7:07                                         ` Milan Bracke [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d537ca16-e31c-4b90-ab6d-6a5e80200618n@googlegroups.com \
    --to=milan.bracke-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).