* docx parsing bug: nested fldChar fields are interpreted incorrectly @ 2021-06-14 7:17 ` Milan Bracke [not found] ` <a4a592f3-414e-488f-be2a-0f7fd1e0cd21n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Milan Bracke @ 2021-06-14 7:17 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 1752 bytes --] For those who don't know fldChar fields, this comment from the docx parse code (parse.hs, starting on line 825) explains it: *fldChar fields work by firsthaving a <w:fldChar fldCharType="begin"> in a run, then a run with<w:instrText>, then a <w:fldChar fldCharType="separate"> run, then thecontent runs, and finally a <w:fldChar fldCharType="end"> run. Forexample (omissions and my comments in brackets): <w:r> [...] <w:fldChar w:fldCharType="begin"/> </w:r> <w:r> [...] <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText> </w:r> <w:r> [...] <w:fldChar w:fldCharType="separate"/> </w:r> <w:r w:rsidRPr=[...]> [...] <w:t>Foundations of Analysis, 2nd Edition</w:t> </w:r> <w:r> [...] <w:fldChar w:fldCharType="end"/> </w:r>*The current way of parsing fldChar fields doesn't take into account that they can be nested. So the end of the nested flcChar field will be interpreted as the end of the surrounding one. This could for example lead to a hyperlink that ends too soon. See attached example for a docx that demonstrates this. I propose to fix this by turning the fldChar state into a stack, so that a field can be started and ended inside other fields. I will include this in my pull request for PAGEREF fields that I announced here a while ago, since they are related. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a4a592f3-414e-488f-be2a-0f7fd1e0cd21n%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 2256 bytes --] [-- Attachment #2: instrText_hyperlink.docx --] [-- Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document, Size: 14112 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <a4a592f3-414e-488f-be2a-0f7fd1e0cd21n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly [not found] ` <a4a592f3-414e-488f-be2a-0f7fd1e0cd21n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2021-06-15 6:38 ` Milan Bracke [not found] ` <2f5489af-f5a9-4ea4-9155-9f85c4808756n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Milan Bracke @ 2021-06-15 6:38 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 2477 bytes --] I've encountered a new problem. A fldChar field can span multiple paragraphs, but it doesn't have to start at the beginning of the first one. Because of this, a field across multiple paragraphs will merge those paragraphs. There is no way to represent this exactly in the pandoc model I think. So my current solution is to have different fields with the same field info in the different paragraphs. This can at least make the hyperlink fields work and I think it will work for the other fields we might add in the future as well (I've checked the list). What do you think about this ? On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote: > For those who don't know fldChar fields, this comment from the docx parse > code (parse.hs, starting on line 825) explains it: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *fldChar fields work by firsthaving a <w:fldChar fldCharType="begin"> in a > run, then a run with<w:instrText>, then a <w:fldChar > fldCharType="separate"> run, then thecontent runs, and finally a <w:fldChar > fldCharType="end"> run. Forexample (omissions and my comments in brackets): > <w:r> [...] <w:fldChar w:fldCharType="begin"/> </w:r> <w:r> [...] > <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText> > </w:r> <w:r> [...] <w:fldChar w:fldCharType="separate"/> </w:r> <w:r > w:rsidRPr=[...]> [...] <w:t>Foundations of Analysis, 2nd Edition</w:t> > </w:r> <w:r> [...] <w:fldChar w:fldCharType="end"/> </w:r>*The current > way of parsing fldChar fields doesn't take into account that they can be > nested. So the end of the nested flcChar field will be interpreted as the > end of the surrounding one. This could for example lead to a hyperlink that > ends too soon. See attached example for a docx that demonstrates this. > > I propose to fix this by turning the fldChar state into a stack, so that a > field can be started and ended inside other fields. I will include this in > my pull request for PAGEREF fields that I announced here a while ago, since > they are related. > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/2f5489af-f5a9-4ea4-9155-9f85c4808756n%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 3264 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <2f5489af-f5a9-4ea4-9155-9f85c4808756n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly [not found] ` <2f5489af-f5a9-4ea4-9155-9f85c4808756n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2021-06-16 9:33 ` Milan Bracke [not found] ` <9bdb337d-fa68-4c66-8f5c-d4fa81547953n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Milan Bracke @ 2021-06-16 9:33 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 2838 bytes --] I can't fix this without at least some feedback. It's a complex issue and the fix will take some time, so I need to at least know that my proposed solution seems good and would be accepted if implemented correctly. On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote: > I've encountered a new problem. A fldChar field can span multiple > paragraphs, but it doesn't have to start at the beginning of the first one. > Because of this, a field across multiple paragraphs will merge those > paragraphs. > There is no way to represent this exactly in the pandoc model I think. So > my current solution is to have different fields with the same field > info in the different paragraphs. This can at least make the hyperlink > fields work and I think it will work for the other fields we might add in > the future as well (I've checked the list). > What do you think about this ? > > On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote: > >> For those who don't know fldChar fields, this comment from the docx parse >> code (parse.hs, starting on line 825) explains it: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *fldChar fields work by firsthaving a <w:fldChar fldCharType="begin"> in >> a run, then a run with<w:instrText>, then a <w:fldChar >> fldCharType="separate"> run, then thecontent runs, and finally a <w:fldChar >> fldCharType="end"> run. Forexample (omissions and my comments in brackets): >> <w:r> [...] <w:fldChar w:fldCharType="begin"/> </w:r> <w:r> [...] >> <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText> >> </w:r> <w:r> [...] <w:fldChar w:fldCharType="separate"/> </w:r> <w:r >> w:rsidRPr=[...]> [...] <w:t>Foundations of Analysis, 2nd Edition</w:t> >> </w:r> <w:r> [...] <w:fldChar w:fldCharType="end"/> </w:r>*The current >> way of parsing fldChar fields doesn't take into account that they can be >> nested. So the end of the nested flcChar field will be interpreted as the >> end of the surrounding one. This could for example lead to a hyperlink that >> ends too soon. See attached example for a docx that demonstrates this. >> >> I propose to fix this by turning the fldChar state into a stack, so that >> a field can be started and ended inside other fields. I will include this >> in my pull request for PAGEREF fields that I announced here a while ago, >> since they are related. >> > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 3793 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <9bdb337d-fa68-4c66-8f5c-d4fa81547953n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly [not found] ` <9bdb337d-fa68-4c66-8f5c-d4fa81547953n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2021-06-16 14:21 ` Jesse Rosenthal [not found] ` <DM6PR01MB4650D9807C8EE5F33B3D5278C90F9-cCDycDV5LeFzw2JhukMBoF5F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Jesse Rosenthal @ 2021-06-16 14:21 UTC (permalink / raw) To: pandoc-discuss Hi Milan, I wrote the original fldChar code (and that comment) and I figured it would have to evolve as further requirements became necessary. If nesting is a requirement, a stack instead of a toggle seems appropriate. As far as crossing paragraphs goes -- your approach seems right (and similar to how we've dealt with similar issues like comments crossing paragraphs in docx parsing). I'd be happy to take a look and offer comments/feedback on your code. Just make sure to ping me (@jkr) on your PRs. Best, Jesse ________________________________________ From: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf of Milan Bracke <milan.bracke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sent: Wednesday, June 16, 2021 5:33 AM To: pandoc-discuss Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly I can't fix this without at least some feedback. It's a complex issue and the fix will take some time, so I need to at least know that my proposed solution seems good and would be accepted if implemented correctly. On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote: I've encountered a new problem. A fldChar field can span multiple paragraphs, but it doesn't have to start at the beginning of the first one. Because of this, a field across multiple paragraphs will merge those paragraphs. There is no way to represent this exactly in the pandoc model I think. So my current solution is to have different fields with the same field info in the different paragraphs. This can at least make the hyperlink fields work and I think it will work for the other fields we might add in the future as well (I've checked the list). What do you think about this ? On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote: For those who don't know fldChar fields, this comment from the docx parse code (parse.hs, starting on line 825) explains it: fldChar fields work by first having a <w:fldChar fldCharType="begin"> in a run, then a run with <w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the content runs, and finally a <w:fldChar fldCharType="end"> run. For example (omissions and my comments in brackets): <w:r> [...] <w:fldChar w:fldCharType="begin"/> </w:r> <w:r> [...] <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText> </w:r> <w:r> [...] <w:fldChar w:fldCharType="separate"/> </w:r> <w:r w:rsidRPr=[...]> [...] <w:t>Foundations of Analysis, 2nd Edition</w:t> </w:r> <w:r> [...] <w:fldChar w:fldCharType="end"/> </w:r> The current way of parsing fldChar fields doesn't take into account that they can be nested. So the end of the nested flcChar field will be interpreted as the end of the surrounding one. This could for example lead to a hyperlink that ends too soon. See attached example for a docx that demonstrates this. I propose to fix this by turning the fldChar state into a stack, so that a field can be started and ended inside other fields. I will include this in my pull request for PAGEREF fields that I announced here a while ago, since they are related. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0>. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/DM6PR01MB4650D9807C8EE5F33B3D5278C90F9%40DM6PR01MB4650.prod.exchangelabs.com. ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <DM6PR01MB4650D9807C8EE5F33B3D5278C90F9-cCDycDV5LeFzw2JhukMBoF5F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>]
* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly [not found] ` <DM6PR01MB4650D9807C8EE5F33B3D5278C90F9-cCDycDV5LeFzw2JhukMBoF5F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org> @ 2021-06-17 6:42 ` Milan Bracke [not found] ` <a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Milan Bracke @ 2021-06-17 6:42 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 5239 bytes --] Hi Jesse, Thanks for the feedback. I'll ping you when making the PR. Most of my code seems to work so far, but I still have some trouble with the fact that the fields now need to contain ParParts instead of Runs. It's harder to match all the cases and treat them correctly. I'll try some more and let you know how it goes. Best, Milan On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote: > Hi Milan, > > I wrote the original fldChar code (and that comment) and I figured it > would have to evolve as further requirements became necessary. If nesting > is a requirement, a stack instead of a toggle seems appropriate. > > As far as crossing paragraphs goes -- your approach seems right (and > similar to how we've dealt with similar issues like comments crossing > paragraphs in docx parsing). > > I'd be happy to take a look and offer comments/feedback on your code. Just > make sure to ping me (@jkr) on your PRs. > > Best, > Jesse > > ________________________________________ > From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf > of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > Sent: Wednesday, June 16, 2021 5:33 AM > To: pandoc-discuss > Subject: Re: docx parsing bug: nested fldChar fields are interpreted > incorrectly > > I can't fix this without at least some feedback. It's a complex issue and > the fix will take some time, so I need to at least know that my proposed > solution > seems good and would be accepted if implemented correctly. > > On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote: > I've encountered a new problem. A fldChar field can span multiple > paragraphs, but it doesn't have to start at the beginning of the first one. > Because of this, a field across multiple paragraphs will merge those > paragraphs. > There is no way to represent this exactly in the pandoc model I think. So > my current solution is to have different fields with the same field > info in the different paragraphs. This can at least make the hyperlink > fields work and I think it will work for the other fields we might add in > the future as well (I've checked the list). > What do you think about this ? > > On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote: > For those who don't know fldChar fields, this comment from the docx parse > code (parse.hs, starting on line 825) explains it: > > fldChar fields work by first > having a <w:fldChar fldCharType="begin"> in a run, then a run with > <w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the > content runs, and finally a <w:fldChar fldCharType="end"> run. For > example (omissions and my comments in brackets): > > <w:r> > [...] > <w:fldChar w:fldCharType="begin"/> > </w:r> > <w:r> > [...] > <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText> > </w:r> > <w:r> > [...] > <w:fldChar w:fldCharType="separate"/> > </w:r> > <w:r w:rsidRPr=[...]> > [...] > <w:t>Foundations of Analysis, 2nd Edition</w:t> > </w:r> > <w:r> > [...] > <w:fldChar w:fldCharType="end"/> > </w:r> > > The current way of parsing fldChar fields doesn't take into account that > they can be nested. So the end of the nested flcChar field will be > interpreted as the end of the surrounding one. This could for example lead > to a hyperlink that ends too soon. See attached example for a docx that > demonstrates this. > > I propose to fix this by turning the fldChar state into a stack, so that a > field can be started and ended inside other fields. I will include this in > my pull request for PAGEREF fields that I announced here a while ago, since > they are related. > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto: > pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com > < > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0 > >. > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 7987 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly [not found] ` <a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2021-07-15 14:43 ` Milan Bracke [not found] ` <24273fbf-2ce9-4c26-886b-50d504cb7b05n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Milan Bracke @ 2021-07-15 14:43 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 5775 bytes --] Hi all, I've had this pull request open for more than 3 weeks now: https://github.com/jgm/pandoc/pull/7401 Is there a reason it's not getting any reaction? I'd be happy to improve or explain it. If I've done something wrong, I'd like to know, so I can fix it. Best, Milan On Thursday, June 17, 2021 at 8:42:48 AM UTC+2 Milan Bracke wrote: > Hi Jesse, > > Thanks for the feedback. I'll ping you when making the PR. Most of my code > seems to work so far, but I still > have some trouble with the fact that the fields now need to contain > ParParts instead of Runs. It's harder to > match all the cases and treat them correctly. I'll try some more and let > you know how it goes. > > Best, > Milan > On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote: > >> Hi Milan, >> >> I wrote the original fldChar code (and that comment) and I figured it >> would have to evolve as further requirements became necessary. If nesting >> is a requirement, a stack instead of a toggle seems appropriate. >> >> As far as crossing paragraphs goes -- your approach seems right (and >> similar to how we've dealt with similar issues like comments crossing >> paragraphs in docx parsing). >> >> I'd be happy to take a look and offer comments/feedback on your code. >> Just make sure to ping me (@jkr) on your PRs. >> >> Best, >> Jesse >> >> ________________________________________ >> From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on >> behalf of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> >> Sent: Wednesday, June 16, 2021 5:33 AM >> To: pandoc-discuss >> Subject: Re: docx parsing bug: nested fldChar fields are interpreted >> incorrectly >> >> I can't fix this without at least some feedback. It's a complex issue and >> the fix will take some time, so I need to at least know that my proposed >> solution >> seems good and would be accepted if implemented correctly. >> >> On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote: >> I've encountered a new problem. A fldChar field can span multiple >> paragraphs, but it doesn't have to start at the beginning of the first one. >> Because of this, a field across multiple paragraphs will merge those >> paragraphs. >> There is no way to represent this exactly in the pandoc model I think. So >> my current solution is to have different fields with the same field >> info in the different paragraphs. This can at least make the hyperlink >> fields work and I think it will work for the other fields we might add in >> the future as well (I've checked the list). >> What do you think about this ? >> >> On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote: >> For those who don't know fldChar fields, this comment from the docx parse >> code (parse.hs, starting on line 825) explains it: >> >> fldChar fields work by first >> having a <w:fldChar fldCharType="begin"> in a run, then a run with >> <w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the >> content runs, and finally a <w:fldChar fldCharType="end"> run. For >> example (omissions and my comments in brackets): >> >> <w:r> >> [...] >> <w:fldChar w:fldCharType="begin"/> >> </w:r> >> <w:r> >> [...] >> <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] >> </w:instrText> >> </w:r> >> <w:r> >> [...] >> <w:fldChar w:fldCharType="separate"/> >> </w:r> >> <w:r w:rsidRPr=[...]> >> [...] >> <w:t>Foundations of Analysis, 2nd Edition</w:t> >> </w:r> >> <w:r> >> [...] >> <w:fldChar w:fldCharType="end"/> >> </w:r> >> >> The current way of parsing fldChar fields doesn't take into account that >> they can be nested. So the end of the nested flcChar field will be >> interpreted as the end of the surrounding one. This could for example lead >> to a hyperlink that ends too soon. See attached example for a docx that >> demonstrates this. >> >> I propose to fix this by turning the fldChar state into a stack, so that >> a field can be started and ended inside other fields. I will include this >> in my pull request for PAGEREF fields that I announced here a while ago, >> since they are related. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "pandoc-discuss" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto: >> pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com >> < >> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0>. >> >> > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/24273fbf-2ce9-4c26-886b-50d504cb7b05n%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 8579 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <24273fbf-2ce9-4c26-886b-50d504cb7b05n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly [not found] ` <24273fbf-2ce9-4c26-886b-50d504cb7b05n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2021-07-15 14:46 ` 'Jesse Rosenthal' via pandoc-discuss [not found] ` <BL3PR01MB7100D5D9E5DC4898EC9E1221C9129-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: 'Jesse Rosenthal' via pandoc-discuss @ 2021-07-15 14:46 UTC (permalink / raw) To: pandoc-discuss Hi Milan, Thanks for the heads up. Honestly just summer craziness: visiting family for the first time in almost two years, shuttling the kids around. Life stuff. I'll take a look at it ASAP. Best, Jesse ________________________________________ From: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf of Milan Bracke <milan.bracke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sent: Thursday, July 15, 2021 10:43 AM To: pandoc-discuss Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly Hi all, I've had this pull request open for more than 3 weeks now: https://github.com/jgm/pandoc/pull/7401 Is there a reason it's not getting any reaction? I'd be happy to improve or explain it. If I've done something wrong, I'd like to know, so I can fix it. Best, Milan On Thursday, June 17, 2021 at 8:42:48 AM UTC+2 Milan Bracke wrote: Hi Jesse, Thanks for the feedback. I'll ping you when making the PR. Most of my code seems to work so far, but I still have some trouble with the fact that the fields now need to contain ParParts instead of Runs. It's harder to match all the cases and treat them correctly. I'll try some more and let you know how it goes. Best, Milan On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote: Hi Milan, I wrote the original fldChar code (and that comment) and I figured it would have to evolve as further requirements became necessary. If nesting is a requirement, a stack instead of a toggle seems appropriate. As far as crossing paragraphs goes -- your approach seems right (and similar to how we've dealt with similar issues like comments crossing paragraphs in docx parsing). I'd be happy to take a look and offer comments/feedback on your code. Just make sure to ping me (@jkr) on your PRs. Best, Jesse ________________________________________ From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sent: Wednesday, June 16, 2021 5:33 AM To: pandoc-discuss Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly I can't fix this without at least some feedback. It's a complex issue and the fix will take some time, so I need to at least know that my proposed solution seems good and would be accepted if implemented correctly. On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote: I've encountered a new problem. A fldChar field can span multiple paragraphs, but it doesn't have to start at the beginning of the first one. Because of this, a field across multiple paragraphs will merge those paragraphs. There is no way to represent this exactly in the pandoc model I think. So my current solution is to have different fields with the same field info in the different paragraphs. This can at least make the hyperlink fields work and I think it will work for the other fields we might add in the future as well (I've checked the list). What do you think about this ? On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote: For those who don't know fldChar fields, this comment from the docx parse code (parse.hs, starting on line 825) explains it: fldChar fields work by first having a <w:fldChar fldCharType="begin"> in a run, then a run with <w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the content runs, and finally a <w:fldChar fldCharType="end"> run. For example (omissions and my comments in brackets): <w:r> [...] <w:fldChar w:fldCharType="begin"/> </w:r> <w:r> [...] <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText> </w:r> <w:r> [...] <w:fldChar w:fldCharType="separate"/> </w:r> <w:r w:rsidRPr=[...]> [...] <w:t>Foundations of Analysis, 2nd Edition</w:t> </w:r> <w:r> [...] <w:fldChar w:fldCharType="end"/> </w:r> The current way of parsing fldChar fields doesn't take into account that they can be nested. So the end of the nested flcChar field will be interpreted as the end of the surrounding one. This could for example lead to a hyperlink that ends too soon. See attached example for a docx that demonstrates this. I propose to fix this by turning the fldChar state into a stack, so that a field can be started and ended inside other fields. I will include this in my pull request for PAGEREF fields that I announced here a while ago, since they are related. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:pandoc-discus...@googlegroups.com>. To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850832949%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=yihkvC%2B7Le7l00nWwyXnOyOmASIibuFvMDgLIXStUSc%3D&reserved=0><https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850842907%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=IYMIxCLKxoONu7qa9IQViRyjj%2FaOVY8x%2FuHlI2oMfXQ%3D&reserved=0>>. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/24273fbf-2ce9-4c26-886b-50d504cb7b05n%40googlegroups.com<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850852858%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BFQ5SkpbfzgZ7yLWGIi9uTHBuMaN9nBZzA%2Ffzwt4XnU%3D&reserved=0>. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/BL3PR01MB7100D5D9E5DC4898EC9E1221C9129%40BL3PR01MB7100.prod.exchangelabs.com. ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <BL3PR01MB7100D5D9E5DC4898EC9E1221C9129-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>]
* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly [not found] ` <BL3PR01MB7100D5D9E5DC4898EC9E1221C9129-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org> @ 2021-10-14 9:33 ` Milan Bracke [not found] ` <50bcbdc6-8d4b-49c1-badb-f35fb968112dn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: Milan Bracke @ 2021-10-14 9:33 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 8665 bytes --] Hi Jesse, I hope you had a good summer. Do you have time to look at my pull request in the coming weeks? I'm now using a fork of Pandoc to have this fix and I have to rebase every time something useful is done in the main repo, so I would really like to have this fix merged. Best, Milan On Thursday, July 15, 2021 at 4:47:19 PM UTC+2 Jesse Rosenthal wrote: > Hi Milan, > > Thanks for the heads up. Honestly just summer craziness: visiting family > for the first time in almost two years, shuttling the kids around. Life > stuff. I'll take a look at it ASAP. > > Best, > Jesse > > ________________________________________ > From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf > of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > Sent: Thursday, July 15, 2021 10:43 AM > To: pandoc-discuss > Subject: Re: docx parsing bug: nested fldChar fields are interpreted > incorrectly > > Hi all, > > I've had this pull request open for more than 3 weeks now: > https://github.com/jgm/pandoc/pull/7401 > Is there a reason it's not getting any reaction? I'd be happy to improve > or explain it. If I've done something > wrong, I'd like to know, so I can fix it. > > Best, > Milan > > On Thursday, June 17, 2021 at 8:42:48 AM UTC+2 Milan Bracke wrote: > Hi Jesse, > > Thanks for the feedback. I'll ping you when making the PR. Most of my code > seems to work so far, but I still > have some trouble with the fact that the fields now need to contain > ParParts instead of Runs. It's harder to > match all the cases and treat them correctly. I'll try some more and let > you know how it goes. > > Best, > Milan > On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote: > Hi Milan, > > I wrote the original fldChar code (and that comment) and I figured it > would have to evolve as further requirements became necessary. If nesting > is a requirement, a stack instead of a toggle seems appropriate. > > As far as crossing paragraphs goes -- your approach seems right (and > similar to how we've dealt with similar issues like comments crossing > paragraphs in docx parsing). > > I'd be happy to take a look and offer comments/feedback on your code. Just > make sure to ping me (@jkr) on your PRs. > > Best, > Jesse > > ________________________________________ > From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf > of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > Sent: Wednesday, June 16, 2021 5:33 AM > To: pandoc-discuss > Subject: Re: docx parsing bug: nested fldChar fields are interpreted > incorrectly > > I can't fix this without at least some feedback. It's a complex issue and > the fix will take some time, so I need to at least know that my proposed > solution > seems good and would be accepted if implemented correctly. > > On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote: > I've encountered a new problem. A fldChar field can span multiple > paragraphs, but it doesn't have to start at the beginning of the first one. > Because of this, a field across multiple paragraphs will merge those > paragraphs. > There is no way to represent this exactly in the pandoc model I think. So > my current solution is to have different fields with the same field > info in the different paragraphs. This can at least make the hyperlink > fields work and I think it will work for the other fields we might add in > the future as well (I've checked the list). > What do you think about this ? > > On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote: > For those who don't know fldChar fields, this comment from the docx parse > code (parse.hs, starting on line 825) explains it: > > fldChar fields work by first > having a <w:fldChar fldCharType="begin"> in a run, then a run with > <w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the > content runs, and finally a <w:fldChar fldCharType="end"> run. For > example (omissions and my comments in brackets): > > <w:r> > [...] > <w:fldChar w:fldCharType="begin"/> > </w:r> > <w:r> > [...] > <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText> > </w:r> > <w:r> > [...] > <w:fldChar w:fldCharType="separate"/> > </w:r> > <w:r w:rsidRPr=[...]> > [...] > <w:t>Foundations of Analysis, 2nd Edition</w:t> > </w:r> > <w:r> > [...] > <w:fldChar w:fldCharType="end"/> > </w:r> > > The current way of parsing fldChar fields doesn't take into account that > they can be nested. So the end of the nested flcChar field will be > interpreted as the end of the surrounding one. This could for example lead > to a hyperlink that ends too soon. See attached example for a docx that > demonstrates this. > > I propose to fix this by turning the fldChar state into a stack, so that a > field can be started and ended inside other fields. I will include this in > my pull request for PAGEREF fields that I announced here a while ago, since > they are related. > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto: > pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com > < > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850832949%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=yihkvC%2B7Le7l00nWwyXnOyOmASIibuFvMDgLIXStUSc%3D&reserved=0 > >< > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0 > < > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850842907%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=IYMIxCLKxoONu7qa9IQViRyjj%2FaOVY8x%2FuHlI2oMfXQ%3D&reserved=0 > >>. > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto: > pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/24273fbf-2ce9-4c26-886b-50d504cb7b05n%40googlegroups.com > < > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850852858%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BFQ5SkpbfzgZ7yLWGIi9uTHBuMaN9nBZzA%2Ffzwt4XnU%3D&reserved=0 > >. > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/50bcbdc6-8d4b-49c1-badb-f35fb968112dn%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 16193 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <50bcbdc6-8d4b-49c1-badb-f35fb968112dn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>]
* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly [not found] ` <50bcbdc6-8d4b-49c1-badb-f35fb968112dn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> @ 2021-10-17 11:02 ` 'Jesse Rosenthal' via pandoc-discuss [not found] ` <BL3PR01MB7100280CEDC0AA33F788D8C9C9BB9-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org> 0 siblings, 1 reply; 10+ messages in thread From: 'Jesse Rosenthal' via pandoc-discuss @ 2021-10-17 11:02 UTC (permalink / raw) To: pandoc-discuss Dear Milan, Just commented on github. This looks good to me. I apologize for the long wait here, and for taking so long to turn my attention to this. Thanks for making this work, and for sharing it with everyone else. Sorry to stand in the way of that process being a bit smoother. Best, Jesse ________________________________________ From: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf of Milan Bracke <milan.bracke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sent: Thursday, October 14, 2021 5:33 AM To: pandoc-discuss Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly Hi Jesse, I hope you had a good summer. Do you have time to look at my pull request in the coming weeks? I'm now using a fork of Pandoc to have this fix and I have to rebase every time something useful is done in the main repo, so I would really like to have this fix merged. Best, Milan On Thursday, July 15, 2021 at 4:47:19 PM UTC+2 Jesse Rosenthal wrote: Hi Milan, Thanks for the heads up. Honestly just summer craziness: visiting family for the first time in almost two years, shuttling the kids around. Life stuff. I'll take a look at it ASAP. Best, Jesse ________________________________________ From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sent: Thursday, July 15, 2021 10:43 AM To: pandoc-discuss Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly Hi all, I've had this pull request open for more than 3 weeks now: https://github.com/jgm/pandoc/pull/7401<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjgm%2Fpandoc%2Fpull%2F7401&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257392816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=quwNSpn8xQSGtRt%2BHuTTBHLOLUltz%2FZPGhmKHQ2hBL8%3D&reserved=0> Is there a reason it's not getting any reaction? I'd be happy to improve or explain it. If I've done something wrong, I'd like to know, so I can fix it. Best, Milan On Thursday, June 17, 2021 at 8:42:48 AM UTC+2 Milan Bracke wrote: Hi Jesse, Thanks for the feedback. I'll ping you when making the PR. Most of my code seems to work so far, but I still have some trouble with the fact that the fields now need to contain ParParts instead of Runs. It's harder to match all the cases and treat them correctly. I'll try some more and let you know how it goes. Best, Milan On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote: Hi Milan, I wrote the original fldChar code (and that comment) and I figured it would have to evolve as further requirements became necessary. If nesting is a requirement, a stack instead of a toggle seems appropriate. As far as crossing paragraphs goes -- your approach seems right (and similar to how we've dealt with similar issues like comments crossing paragraphs in docx parsing). I'd be happy to take a look and offer comments/feedback on your code. Just make sure to ping me (@jkr) on your PRs. Best, Jesse ________________________________________ From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sent: Wednesday, June 16, 2021 5:33 AM To: pandoc-discuss Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly I can't fix this without at least some feedback. It's a complex issue and the fix will take some time, so I need to at least know that my proposed solution seems good and would be accepted if implemented correctly. On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote: I've encountered a new problem. A fldChar field can span multiple paragraphs, but it doesn't have to start at the beginning of the first one. Because of this, a field across multiple paragraphs will merge those paragraphs. There is no way to represent this exactly in the pandoc model I think. So my current solution is to have different fields with the same field info in the different paragraphs. This can at least make the hyperlink fields work and I think it will work for the other fields we might add in the future as well (I've checked the list). What do you think about this ? On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote: For those who don't know fldChar fields, this comment from the docx parse code (parse.hs, starting on line 825) explains it: fldChar fields work by first having a <w:fldChar fldCharType="begin"> in a run, then a run with <w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the content runs, and finally a <w:fldChar fldCharType="end"> run. For example (omissions and my comments in brackets): <w:r> [...] <w:fldChar w:fldCharType="begin"/> </w:r> <w:r> [...] <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText> </w:r> <w:r> [...] <w:fldChar w:fldCharType="separate"/> </w:r> <w:r w:rsidRPr=[...]> [...] <w:t>Foundations of Analysis, 2nd Edition</w:t> </w:r> <w:r> [...] <w:fldChar w:fldCharType="end"/> </w:r> The current way of parsing fldChar fields doesn't take into account that they can be nested. So the end of the nested flcChar field will be interpreted as the end of the surrounding one. This could for example lead to a hyperlink that ends too soon. See attached example for a docx that demonstrates this. I propose to fix this by turning the fldChar state into a stack, so that a field can be started and ended inside other fields. I will include this in my pull request for PAGEREF fields that I announced here a while ago, since they are related. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:pandoc-discus...@googlegroups.com>. To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257392816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=LUBegYzlL9%2BxTt7flRmGawdKytypKU9cbguRYa4L7GY%3D&reserved=0><https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850832949%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=yihkvC%2B7Le7l00nWwyXnOyOmASIibuFvMDgLIXStUSc%3D&reserved=0<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257402781%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5AwXcUxCA%2BCpIEMk0K%2FJwmSWZlDH8dK4UQebZ7jMQKo%3D&reserved=0>><https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257402781%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fI%2BAchsEj1Ik%2F%2BSfY9obLEhbcmt54V27Ao9lb05k3jk%3D&reserved=0><https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850842907%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=IYMIxCLKxoONu7qa9IQViRyjj%2FaOVY8x%2FuHlI2oMfXQ%3D&reserved=0<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257412739%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yEe8Bn%2BylkLhfB9skaPVxsxI0ngxAQKWjX6C04jlfWw%3D&reserved=0>>>. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:pandoc-discus...@googlegroups.com>. To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/24273fbf-2ce9-4c26-886b-50d504cb7b05n%40googlegroups.com<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257412739%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QuoRTqPLxb%2FS0L5JVpkdfc2Qg14lS%2FKS7Yvtzwm4cfg%3D&reserved=0><https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850852858%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BFQ5SkpbfzgZ7yLWGIi9uTHBuMaN9nBZzA%2Ffzwt4XnU%3D&reserved=0<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257422689%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=eY8ikgAGkbiXevRpp4lXXfUT71j0tATVP%2BQuJfd%2BGE8%3D&reserved=0>>. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/50bcbdc6-8d4b-49c1-badb-f35fb968112dn%40googlegroups.com<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F50bcbdc6-8d4b-49c1-badb-f35fb968112dn%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257432648%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VihSAzBEbJslRknO%2FoJs32DAyA39iUnoN2alFLB2mfU%3D&reserved=0>. -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/BL3PR01MB7100280CEDC0AA33F788D8C9C9BB9%40BL3PR01MB7100.prod.exchangelabs.com. ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <BL3PR01MB7100280CEDC0AA33F788D8C9C9BB9-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>]
* Re: docx parsing bug: nested fldChar fields are interpreted incorrectly [not found] ` <BL3PR01MB7100280CEDC0AA33F788D8C9C9BB9-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org> @ 2021-10-18 7:07 ` Milan Bracke 0 siblings, 0 replies; 10+ messages in thread From: Milan Bracke @ 2021-10-18 7:07 UTC (permalink / raw) To: pandoc-discuss [-- Attachment #1.1: Type: text/plain, Size: 14038 bytes --] No worries. Thanks for taking a look. I responded to your question in the pull request. On Sunday, October 17, 2021 at 1:02:22 PM UTC+2 Jesse Rosenthal wrote: > Dear Milan, > > Just commented on github. This looks good to me. I apologize for the long > wait here, and for taking so long to turn my attention to this. > > Thanks for making this work, and for sharing it with everyone else. Sorry > to stand in the way of that process being a bit smoother. > > Best, > Jesse > > ________________________________________ > From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf > of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > Sent: Thursday, October 14, 2021 5:33 AM > To: pandoc-discuss > Subject: Re: docx parsing bug: nested fldChar fields are interpreted > incorrectly > > Hi Jesse, > > I hope you had a good summer. Do you have time to look at my pull request > in the coming weeks? > I'm now using a fork of Pandoc to have this fix and I have to rebase every > time something useful is done > in the main repo, so I would really like to have this fix merged. > > Best, > Milan > > On Thursday, July 15, 2021 at 4:47:19 PM UTC+2 Jesse Rosenthal wrote: > Hi Milan, > > Thanks for the heads up. Honestly just summer craziness: visiting family > for the first time in almost two years, shuttling the kids around. Life > stuff. I'll take a look at it ASAP. > > Best, > Jesse > > ________________________________________ > From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf > of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > Sent: Thursday, July 15, 2021 10:43 AM > To: pandoc-discuss > Subject: Re: docx parsing bug: nested fldChar fields are interpreted > incorrectly > > Hi all, > > I've had this pull request open for more than 3 weeks now: > https://github.com/jgm/pandoc/pull/7401< > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjgm%2Fpandoc%2Fpull%2F7401&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257392816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=quwNSpn8xQSGtRt%2BHuTTBHLOLUltz%2FZPGhmKHQ2hBL8%3D&reserved=0 > > > Is there a reason it's not getting any reaction? I'd be happy to improve > or explain it. If I've done something > wrong, I'd like to know, so I can fix it. > > Best, > Milan > > On Thursday, June 17, 2021 at 8:42:48 AM UTC+2 Milan Bracke wrote: > Hi Jesse, > > Thanks for the feedback. I'll ping you when making the PR. Most of my code > seems to work so far, but I still > have some trouble with the fact that the fields now need to contain > ParParts instead of Runs. It's harder to > match all the cases and treat them correctly. I'll try some more and let > you know how it goes. > > Best, > Milan > On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote: > Hi Milan, > > I wrote the original fldChar code (and that comment) and I figured it > would have to evolve as further requirements became necessary. If nesting > is a requirement, a stack instead of a toggle seems appropriate. > > As far as crossing paragraphs goes -- your approach seems right (and > similar to how we've dealt with similar issues like comments crossing > paragraphs in docx parsing). > > I'd be happy to take a look and offer comments/feedback on your code. Just > make sure to ping me (@jkr) on your PRs. > > Best, > Jesse > > ________________________________________ > From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf > of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > Sent: Wednesday, June 16, 2021 5:33 AM > To: pandoc-discuss > Subject: Re: docx parsing bug: nested fldChar fields are interpreted > incorrectly > > I can't fix this without at least some feedback. It's a complex issue and > the fix will take some time, so I need to at least know that my proposed > solution > seems good and would be accepted if implemented correctly. > > On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote: > I've encountered a new problem. A fldChar field can span multiple > paragraphs, but it doesn't have to start at the beginning of the first one. > Because of this, a field across multiple paragraphs will merge those > paragraphs. > There is no way to represent this exactly in the pandoc model I think. So > my current solution is to have different fields with the same field > info in the different paragraphs. This can at least make the hyperlink > fields work and I think it will work for the other fields we might add in > the future as well (I've checked the list). > What do you think about this ? > > On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote: > For those who don't know fldChar fields, this comment from the docx parse > code (parse.hs, starting on line 825) explains it: > > fldChar fields work by first > having a <w:fldChar fldCharType="begin"> in a run, then a run with > <w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the > content runs, and finally a <w:fldChar fldCharType="end"> run. For > example (omissions and my comments in brackets): > > <w:r> > [...] > <w:fldChar w:fldCharType="begin"/> > </w:r> > <w:r> > [...] > <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText> > </w:r> > <w:r> > [...] > <w:fldChar w:fldCharType="separate"/> > </w:r> > <w:r w:rsidRPr=[...]> > [...] > <w:t>Foundations of Analysis, 2nd Edition</w:t> > </w:r> > <w:r> > [...] > <w:fldChar w:fldCharType="end"/> > </w:r> > > The current way of parsing fldChar fields doesn't take into account that > they can be nested. So the end of the nested flcChar field will be > interpreted as the end of the surrounding one. This could for example lead > to a hyperlink that ends too soon. See attached example for a docx that > demonstrates this. > > I propose to fix this by turning the fldChar state into a stack, so that a > field can be started and ended inside other fields. I will include this in > my pull request for PAGEREF fields that I announced here a while ago, since > they are related. > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto: > pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com > < > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257392816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=LUBegYzlL9%2BxTt7flRmGawdKytypKU9cbguRYa4L7GY%3D&reserved=0 > >< > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850832949%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=yihkvC%2B7Le7l00nWwyXnOyOmASIibuFvMDgLIXStUSc%3D&reserved=0 > < > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257402781%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5AwXcUxCA%2BCpIEMk0K%2FJwmSWZlDH8dK4UQebZ7jMQKo%3D&reserved=0 > >>< > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0 > < > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257402781%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fI%2BAchsEj1Ik%2F%2BSfY9obLEhbcmt54V27Ao9lb05k3jk%3D&reserved=0 > >< > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850842907%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=IYMIxCLKxoONu7qa9IQViRyjj%2FaOVY8x%2FuHlI2oMfXQ%3D&reserved=0 > < > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257412739%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yEe8Bn%2BylkLhfB9skaPVxsxI0ngxAQKWjX6C04jlfWw%3D&reserved=0 > >>>. > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto: > pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/24273fbf-2ce9-4c26-886b-50d504cb7b05n%40googlegroups.com > < > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257412739%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QuoRTqPLxb%2FS0L5JVpkdfc2Qg14lS%2FKS7Yvtzwm4cfg%3D&reserved=0 > >< > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7Ca0b9fef0818d4fa2010808d9479efa08%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637619570850852858%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BFQ5SkpbfzgZ7yLWGIi9uTHBuMaN9nBZzA%2Ffzwt4XnU%3D&reserved=0 > < > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F24273fbf-2ce9-4c26-886b-50d504cb7b05n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257422689%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=eY8ikgAGkbiXevRpp4lXXfUT71j0tATVP%2BQuJfd%2BGE8%3D&reserved=0 > >>. > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto: > pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/pandoc-discuss/50bcbdc6-8d4b-49c1-badb-f35fb968112dn%40googlegroups.com > < > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F50bcbdc6-8d4b-49c1-badb-f35fb968112dn%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C48a0aa14005e4819a4c008d98ef5b5f0%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637698008257432648%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VihSAzBEbJslRknO%2FoJs32DAyA39iUnoN2alFLB2mfU%3D&reserved=0 > >. > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/d537ca16-e31c-4b90-ab6d-6a5e80200618n%40googlegroups.com. [-- Attachment #1.2: Type: text/html, Size: 32119 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2021-10-18 7:07 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <AQHXeYfzXOAwmGiNTUe7VHoqgBZUyqtEHQvVgI6tBACABM9T0w==> [not found] ` <AQHXeYfzXOAwmGiNTUe7VHoqgBZUyqtEHQvV> [not found] ` <AQHXYO2J8TKZPxOOWkqee4W1dsjoP6sUoEAAgAHDUYCAAE9Mnw==> 2021-06-14 7:17 ` docx parsing bug: nested fldChar fields are interpreted incorrectly Milan Bracke [not found] ` <a4a592f3-414e-488f-be2a-0f7fd1e0cd21n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2021-06-15 6:38 ` Milan Bracke [not found] ` <2f5489af-f5a9-4ea4-9155-9f85c4808756n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2021-06-16 9:33 ` Milan Bracke [not found] ` <9bdb337d-fa68-4c66-8f5c-d4fa81547953n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2021-06-16 14:21 ` Jesse Rosenthal [not found] ` <DM6PR01MB4650D9807C8EE5F33B3D5278C90F9-cCDycDV5LeFzw2JhukMBoF5F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org> 2021-06-17 6:42 ` Milan Bracke [not found] ` <a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2021-07-15 14:43 ` Milan Bracke [not found] ` <24273fbf-2ce9-4c26-886b-50d504cb7b05n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2021-07-15 14:46 ` 'Jesse Rosenthal' via pandoc-discuss [not found] ` <BL3PR01MB7100D5D9E5DC4898EC9E1221C9129-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org> 2021-10-14 9:33 ` Milan Bracke [not found] ` <50bcbdc6-8d4b-49c1-badb-f35fb968112dn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> 2021-10-17 11:02 ` 'Jesse Rosenthal' via pandoc-discuss [not found] ` <BL3PR01MB7100280CEDC0AA33F788D8C9C9BB9-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org> 2021-10-18 7:07 ` Milan Bracke
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).