public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
From: Milan Bracke <milan.bracke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: pandoc-discuss <pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly
Date: Wed, 16 Jun 2021 23:42:48 -0700 (PDT)	[thread overview]
Message-ID: <a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an@googlegroups.com> (raw)
In-Reply-To: <DM6PR01MB4650D9807C8EE5F33B3D5278C90F9-cCDycDV5LeFzw2JhukMBoF5F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 5239 bytes --]

Hi Jesse,

Thanks for the feedback. I'll ping you when making the PR. Most of my code 
seems to work so far, but I still
have some trouble with the fact that the fields now need to contain 
ParParts instead of Runs. It's harder to
match all the cases and treat them correctly. I'll try some more and let 
you know how it goes.

Best,
Milan
On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote:

> Hi Milan,
>
> I wrote the original fldChar code (and that comment) and I figured it 
> would have to evolve as further requirements became necessary. If nesting 
> is a requirement, a stack instead of a toggle seems appropriate.
>
> As far as crossing paragraphs goes -- your approach seems right (and 
> similar to how we've dealt with similar issues like comments crossing 
> paragraphs in docx parsing).
>
> I'd be happy to take a look and offer comments/feedback on your code. Just 
> make sure to ping me (@jkr) on your PRs.
>
> Best,
> Jesse
>
> ________________________________________
> From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf 
> of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Sent: Wednesday, June 16, 2021 5:33 AM
> To: pandoc-discuss
> Subject: Re: docx parsing bug: nested fldChar fields are interpreted 
> incorrectly
>
> I can't fix this without at least some feedback. It's a complex issue and 
> the fix will take some time, so I need to at least know that my proposed 
> solution
> seems good and would be accepted if implemented correctly.
>
> On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote:
> I've encountered a new problem. A fldChar field can span multiple 
> paragraphs, but it doesn't have to start at the beginning of the first one.
> Because of this, a field across multiple paragraphs will merge those 
> paragraphs.
> There is no way to represent this exactly in the pandoc model I think. So 
> my current solution is to have different fields with the same field
> info in the different paragraphs. This can at least make the hyperlink 
> fields work and I think it will work for the other fields we might add in
> the future as well (I've checked the list).
> What do you think about this ?
>
> On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote:
> For those who don't know fldChar fields, this comment from the docx parse 
> code (parse.hs, starting on line 825) explains it:
>
> fldChar fields work by first
> having a <w:fldChar fldCharType="begin"> in a run, then a run with
> <w:instrText>, then a <w:fldChar fldCharType="separate"> run, then the
> content runs, and finally a <w:fldChar fldCharType="end"> run. For
> example (omissions and my comments in brackets):
>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="begin"/>
> </w:r>
> <w:r>
> [...]
> <w:instrText xml:space="preserve"> HYPERLINK [hyperlink url] </w:instrText>
> </w:r>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="separate"/>
> </w:r>
> <w:r w:rsidRPr=[...]>
> [...]
> <w:t>Foundations of Analysis, 2nd Edition</w:t>
> </w:r>
> <w:r>
> [...]
> <w:fldChar w:fldCharType="end"/>
> </w:r>
>
> The current way of parsing fldChar fields doesn't take into account that 
> they can be nested. So the end of the nested flcChar field will be 
> interpreted as the end of the surrounding one. This could for example lead 
> to a hyperlink that ends too soon. See attached example for a docx that 
> demonstrates this.
>
> I propose to fix this by turning the fldChar state into a stack, so that a 
> field can be started and ended inside other fields. I will include this in 
> my pull request for PAGEREF fields that I announced here a while ago, since 
> they are related.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<mailto:
> pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com
> <
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=0
> >.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 7987 bytes --]

  parent reply	other threads:[~2021-06-17  6:42 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <AQHXeYfzXOAwmGiNTUe7VHoqgBZUyqtEHQvVgI6tBACABM9T0w==>
     [not found] ` <AQHXeYfzXOAwmGiNTUe7VHoqgBZUyqtEHQvV>
     [not found]   ` <AQHXYO2J8TKZPxOOWkqee4W1dsjoP6sUoEAAgAHDUYCAAE9Mnw==>
2021-06-14  7:17     ` Milan Bracke
     [not found]       ` <a4a592f3-414e-488f-be2a-0f7fd1e0cd21n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-15  6:38         ` Milan Bracke
     [not found]           ` <2f5489af-f5a9-4ea4-9155-9f85c4808756n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-16  9:33             ` Milan Bracke
     [not found]               ` <9bdb337d-fa68-4c66-8f5c-d4fa81547953n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-06-16 14:21                 ` Jesse Rosenthal
     [not found]                   ` <DM6PR01MB4650D9807C8EE5F33B3D5278C90F9-cCDycDV5LeFzw2JhukMBoF5F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-06-17  6:42                     ` Milan Bracke [this message]
     [not found]                       ` <a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-07-15 14:43                         ` Milan Bracke
     [not found]                           ` <24273fbf-2ce9-4c26-886b-50d504cb7b05n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-07-15 14:46                             ` 'Jesse Rosenthal' via pandoc-discuss
     [not found]                               ` <BL3PR01MB7100D5D9E5DC4898EC9E1221C9129-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-10-14  9:33                                 ` Milan Bracke
     [not found]                                   ` <50bcbdc6-8d4b-49c1-badb-f35fb968112dn-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2021-10-17 11:02                                     ` 'Jesse Rosenthal' via pandoc-discuss
     [not found]                                       ` <BL3PR01MB7100280CEDC0AA33F788D8C9C9BB9-58w3sbR9x38r+1VU5cxmT15F3wEVaoLpobIHt/V7iKVBDgjK7y7TUQ@public.gmane.org>
2021-10-18  7:07                                         ` Milan Bracke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an@googlegroups.com \
    --to=milan.bracke-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).