From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/28623 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Milan Bracke Newsgroups: gmane.text.pandoc Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly Date: Wed, 16 Jun 2021 23:42:48 -0700 (PDT) Message-ID: References: <2f5489af-f5a9-4ea4-9155-9f85c4808756n@googlegroups.com> <9bdb337d-fa68-4c66-8f5c-d4fa81547953n@googlegroups.com> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_368_436065527.1623912168329" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="24166"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBDM4NA6G6UGRB2O5VODAMGQEX7Q2DWA-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Thu Jun 17 08:42:51 2021 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-oi1-f190.google.com ([209.85.167.190]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1ltljn-00065y-Fm for gtp-pandoc-discuss@m.gmane-mx.org; Thu, 17 Jun 2021 08:42:51 +0200 Original-Received: by mail-oi1-f190.google.com with SMTP id f16-20020acacf100000b02901eed1481b82sf2471490oig.20 for ; Wed, 16 Jun 2021 23:42:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=sKkTTIFFogfdtmX46qGmjVXePXAkKn5wlyQAvo8D5VE=; b=NNHCbbV8ZX52ks6vzPlOxwYZyyHqeCMOHG3R/kH8N92UNNesUcKxNbVJ77uzuTLNot chtNhTn4yJTZuXFdkHq0JyRsL9stkKkz0U2O0QuUUW+IZmaqcsAKsup1hD9VxQf09/De etjvMel6zS94ItkrFlwNf+gi00GfpPq3I3e3/b8ilHACI1g4GJxBVbFOpMmJki3VzBIq 4Y90CTg0MYhhd4Xy9rkIfzF5fH3zHl1PBaDndh0z2d4CLmnc2YYjlOu6dYqU+q0PLaAT 3B8bqQ09uts8sNaio1JTc+dRmYZ1DK5vKdt39zWNIIWtmMy7UsyJDvckkI8Tb9MbZomP fLeQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=sKkTTIFFogfdtmX46qGmjVXePXAkKn5wlyQAvo8D5VE=; b=lUwLBYGxB9Op0BfROcF5Nnxk9FWE3eZyOHj547e5geMvM0+kr5QXuhOUd+ZwfuagbG aeG1/nfR5OUe3OgtxxkBCzYq/05w8XeBSCRssCfBZ9e2Ja0rRXYBKSDqsH42Xiaj3+y9 VKvFZ/wgLN3tnbUjqNDsCNeSc2qWugn7d1oVGeXe6PEEsRKZOF1AQnl9XbXRkxKp2xCI edq3gN0zm6sP6LGEu3N+0k62aS1ksN566m7iYR35JM08rf7a9uLTex2p/NcB1qqnSnst itaRV006Ay0gPsiMIpuMMAXdHa3a+WnLPUuryqGF8ktqdvT7LPDNFDxYmaK1ENQE5Qug BeCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=sKkTTIFFogfdtmX46qGmjVXePXAkKn5wlyQAvo8D5VE=; b=BA73WXIxUEBa/Ynn5pvpaYH56CQH1sbsK6JihOeOyv1JPtWyZMmpgSI1cuuJEBzgW2 qJEvvhSYDTzoPdX9vcTksz15C4FUWnjHfu/Mdmx9T+f8EIL/0lTBcV2sVKF0IXPvvTAQ jBxM3Efmkb0rmv5T/U/3/R3Q1BYYKvCMs8Jzx7BWYhF3IIap9WMs8mLD8kFXJ3oH6Dii Bu+mA1VIGOJC0jPYj5sk2xcwS9hwdor0ptcz8aJOIwSj+98rfINhp97ynXcEPt/fQoeg 3z01BnrUeg39MBKtCqRwM/CjWIJrmwNnItF1Hr8rIpKnDD2hi6jgMsm/jPZMMEypq6C3 M79w== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM5307rGOKoYILdYKyquHBpkaV1RuKQ0qx2tz2MtVHr9AKV920BAeS jjPk7EHsjxouyMOnpfw65rM= X-Google-Smtp-Source: ABdhPJy6i1fmC3BpH0nNyIE/2gOCjtI7e6OCO6yF0BhxfBQ5ZCJk2N9EyiUTLhQdFJliudQY2W1pgA== X-Received: by 2002:a9d:7682:: with SMTP id j2mr3051953otl.299.1623912170458; Wed, 16 Jun 2021 23:42:50 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a05:6830:1310:: with SMTP id p16ls1960753otq.10.gmail; Wed, 16 Jun 2021 23:42:49 -0700 (PDT) X-Received: by 2002:a05:6830:1653:: with SMTP id h19mr3276672otr.312.1623912168933; Wed, 16 Jun 2021 23:42:48 -0700 (PDT) In-Reply-To: X-Original-Sender: milan.bracke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:28623 Archived-At: ------=_Part_368_436065527.1623912168329 Content-Type: multipart/alternative; boundary="----=_Part_369_2137336292.1623912168329" ------=_Part_369_2137336292.1623912168329 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Jesse, Thanks for the feedback. I'll ping you when making the PR. Most of my code= =20 seems to work so far, but I still have some trouble with the fact that the fields now need to contain=20 ParParts instead of Runs. It's harder to match all the cases and treat them correctly. I'll try some more and let=20 you know how it goes. Best, Milan On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote: > Hi Milan, > > I wrote the original fldChar code (and that comment) and I figured it=20 > would have to evolve as further requirements became necessary. If nesting= =20 > is a requirement, a stack instead of a toggle seems appropriate. > > As far as crossing paragraphs goes -- your approach seems right (and=20 > similar to how we've dealt with similar issues like comments crossing=20 > paragraphs in docx parsing). > > I'd be happy to take a look and offer comments/feedback on your code. Jus= t=20 > make sure to ping me (@jkr) on your PRs. > > Best, > Jesse > > ________________________________________ > From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org on behalf= =20 > of Milan Bracke > Sent: Wednesday, June 16, 2021 5:33 AM > To: pandoc-discuss > Subject: Re: docx parsing bug: nested fldChar fields are interpreted=20 > incorrectly > > I can't fix this without at least some feedback. It's a complex issue and= =20 > the fix will take some time, so I need to at least know that my proposed= =20 > solution > seems good and would be accepted if implemented correctly. > > On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote: > I've encountered a new problem. A fldChar field can span multiple=20 > paragraphs, but it doesn't have to start at the beginning of the first on= e. > Because of this, a field across multiple paragraphs will merge those=20 > paragraphs. > There is no way to represent this exactly in the pandoc model I think. So= =20 > my current solution is to have different fields with the same field > info in the different paragraphs. This can at least make the hyperlink=20 > fields work and I think it will work for the other fields we might add in > the future as well (I've checked the list). > What do you think about this ? > > On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote: > For those who don't know fldChar fields, this comment from the docx parse= =20 > code (parse.hs, starting on line 825) explains it: > > fldChar fields work by first > having a in a run, then a run with > , then a run, then the > content runs, and finally a run. For > example (omissions and my comments in brackets): > > > [...] > > > > [...] > HYPERLINK [hyperlink url] > > > [...] > > > > [...] > Foundations of Analysis, 2nd Edition > > > [...] > > > > The current way of parsing fldChar fields doesn't take into account that= =20 > they can be nested. So the end of the nested flcChar field will be=20 > interpreted as the end of the surrounding one. This could for example lea= d=20 > to a hyperlink that ends too soon. See attached example for a docx that= =20 > demonstrates this. > > I propose to fix this by turning the fldChar state into a stack, so that = a=20 > field can be started and ended inside other fields. I will include this i= n=20 > my pull request for PAGEREF fields that I announced here a while ago, sin= ce=20 > they are related. > > -- > You received this message because you are subscribed to the Google Groups= =20 > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an= =20 > email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. > To view this discussion on the web visit=20 > https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-= d4fa81547953n%40googlegroups.com > < > https://nam02.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fgroup= s.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa815= 47953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data= =3D04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa4= f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFpb= GZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7= C3000&sdata=3D8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=3D0 > >. > --=20 You received this message because you are subscribed to the Google Groups "= pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/= pandoc-discuss/a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an%40googlegroups.com. ------=_Part_369_2137336292.1623912168329 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Jesse,

Thanks for the feedback. I'll ping = you when making the PR. Most of my code seems to work so far, but I still
have some trouble with the fact that the fields now need to contai= n ParParts instead of Runs. It's harder to
match all the cases an= d treat them correctly. I'll try some more and let you know how it goes.

Best,
Milan
On Wednesday, June 16,= 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote:
Hi Milan,

I wrote the original fldChar code (and that comment) and I figured it w= ould have to evolve as further requirements became necessary. If nesting is= a requirement, a stack instead of a toggle seems appropriate.

As far as crossing paragraphs goes -- your approach seems right (and si= milar to how we've dealt with similar issues like comments crossing par= agraphs in docx parsing).

I'd be happy to take a look and offer comments/feedback on your cod= e. Just make sure to ping me (@jkr) on your PRs.

Best,
Jesse

________________________________________
From: pandoc-...@googlegroup= s.com <pandoc-...@googleg= roups.com> on behalf of Milan Bracke <milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Sent: Wednesday, June 16, 2021 5:33 AM
To: pandoc-discuss
Subject: Re: docx parsing bug: nested fldChar fields are interpreted in= correctly

I can't fix this without at least some feedback. It's a complex= issue and the fix will take some time, so I need to at least know that my = proposed solution
seems good and would be accepted if implemented correctly.

On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote:
I've encountered a new problem. A fldChar field can span multiple p= aragraphs, but it doesn't have to start at the beginning of the first o= ne.
Because of this, a field across multiple paragraphs will merge those pa= ragraphs.
There is no way to represent this exactly in the pandoc model I think. = So my current solution is to have different fields with the same field
info in the different paragraphs. This can at least make the hyperlink = fields work and I think it will work for the other fields we might add in
the future as well (I've checked the list).
What do you think about this ?

On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote:
For those who don't know fldChar fields, this comment from the docx= parse code (parse.hs, starting on line 825) explains it:

fldChar fields work by first
having a <w:fldChar fldCharType=3D"begin"> in a run, th= en a run with
<w:instrText>, then a <w:fldChar fldCharType=3D"separate&= quot;> run, then the
content runs, and finally a <w:fldChar fldCharType=3D"end"= > run. For
example (omissions and my comments in brackets):

<w:r>
[...]
<w:fldChar w:fldCharType=3D"begin"/>
</w:r>
<w:r>
[...]
<w:instrText xml:space=3D"preserve"> HYPERLINK [hyperli= nk url] </w:instrText>
</w:r>
<w:r>
[...]
<w:fldChar w:fldCharType=3D"separate"/>
</w:r>
<w:r w:rsidRPr=3D[...]>
[...]
<w:t>Foundations of Analysis, 2nd Edition</w:t>
</w:r>
<w:r>
[...]
<w:fldChar w:fldCharType=3D"end"/>
</w:r>

The current way of parsing fldChar fields doesn't take into account= that they can be nested. So the end of the nested flcChar field will be in= terpreted as the end of the surrounding one. This could for example lead to= a hyperlink that ends too soon. See attached example for a docx that demon= strates this.

I propose to fix this by turning the fldChar state into a stack, so tha= t a field can be started and ended inside other fields. I will include this= in my pull request for PAGEREF fields that I announced here a while ago, s= ince they are related.

--
You received this message because you are subscribed to the Google Grou= ps "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send = an email to pandoc-discus...@goo= glegroups.com<mailto:pand= oc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa= 68-4c66-8f5c-d4fa81547953n%40googlegroups.com<https://nam02.safelinks.protection.outlook.com/?= url=3Dhttps%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb33= 7d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%2= 6utm_source%3Dfooter&data=3D04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b35= 3d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C6375943= 29240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJ= BTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=3D8fxpTInSSkpzMwmvDK0BYRHtKx%2= BArUEcX7BLQoBE7qo%3D&reserved=3D0>.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/a8b2c5dd-ca1d-494c-86ca-e4ad90544c5an%40googlegroups.= com.
------=_Part_369_2137336292.1623912168329-- ------=_Part_368_436065527.1623912168329--