From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/28829 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Milan Bracke Newsgroups: gmane.text.pandoc Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly Date: Thu, 15 Jul 2021 07:43:52 -0700 (PDT) Message-ID: <24273fbf-2ce9-4c26-886b-50d504cb7b05n@googlegroups.com> References: <2f5489af-f5a9-4ea4-9155-9f85c4808756n@googlegroups.com> <9bdb337d-fa68-4c66-8f5c-d4fa81547953n@googlegroups.com> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_886_713772383.1626360232325" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="31603"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBDM4NA6G6UGRBKMTYGDQMGQEYRSDIRY-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Thu Jul 15 16:43:56 2021 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-oi1-f187.google.com ([209.85.167.187]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1m42ah-00081w-Kp for gtp-pandoc-discuss@m.gmane-mx.org; Thu, 15 Jul 2021 16:43:55 +0200 Original-Received: by mail-oi1-f187.google.com with SMTP id n134-20020acad68c0000b029025a4350857esf3971828oig.8 for ; Thu, 15 Jul 2021 07:43:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=kaLpayafNnhC0aYf+O4u3pK23V97Xn7+KD5Ae2LSOwk=; b=dMMWNSBb0lG3e0ZMusG8Nwy8+T0DC14NoFihUluka221kFJQl/YjqxbRaz8zLtJxzk lmqlhtVgmOegW+umYSeOc+3fQB6U8JF1BTvhbfEN1buVjy9b513aN+3ghoT0rW6PhmBr mnl1O+AuM+JhNlOpcAaJfa6lR1NUF047payO2F0+28A1FKe/Qpm/kicyM/C1GOnoePPX IIZU5F2/wDm54TswUjDcrQ9QWyDgAeZ631QOl8RmyeHczJTXbKvCBxe2VX7h+yv3aF5Q /2dOCF+hjuW2kyb61M4AWMQ6lMjXi2TWsWZAULwsgookxiw9umXKPf0RDRAQxjbBXWRs AOTA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=kaLpayafNnhC0aYf+O4u3pK23V97Xn7+KD5Ae2LSOwk=; b=HA7zY3DxqS8NgPX0h74/l/T3hRqih0HOFv7/6wTfYSx+ufgQF/98evdEyA2IOI6sBJ Sl6Bd0PgSIfnqoehHEWmlLUcjARw7slAJuYArneYEoX/qM74UwxUFRVrZa/AxDV/7Lk3 6olab/E2651+MSMIKwISyjJ8AERhvbt580QuOyoFmtqFPyZzaWm39d12onagOTNbyYtn QRumXFn9fZ0kZyMrg/u6q6/iwuRhp3K4L1b/K1d1WE0lEEaH9i2Woc+WNPfULTjqf/wF yM4sWWwgXAhECIhzVju6prtor+V43Y5/U2K4EJ0a1E5DqeDpvtk1rB9lXI88Kz3P3eQQ dZJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=kaLpayafNnhC0aYf+O4u3pK23V97Xn7+KD5Ae2LSOwk=; b=RHG+P0NUNZeceYeOZvIxWbirdG83M8wA1uIU3gdEA24mneC+iROpddTnBU580vcip1 UkIAWzUPHKHMNYw/iSzd4qiAc/C7rehf7viBmRx27kP7SHrHBmaOKbH+SdaApbnEZ8iq Z39dMWEzBzH9pM+QYlzHkaXcNpYj3MHduPyDKkGXEr4S6ZJ3s2MyFDIGG40Hw41RNwgg QzcQCG9sxd5BpG9Z9O92pECevi06X3eW/+0FZoldUMJbXbVRNQ6T30QoPTLnfDJcIm89 KtujV5LbViyF8UHJfgVZ9BDj6KFnEFLYdAusz/aOix9B4KXqUUTGQ/WROy7pU2KgtrJH 728Q== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM530OZTCxeeMGDRl1t9Dowyu+GyZnHi0z8RW8sUuFckGUoonDzi0z U4emGKZSb/a9aoDs+Vi7Mog= X-Google-Smtp-Source: ABdhPJwaXE81DnUQOFJvoWbQWT2mQ6q15hJRWGFIaL+U3kiHBlqLFf5cIFyJVscLncR7D1fVP7WG0w== X-Received: by 2002:aca:6203:: with SMTP id w3mr4163374oib.64.1626360234476; Thu, 15 Jul 2021 07:43:54 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a05:6808:1402:: with SMTP id w2ls3082951oiv.9.gmail; Thu, 15 Jul 2021 07:43:53 -0700 (PDT) X-Received: by 2002:aca:5706:: with SMTP id l6mr3960373oib.109.1626360233034; Thu, 15 Jul 2021 07:43:53 -0700 (PDT) In-Reply-To: X-Original-Sender: milan.bracke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:28829 Archived-At: ------=_Part_886_713772383.1626360232325 Content-Type: multipart/alternative; boundary="----=_Part_887_1202933771.1626360232325" ------=_Part_887_1202933771.1626360232325 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi all, I've had this pull request open for more than 3 weeks now:=20 https://github.com/jgm/pandoc/pull/7401 Is there a reason it's not getting any reaction? I'd be happy to improve or= =20 explain it. If I've done something wrong, I'd like to know, so I can fix it. Best, Milan On Thursday, June 17, 2021 at 8:42:48 AM UTC+2 Milan Bracke wrote: > Hi Jesse, > > Thanks for the feedback. I'll ping you when making the PR. Most of my cod= e=20 > seems to work so far, but I still > have some trouble with the fact that the fields now need to contain=20 > ParParts instead of Runs. It's harder to > match all the cases and treat them correctly. I'll try some more and let= =20 > you know how it goes. > > Best, > Milan > On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosenthal wrote: > >> Hi Milan,=20 >> >> I wrote the original fldChar code (and that comment) and I figured it=20 >> would have to evolve as further requirements became necessary. If nestin= g=20 >> is a requirement, a stack instead of a toggle seems appropriate.=20 >> >> As far as crossing paragraphs goes -- your approach seems right (and=20 >> similar to how we've dealt with similar issues like comments crossing=20 >> paragraphs in docx parsing).=20 >> >> I'd be happy to take a look and offer comments/feedback on your code.=20 >> Just make sure to ping me (@jkr) on your PRs.=20 >> >> Best,=20 >> Jesse=20 >> >> ________________________________________=20 >> From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org on=20 >> behalf of Milan Bracke =20 >> Sent: Wednesday, June 16, 2021 5:33 AM=20 >> To: pandoc-discuss=20 >> Subject: Re: docx parsing bug: nested fldChar fields are interpreted=20 >> incorrectly=20 >> >> I can't fix this without at least some feedback. It's a complex issue an= d=20 >> the fix will take some time, so I need to at least know that my proposed= =20 >> solution=20 >> seems good and would be accepted if implemented correctly.=20 >> >> On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote:=20 >> I've encountered a new problem. A fldChar field can span multiple=20 >> paragraphs, but it doesn't have to start at the beginning of the first o= ne.=20 >> Because of this, a field across multiple paragraphs will merge those=20 >> paragraphs.=20 >> There is no way to represent this exactly in the pandoc model I think. S= o=20 >> my current solution is to have different fields with the same field=20 >> info in the different paragraphs. This can at least make the hyperlink= =20 >> fields work and I think it will work for the other fields we might add i= n=20 >> the future as well (I've checked the list).=20 >> What do you think about this ?=20 >> >> On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote:=20 >> For those who don't know fldChar fields, this comment from the docx pars= e=20 >> code (parse.hs, starting on line 825) explains it:=20 >> >> fldChar fields work by first=20 >> having a in a run, then a run with=20 >> , then a run, then the= =20 >> content runs, and finally a run. For=20 >> example (omissions and my comments in brackets):=20 >> >> =20 >> [...]=20 >> =20 >> =20 >> =20 >> [...]=20 >> HYPERLINK [hyperlink url]=20 >> =20 >> =20 >> =20 >> [...]=20 >> =20 >> =20 >> =20 >> [...]=20 >> Foundations of Analysis, 2nd Edition=20 >> =20 >> =20 >> [...]=20 >> =20 >> =20 >> >> The current way of parsing fldChar fields doesn't take into account that= =20 >> they can be nested. So the end of the nested flcChar field will be=20 >> interpreted as the end of the surrounding one. This could for example le= ad=20 >> to a hyperlink that ends too soon. See attached example for a docx that= =20 >> demonstrates this.=20 >> >> I propose to fix this by turning the fldChar state into a stack, so that= =20 >> a field can be started and ended inside other fields. I will include thi= s=20 >> in my pull request for PAGEREF fields that I announced here a while ago,= =20 >> since they are related.=20 >> >> --=20 >> You received this message because you are subscribed to the Google Group= s=20 >> "pandoc-discuss" group.=20 >> To unsubscribe from this group and stop receiving emails from it, send a= n=20 >> email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.=20 >> To view this discussion on the web visit=20 >> https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c= -d4fa81547953n%40googlegroups.com >> < >> https://nam02.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fgrou= ps.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb337d-fa68-4c66-8f5c-d4fa81= 547953n%2540googlegroups.com%3Futm_medium%3Demail%26utm_source%3Dfooter&dat= a=3D04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b353d4b73a4dd08d930a9dbd8%7C9fa= 4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637594329240701072%7CUnknown%7CTWFp= bGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%= 7C3000&sdata=3D8fxpTInSSkpzMwmvDK0BYRHtKx%2BArUEcX7BLQoBE7qo%3D&reserved=3D= 0>.=20 >> >> > --=20 You received this message because you are subscribed to the Google Groups "= pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/= pandoc-discuss/24273fbf-2ce9-4c26-886b-50d504cb7b05n%40googlegroups.com. ------=_Part_887_1202933771.1626360232325 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi all,

I've had this pull request open for m= ore than 3 weeks now: https://github.com/jgm/pandoc/pull/7401
Is there a= reason it's not getting any reaction? I'd be happy to improve or explain i= t. If I've done something
wrong, I'd like to know, so I can fix i= t.

Best,
Milan

On Thursday, June 17, 2021 a= t 8:42:48 AM UTC+2 Milan Bracke wrote:
Hi Jesse,

Thanks fo= r the feedback. I'll ping you when making the PR. Most of my code seems= to work so far, but I still
have some trouble with the fact that= the fields now need to contain ParParts instead of Runs. It's harder t= o
match all the cases and treat them correctly. I'll try some= more and let you know how it goes.

Best,
<= /div>
Milan
On Wednesday, June 16, 2021 at 4:21:05 PM UTC+2 Jesse Rosen= thal wrote:
Hi Milan,

I wrote the original fldChar code (and that comment) and I figured it w= ould have to evolve as further requirements became necessary. If nesting is= a requirement, a stack instead of a toggle seems appropriate.

As far as crossing paragraphs goes -- your approach seems right (and si= milar to how we've dealt with similar issues like comments crossing par= agraphs in docx parsing).

I'd be happy to take a look and offer comments/feedback on your cod= e. Just make sure to ping me (@jkr) on your PRs.

Best,
Jesse

________________________________________
From: pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <pandoc-...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> on behalf of Milan Bracke &l= t;milan....-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Sent: Wednesday, June 16, 2021 5:33 AM
To: pandoc-discuss
Subject: Re: docx parsing bug: nested fldChar fields are interpreted in= correctly

I can't fix this without at least some feedback. It's a complex= issue and the fix will take some time, so I need to at least know that my = proposed solution
seems good and would be accepted if implemented correctly.

On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote:
I've encountered a new problem. A fldChar field can span multiple p= aragraphs, but it doesn't have to start at the beginning of the first o= ne.
Because of this, a field across multiple paragraphs will merge those pa= ragraphs.
There is no way to represent this exactly in the pandoc model I think. = So my current solution is to have different fields with the same field
info in the different paragraphs. This can at least make the hyperlink = fields work and I think it will work for the other fields we might add in
the future as well (I've checked the list).
What do you think about this ?

On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote:
For those who don't know fldChar fields, this comment from the docx= parse code (parse.hs, starting on line 825) explains it:

fldChar fields work by first
having a <w:fldChar fldCharType=3D"begin"> in a run, th= en a run with
<w:instrText>, then a <w:fldChar fldCharType=3D"separate&= quot;> run, then the
content runs, and finally a <w:fldChar fldCharType=3D"end"= > run. For
example (omissions and my comments in brackets):

<w:r>
[...]
<w:fldChar w:fldCharType=3D"begin"/>
</w:r>
<w:r>
[...]
<w:instrText xml:space=3D"preserve"> HYPERLINK [hyperli= nk url] </w:instrText>
</w:r>
<w:r>
[...]
<w:fldChar w:fldCharType=3D"separate"/>
</w:r>
<w:r w:rsidRPr=3D[...]>
[...]
<w:t>Foundations of Analysis, 2nd Edition</w:t>
</w:r>
<w:r>
[...]
<w:fldChar w:fldCharType=3D"end"/>
</w:r>

The current way of parsing fldChar fields doesn't take into account= that they can be nested. So the end of the nested flcChar field will be in= terpreted as the end of the surrounding one. This could for example lead to= a hyperlink that ends too soon. See attached example for a docx that demon= strates this.

I propose to fix this by turning the fldChar state into a stack, so tha= t a field can be started and ended inside other fields. I will include this= in my pull request for PAGEREF fields that I announced here a while ago, s= ince they are related.

--
You received this message because you are subscribed to the Google Grou= ps "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send = an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<ma= ilto:pandoc-discus...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa= 68-4c66-8f5c-d4fa81547953n%40googlegroups.com<https://nam02.safelinks.protection.outlook.com/?= url=3Dhttps%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fpandoc-discuss%2F9bdb33= 7d-fa68-4c66-8f5c-d4fa81547953n%2540googlegroups.com%3Futm_medium%3Demail%2= 6utm_source%3Dfooter&data=3D04%7C01%7Cjrosenthal%40jhu.edu%7C3013bb2b35= 3d4b73a4dd08d930a9dbd8%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C6375943= 29240701072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJ= BTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=3D8fxpTInSSkpzMwmvDK0BYRHtKx%2= BArUEcX7BLQoBE7qo%3D&reserved=3D0>.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/24273fbf-2ce9-4c26-886b-50d504cb7b05n%40googlegroups.= com.
------=_Part_887_1202933771.1626360232325-- ------=_Part_886_713772383.1626360232325--