From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/28591 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Milan Bracke Newsgroups: gmane.text.pandoc Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly Date: Mon, 14 Jun 2021 23:38:30 -0700 (PDT) Message-ID: <2f5489af-f5a9-4ea4-9155-9f85c4808756n@googlegroups.com> References: Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_476_774533858.1623739110290" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="32864"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBDM4NA6G6UGRBZ4VUGDAMGQEQOC34SA-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Tue Jun 15 08:38:34 2021 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-ot1-f59.google.com ([209.85.210.59]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1lt2iY-0008Mj-4G for gtp-pandoc-discuss@m.gmane-mx.org; Tue, 15 Jun 2021 08:38:34 +0200 Original-Received: by mail-ot1-f59.google.com with SMTP id c10-20020a9d75ca0000b02903f63362f6f3sf8850717otl.1 for ; Mon, 14 Jun 2021 23:38:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=bWxFcdO7dva/6Rw64tzjL22YrIs+aCNACJjaxA6NwaY=; b=M+sBHKSNNXhdqE56+LBx0z9otSmP01RwLqX/UrKyZn22U+D3uJLBpzf5XJ6Svzkoo1 WcfL8gfQNNotQGu8v7hSoTkO63b1XO+5X1bEy/m99FAvJurdIIaQdcPGA1FMV5dSdb2N SZXZ0qpkSUlTOnWWKhW4PwUzcBBp3oPMvIgRPrwCz0S/I0Bo4nhTntlhW1C1NKych561 IiSxlO0g/FWXqAZ+Lpfcne3WdRYUK7OOgr0G1l8xSplAzSiQ4FJCVyPgt2jtTkXl0BVu P6r7DxFRi1oOxsn2ceRb0NtuV4djxO9XCJkSit2t4AclpMH3DxAHeSC//D0xGSb7/8TO mPBg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=bWxFcdO7dva/6Rw64tzjL22YrIs+aCNACJjaxA6NwaY=; b=p4EIGD7O3cTrLMGaz/2nsSC7ddR8qOIUuuYNEpZNMa+0P640nKurffKr05fL0sjdEP /Xe8IYmU2dUjYl/MKtOqrpr9vgMgqNAJs/Iwd9Q5Rw8QG0cdlVDDKrDohmam4X+JlC3S D1GakXcf7jYp6anVI5dNaStI0qmOD0Z8xLuVyxuAOIqkzDQP4YqzR2HeUk8cadV0Kd3m q9EE8h7LH0c4pkXxgPmlkO0sLfb8eQlR6UrH99DcaG6mE6V7h2NMlB24k+8yVneZuwVA iE0Uh1tIpgnb+OUlkbVHbyFx/t6F4Lr18sWHPWw6ydIFCBcmQKKfv2Ra37shpzkc//wU CCFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=bWxFcdO7dva/6Rw64tzjL22YrIs+aCNACJjaxA6NwaY=; b=ndXGwFl6Y948B78WXZnZm4ms54Qr+Mqd0PxlFMDbp8Qb7xZPqizbvnhX93Db49OtR4 DRI9T8Qf1cum9Xn7dgLfTVYvStVG3TCB5Zq6rcl3DxENSXAzKQS2ACCzmjl54ZBBbXV1 Wdw7OeF4gVzjsrjEwzCzrSFqgyS+aI3BAjoR8pn/5nmQNwGoMZ1l8EazCGld/Y41QnsD gIUflCqAVucOohp7eh1kKBIhiCIqna5dLF0VxrGif6uE8dIYZWmG1gdnvM2XoW9NloLF 4iWPKxOoDRTEv03UXZykbLBW7KAXlWFoiCqVLDPScUcdLqULw3Yid8IoNwiJe6F4pSUf sTcQ== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM531EGb7na+eZw6s0wT/uQIstuSUH5XPiFW+03DMib8r+rjB/L8yS NruonQlTfYDIc5Bi3ZXtsiU= X-Google-Smtp-Source: ABdhPJyv4DsEiqKfNsMSXszEFxDm1MnuxcDVrJ8ib9rgavXkbHV7mkS0wwl1HnwUzBEKQPg1LBzl4A== X-Received: by 2002:a9d:6287:: with SMTP id x7mr17029715otk.132.1623739112577; Mon, 14 Jun 2021 23:38:32 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a9d:73c9:: with SMTP id m9ls5085580otk.9.gmail; Mon, 14 Jun 2021 23:38:30 -0700 (PDT) X-Received: by 2002:a05:6830:17c3:: with SMTP id p3mr16678323ota.140.1623739110826; Mon, 14 Jun 2021 23:38:30 -0700 (PDT) In-Reply-To: X-Original-Sender: milan.bracke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:28591 Archived-At: ------=_Part_476_774533858.1623739110290 Content-Type: multipart/alternative; boundary="----=_Part_477_2012875782.1623739110290" ------=_Part_477_2012875782.1623739110290 Content-Type: text/plain; charset="UTF-8" I've encountered a new problem. A fldChar field can span multiple paragraphs, but it doesn't have to start at the beginning of the first one. Because of this, a field across multiple paragraphs will merge those paragraphs. There is no way to represent this exactly in the pandoc model I think. So my current solution is to have different fields with the same field info in the different paragraphs. This can at least make the hyperlink fields work and I think it will work for the other fields we might add in the future as well (I've checked the list). What do you think about this ? On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote: > For those who don't know fldChar fields, this comment from the docx parse > code (parse.hs, starting on line 825) explains it: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *fldChar fields work by firsthaving a in a > run, then a run with, then a fldCharType="separate"> run, then thecontent runs, and finally a fldCharType="end"> run. Forexample (omissions and my comments in brackets): > [...] [...] > HYPERLINK [hyperlink url] > [...] w:rsidRPr=[...]> [...] Foundations of Analysis, 2nd Edition > [...] *The current > way of parsing fldChar fields doesn't take into account that they can be > nested. So the end of the nested flcChar field will be interpreted as the > end of the surrounding one. This could for example lead to a hyperlink that > ends too soon. See attached example for a docx that demonstrates this. > > I propose to fix this by turning the fldChar state into a stack, so that a > field can be started and ended inside other fields. I will include this in > my pull request for PAGEREF fields that I announced here a while ago, since > they are related. > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/2f5489af-f5a9-4ea4-9155-9f85c4808756n%40googlegroups.com. ------=_Part_477_2012875782.1623739110290 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I've encountered a new problem. A fldChar field can span multiple para= graphs, but it doesn't have to start at the beginning of the first one.
Because of this, a field across multiple paragraphs will merge those= paragraphs.
There is no way to represent this exactly in the pan= doc model I think. So my current solution is to have different fields with = the same field
info in the different paragraphs. This can at leas= t make the hyperlink fields work and I think it will work for the other fie= lds we might add in
the future as well (I've checked the list).
What do you think about this ?

On Monday, June 14, 2021 at 9:1= 7:13 AM UTC+2 Milan Bracke wrote:
For those who don't know fldChar fields, this comm= ent from the docx parse code (parse.hs, starting on line 825) explains it:<= br>
fldChar fields work by first
having a <w:fldChar fldCharTyp= e=3D"begin"> in a run, then a run with
<w:instrText>,= then a <w:fldChar fldCharType=3D"separate"> run, then the<= br>content runs, and finally a <w:fldChar fldCharType=3D"end"&= gt; run. For
example (omissions and my comments in brackets):

&l= t;w:r>
[...]
<w:fldChar w:fldCharType=3D"begin"/>= ;
</w:r>
<w:r>
[...]
<w:instrText xml:space= =3D"preserve"> HYPERLINK [hyperlink url] </w:instrText><= br> </w:r>
<w:r>
[...]
<w:fldChar w:fldCharType= =3D"separate"/>
</w:r>
<w:r w:rsidRPr=3D[...]= >
[...]
<w:t>Foundations of Analysis, 2nd Edition</w:t&= gt;
</w:r>
<w:r>
[...]
<w:fldChar w:fldCharT= ype=3D"end"/>
</w:r>

The current way of p= arsing fldChar fields doesn't take into account that they can be nested= . So the end of the nested flcChar field will be interpreted as the end of = the surrounding one. This could for example lead to a hyperlink that ends t= oo soon. See attached example for a docx that demonstrates this.

I p= ropose to fix this by turning the fldChar state into a stack, so that a fie= ld can be started and ended inside other fields. I will include this in my = pull request for PAGEREF fields that I announced here a while ago, since th= ey are related.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/2f5489af-f5a9-4ea4-9155-9f85c4808756n%40googlegroups.= com.
------=_Part_477_2012875782.1623739110290-- ------=_Part_476_774533858.1623739110290--