From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/28605 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Milan Bracke Newsgroups: gmane.text.pandoc Subject: Re: docx parsing bug: nested fldChar fields are interpreted incorrectly Date: Wed, 16 Jun 2021 02:33:49 -0700 (PDT) Message-ID: <9bdb337d-fa68-4c66-8f5c-d4fa81547953n@googlegroups.com> References: <2f5489af-f5a9-4ea4-9155-9f85c4808756n@googlegroups.com> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_592_331595231.1623836029079" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="19313"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBDM4NA6G6UGRB7UKU6DAMGQEDF7N64Q-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Wed Jun 16 11:33:52 2021 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-ot1-f62.google.com ([209.85.210.62]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1ltRvk-0004o0-6c for gtp-pandoc-discuss@m.gmane-mx.org; Wed, 16 Jun 2021 11:33:52 +0200 Original-Received: by mail-ot1-f62.google.com with SMTP id q12-20020a9d664c0000b02903ec84bc44bbsf1197055otm.5 for ; Wed, 16 Jun 2021 02:33:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=HaMFnNFCBpwQ9eGVYwa1A8fmJh/0wqwTToDG8WwXF4s=; b=s5wnFLyhxA0FApMXPvG834lhj87bT12CUtJcgxAoyb0NCOdvYiIgGn69VsyeAsh/0z u5epPJoQ1BRNgEV9dmFAcd5kIMj/gx0j0YLTQoiEp66viva2Vrnm73RoceJ2Pxkcr50A t+ga8Iluz68k7PAhjvid/PYDIhGOpXXFtpNgqmeWJwFIOtS+XD+lWWvpE8oQWF1B+70b vB7kCfJJlUj4IrR4ulb5AJKuHlyWQPEs23RbMkxSxC+4146XIkjkrjwBhEbWjX3gzfk8 kyFn1rgzwUHWp5ZGIMJq8d0xw6TQN4YdeUB79DivMJusA80JeTscO/TRCg9RPPdkZ1mb h+aw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=HaMFnNFCBpwQ9eGVYwa1A8fmJh/0wqwTToDG8WwXF4s=; b=gZo0bHdJIvY6cj7clr78hUQfDq9ke2aD62I6VW875fItsxkAVwohkoOUO5TNYvXstP fgb9aij8bzy7/BfLhOW0hp1efU9EcLpxyA8gUCMrbb+V03VYecSvjld94N/upM2zAkCL hjhyqTlOSfVFeQ/w/b2I8xoqN4HExghQpdAN+LQ0NfVlgk5m9CdOxFnwaz+N674Nbc5h wy7e0yRAI1O2FCpWTYEFeYN9mq7FLoyq6cEw8KrhKjqyKzKrTjqu/OWZIsIF8CyN/SyY ooI6nEHLKgQVd7I4uwiOywy/1cYC0I4N1H4jfmV4uT/a5j/toFhCW89ocU8AJYo/HCDT 0CUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=HaMFnNFCBpwQ9eGVYwa1A8fmJh/0wqwTToDG8WwXF4s=; b=JMagZZQVPUTSeAd8jpyX67KeXerIa/cZUVWCzq0QVl1qP4+zMLAOgzL9jx5JauPDzw oMhtCOeQ0HU5h7Jzo0a0JxxLdZ3553Bh7qfGDZILINrK9DLGlcvGE539bzj5v1oLZiJW Ch/jMPS5hFs7i8IE0qRLTuOTkJSwZcdE7dU+0c4o90SnyPZvEnXW4IBTgrtWWiPcBAWc gwKTv4899A8ekYrv4frj8pbMdIBi6fYvkOlc4P6E/uej9CYwDEJTxXBW2FS2FkR3g+jF LXVXS6XqHxcemQsgbyECigPTSSEhUjdty2u6fF5Mqusgqyu6YaQnYSvKKt3ADBUI+VQi XUXA== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM531a1l3Yl5browl1iMCEFUhpYMjYGOAT8LrfNxf3aflb0BSIgh4H amNW23tv/+4/dYaZlEIkth8= X-Google-Smtp-Source: ABdhPJyVD8/joWJvx5dFPG5H033waGS1bZCcjN4zx/v14iEJkA3kmJYJZH9vgMlHCtFp9SW8R6OnpQ== X-Received: by 2002:a05:6830:1396:: with SMTP id d22mr3178144otq.55.1623836031233; Wed, 16 Jun 2021 02:33:51 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a4a:494d:: with SMTP id z74ls182169ooa.9.gmail; Wed, 16 Jun 2021 02:33:49 -0700 (PDT) X-Received: by 2002:a4a:e9b1:: with SMTP id t17mr3193181ood.0.1623836029821; Wed, 16 Jun 2021 02:33:49 -0700 (PDT) In-Reply-To: <2f5489af-f5a9-4ea4-9155-9f85c4808756n-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> X-Original-Sender: milan.bracke-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:28605 Archived-At: ------=_Part_592_331595231.1623836029079 Content-Type: multipart/alternative; boundary="----=_Part_593_1224645636.1623836029079" ------=_Part_593_1224645636.1623836029079 Content-Type: text/plain; charset="UTF-8" I can't fix this without at least some feedback. It's a complex issue and the fix will take some time, so I need to at least know that my proposed solution seems good and would be accepted if implemented correctly. On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke wrote: > I've encountered a new problem. A fldChar field can span multiple > paragraphs, but it doesn't have to start at the beginning of the first one. > Because of this, a field across multiple paragraphs will merge those > paragraphs. > There is no way to represent this exactly in the pandoc model I think. So > my current solution is to have different fields with the same field > info in the different paragraphs. This can at least make the hyperlink > fields work and I think it will work for the other fields we might add in > the future as well (I've checked the list). > What do you think about this ? > > On Monday, June 14, 2021 at 9:17:13 AM UTC+2 Milan Bracke wrote: > >> For those who don't know fldChar fields, this comment from the docx parse >> code (parse.hs, starting on line 825) explains it: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *fldChar fields work by firsthaving a in >> a run, then a run with, then a > fldCharType="separate"> run, then thecontent runs, and finally a > fldCharType="end"> run. Forexample (omissions and my comments in brackets): >> [...] [...] >> HYPERLINK [hyperlink url] >> [...] > w:rsidRPr=[...]> [...] Foundations of Analysis, 2nd Edition >> [...] *The current >> way of parsing fldChar fields doesn't take into account that they can be >> nested. So the end of the nested flcChar field will be interpreted as the >> end of the surrounding one. This could for example lead to a hyperlink that >> ends too soon. See attached example for a docx that demonstrates this. >> >> I propose to fix this by turning the fldChar state into a stack, so that >> a field can be started and ended inside other fields. I will include this >> in my pull request for PAGEREF fields that I announced here a while ago, >> since they are related. >> > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.com. ------=_Part_593_1224645636.1623836029079 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I can't fix this without at least some feedback. It's a complex issue = and the fix will take some time, so I need to at least know that my propose= d solution
seems good and would be accepted if implemented c= orrectly.

On Tuesday, June 15, 2021 at 8:38:30 AM UTC+2 Milan Bracke = wrote:
I= 've encountered a new problem. A fldChar field can span multiple paragr= aphs, but it doesn't have to start at the beginning of the first one.
Because of this, a field across multiple paragraphs will merge tho= se paragraphs.
There is no way to represent this exactly in the p= andoc model I think. So my current solution is to have different fields wit= h the same field
info in the different paragraphs. This can at le= ast make the hyperlink fields work and I think it will work for the other f= ields we might add in
the future as well (I've checked the li= st).
What do you think about this ?

On Monday, June 14, 2021 = at 9:17:13 AM UTC+2 Milan Bracke wrote:
For those who don't know fldChar fields, this comment= from the docx parse code (parse.hs, starting on line 825) explains it:
=
fldChar fields work by first
having a <w:fldChar fldCharType= =3D"begin"> in a run, then a run with
<w:instrText>, = then a <w:fldChar fldCharType=3D"separate"> run, then thecontent runs, and finally a <w:fldChar fldCharType=3D"end"&g= t; run. For
example (omissions and my comments in brackets):

<= ;w:r>
[...]
<w:fldChar w:fldCharType=3D"begin"/>=
</w:r>
<w:r>
[...]
<w:instrText xml:space= =3D"preserve"> HYPERLINK [hyperlink url] </w:instrText><= br> </w:r>
<w:r>
[...]
<w:fldChar w:fldCharType= =3D"separate"/>
</w:r>
<w:r w:rsidRPr=3D[...]= >
[...]
<w:t>Foundations of Analysis, 2nd Edition</w:t&= gt;
</w:r>
<w:r>
[...]
<w:fldChar w:fldCharT= ype=3D"end"/>
</w:r>

The current way of p= arsing fldChar fields doesn't take into account that they can be nested= . So the end of the nested flcChar field will be interpreted as the end of = the surrounding one. This could for example lead to a hyperlink that ends t= oo soon. See attached example for a docx that demonstrates this.

I p= ropose to fix this by turning the fldChar state into a stack, so that a fie= ld can be started and ended inside other fields. I will include this in my = pull request for PAGEREF fields that I announced here a while ago, since th= ey are related.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/9bdb337d-fa68-4c66-8f5c-d4fa81547953n%40googlegroups.= com.
------=_Part_593_1224645636.1623836029079-- ------=_Part_592_331595231.1623836029079--