From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/18004 Path: news.gmane.org!.POSTED!not-for-mail From: Melroch Newsgroups: gmane.text.pandoc Subject: Re: Writing custom filter in python to remove non-breaking spaces Date: Fri, 4 Aug 2017 20:58:53 +0200 Message-ID: References: <3be5ee09-90dc-41ad-a368-9298b965dfaa@googlegroups.com> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="94eb2c062634a936c30555f21684" X-Trace: blaine.gmane.org 1501873137 16369 195.159.176.226 (4 Aug 2017 18:58:57 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Fri, 4 Aug 2017 18:58:57 +0000 (UTC) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-X-From: pandoc-discuss+bncBCWMVYEK54FRB34HSPGAKGQEY2BZRAI-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Fri Aug 04 20:58:53 2017 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane.org Original-Received: from mail-vk0-f57.google.com ([209.85.213.57]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ddho9-0003pu-Km for gtp-pandoc-discuss@m.gmane.org; Fri, 04 Aug 2017 20:58:49 +0200 Original-Received: by mail-vk0-f57.google.com with SMTP id g189sf1344735vke.2 for ; Fri, 04 Aug 2017 11:58:56 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1501873135; cv=pass; d=google.com; s=arc-20160816; b=FF9qoXrLPY8s3rNSCRK00bP6bTaHrnIS8joqFdDNePd4IdmY7LXWDOgRghFZILWqaa naRi2u0XUjUXGP5bzrMQzaSkMN8ndyEHDx1ksesTv+nJ6qNhyJip/HBeMQSjXLX8Aa2g 1nZ2dc2WcIEIMYG5dmiV5pxjQAKDTlb6NbYiTTHwk/9EQ7MDkcEkcPlamJRJbToPXeIO DGAfsqFpaeaGL3zxCCLemZnISbCOoFXK/ywuzS2s5y0saIFgr9kdpE3FC56IhONtIrf6 6865gSzq2AUI8ODjHEfVEqj5+xFBTyi6y56W7zoMBoa+VwLOByyBz0vELKnn+g6l01Ht 6czw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:reply-to:to:subject:message-id:date :from:references:in-reply-to:mime-version:arc-authentication-results :arc-message-signature:sender:dkim-signature:dkim-signature :arc-authentication-results; bh=4joAK6e+FSK1ixDZLkSWUXEtLKWhEHX9cKOWXGIprF0=; b=YVkZvB0kFDoJD9PYyJC22Ly/rg2YLA+zrEMdQf1XggjkghNqF463ei5REr8tinMLrE 1fqJ/LUncei9cuYV7HdbQxppgRgJxcNo0qjyz+5en9ASJCYTTenfruMPQgVvgJE/6OOi ihQIQcdYBUeUiVEq8h72XLLh9vGiCvb04eG/l5hz+tWZoYdiv1sN2WSCFD88iy1GDIb7 Eh/GZaMCWSjSSQJgSSBMbI2bdAwEKOx2WTw2UIklZmuUGwqgF3ONd3E1lgK5HMXMYx8W Kb4lT58B0AZgFQ7HEePQTFCYo5zgeuHBUSHSg+AViG7hCZPPbAfylNKnEQGrMZMtzm1X vxdw== ARC-Authentication-Results: i=2; gmr-mx.google.com; dkim=pass header.i=@gmail.com header.b=t2sZZc1U; spf=pass (google.com: domain of melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2607:f8b0:400d:c09::244 as permitted sender) smtp.mailfrom=melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gmail.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:mime-version:in-reply-to:references:from:date:message-id :subject:to:x-original-sender:x-original-authentication-results :reply-to:precedence:mailing-list:list-id:list-post:list-help :list-archive:list-subscribe:list-unsubscribe; bh=4joAK6e+FSK1ixDZLkSWUXEtLKWhEHX9cKOWXGIprF0=; b=C+u84LzVZiN+o9vxxUdmrt4xNwvW1ajcX7GQS94AndS4eCZyrVaOwy2t3Rhb+yDY0L +QOpJz6H+sTB86JfYBpyUO8nc7LuhnhVHc2bT53Sc1MKXK8T8dcvc3k6jjts/Ga8pCP2 BWIfgedS2X/2nxm/CPxHGE1vu8SLrsjkgLQKSCWBN0TcaON/n44Aa8xyglU+c3PD0+TW TSvDEso5OHv+uRlfB9QQ8lsA3OxOjBs24DKD9OfQjYdnwhFkinuZAba7lA40PWvKph/w Ulnu1ZfEP0aW+3OrzXzG1p4rP+0ulXAJusmHM07Vt5BEou4juauA7TMfILZz5WAvrT6h PR4Q== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :x-original-sender:x-original-authentication-results:reply-to :precedence:mailing-list:list-id:list-post:list-help:list-archive :list-subscribe:list-unsubscribe; bh=4joAK6e+FSK1ixDZLkSWUXEtLKWhEHX9cKOWXGIprF0=; b=QMFVQDoPl5UOZe7m63nVuxZVfbdWXteun+YXBNVg7rA7w/Ej3x9avZ+dBlPw6rBAjN Ko9aTVhW5/r8TPnLzZx79QrlZ9vvfaH1HjyGWaKwWRogW7BK+Y7yEai4MymAKVvl0qKD d8OqZOj88ZbIV0RwWSUqJm5qt+KWKM5QzA4/xv3OPQ9bxbErpJ08FTJAtTyjIVkP/PXd EUyn9PXhA/6c4+vGeuNenHZAk99RzKlgE9B6czERZotcVd4+g+0yAMjYe+MjizTjFldG ZhFjJUX5dQVxkvFxwQbkXFBdV+UMSX0novsIYnAdi5ICBR+KBtGPHK8vaR7pmKpZWhKM Mr2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:mime-version:in-reply-to:references:from :date:message-id:subject:to:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:x-spam-checked-in-group:list-post:list-help:list-archive :list-subscribe:list-unsubscribe; bh=4joAK6e+FSK1ixDZLkSWUXEtLKWhEHX9cKOWXGIprF0=; b=trpPm8EN6b4RRYsY6MMzVMqo8/dQT0sPaDq1SkF2ZTS/0GBhAMlurcb9SRddjHHbsT vBJxHBxciapJsUp+aozrTR5nCjAruIOL/ZzfVHIpvZe6Exi1Ae41EAimsakYbKGj/Z+5 4K9dOsFYhqkY+R0rQ3ma+ItoSxHJWJRrpyn6eJW5d1/wXk7P3MNIH/go/ySfcoG0MZrv Q9dhdvNy8nwYxruhzpuHRNcPDujhDVKG+gJvN++lq0QACXuwJ391mDiknEudP/ya/Bfm FXCstZNIjFFbWtxuZjHf4NmgPnNhULXpVZ9N66fFFx2rcT8xHH22I4v5M55NBm0Pt67J 3USw== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AHYfb5iQNoOeJ745FouRBp53ihUId6UT3Tb4HXvrtKzeXw7wwjVOKFjN gY36+Vor6V+QRA== X-Received: by 10.36.34.209 with SMTP id o200mr106477ito.0.1501873135641; Fri, 04 Aug 2017 11:58:55 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 10.36.138.65 with SMTP id v62ls2069560itd.3.canary-gmail; Fri, 04 Aug 2017 11:58:54 -0700 (PDT) X-Received: by 10.99.178.75 with SMTP id t11mr2343640pgo.38.1501873134840; Fri, 04 Aug 2017 11:58:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1501873134; cv=none; d=google.com; s=arc-20160816; b=FApiFybMMEghimyDNcVSV2wbJfgjKE++wDejill5uywIfMAMCGj1Kmt5Wzb8j8Wyqy 2soT5+3ys7YkUbF9A7qPLKwwrvDCVmJNNawUl+txjPMyJK+sOUmev5KxGmDpUtVokqNi IuO49YqiLxOIGD902XU/OIe1xss6sfgZLjLH8ZqSTt8Zbvnkmq158OVMueQLdAFA6diY NTT9dpLyJemqA7uFfrxfq+bUEtEKDKwz6B8e31obfpXPNfsURdnJg1GobsFBR5zlmjuE xfy8pMKTt7ImI3BBjgOO0jNbIh7e7DCHdpbfYFSxPj1DaGveA3PLMTe83OlLfTZxNiWJ wroQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=to:subject:message-id:date:from:references:in-reply-to:mime-version :dkim-signature:arc-authentication-results; bh=2i2TdUqglv7V01EShLDLbS2+TNf6H6KJvjNT0M57mDA=; b=UUrYDNZQP5/30NX9JpRR4AjIOgWf9AQ6GB1nnNj+2D/urRwlKuTP43VaLlpa9PC2TD 0BJIoE+TZXrWCG8Cto1fP03zelIuD7ZHtT+QTqJyBxcPNoG+YfzWMKyfaXCpNhSBG/MA e0nnIbdJ9YRDSfNrcUzSuKu1QZccooGpw3YCa6ZhWNrtcmzE5QSIzqdbHSabjo9lGJby hf3usoR8w0VIIHVoOCKrI4wszRvmlJO+pNLNm+n2lpoqzuMTzLYZ6WUzhF/4wRadq+Dr 8B2Kcs9jTKFC+zmQEVcONdN5KFPcKt2Dz2BYvD0CnCwBNOua1fucy7IeJz/dh/BJb84N A1YQ== ARC-Authentication-Results: i=1; gmr-mx.google.com; dkim=pass header.i=@gmail.com header.b=t2sZZc1U; spf=pass (google.com: domain of melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2607:f8b0:400d:c09::244 as permitted sender) smtp.mailfrom=melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gmail.com Original-Received: from mail-qk0-x244.google.com (mail-qk0-x244.google.com. [2607:f8b0:400d:c09::244]) by gmr-mx.google.com with ESMTPS id v195si193782ywg.3.2017.08.04.11.58.54 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 04 Aug 2017 11:58:54 -0700 (PDT) Received-SPF: pass (google.com: domain of melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2607:f8b0:400d:c09::244 as permitted sender) client-ip=2607:f8b0:400d:c09::244; Original-Received: by mail-qk0-x244.google.com with SMTP id x77so2364333qka.4 for ; Fri, 04 Aug 2017 11:58:54 -0700 (PDT) X-Received: by 10.55.129.7 with SMTP id c7mr4051972qkd.29.1501873134186; Fri, 04 Aug 2017 11:58:54 -0700 (PDT) Original-Received: by 10.55.212.153 with HTTP; Fri, 4 Aug 2017 11:58:53 -0700 (PDT) Original-Received: by 10.55.212.153 with HTTP; Fri, 4 Aug 2017 11:58:53 -0700 (PDT) In-Reply-To: X-Original-Sender: melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org X-Original-Authentication-Results: gmr-mx.google.com; dkim=pass header.i=@gmail.com header.b=t2sZZc1U; spf=pass (google.com: domain of melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2607:f8b0:400d:c09::244 as permitted sender) smtp.mailfrom=melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gmail.com Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.org gmane.text.pandoc:18004 Archived-At: --94eb2c062634a936c30555f21684 Content-Type: text/plain; charset="UTF-8" I think the OP might actually want to replace literal or entity nbspaces with regular spaces. There certainly may be reasonable reasons to want to do that. I think that the OP might be helped by a prefilter written in python. The following tries to skip any fenced code or code blocks in order to not replace nbsp entities inside code. It may be thrown by things like `\~~~strikeout~~` but those are unlikely in practice. ````python import sys import re inp = sys.stdin.read() txt = inp.decode('utf-8') pat = u"""(?isxu) # Match delimited code or code block (?P (?P \`{1,} ) .*? (?P=backtick) | (?P \~{3,} ) .*? (?P=tilde) ) | # Match any form of nbsp ( \& (?: nbsp|[#]160|[#]xa0) \; | \u00a0 ) """ # keep code and replace ordinary space def rep(m): return m.group(1) if m.group(1) else u"\u0020" print re.sub(pat, rep, txt).encode('utf-8') ```` fre 4 aug. 2017 kl. 00:30 skrev Kolen Cheung : > I actually want to know why you would want to remove that in the first > place? It seems only if the source has bugs you would want to do that. > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/ > msgid/pandoc-discuss/f0bb6fae-6104-4efc-840d-34fd19b02840% > 40googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhBEZ8-BdJTQRJE4M2nettrGKhf1xYzqBYs%3Dpe%3D_DAodpA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. --94eb2c062634a936c30555f21684 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I think the OP might actually want to r= eplace literal or entity nbspaces with regular spaces. There certainly may = be reasonable reasons to want to do that.
I think th= at the OP might be helped by a prefilter written in python. The following t= ries to skip any fenced code or code blocks in order to not replace nbsp en= tities inside code. It may be thrown by things like `\~~~strikeout~~` but t= hose are unlikely in practice.

````python
import sys
=
import re

inp =3D sys.stdin.read()
txt =3D inp.decode('= ;utf-8')
pat =3D u"""(?isxu)
# Match delimited code or code block
(?P<code>
(?P<backtick> \`{1,} ) .*? (?P=3Dbac= ktick)
|
(?P<tilde> \~{3,} ) .*? =C2=A0(?P= =3Dtilde)
)
|
# Match any form of nbsp
( \& (?: nbsp= |[#]160|[#]xa0) \; | \u00a0 )
"""

# keep code and replace ord= inary space
def rep(m):
=C2= =A0 =C2=A0 return m.group(1) if m.group(1) else u"\u0020"
=C2=A0 =C2=A0=C2=A0
print re.sub(pat= , rep, txt).encode('utf-8')
````

fr= e 4 aug. 2017 kl. 00:30 skrev Kolen Cheung <christian.kolen-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
I actually want to know why you would wa= nt to remove that in the first place? It seems only if the source has bugs = you would want to do that.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe@googlegroups.com.
To post to this group, send email to pandoc-discuss@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/f0bb6fae-6104-4efc-840d-34fd19b02840%40go= oglegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to
pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://gro= ups.google.com/d/msgid/pandoc-discuss/CADAJKhBEZ8-BdJTQRJE4M2nettrGKhf1xYzq= BYs%3Dpe%3D_DAodpA%40mail.gmail.com.
For more options, visit http= s://groups.google.com/d/optout.
--94eb2c062634a936c30555f21684--