From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/11568 Path: news.gmane.org!not-for-mail From: BP Jonsson Newsgroups: gmane.text.pandoc Subject: Re: filter to break urls in HTML Date: Sun, 21 Dec 2014 01:29:44 +0100 Message-ID: <54961478.30103@gmail.com> References: <5491EDED.7030000@web.de> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1419121793 14663 80.91.229.3 (21 Dec 2014 00:29:53 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 21 Dec 2014 00:29:53 +0000 (UTC) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-X-From: pandoc-discuss+bncBCWMVYEK54FRB6VI3CSAKGQEWUXWYEQ-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Sun Dec 21 01:29:48 2014 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane.org Original-Received: from mail-wg0-f58.google.com ([74.125.82.58]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Y2UP5-0004aQ-O1 for gtp-pandoc-discuss@m.gmane.org; Sun, 21 Dec 2014 01:29:47 +0100 Original-Received: by mail-wg0-f58.google.com with SMTP id b13sf214610wgh.23 for ; Sat, 20 Dec 2014 16:29:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20120806; h=sender:from:message-id:date:user-agent:mime-version:to:subject :references:in-reply-to:content-type:content-transfer-encoding :x-original-sender:x-original-authentication-results:reply-to :precedence:mailing-list:list-id:list-post:list-help:list-archive :list-subscribe:list-unsubscribe; bh=3/7nMurwhqB4bGlixxHDRG3dRofoPx75HCYoU2j9Xjg=; b=Dn2iMqHpd0HBSRsv/2nmobe6F4tfHApUzX/VLUr0i/NnlLsM7zXpH9esWV+v9FqvIg 9q8t04qUX0h8CBME6OC6t8kI5X3My6x/RlkNiN1kcSGbnvJMiSuSIzXV9B0FNoht1uIG 5w69CAD3k6XEMnZjrzf8jazLVJBIxvxiL9avIbDNorqmytpJiYuNUh3yS313bB01FVGK oIMC+nn9H7TNhytKx1wiEpo0y9FtmvwcXotXpepCcSygC8OHnbzMYXLXFNzVlc7DX3dD lQh/OixRJeDv2Vk1215gCMwoYa8AGovxp1EcnFoKa8qincJSxOqg+YqptscVAX898vz3 B0rg== X-Received: by 10.152.5.2 with SMTP id o2mr32886lao.2.1419121787411; Sat, 20 Dec 2014 16:29:47 -0800 (PST) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 10.152.27.202 with SMTP id v10ls1170824lag.77.gmail; Sat, 20 Dec 2014 16:29:46 -0800 (PST) X-Received: by 10.112.72.228 with SMTP id g4mr11101lbv.4.1419121786064; Sat, 20 Dec 2014 16:29:46 -0800 (PST) Original-Received: from mail-lb0-x231.google.com (mail-lb0-x231.google.com. [2a00:1450:4010:c04::231]) by gmr-mx.google.com with ESMTPS id oi7si678517lbb.1.2014.12.20.16.29.45 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 20 Dec 2014 16:29:45 -0800 (PST) Received-SPF: pass (google.com: domain of melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2a00:1450:4010:c04::231 as permitted sender) client-ip=2a00:1450:4010:c04::231; Original-Received: by mail-lb0-x231.google.com with SMTP id b6so2383028lbj.22 for ; Sat, 20 Dec 2014 16:29:45 -0800 (PST) X-Received: by 10.152.3.9 with SMTP id 9mr10539001lay.54.1419121785654; Sat, 20 Dec 2014 16:29:45 -0800 (PST) Original-Received: from [192.168.1.248] (user26.77-105-220.netatonce.net. [77.105.220.26]) by mx.google.com with ESMTPSA id f5sm3904570laf.9.2014.12.20.16.29.44 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 20 Dec 2014 16:29:45 -0800 (PST) Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 In-Reply-To: <5491EDED.7030000-S0/GAf8tV78@public.gmane.org> X-Original-Sender: melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2a00:1450:4010:c04::231 as permitted sender) smtp.mail=melroch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; dkim=pass header.i=@gmail.com Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.org gmane.text.pandoc:11568 Archived-At: Den 2014-12-17 21:56, Pablo Rodr=C3=ADguez skrev: > I would need a filter that parses the following url: >=20 > >=20 > in HTML as: >=20 > http://​ > www​.​link​.​com​#​a​=3D > ​b​.​php​?​what Are you not afraid that someone who copypastes that URL will get angry at y= ou? :-) Actually I wrote such a filter in Perl just for kicks since I ralized that = the main action could be compacted into a single substitution: $_->{c}[0][0]{c} =3D~ s{ (^.+?://) | (?=3D [^-:a-z0-9%] ) | (?<=3D [-:]= ) }{ $1 || "\x{200b}" }egix; I figured that I would like to have breaks after hyphens and colons but bef= ore other punctuation, except for percent-encoded characters/bytes where I = wouldn't want any breaks at all, hence the three-way alternation in the sea= rch pattern.=20 BTW Matthew, isnt `nb =3D '\x8203'` in your Haskell version a mistake. Code= point 8203 *hex* is U+8203 CJK UNIFIED IDEOGRAPH-8203 while 8203 *decimal* = is U+200B ZERO WIDTH SPACE! /bpj --=20 You received this message because you are subscribed to the Google Groups "= pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/= pandoc-discuss/54961478.30103%40gmail.com. For more options, visit https://groups.google.com/d/optout.