From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/18117 Path: news.gmane.org!.POSTED!not-for-mail From: BP Jonsson Newsgroups: gmane.text.pandoc Subject: Re: ultra clean conversion from html to org mode Date: Thu, 31 Aug 2017 07:29:15 +0200 Message-ID: References: <877exkx2zw.fsf@mat.ucm.es> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="001a1134e49ae60199055805ec74" X-Trace: blaine.gmane.org 1504157358 32545 195.159.176.226 (31 Aug 2017 05:29:18 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 31 Aug 2017 05:29:18 +0000 (UTC) To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-X-From: pandoc-discuss+bncBDIY76M674FRBLF5T3GQKGQECAGNF5A-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Thu Aug 31 07:29:14 2017 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane.org Original-Received: from mail-io0-f190.google.com ([209.85.223.190]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dnI2Q-00085v-GH for gtp-pandoc-discuss@m.gmane.org; Thu, 31 Aug 2017 07:29:10 +0200 Original-Received: by mail-io0-f190.google.com with SMTP id c125sf6588274ioc.7 for ; Wed, 30 Aug 2017 22:29:18 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1504157357; cv=pass; d=google.com; s=arc-20160816; b=IENQttHRZUGvtfL/ZAtTxRaKyLVRO340N31ERpxTIGwnIer1PyTb0wMOzBGRmUYBZh 1vsIZD2uWb9bM2N/xLo5PotY6f6velNUyWuPVMG+j0t6HDXZv5FWIBWoArjhVL0BxFQL S09XLRwVYHh2OQ4vtz7cBUvucUe2qRkZNMjC8UUu6SA4OGzNopNRyhVK2gJhc0KFb6sN YDQ5eIiOhCbrdgTjJrdcQ7kpOyTCcWCOauPqpsBJmHXVJa7lADNOKIO+pJZ7ld90tLRn EsxyYp/F/qxqY/CUGyM/S7iTA/umJLFZm2Xf95rfCOEDbjqVanaFzEeA1rV+X0qWl4KG LURA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:reply-to:to:subject:message-id:date :from:references:in-reply-to:mime-version:arc-authentication-results :arc-message-signature:sender:dkim-signature:dkim-signature :arc-authentication-results; bh=zGr9dRnYjxULChX34JRFuo6cYC4nZs589uf8GvXlK2E=; b=X1KTypp/Z8J28nLfaIn2YrzV/5knY32gPNtf3STiDF8UAjvDNQrTSNJ2ykN0QDmDMB fL787kAWCQ8eurtnP8YAe8/KiGjFiPAjvlO5v6bWEsCbnJvCwlwRW0teA3YgYGtyfGrv Tey+DFFPVykI/2KY7H1YUE2wG4rDzbUDDBairE7Tc49Kde4Jhf5IpSESR8ukEVSOMnrC xeo8N1ZU0O9DfWxvNSPcRwC73ZnjrkFtevy/Y6wkMHZPAd5sTddfof7tWj7mLcfQLBGB Oak+4ve3vquv1t8XV7Ek3pc5OKmAdIXbg5+LfZDS/p2YKbY+Ue0ILe3iXgtYoQNpcdfq aEmQ== ARC-Authentication-Results: i=2; gmr-mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=jOynYHXU; spf=pass (google.com: domain of bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2607:f8b0:4003:c06::22f as permitted sender) smtp.mailfrom=bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gmail.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:mime-version:in-reply-to:references:from:date:message-id :subject:to:x-original-sender:x-original-authentication-results :reply-to:precedence:mailing-list:list-id:list-post:list-help :list-archive:list-subscribe:list-unsubscribe; bh=zGr9dRnYjxULChX34JRFuo6cYC4nZs589uf8GvXlK2E=; b=Cya5LMNDpmFmvoHv4m9MAhlEqsC3PCaNE7rPh/+ai0tR1W8tklTasIlopKQCVDt7z4 Q+8esO7EXdJ7QixtPskBbvfzLPpDLkm+cAKkOmrPGYaCEHyQeQRrqxHZH0L7MgS/UHEn 2VkHpNe+K6qPf8FBjtpNsO0+JHJl26+IDq2sneGBfA+gDq93eDIUJ+y6NJWIYcvC/n9M Ps6mxwleJ4UT3J0BBpFFi/DzQwbTdYEeN/I/TZj9UQyKTZ5Q4utEhuau+YY6PR/SEBaR +w4aJ5/5YczDhsdFMUvameApgZWNlSKryMwBIN4i0w4hBGbV5jV089N58RBujJeunOAF JDeQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :x-original-sender:x-original-authentication-results:reply-to :precedence:mailing-list:list-id:list-post:list-help:list-archive :list-subscribe:list-unsubscribe; bh=zGr9dRnYjxULChX34JRFuo6cYC4nZs589uf8GvXlK2E=; b=AqOq69WCDBzmNoLKFN6fpJdVi1g5MSykyblRcicoC3DHFMAihZvRumlr/4JKHbmo/n 2UbkU5FTCFoYls4a+ozftwCTnygIy16dqjDSOd+z/eEKOXuDpm1o/7CMR+9vKdOTmc2N +rZGsb4+BZUQkTiWP0dSGT6izn86k5p+1y9cjfIfXVNnK4J4BW1c5U2NrcDgBkZPMwzh iz5lCUXf0XyXGB538Sgaux2WQjJlmfoU8KYJsVEGXNlYuIfxKA4LC29cwjLgNXvJtC3C eR3qL+Y8D/i2Yte2ih6EE4+DXL9tibCj7EeM8e3BU+rMSfITZ+x1TticO25gmepIVo9x m50Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:mime-version:in-reply-to:references:from :date:message-id:subject:to:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:x-spam-checked-in-group:list-post:list-help:list-archive :list-subscribe:list-unsubscribe; bh=zGr9dRnYjxULChX34JRFuo6cYC4nZs589uf8GvXlK2E=; b=GZsVa/EdGl1xcT6GkH5fWatvQoNr2FxN1gdPN3ept664gPaqGt/aAkpm5YHTDqdrbn ce6JS2X1ejuzCL757z8rsB2pFI4GOSWk0BaNBMneZK4n/YYMdXv1yisi0UeAScPODqN5 8QISWnzIXASg2j/kUPsndzuyoHomlhQUN972k4ymVa94jAiiXChstpv3fg767gcSiZlW dXz5gACULk0XeUtXmjJvY6wYr1hpnNPtMU16iYGELeZ1KBOjeWBGqfKkVVNGr1aM0LMU hPzDZHbTesXMz1IKYeQUZqPfHdrsoPPw4KYaxG9twuX20FLvMHkjw3bDA7wdE7dsFHlM IZAw== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AHYfb5jCpTVhUJQocTspOhJHbJ1kdd7BJ9fV9eUGdSUyac1bnLhqNTbH jjpC2xRlbBmE0w== X-Received: by 10.36.26.201 with SMTP id 192mr170784iti.2.1504157357428; Wed, 30 Aug 2017 22:29:17 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 10.107.6.166 with SMTP id f38ls5362216ioi.7.gmail; Wed, 30 Aug 2017 22:29:16 -0700 (PDT) X-Received: by 10.99.186.7 with SMTP id k7mr680562pgf.220.1504157356616; Wed, 30 Aug 2017 22:29:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1504157356; cv=none; d=google.com; s=arc-20160816; b=d9zrEMrsaBZTOS9sljnmHcHnbKZaRE1LGV9PiW3zxNtjDdYAjp/fiWzMHR8T8aWx7u s0FhpnBxPRZFyrVBY8h7/h2/8h5Cz4SVrVgO9D648aAkCU2lLTZU3BL2SDQgBJ8PhKMR FKXR3C3RDKfgC5oMOpTEi+dsMUMXB+4WqdRxYwja9B+HUjDrW2AVPAKgA5XufhMOMWj/ rqt5soawPbEtLdbYBnacOMGrOB6fy/re5mEGaqnp52daugTUSlh0XmSLrVJx/S2xFO40 JaLRAVJDRFlju2weAZKnTLpneKH1pIX0YusH6eLVGq4ypFqq5BccCBFhIrfSKoEcy/Hm nfYg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=to:subject:message-id:date:from:references:in-reply-to:mime-version :dkim-signature:arc-authentication-results; bh=us4UY1AtHUSkfhbZm8vE3cZ5po+ha/kB1YvWpnTB/bU=; b=jBY+2clTdlnmHwfZAablCNfMFxR/6YoctHfoa1Ngq0KfQ80rFKYMHLYL61GmbRb3yr axPniGRa2KNPVxmAR8tY1G4yYgMldF3FoNvPPmQunEmwUKRXIu2C3gAMt4QCk1tW5L/h W1xYMnXgnu4uxtuGynTwC2EswR/EEWM84AJNlvE+vXqLsPv5vRjvxmGokV/EfixTaiTa eUYSEcqndNqgzN5GdGZHUjD71Hpjv/1FluqE9ZTgK6IuE9Lf33ptRi5wKN/9DrXIkSzK 1TbvcM0imEZOjlrO94aXK+E4gb8KDxrjQGOFTK28XO2i+XIfFdAMya0CEbhOYIjoCtdS PLHg== ARC-Authentication-Results: i=1; gmr-mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=jOynYHXU; spf=pass (google.com: domain of bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2607:f8b0:4003:c06::22f as permitted sender) smtp.mailfrom=bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gmail.com Original-Received: from mail-oi0-x22f.google.com (mail-oi0-x22f.google.com. [2607:f8b0:4003:c06::22f]) by gmr-mx.google.com with ESMTPS id c194si22262itb.5.2017.08.30.22.29.16 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 30 Aug 2017 22:29:16 -0700 (PDT) Received-SPF: pass (google.com: domain of bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2607:f8b0:4003:c06::22f as permitted sender) client-ip=2607:f8b0:4003:c06::22f; Original-Received: by mail-oi0-x22f.google.com with SMTP id k77so67343741oib.2 for ; Wed, 30 Aug 2017 22:29:16 -0700 (PDT) X-Received: by 10.202.71.3 with SMTP id u3mr3671902oia.234.1504157356123; Wed, 30 Aug 2017 22:29:16 -0700 (PDT) Original-Received: by 10.157.27.157 with HTTP; Wed, 30 Aug 2017 22:29:15 -0700 (PDT) Original-Received: by 10.157.27.157 with HTTP; Wed, 30 Aug 2017 22:29:15 -0700 (PDT) In-Reply-To: <877exkx2zw.fsf-YB6e1s5WF/He5aOfsHch1g@public.gmane.org> X-Original-Sender: bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org X-Original-Authentication-Results: gmr-mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=jOynYHXU; spf=pass (google.com: domain of bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 2607:f8b0:4003:c06::22f as permitted sender) smtp.mailfrom=bpjonsson-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gmail.com Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.org gmane.text.pandoc:18117 Archived-At: --001a1134e49ae60199055805ec74 Content-Type: text/plain; charset="UTF-8" Piping pandoc's output through a simple text filter using Perl's text match/line range operator should do the trick by not printing lines on and between the begin/end HTML markers. ````perl #!/usr/bin/env perl use 5.010001; use strict; use warnings; use utf8; use open qw[ :utf8 :std ]; while(<>) { # loop on STDIN if ( /^\#\+BEGIN_HTML/ .. /^\#\+END_HTML/ ) { # skip if on/between the fences } else { print $_; } } __END__ ```` Den 30 aug 2017 21:36 skrev "Uwe Brauer" : > > Hi > > I just converted > https://www.theguardian.com/politics/2017/aug/30/may-to- > press-japan-on-its-eu-trade-deal-in-hopes-of-a-model-for-uk > To org mode using pandoc (pandoc 1.19 in Kubuntu 14.04) > > But the file containts lines such as > > #+BEGIN_HTML >
> #+END_HTML > > #+BEGIN_HTML >
> #+END_HTML > > > Couldn't they just be ignored in the conversion process since they don't > provide much help for org > mode. > > Thanks > > Uwe Brauer > > -- > You received this message because you are subscribed to the Google Groups > "pandoc-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/ > msgid/pandoc-discuss/877exkx2zw.fsf%40mat.ucm.es. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuTLQ3aj-j6CHgQekOPt-1oZgxp52QWR6X2rRjXunB7K5A%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout. --001a1134e49ae60199055805ec74 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Piping pandoc's output through a simple text filter u= sing Perl's text match/line range operator should do the trick by not p= rinting lines on and between the begin/end HTML markers.
<= br>
````perl
#!/usr/bin/env perl

us= e 5.010001;
use strict;
use w= arnings;
use utf8;
use open q= w[ :utf8 :std ];

while(&= lt;>) { =C2=A0# loop on STDIN
=C2=A0 =C2=A0 if ( = /^\#\+BEGIN_HTML/ .. /^\#\+END_HTML/ ) {
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 # skip if on/between the fences
= =C2=A0 =C2=A0 }
=C2=A0 =C2=A0 else {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 print $_;
=C2= =A0 =C2=A0 }
}

__END__
````

Den 30 aug 2017 21:36 skr= ev "Uwe Brauer" <oub-vOKZqXPyTgQ@public.gmane.org= .es>:

Hi

I just converted
https://www.theguardian.com/politics/2017/aug/30/may-to= -press-japan-on-its-eu-trade-deal-in-hopes-of-a-model-for-uk=
To org mode using pandoc (pandoc 1.19 in Kubuntu 14.04)

But the file containts lines such as

#+BEGIN_HTML
=C2=A0 <div itemprop=3D"publisher" itemtype=3D"http= s://schema.org/Organization">
#+END_HTML

#+BEGIN_HTML
=C2=A0 <div itemprop=3D"logo" itemtype=3D"https://sc= hema.org/ImageObject">
#+END_HTML


Couldn't they just be ignored in the conversion process since they don&= #39;t provide much help for org
mode.

Thanks

Uwe Brauer

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pa= ndoc-discuss+unsubscribe@googlegroups.com.
To post to this group, send email to pandoc-discuss@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/877exkx2zw.fsf%40mat.ucm.es.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to
pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.= google.com/d/msgid/pandoc-discuss/CAFC_yuTLQ3aj-j6CHgQekOPt-1oZgxp52QWR6X2r= RjXunB7K5A%40mail.gmail.com.
For more options, visit http= s://groups.google.com/d/optout.
--001a1134e49ae60199055805ec74--