public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* omit HTML block in html --> org conversion
@ 2016-05-20 15:30 Ista Zahn
       [not found] ` <43181b14-9c6d-402c-bed5-ba790f6ee5cb-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Ista Zahn @ 2016-05-20 15:30 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 2081 bytes --]

Hello,

I'm using pandoc to convert html email to org mode for viewing in emacs. 
This works well, except that <div> tags are not converted and are instead 
stuffed into #BEGIN_HTML ... #+END_HTML blocks. Is there any way I can 
prevent that from happening? I'd really like for those div tags to be 
omitted from the converted org file.

Here is a brief example.

Input:

echo '<html>

 <head>  
 <meta http-equiv="Content-Type" content="text/html; charset=Windows-1252"> 
 
 </head>  
 <body lang="EN-US" link="blue" vlink="purple">  
 <div class="WordSection1">  
   <div  
   style="width:100%;padding:24px 0 16px 
0;background-color:#f5f5f5;text-align:center">  
 </div>  
 <p style="background:white"><i><span 
style="font-size:11.0pt;font-family:"Calibri","sans-serif"">You are 
receiving this email because you took professional training and/or required 
 
training in the past year.</span></i><o:p></o:p></p>

 </body>  
 </html>  
' | pandoc -f html -t org 

Result:

#+BEGIN_HTML 
 <div class="WordSection1"> 
#+END_HTML 
 
#+BEGIN_HTML 
 <div 
 style="width:100%;padding:24px 0 16px 
0;background-color:#f5f5f5;text-align:center"> 
#+END_HTML 
 
#+BEGIN_HTML 
 </div> 
#+END_HTML 
 
/You are receiving this email because you took professional training 
and/or required training in the past year./ 
 
#+BEGIN_HTML 
 </div> 
#+END_HTML


Desired result:
/You are receiving this email because you took professional training 
and/or required training in the past year./ 



Thanks for any suggestions.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/43181b14-9c6d-402c-bed5-ba790f6ee5cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #1.2: Type: text/html, Size: 7474 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: omit HTML block in html --> org conversion
       [not found] ` <43181b14-9c6d-402c-bed5-ba790f6ee5cb-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
@ 2016-05-20 17:47   ` John MACFARLANE
       [not found]     ` <20160520174700.GA95956-nFAEphtLEs/fysO+viCLMa55KtNWUUjk@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: John MACFARLANE @ 2016-05-20 17:47 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

You could use a filter to remove the divs.

% cat nodivs.hs
import Text.Pandoc.JSON

main = toJSONFilter nodivs
  where nodivs (Div _ bs) = bs
        nodivs b          = [b]

% ghc --make nodivs

% pandoc --filter ./nodivs -t org
<div class="hi">
ok
</div>
ok

If you don't have ghc installed, you could write the
filter in python instead using the pandocfilters or
panflute library.

+++ Ista Zahn [May 20 16 08:30 ]:
>   Hello,
>   I'm using pandoc to convert html email to org mode for viewing in
>   emacs. This works well, except that <div> tags are not converted and
>   are instead stuffed into #BEGIN_HTML ... #+END_HTML blocks. Is there
>   any way I can prevent that from happening? I'd really like for those
>   div tags to be omitted from the converted org file.
>   Here is a brief example.
>   Input:
>   echo '<html>
>    <head>
>    <meta http-equiv="Content-Type" content="text/html;
>   charset=Windows-1252">
>    </head>
>    <body lang="EN-US" link="blue" vlink="purple">
>    <div class="WordSection1">
>      <div
>      style="width:100%;padding:24px 0 16px
>   0;background-color:#f5f5f5;text-align:center">
>    </div>
>    <p style="background:white"><i><span
>   style="font-size:11.0pt;font-family:"Calibri","sans-serif"">You are
>   receiving this email because you took professional training and/or
>   required
>   training in the past year.</span></i><o:p></o:p></p>
>    </body>
>    </html>
>   ' | pandoc -f html -t org
>   Result:
>   #+BEGIN_HTML
>    <div class="WordSection1">
>   #+END_HTML
>
>   #+BEGIN_HTML
>    <div
>    style="width:100%;padding:24px 0 16px
>   0;background-color:#f5f5f5;text-align:center">
>   #+END_HTML
>
>   #+BEGIN_HTML
>    </div>
>   #+END_HTML
>
>   /You are receiving this email because you took professional training
>   and/or required training in the past year./
>
>   #+BEGIN_HTML
>    </div>
>   #+END_HTML
>   Desired result:
>   /You are receiving this email because you took professional training
>   and/or required training in the past year./
>   Thanks for any suggestions.
>
>   --
>   You received this message because you are subscribed to the Google
>   Groups "pandoc-discuss" group.
>   To unsubscribe from this group and stop receiving emails from it, send
>   an email to [1]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To post to this group, send email to
>   [2]pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>   To view this discussion on the web visit
>   [3]https://groups.google.com/d/msgid/pandoc-discuss/43181b14-9c6d-402c-
>   bed5-ba790f6ee5cb%40googlegroups.com.
>   For more options, visit [4]https://groups.google.com/d/optout.
>
>References
>
>   1. mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   2. mailto:pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>   3. https://groups.google.com/d/msgid/pandoc-discuss/43181b14-9c6d-402c-bed5-ba790f6ee5cb-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org?utm_medium=email&utm_source=footer
>   4. https://groups.google.com/d/optout


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: omit HTML block in html --> org conversion
       [not found]     ` <20160520174700.GA95956-nFAEphtLEs/fysO+viCLMa55KtNWUUjk@public.gmane.org>
@ 2016-05-20 20:27       ` Ista Zahn
  0 siblings, 0 replies; 3+ messages in thread
From: Ista Zahn @ 2016-05-20 20:27 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 4474 bytes --]

Wonderful, thank you!
On May 20, 2016 1:47 PM, "John MACFARLANE" <jgm-TVLZxgkOlNX2fBVCVOL8/A@public.gmane.org> wrote:

> You could use a filter to remove the divs.
>
> % cat nodivs.hs
> import Text.Pandoc.JSON
>
> main = toJSONFilter nodivs
>  where nodivs (Div _ bs) = bs
>        nodivs b          = [b]
>
> % ghc --make nodivs
>
> % pandoc --filter ./nodivs -t org
> <div class="hi">
> ok
> </div>
> ok
>
> If you don't have ghc installed, you could write the
> filter in python instead using the pandocfilters or
> panflute library.
>
> +++ Ista Zahn [May 20 16 08:30 ]:
>
>>   Hello,
>>   I'm using pandoc to convert html email to org mode for viewing in
>>   emacs. This works well, except that <div> tags are not converted and
>>   are instead stuffed into #BEGIN_HTML ... #+END_HTML blocks. Is there
>>   any way I can prevent that from happening? I'd really like for those
>>   div tags to be omitted from the converted org file.
>>   Here is a brief example.
>>   Input:
>>   echo '<html>
>>    <head>
>>    <meta http-equiv="Content-Type" content="text/html;
>>   charset=Windows-1252">
>>    </head>
>>    <body lang="EN-US" link="blue" vlink="purple">
>>    <div class="WordSection1">
>>      <div
>>      style="width:100%;padding:24px 0 16px
>>   0;background-color:#f5f5f5;text-align:center">
>>    </div>
>>    <p style="background:white"><i><span
>>   style="font-size:11.0pt;font-family:"Calibri","sans-serif"">You are
>>   receiving this email because you took professional training and/or
>>   required
>>   training in the past year.</span></i><o:p></o:p></p>
>>    </body>
>>    </html>
>>   ' | pandoc -f html -t org
>>   Result:
>>   #+BEGIN_HTML
>>    <div class="WordSection1">
>>   #+END_HTML
>>
>>   #+BEGIN_HTML
>>    <div
>>    style="width:100%;padding:24px 0 16px
>>   0;background-color:#f5f5f5;text-align:center">
>>   #+END_HTML
>>
>>   #+BEGIN_HTML
>>    </div>
>>   #+END_HTML
>>
>>   /You are receiving this email because you took professional training
>>   and/or required training in the past year./
>>
>>   #+BEGIN_HTML
>>    </div>
>>   #+END_HTML
>>   Desired result:
>>   /You are receiving this email because you took professional training
>>   and/or required training in the past year./
>>   Thanks for any suggestions.
>>
>>   --
>>   You received this message because you are subscribed to the Google
>>   Groups "pandoc-discuss" group.
>>   To unsubscribe from this group and stop receiving emails from it, send
>>   an email to [1]pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>   To post to this group, send email to
>>   [2]pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>   To view this discussion on the web visit
>>   [3]https://groups.google.com/d/msgid/pandoc-discuss/43181b14-9c6d-402c-
>>   bed5-ba790f6ee5cb%40googlegroups.com.
>>   For more options, visit [4]https://groups.google.com/d/optout.
>>
>> References
>>
>>   1. mailto:pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>>   2. mailto:pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
>>   3.
>> https://groups.google.com/d/msgid/pandoc-discuss/43181b14-9c6d-402c-bed5-ba790f6ee5cb-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org?utm_medium=email&utm_source=footer
>>   4. https://groups.google.com/d/optout
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/20160520174700.GA95956%40protagoras.berkeley.edu
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CA%2BvqiLGeurWLANYsvuYE5RzA8X9a14J5YXVxoDWenE28PVEZOw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 7174 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-05-20 20:27 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-20 15:30 omit HTML block in html --> org conversion Ista Zahn
     [not found] ` <43181b14-9c6d-402c-bed5-ba790f6ee5cb-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
2016-05-20 17:47   ` John MACFARLANE
     [not found]     ` <20160520174700.GA95956-nFAEphtLEs/fysO+viCLMa55KtNWUUjk@public.gmane.org>
2016-05-20 20:27       ` Ista Zahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).