public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Simplifying pandoc's HTML output even more
@ 2017-02-11 21:14 Marc Haber
       [not found] ` <20170211211439.GD2488-MEsB+WDYHc7QKvwJT6wXshvVK+yQ3ZXh@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Marc Haber @ 2017-02-11 21:14 UTC (permalink / raw)
  To: pandoc-discuss

Hi,

I am using pandoc to generate simple HTML from markdown. Simple HTML
is required because the german tax authority wants footnotes and
explanation in a rather limited subset of XHTML.

For example, here a test markdown input:
  Right     Left     Center     Default
-------   ------   --------     -------
12        12       12           12
123       123      123          123
1         1        1            1

This creates the following HTML:
<table>
<thead>
<tr class="header">
<th align="right">Right</th>
<th align="right">Left</th>
<th align="right">Center</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="right">12</td>
<td align="right">12</td>
<td align="right">12</td>
<td>12</td>
</tr>
<tr class="even">
<td align="right">123</td>
<td align="right">123</td>
<td align="right">123</td>
<td>123</td>
</tr>
<tr class="odd">
<td align="right">1</td>
<td align="right">1</td>
<td align="right">1</td>
<td>1</td>
</tr>
</tbody>
</table>

This HTML does not pass tax validation due to the thead and tbody and
the class attribute to the tr tag.

Can I make pandoc omit those tags and attributes, or do I need to do
post-processing of the generated HTML?

Greetings
Marc


-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Simplifying pandoc's HTML output even more
       [not found] ` <20170211211439.GD2488-MEsB+WDYHc7QKvwJT6wXshvVK+yQ3ZXh@public.gmane.org>
@ 2017-02-11 22:00   ` BP Jonsson
       [not found]     ` <CAFC_yuRk+EGMRy6Bw0p2u6EiTwSHVwr589MZPm_+da3hZudUiQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-02-11 22:25   ` John MacFarlane
  1 sibling, 1 reply; 8+ messages in thread
From: BP Jonsson @ 2017-02-11 22:00 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 4024 bytes --]

If you make a list of everything you want changed/removed in terms of HTML
I'll try to write an HTML filter.

/bpj

lör 11 feb. 2017 kl. 22:14 skrev Marc Haber <mh+pandoc-discuss@zugschlus.de
>:

> Hi,
>
> I am using pandoc to generate simple HTML from markdown. Simple HTML
> is required because the german tax authority wants footnotes and
> explanation in a rather limited subset of XHTML.
>
> For example, here a test markdown input:
>   Right     Left     Center     Default
> -------   ------   --------     -------
> 12        12       12           12
> 123       123      123          123
> 1         1        1            1
>
> This creates the following HTML:
> <table>
> <thead>
> <tr class="header">
> <th align="right">Right</th>
> <th align="right">Left</th>
> <th align="right">Center</th>
> <th>Default</th>
> </tr>
> </thead>
> <tbody>
> <tr class="odd">
> <td align="right">12</td>
> <td align="right">12</td>
> <td align="right">12</td>
> <td>12</td>
> </tr>
> <tr class="even">
> <td align="right">123</td>
> <td align="right">123</td>
> <td align="right">123</td>
> <td>123</td>
> </tr>
> <tr class="odd">
> <td align="right">1</td>
> <td align="right">1</td>
> <td align="right">1</td>
> <td>1</td>
> </tr>
> </tbody>
> </table>
>
> This HTML does not pass tax validation due to the thead and tbody and
> the class attribute to the tr tag.
>
> Can I make pandoc omit those tags and attributes, or do I need to do
> post-processing of the generated HTML?
>
> Greetings
> Marc
>
>
> --
>
> -----------------------------------------------------------------------------
> Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
> Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
> Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
>
> --
> You received this message because you are subscribed to the Google Groups
> "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pandoc-discuss/20170211211439.GD2488%40torres.zugschlus.de
> .
> For more options, visit https://groups.google.com/d/optout.
>
-- 

------------------------------
SavedURI :Show URLShow URLSavedURI :
SavedURI :Hide URLHide URLSavedURI :
https://mail.google.com/_/scs/mail-static/_/js/k=gmail.main.sv.G3GZFwvcniQ.O/m=m_i,t,it/am=fUAcTAoZawdGHAZ2YD-g9N_f7LL4CX7WlSgHQKgABHaCv9kToPiBD8qOMw/rt=h/d=1/rs=AItRSTO5CF1YB_frDRXLXTeUsQ1zItcBvwhttps://mail.google.com/_/scs/mail-static/_/js/k=gmail.main.sv.G3GZFwvcniQ.O/m=m_i,t,it/am=fUAcTAoZawdGHAZ2YD-g9N_f7LL4CX7WlSgHQKgABHaCv9kToPiBD8qOMw/rt=h/d=1/rs=AItRSTO5CF1YB_frDRXLXTeUsQ1zItcBvw
<https://mail.google.com/_/scs/mail-static/_/js/k=gmail.main.sv.G3GZFwvcniQ.O/m=m_i,t,it/am=fUAcTAoZawdGHAZ2YD-g9N_f7LL4CX7WlSgHQKgABHaCv9kToPiBD8qOMw/rt=h/d=1/rs=AItRSTO5CF1YB_frDRXLXTeUsQ1zItcBvw>
<https://mail.google.com/_/scs/mail-static/_/js/k=gmail.main.sv.G3GZFwvcniQ.O/m=m_i,t,it/am=fUAcTAoZawdGHAZ2YD-g9N_f7LL4CX7WlSgHQKgABHaCv9kToPiBD8qOMw/rt=h/d=1/rs=AItRSTO5CF1YB_frDRXLXTeUsQ1zItcBvw>
------------------------------

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuRk%2BEGMRy6Bw0p2u6EiTwSHVwr589MZPm_%2Bda3hZudUiQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: Type: text/html, Size: 7338 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Simplifying pandoc's HTML output even more
       [not found] ` <20170211211439.GD2488-MEsB+WDYHc7QKvwJT6wXshvVK+yQ3ZXh@public.gmane.org>
  2017-02-11 22:00   ` BP Jonsson
@ 2017-02-11 22:25   ` John MacFarlane
  1 sibling, 0 replies; 8+ messages in thread
From: John MacFarlane @ 2017-02-11 22:25 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

+++ Marc Haber [Feb 11 17 22:14 ]:
>Can I make pandoc omit those tags and attributes, or do I need to do
>post-processing of the generated HTML?

No, you need to do post-processing.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Simplifying pandoc's HTML output even more
       [not found]     ` <CAFC_yuRk+EGMRy6Bw0p2u6EiTwSHVwr589MZPm_+da3hZudUiQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-02-12  7:10       ` Marc Haber
       [not found]         ` <20170212071012.GI2488-MEsB+WDYHc7QKvwJT6wXshvVK+yQ3ZXh@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Marc Haber @ 2017-02-12  7:10 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On Sat, Feb 11, 2017 at 10:00:04PM +0000, BP Jonsson wrote:
> If you make a list of everything you want changed/removed in terms of HTML
> I'll try to write an HTML filter.

So, pandoc's output is not configurable in this regard, there a no
run-time changeable termplates being used?

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Simplifying pandoc's HTML output even more
       [not found]         ` <20170212071012.GI2488-MEsB+WDYHc7QKvwJT6wXshvVK+yQ3ZXh@public.gmane.org>
@ 2017-02-13 12:33           ` BP Jonsson
       [not found]             ` <735216ab-2350-d8a5-d582-10d82d7a8d61-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-02-13 15:27           ` John MacFarlane
  1 sibling, 1 reply; 8+ messages in thread
From: BP Jonsson @ 2017-02-13 12:33 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

[-- Attachment #1: Type: text/plain, Size: 1376 bytes --]

Den 2017-02-12 kl. 08:10, skrev Marc Haber:
> So, pandoc's output is not configurable in this regard, there a no
> run-time changeable termplates being used?

No, you have to post-process. When I need to do that I usually
write an HTML filter based on [Mojo::DOM][] like the one attached
(which should do the trick for you if thead, tbody and tr.class
are the only issues. Of course you need [perl][][^1] and the
Mojo::DOM [modules][] installed, but that should be a piece of
cake, then pipe pandoc's output through the html filter:

     $ pandoc input.md | perl strip-table-parts.pl > output.html

[Mojo::DOM]: https://metacpan.org/pod/Mojo::DOM
[perl]:      https://www.perl.org/get.html
[modules]:   http://www.cpan.org/modules/INSTALL.html

[^1]:   I recommend Strawberry Perl if you are on Windows.

/bpj

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/735216ab-2350-d8a5-d582-10d82d7a8d61%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

[-- Attachment #2: strip-table-parts.pl --]
[-- Type: text/x-perl, Size: 749 bytes --]

#!/usr/bin/env perl

# strip thead, tbody tags and tr classes from HTML (as produced by pandoc)

use utf8;
use strict;
use warnings;
use warnings  qw(FATAL utf8);
use open      qw(:std :utf8);

use Mojo::DOM;

sub trim {
    my($string) = @_;
    $string =~ s/\A\s+//;   # remove leading whitespace
    $string =~ s/\s+\z//;   # remove trailing whitespace
    return $string;
}

my $html = do { local $/; <>; }; # slurp STDIN

my $dom = Mojo::DOM->new($html);

my $stripped = $dom->find('thead, tbody');

for my $elem ( @$stripped ) {
    # trim content so no empty lines inside table
    $elem->replace( trim $elem->content );
}

my $trs = $dom->find('tr');

for my $tr ( @$trs ) {
    delete $tr->attr->{class};
}

print $dom->to_string;

__END__

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Simplifying pandoc's HTML output even more
       [not found]         ` <20170212071012.GI2488-MEsB+WDYHc7QKvwJT6wXshvVK+yQ3ZXh@public.gmane.org>
  2017-02-13 12:33           ` BP Jonsson
@ 2017-02-13 15:27           ` John MacFarlane
       [not found]             ` <20170213152705.GB67285-l/d5Ua9yGnxXsXJlQylH7w@public.gmane.org>
  1 sibling, 1 reply; 8+ messages in thread
From: John MacFarlane @ 2017-02-13 15:27 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Another option is to use a custom lua writer (see
the manual).

The example included with pandoc (data/sample.lua)
generates HTML, so it would be easy to modify this
slightly for your needs.

+++ Marc Haber [Feb 12 17 08:10 ]:
>On Sat, Feb 11, 2017 at 10:00:04PM +0000, BP Jonsson wrote:
>> If you make a list of everything you want changed/removed in terms of HTML
>> I'll try to write an HTML filter.
>
>So, pandoc's output is not configurable in this regard, there a no
>run-time changeable termplates being used?
>
>Greetings
>Marc
>
>-- 
>-----------------------------------------------------------------------------
>Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
>Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
>Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
>
>-- 
>You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20170212071012.GI2488%40torres.zugschlus.de.
>For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Simplifying pandoc's HTML output even more
       [not found]             ` <20170213152705.GB67285-l/d5Ua9yGnxXsXJlQylH7w@public.gmane.org>
@ 2017-02-16 11:45               ` Marc Haber
  0 siblings, 0 replies; 8+ messages in thread
From: Marc Haber @ 2017-02-16 11:45 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On Mon, Feb 13, 2017 at 04:27:05PM +0100, John MacFarlane wrote:
> Another option is to use a custom lua writer (see
> the manual).
> 
> The example included with pandoc (data/sample.lua)
> generates HTML, so it would be easy to modify this
> slightly for your needs.

Perfect. Thanks!

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Simplifying pandoc's HTML output even more
       [not found]             ` <735216ab-2350-d8a5-d582-10d82d7a8d61-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-02-16 11:46               ` Marc Haber
  0 siblings, 0 replies; 8+ messages in thread
From: Marc Haber @ 2017-02-16 11:46 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

On Mon, Feb 13, 2017 at 01:33:30PM +0100, BP Jonsson wrote:
> No, you have to post-process. When I need to do that I usually
> write an HTML filter based on [Mojo::DOM][] like the one attached
> (which should do the trick for you if thead, tbody and tr.class
> are the only issues. Of course you need [perl][][^1] and the
> Mojo::DOM [modules][] installed, but that should be a piece of
> cake, then pipe pandoc's output through the html filter:
> 
>     $ pandoc input.md | perl strip-table-parts.pl > output.html

Thanks for the code and the insight into Mojo::DOM, but writing a
custom lua filter was easier for me. I'll keep Mojo::DOM in mind though.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-02-16 11:46 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-11 21:14 Simplifying pandoc's HTML output even more Marc Haber
     [not found] ` <20170211211439.GD2488-MEsB+WDYHc7QKvwJT6wXshvVK+yQ3ZXh@public.gmane.org>
2017-02-11 22:00   ` BP Jonsson
     [not found]     ` <CAFC_yuRk+EGMRy6Bw0p2u6EiTwSHVwr589MZPm_+da3hZudUiQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-12  7:10       ` Marc Haber
     [not found]         ` <20170212071012.GI2488-MEsB+WDYHc7QKvwJT6wXshvVK+yQ3ZXh@public.gmane.org>
2017-02-13 12:33           ` BP Jonsson
     [not found]             ` <735216ab-2350-d8a5-d582-10d82d7a8d61-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-16 11:46               ` Marc Haber
2017-02-13 15:27           ` John MacFarlane
     [not found]             ` <20170213152705.GB67285-l/d5Ua9yGnxXsXJlQylH7w@public.gmane.org>
2017-02-16 11:45               ` Marc Haber
2017-02-11 22:25   ` John MacFarlane

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).