public inbox archive for pandoc-discuss@googlegroups.com
 help / color / mirror / Atom feed
* Attribute-less Markdown from web page Html
@ 2023-04-26  0:32 Oliver
       [not found] ` <8AD0B607-B556-48CC-83AA-7D0BACD3B8BE-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Oliver @ 2023-04-26  0:32 UTC (permalink / raw)
  To: pandoc-discuss

Hi all

I try to use Pandoc to convert web pages to markdown without all the class clutter like `{.underline}`, etc.

So I try

```
pandoc -f html -t markdown-raw_html-native_divs-native_spans-fenced_divs-header_attributes-auto_identifiers-inline_code_attributes-link_attributes-raw_attribute-simple_tables-multiline_tables-grid_tables page.html
```

and it works reasonably well, but I still get a bit of class clutter like

```
{.v-visible-sr .js-screen-reader-info}
```

or attributes like

```
{title="sometext“}
```

, both after links

How can I supress these?

I want really only the text and (image) links.

Any help much appreciated!


-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/8AD0B607-B556-48CC-83AA-7D0BACD3B8BE%40halloleo.hailmail.net.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Attribute-less Markdown from web page Html
       [not found] ` <8AD0B607-B556-48CC-83AA-7D0BACD3B8BE-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org>
@ 2023-04-26  5:39   ` John MacFarlane
       [not found]     ` <1E55C9DB-9A8C-4064-9927-2EC8B70076A0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: John MacFarlane @ 2023-04-26  5:39 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Turning off -link_attributes should do it, but looks like you tried that.

I'd have to look at an example of the input that produces this with these settings.

If you don't need fancy features, you could also try `-t commonmark` or `-t markdown_strict`.

> On Apr 25, 2023, at 5:32 PM, Oliver <news-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org> wrote:
> 
> Hi all
> 
> I try to use Pandoc to convert web pages to markdown without all the class clutter like `{.underline}`, etc.
> 
> So I try
> 
> ```
> pandoc -f html -t markdown-raw_html-native_divs-native_spans-fenced_divs-header_attributes-auto_identifiers-inline_code_attributes-link_attributes-raw_attribute-simple_tables-multiline_tables-grid_tables page.html
> ```
> 
> and it works reasonably well, but I still get a bit of class clutter like
> 
> ```
> {.v-visible-sr .js-screen-reader-info}
> ```
> 
> or attributes like
> 
> ```
> {title="sometext“}
> ```
> 
> , both after links
> 
> How can I supress these?
> 
> I want really only the text and (image) links.
> 
> Any help much appreciated!
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/8AD0B607-B556-48CC-83AA-7D0BACD3B8BE%40halloleo.hailmail.net.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/1E55C9DB-9A8C-4064-9927-2EC8B70076A0%40gmail.com.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Attribute-less Markdown from web page Html
       [not found]     ` <1E55C9DB-9A8C-4064-9927-2EC8B70076A0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2023-04-26 23:59       ` Oliver
       [not found]         ` <CA3F14A6-3BC2-47D2-9FDC-ED464D6CAF49-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Oliver @ 2023-04-26 23:59 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Thanks John

`-t markdown_strict-raw_html` does the trick for me!

One thing though with markdown_strict is odd: The text of lists is indented to the next 4-space column:

`-   list text`

Can I somehow tell the markdown_strict writer to use only _one_ space here:

`- list text`

Anyway, thousand time thanks for Pandoc!


On 26 Apr 2023, at 15:39, John MacFarlane wrote:

> Turning off -link_attributes should do it, but looks like you tried that.
>
> I'd have to look at an example of the input that produces this with these settings.
>
> If you don't need fancy features, you could also try `-t commonmark` or `-t markdown_strict`.
>
>> On Apr 25, 2023, at 5:32 PM, Oliver <news-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org> wrote:
>>
>> Hi all
>>
>> I try to use Pandoc to convert web pages to markdown without all the class clutter like `{.underline}`, etc.
>>
>> So I try
>>
>> ```
>> pandoc -f html -t markdown-raw_html-native_divs-native_spans-fenced_divs-header_attributes-auto_identifiers-inline_code_attributes-link_attributes-raw_attribute-simple_tables-multiline_tables-grid_tables page.html
>> ```
>>
>> and it works reasonably well, but I still get a bit of class clutter like
>>
>> ```
>> {.v-visible-sr .js-screen-reader-info}
>> ```
>>
>> or attributes like
>>
>> ```
>> {title="sometext“}
>> ```
>>
>> , both after links
>>
>> How can I supress these?
>>
>> I want really only the text and (image) links.
>>
>> Any help much appreciated!
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/8AD0B607-B556-48CC-83AA-7D0BACD3B8BE%40halloleo.hailmail.net.
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/1E55C9DB-9A8C-4064-9927-2EC8B70076A0%40gmail.com.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CA3F14A6-3BC2-47D2-9FDC-ED464D6CAF49%40halloleo.hailmail.net.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Attribute-less Markdown from web page Html
       [not found]         ` <CA3F14A6-3BC2-47D2-9FDC-ED464D6CAF49-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org>
@ 2023-04-28 13:56           ` Ian Cornelius
  2023-04-28 16:49           ` John MacFarlane
  1 sibling, 0 replies; 8+ messages in thread
From: Ian Cornelius @ 2023-04-28 13:56 UTC (permalink / raw)
  To: pandoc-discuss


[-- Attachment #1.1: Type: text/plain, Size: 3176 bytes --]

To adjust indentation of lists, I would pipe pandoc's output to sed, e.g.

```
pandoc input.html -t markdown_strict-raw_html | sed 's/   / /'
```
On Wednesday, April 26, 2023 at 6:59:31 PM UTC-5 Oliver wrote:

> Thanks John
>
> `-t markdown_strict-raw_html` does the trick for me!
>
> One thing though with markdown_strict is odd: The text of lists is 
> indented to the next 4-space column:
>
> `- list text`
>
> Can I somehow tell the markdown_strict writer to use only _one_ space here:
>
> `- list text`
>
> Anyway, thousand time thanks for Pandoc!
>
>
> On 26 Apr 2023, at 15:39, John MacFarlane wrote:
>
> > Turning off -link_attributes should do it, but looks like you tried that.
> >
> > I'd have to look at an example of the input that produces this with 
> these settings.
> >
> > If you don't need fancy features, you could also try `-t commonmark` or 
> `-t markdown_strict`.
> >
> >> On Apr 25, 2023, at 5:32 PM, Oliver <ne...-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org> 
> wrote:
> >>
> >> Hi all
> >>
> >> I try to use Pandoc to convert web pages to markdown without all the 
> class clutter like `{.underline}`, etc.
> >>
> >> So I try
> >>
> >> ```
> >> pandoc -f html -t 
> markdown-raw_html-native_divs-native_spans-fenced_divs-header_attributes-auto_identifiers-inline_code_attributes-link_attributes-raw_attribute-simple_tables-multiline_tables-grid_tables 
> page.html
> >> ```
> >>
> >> and it works reasonably well, but I still get a bit of class clutter 
> like
> >>
> >> ```
> >> {.v-visible-sr .js-screen-reader-info}
> >> ```
> >>
> >> or attributes like
> >>
> >> ```
> >> {title="sometext“}
> >> ```
> >>
> >> , both after links
> >>
> >> How can I supress these?
> >>
> >> I want really only the text and (image) links.
> >>
> >> Any help much appreciated!
> >>
> >>
> >> -- 
> >> You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> >> To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> >> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/8AD0B607-B556-48CC-83AA-7D0BACD3B8BE%40halloleo.hailmail.net
> .
> >
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "pandoc-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/pandoc-discuss/1E55C9DB-9A8C-4064-9927-2EC8B70076A0%40gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/66967a25-b28c-4d2f-aac2-6811b397d145n%40googlegroups.com.

[-- Attachment #1.2: Type: text/html, Size: 4957 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Attribute-less Markdown from web page Html
       [not found]         ` <CA3F14A6-3BC2-47D2-9FDC-ED464D6CAF49-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org>
  2023-04-28 13:56           ` Ian Cornelius
@ 2023-04-28 16:49           ` John MacFarlane
       [not found]             ` <C0C55D35-D675-4B6E-8D1A-CACEF8F738D1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 1 reply; 8+ messages in thread
From: John MacFarlane @ 2023-04-28 16:49 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

You would get different indentation with `-t commonmark`. markdown_strict follows the '4-space rule'.


> On Apr 26, 2023, at 4:59 PM, Oliver <news-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org> wrote:
> 
> Thanks John
> 
> `-t markdown_strict-raw_html` does the trick for me!
> 
> One thing though with markdown_strict is odd: The text of lists is indented to the next 4-space column:
> 
> `-   list text`
> 
> Can I somehow tell the markdown_strict writer to use only _one_ space here:
> 
> `- list text`
> 
> Anyway, thousand time thanks for Pandoc!
> 
> 
> On 26 Apr 2023, at 15:39, John MacFarlane wrote:
> 
>> Turning off -link_attributes should do it, but looks like you tried that.
>> 
>> I'd have to look at an example of the input that produces this with these settings.
>> 
>> If you don't need fancy features, you could also try `-t commonmark` or `-t markdown_strict`.
>> 
>>> On Apr 25, 2023, at 5:32 PM, Oliver <news-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org> wrote:
>>> 
>>> Hi all
>>> 
>>> I try to use Pandoc to convert web pages to markdown without all the class clutter like `{.underline}`, etc.
>>> 
>>> So I try
>>> 
>>> ```
>>> pandoc -f html -t markdown-raw_html-native_divs-native_spans-fenced_divs-header_attributes-auto_identifiers-inline_code_attributes-link_attributes-raw_attribute-simple_tables-multiline_tables-grid_tables page.html
>>> ```
>>> 
>>> and it works reasonably well, but I still get a bit of class clutter like
>>> 
>>> ```
>>> {.v-visible-sr .js-screen-reader-info}
>>> ```
>>> 
>>> or attributes like
>>> 
>>> ```
>>> {title="sometext“}
>>> ```
>>> 
>>> , both after links
>>> 
>>> How can I supress these?
>>> 
>>> I want really only the text and (image) links.
>>> 
>>> Any help much appreciated!
>>> 
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/8AD0B607-B556-48CC-83AA-7D0BACD3B8BE%40halloleo.hailmail.net.
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/1E55C9DB-9A8C-4064-9927-2EC8B70076A0%40gmail.com.
> 
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CA3F14A6-3BC2-47D2-9FDC-ED464D6CAF49%40halloleo.hailmail.net.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/C0C55D35-D675-4B6E-8D1A-CACEF8F738D1%40gmail.com.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Attribute-less Markdown from web page Html
       [not found]             ` <C0C55D35-D675-4B6E-8D1A-CACEF8F738D1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2023-04-28 22:14               ` Oliver
       [not found]                 ` <0B417699-0C93-4DBC-9B09-8C36B6F39B0F-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Oliver @ 2023-04-28 22:14 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Cool. But CommonMark uses these strange empty link reference labels (when the link title itself is the link label):

    [Pandoc Manual][]

      [Pandoc Manual]: https://pandoc.org/MANUAL.html

Is there a way to switch this off? I.e just:

    [Pandoc Manual]


On 29 Apr 2023, at 2:49, John MacFarlane wrote:

> You would get different indentation with `-t commonmark`. markdown_strict follows the '4-space rule'.
>
>
>> On Apr 26, 2023, at 4:59 PM, Oliver <news-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org> wrote:
>>
>> Thanks John
>>
>> `-t markdown_strict-raw_html` does the trick for me!
>>
>> One thing though with markdown_strict is odd: The text of lists is indented to the next 4-space column:
>>
>> `-   list text`
>>
>> Can I somehow tell the markdown_strict writer to use only _one_ space here:
>>
>> `- list text`
>>
>> Anyway, thousand time thanks for Pandoc!
>>
>>
>> On 26 Apr 2023, at 15:39, John MacFarlane wrote:
>>
>>> Turning off -link_attributes should do it, but looks like you tried that.
>>>
>>> I'd have to look at an example of the input that produces this with these settings.
>>>
>>> If you don't need fancy features, you could also try `-t commonmark` or `-t markdown_strict`.
>>>
>>>> On Apr 25, 2023, at 5:32 PM, Oliver <news-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org> wrote:
>>>>
>>>> Hi all
>>>>
>>>> I try to use Pandoc to convert web pages to markdown without all the class clutter like `{.underline}`, etc.
>>>>
>>>> So I try
>>>>
>>>> ```
>>>> pandoc -f html -t markdown-raw_html-native_divs-native_spans-fenced_divs-header_attributes-auto_identifiers-inline_code_attributes-link_attributes-raw_attribute-simple_tables-multiline_tables-grid_tables page.html
>>>> ```
>>>>
>>>> and it works reasonably well, but I still get a bit of class clutter like
>>>>
>>>> ```
>>>> {.v-visible-sr .js-screen-reader-info}
>>>> ```
>>>>
>>>> or attributes like
>>>>
>>>> ```
>>>> {title="sometext“}
>>>> ```
>>>>
>>>> , both after links
>>>>
>>>> How can I supress these?
>>>>
>>>> I want really only the text and (image) links.
>>>>
>>>> Any help much appreciated!
>>>>
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/8AD0B607-B556-48CC-83AA-7D0BACD3B8BE%40halloleo.hailmail.net.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/1E55C9DB-9A8C-4064-9927-2EC8B70076A0%40gmail.com.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CA3F14A6-3BC2-47D2-9FDC-ED464D6CAF49%40halloleo.hailmail.net.
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/C0C55D35-D675-4B6E-8D1A-CACEF8F738D1%40gmail.com.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/0B417699-0C93-4DBC-9B09-8C36B6F39B0F%40halloleo.hailmail.net.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Attribute-less Markdown from web page Html
       [not found]                 ` <0B417699-0C93-4DBC-9B09-8C36B6F39B0F-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org>
@ 2023-04-29  5:04                   ` John MacFarlane
       [not found]                     ` <A3814551-04F1-4FD2-B6CF-7B89E39D840A-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: John MacFarlane @ 2023-04-29  5:04 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

I think that's just an oversight. I just pushed a change to the commonmark writer so it will use the shortcut forms.


> On Apr 28, 2023, at 3:14 PM, Oliver <news-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org> wrote:
> 
> Cool. But CommonMark uses these strange empty link reference labels (when the link title itself is the link label):
> 
>    [Pandoc Manual][]
> 
>      [Pandoc Manual]: https://pandoc.org/MANUAL.html
> 
> Is there a way to switch this off? I.e just:
> 
>    [Pandoc Manual]
> 
> 
> On 29 Apr 2023, at 2:49, John MacFarlane wrote:
> 
>> You would get different indentation with `-t commonmark`. markdown_strict follows the '4-space rule'.
>> 
>> 
>>> On Apr 26, 2023, at 4:59 PM, Oliver <news-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org> wrote:
>>> 
>>> Thanks John
>>> 
>>> `-t markdown_strict-raw_html` does the trick for me!
>>> 
>>> One thing though with markdown_strict is odd: The text of lists is indented to the next 4-space column:
>>> 
>>> `-   list text`
>>> 
>>> Can I somehow tell the markdown_strict writer to use only _one_ space here:
>>> 
>>> `- list text`
>>> 
>>> Anyway, thousand time thanks for Pandoc!
>>> 
>>> 
>>> On 26 Apr 2023, at 15:39, John MacFarlane wrote:
>>> 
>>>> Turning off -link_attributes should do it, but looks like you tried that.
>>>> 
>>>> I'd have to look at an example of the input that produces this with these settings.
>>>> 
>>>> If you don't need fancy features, you could also try `-t commonmark` or `-t markdown_strict`.
>>>> 
>>>>> On Apr 25, 2023, at 5:32 PM, Oliver <news-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org> wrote:
>>>>> 
>>>>> Hi all
>>>>> 
>>>>> I try to use Pandoc to convert web pages to markdown without all the class clutter like `{.underline}`, etc.
>>>>> 
>>>>> So I try
>>>>> 
>>>>> ```
>>>>> pandoc -f html -t markdown-raw_html-native_divs-native_spans-fenced_divs-header_attributes-auto_identifiers-inline_code_attributes-link_attributes-raw_attribute-simple_tables-multiline_tables-grid_tables page.html
>>>>> ```
>>>>> 
>>>>> and it works reasonably well, but I still get a bit of class clutter like
>>>>> 
>>>>> ```
>>>>> {.v-visible-sr .js-screen-reader-info}
>>>>> ```
>>>>> 
>>>>> or attributes like
>>>>> 
>>>>> ```
>>>>> {title="sometext“}
>>>>> ```
>>>>> 
>>>>> , both after links
>>>>> 
>>>>> How can I supress these?
>>>>> 
>>>>> I want really only the text and (image) links.
>>>>> 
>>>>> Any help much appreciated!
>>>>> 
>>>>> 
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>>>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/8AD0B607-B556-48CC-83AA-7D0BACD3B8BE%40halloleo.hailmail.net.
>>>> 
>>>> -- 
>>>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/1E55C9DB-9A8C-4064-9927-2EC8B70076A0%40gmail.com.
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CA3F14A6-3BC2-47D2-9FDC-ED464D6CAF49%40halloleo.hailmail.net.
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/C0C55D35-D675-4B6E-8D1A-CACEF8F738D1%40gmail.com.
> 
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/0B417699-0C93-4DBC-9B09-8C36B6F39B0F%40halloleo.hailmail.net.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/A3814551-04F1-4FD2-B6CF-7B89E39D840A%40gmail.com.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Attribute-less Markdown from web page Html
       [not found]                     ` <A3814551-04F1-4FD2-B6CF-7B89E39D840A-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2023-04-29 14:01                       ` Oliver
  0 siblings, 0 replies; 8+ messages in thread
From: Oliver @ 2023-04-29 14:01 UTC (permalink / raw)
  To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw

Hey, thanks for this! Will try it out.

On 29 Apr 2023, at 15:04, John MacFarlane wrote:

> I think that's just an oversight. I just pushed a change to the commonmark writer so it will use the shortcut forms.
>
>
>> On Apr 28, 2023, at 3:14 PM, Oliver <news-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org> wrote:
>>
>> Cool. But CommonMark uses these strange empty link reference labels (when the link title itself is the link label):
>>
>>    [Pandoc Manual][]
>>
>>      [Pandoc Manual]: https://pandoc.org/MANUAL.html
>>
>> Is there a way to switch this off? I.e just:
>>
>>    [Pandoc Manual]
>>
>>
>> On 29 Apr 2023, at 2:49, John MacFarlane wrote:
>>
>>> You would get different indentation with `-t commonmark`. markdown_strict follows the '4-space rule'.
>>>
>>>
>>>> On Apr 26, 2023, at 4:59 PM, Oliver <news-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org> wrote:
>>>>
>>>> Thanks John
>>>>
>>>> `-t markdown_strict-raw_html` does the trick for me!
>>>>
>>>> One thing though with markdown_strict is odd: The text of lists is indented to the next 4-space column:
>>>>
>>>> `-   list text`
>>>>
>>>> Can I somehow tell the markdown_strict writer to use only _one_ space here:
>>>>
>>>> `- list text`
>>>>
>>>> Anyway, thousand time thanks for Pandoc!
>>>>
>>>>
>>>> On 26 Apr 2023, at 15:39, John MacFarlane wrote:
>>>>
>>>>> Turning off -link_attributes should do it, but looks like you tried that.
>>>>>
>>>>> I'd have to look at an example of the input that produces this with these settings.
>>>>>
>>>>> If you don't need fancy features, you could also try `-t commonmark` or `-t markdown_strict`.
>>>>>
>>>>>> On Apr 25, 2023, at 5:32 PM, Oliver <news-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org> wrote:
>>>>>>
>>>>>> Hi all
>>>>>>
>>>>>> I try to use Pandoc to convert web pages to markdown without all the class clutter like `{.underline}`, etc.
>>>>>>
>>>>>> So I try
>>>>>>
>>>>>> ```
>>>>>> pandoc -f html -t markdown-raw_html-native_divs-native_spans-fenced_divs-header_attributes-auto_identifiers-inline_code_attributes-link_attributes-raw_attribute-simple_tables-multiline_tables-grid_tables page.html
>>>>>> ```
>>>>>>
>>>>>> and it works reasonably well, but I still get a bit of class clutter like
>>>>>>
>>>>>> ```
>>>>>> {.v-visible-sr .js-screen-reader-info}
>>>>>> ```
>>>>>>
>>>>>> or attributes like
>>>>>>
>>>>>> ```
>>>>>> {title="sometext“}
>>>>>> ```
>>>>>>
>>>>>> , both after links
>>>>>>
>>>>>> How can I supress these?
>>>>>>
>>>>>> I want really only the text and (image) links.
>>>>>>
>>>>>> Any help much appreciated!
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>>>>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/8AD0B607-B556-48CC-83AA-7D0BACD3B8BE%40halloleo.hailmail.net.
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>>>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/1E55C9DB-9A8C-4064-9927-2EC8B70076A0%40gmail.com.
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CA3F14A6-3BC2-47D2-9FDC-ED464D6CAF49%40halloleo.hailmail.net.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/C0C55D35-D675-4B6E-8D1A-CACEF8F738D1%40gmail.com.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/0B417699-0C93-4DBC-9B09-8C36B6F39B0F%40halloleo.hailmail.net.
>
> -- 
> You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/A3814551-04F1-4FD2-B6CF-7B89E39D840A%40gmail.com.

-- 
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/2418447D-A319-4F7C-A9D7-AB5BC8C8892B%40halloleo.hailmail.net.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-04-29 14:01 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-26  0:32 Attribute-less Markdown from web page Html Oliver
     [not found] ` <8AD0B607-B556-48CC-83AA-7D0BACD3B8BE-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org>
2023-04-26  5:39   ` John MacFarlane
     [not found]     ` <1E55C9DB-9A8C-4064-9927-2EC8B70076A0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2023-04-26 23:59       ` Oliver
     [not found]         ` <CA3F14A6-3BC2-47D2-9FDC-ED464D6CAF49-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org>
2023-04-28 13:56           ` Ian Cornelius
2023-04-28 16:49           ` John MacFarlane
     [not found]             ` <C0C55D35-D675-4B6E-8D1A-CACEF8F738D1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2023-04-28 22:14               ` Oliver
     [not found]                 ` <0B417699-0C93-4DBC-9B09-8C36B6F39B0F-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org>
2023-04-29  5:04                   ` John MacFarlane
     [not found]                     ` <A3814551-04F1-4FD2-B6CF-7B89E39D840A-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2023-04-29 14:01                       ` Oliver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).