From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/32500 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Ian Cornelius Newsgroups: gmane.text.pandoc Subject: Re: Attribute-less Markdown from web page Html Date: Fri, 28 Apr 2023 06:56:19 -0700 (PDT) Message-ID: <66967a25-b28c-4d2f-aac2-6811b397d145n@googlegroups.com> References: <8AD0B607-B556-48CC-83AA-7D0BACD3B8BE@halloleo.hailmail.net> <1E55C9DB-9A8C-4064-9927-2EC8B70076A0@gmail.com> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_280_1077256482.1682690179704" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="7948"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBCU6DXMQSUIRBBNBV6RAMGQENGJMLVQ-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Fri Apr 28 15:56:25 2023 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-ot1-f64.google.com ([209.85.210.64]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1psOaG-0001qM-G7 for gtp-pandoc-discuss@m.gmane-mx.org; Fri, 28 Apr 2023 15:56:24 +0200 Original-Received: by mail-ot1-f64.google.com with SMTP id 46e09a7af769-6a5d32ce366sf9143734a34.1 for ; Fri, 28 Apr 2023 06:56:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20221208; t=1682690183; x=1685282183; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:reply-to:x-original-sender :mime-version:subject:references:in-reply-to:message-id:to:from:date :sender:from:to:cc:subject:date:message-id:reply-to; bh=djSk+J3Ej+qfO2vjpNCGowNc+Yqac+Zz5khy8IWfqVk=; b=mKZ1D1bYjz3o0Ywli6BvTzsLeXly1lxx4LVQ/A21hqHG9TKUloRG+WOdg0v+Zkt6AG 9NqQvHc5MgA/yaQO1dEd9BzVNzOg0H6wotMCN+d2wHIo3AHuPf4pZ0JfyeTED+Yhq/Jd 8csBLOg0IfIGueEdK3sejlqGFVh8cHHtoZJuJurZ9ggtguQ51l7DTbIzzHqkrM4chItI zg5xtrFQge3q6URGNsJYjHjD8hYAy95pedItu2c4aPMJcgE4FxdEgkrKU8uNUErildSd o1cjGk9riHO0B1WS5rqf/v90Goxq//Hy0qYHnprciJJYY9BPU0i4UUuugHo+gRH02Ss6 DipA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1682690183; x=1685282183; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:reply-to:x-original-sender :mime-version:subject:references:in-reply-to:message-id:to:from:date :from:to:cc:subject:date:message-id:reply-to; bh=djSk+J3Ej+qfO2vjpNCGowNc+Yqac+Zz5khy8IWfqVk=; b=S7d7mG47cyz6gWK93foJVC5sfbBcv/AIXVEV7IVJTVbeYbEh7RB5HzTJ8MQjZMKLrj B7cb6AbwXi28y6bSO3MXw6eTcQorNASZLJ/U+iBDgsXg/YM1iZVPR+segYhyolGkqV5B Shewu5T8g2SLJo+qsb52c532Lrh2PpRR7QxSRHZEHACCHFRhrSkSpHUgL/y+E4j2DdM7 cLnOllf3HSOu6CvI3xMZC61wkt+0M6i4OJqDuhLmbse9ecV3I+bwHrRmc40WKXSpLHNi 3+6/eqNtgUnc5lrCrG9ZjEY5bbgNOW+3jTBoTipMVIIs2aP1tBOCyEvrPANcFaWBTt7V Qxtg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682690183; x=1685282183; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :x-spam-checked-in-group:list-id:mailing-list:precedence:reply-to :x-original-sender:mime-version:subject:references:in-reply-to :message-id:to:from:date:x-beenthere:x-gm-message-state:sender:from :to:cc:subject:date:message-id:reply-to; bh=djSk+J3Ej+qfO2vjpNCGowNc+Yqac+Zz5khy8IWfqVk=; b=Z3tw26iGtLgq7rqICX/hHsROZxQuVcFyFNM7h4VCRV0ECrTKpG75z0kXjNgtzhdFdO lYcvawNqshZ94rIr3HbzvFgmzqrkPzvzahCF0XYLu42R9/VCl/+I40R1NHf0y5XNU1uS 2I9z47gUN0fMTnI4v39yRebx/+XOGiswSol6RkGiRGf8Fv/R6Bjkp8MnUnT943CNI0JA 3ZSTMp6BLMevAQwLT1PeJv/jy6xFzDQa3VlHJVpaVJ13cVkv7aeH8OPlmUs4urPTLqkW JGgeAc6mkKUrHCLCGBMtvqW0s44MwUosCcXy+/VUKXpWvdu8+8 Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AC+VfDyC2kyV5+CDWBFKROef3WSjex/wYoC3Q2bqYq5a6JJ3nHAtbZwn WD5btJz7t89xTi1S+NNPqdo= X-Google-Smtp-Source: ACHHUZ5WYqRJ78VqhsnfWMu5VlbD6pmRZUgcDtj7I3JDoIBqrub2/pZFZtgU5pidv0uOL3EsKE1J4w== X-Received: by 2002:a05:6830:13c9:b0:6a4:4e5a:61ac with SMTP id e9-20020a05683013c900b006a44e5a61acmr1333158otq.2.1682690183287; Fri, 28 Apr 2023 06:56:23 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:a05:6808:1489:b0:38e:6824:5243 with SMTP id e9-20020a056808148900b0038e68245243ls1182987oiw.7.-pod-prod-gmail; Fri, 28 Apr 2023 06:56:20 -0700 (PDT) X-Received: by 2002:aca:d744:0:b0:38e:a040:2a30 with SMTP id o65-20020acad744000000b0038ea0402a30mr1376471oig.5.1682690180440; Fri, 28 Apr 2023 06:56:20 -0700 (PDT) In-Reply-To: X-Original-Sender: ircornelius-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:32500 Archived-At: ------=_Part_280_1077256482.1682690179704 Content-Type: multipart/alternative; boundary="----=_Part_281_1185875686.1682690179704" ------=_Part_281_1185875686.1682690179704 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable To adjust indentation of lists, I would pipe pandoc's output to sed, e.g. ``` pandoc input.html -t markdown_strict-raw_html | sed 's/ / /' ``` On Wednesday, April 26, 2023 at 6:59:31=E2=80=AFPM UTC-5 Oliver wrote: > Thanks John > > `-t markdown_strict-raw_html` does the trick for me! > > One thing though with markdown_strict is odd: The text of lists is=20 > indented to the next 4-space column: > > `- list text` > > Can I somehow tell the markdown_strict writer to use only _one_ space her= e: > > `- list text` > > Anyway, thousand time thanks for Pandoc! > > > On 26 Apr 2023, at 15:39, John MacFarlane wrote: > > > Turning off -link_attributes should do it, but looks like you tried tha= t. > > > > I'd have to look at an example of the input that produces this with=20 > these settings. > > > > If you don't need fancy features, you could also try `-t commonmark` or= =20 > `-t markdown_strict`. > > > >> On Apr 25, 2023, at 5:32 PM, Oliver =20 > wrote: > >> > >> Hi all > >> > >> I try to use Pandoc to convert web pages to markdown without all the= =20 > class clutter like `{.underline}`, etc. > >> > >> So I try > >> > >> ``` > >> pandoc -f html -t=20 > markdown-raw_html-native_divs-native_spans-fenced_divs-header_attributes-= auto_identifiers-inline_code_attributes-link_attributes-raw_attribute-simpl= e_tables-multiline_tables-grid_tables=20 > page.html > >> ``` > >> > >> and it works reasonably well, but I still get a bit of class clutter= =20 > like > >> > >> ``` > >> {.v-visible-sr .js-screen-reader-info} > >> ``` > >> > >> or attributes like > >> > >> ``` > >> {title=3D"sometext=E2=80=9C} > >> ``` > >> > >> , both after links > >> > >> How can I supress these? > >> > >> I want really only the text and (image) links. > >> > >> Any help much appreciated! > >> > >> > >> --=20 > >> You received this message because you are subscribed to the Google=20 > Groups "pandoc-discuss" group. > >> To unsubscribe from this group and stop receiving emails from it, send= =20 > an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > >> To view this discussion on the web visit=20 > https://groups.google.com/d/msgid/pandoc-discuss/8AD0B607-B556-48CC-83AA-= 7D0BACD3B8BE%40halloleo.hailmail.net > . > > > > --=20 > > You received this message because you are subscribed to the Google=20 > Groups "pandoc-discuss" group. > > To unsubscribe from this group and stop receiving emails from it, send= =20 > an email to pandoc-discus...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To view this discussion on the web visit=20 > https://groups.google.com/d/msgid/pandoc-discuss/1E55C9DB-9A8C-4064-9927-= 2EC8B70076A0%40gmail.com > . > --=20 You received this message because you are subscribed to the Google Groups "= pandoc-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/= pandoc-discuss/66967a25-b28c-4d2f-aac2-6811b397d145n%40googlegroups.com. ------=_Part_281_1185875686.1682690179704 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable To adjust indentation of lists, I would pipe pandoc's output to sed, e.g.
```
pandoc input.html -t markdown_strict-raw_= html |=C2=A0sed 's/ =C2=A0 / /'
```
On Wednesday, April 26, 2023 at 6:59:3= 1=E2=80=AFPM UTC-5 Oliver wrote:
Thanks John

`-t markdown_strict-raw_html` does the trick for me!

One thing though with markdown_strict is odd: The text of lists is inde= nted to the next 4-space column:

`- list text`

Can I somehow tell the markdown_strict writer to use only _one_ space h= ere:

`- list text`

Anyway, thousand time thanks for Pandoc!


On 26 Apr 2023, at 15:39, John MacFarlane wrote:

> Turning off -link_attributes should do it, but looks like you trie= d that.
>
> I'd have to look at an example of the input that produces this= with these settings.
>
> If you don't need fancy features, you could also try `-t commo= nmark` or `-t markdown_strict`.
>
>> On Apr 25, 2023, at 5:32 PM, Oliver <ne...-WPTjrydoUPgeaOpM6FAJmQkbCANdLtlA@public.gmane.org> wrote:
>>
>> Hi all
>>
>> I try to use Pandoc to convert web pages to markdown without a= ll the class clutter like `{.underline}`, etc.
>>
>> So I try
>>
>> ```
>> pandoc -f html -t markdown-raw_html-native_divs-native_spans-f= enced_divs-header_attributes-auto_identifiers-inline_code_attributes-link_a= ttributes-raw_attribute-simple_tables-multiline_tables-grid_tables page.htm= l
>> ```
>>
>> and it works reasonably well, but I still get a bit of class c= lutter like
>>
>> ```
>> {.v-visible-sr .js-screen-reader-info}
>> ```
>>
>> or attributes like
>>
>> ```
>> {title=3D"sometext=E2=80=9C}
>> ```
>>
>> , both after links
>>
>> How can I supress these?
>>
>> I want really only the text and (image) links.
>>
>> Any help much appreciated!
>>
>>
>> --=20
>> You received this message because you are subscribed to the Go= ogle Groups "pandoc-discuss" group.
>> To unsubscribe from this group and stop receiving emails from = it, send an email to pandoc-disc= us...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
>> To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/= 8AD0B607-B556-48CC-83AA-7D0BACD3B8BE%40halloleo.hailmail.net.
>
> --=20
> You received this message because you are subscribed to the Google= Groups "pandoc-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, = send an email to pandoc-discus..= .@googlegroups.com.
> To view this discussion on the web visit ht= tps://groups.google.com/d/msgid/pandoc-discuss/1E55C9DB-9A8C-4064-9927-2EC8= B70076A0%40gmail.com.

--
You received this message because you are subscribed to the Google Groups &= quot;pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to pand= oc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org.
To view this discussion on the web visit https://groups.google.com/d= /msgid/pandoc-discuss/66967a25-b28c-4d2f-aac2-6811b397d145n%40googlegroups.= com.
------=_Part_281_1185875686.1682690179704-- ------=_Part_280_1077256482.1682690179704--