From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.text.pandoc/29170 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Gwern Branwen Newsgroups: gmane.text.pandoc Subject: Re: Removing parts of the document with [walk] in Haskell Date: Sat, 4 Sep 2021 15:26:51 -0400 Message-ID: References: <915213fc-e4c4-480b-a6a2-fd3420777ddan@googlegroups.com> Reply-To: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="16683"; mail-complaints-to="usenet@ciao.gmane.io" To: pandoc-discuss Original-X-From: pandoc-discuss+bncBDFJXQMSYMIRBIMRZ6EQMGQEKOGRVOQ-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Sat Sep 04 21:27:33 2021 Return-path: Envelope-to: gtp-pandoc-discuss@m.gmane-mx.org Original-Received: from mail-oo1-f61.google.com ([209.85.161.61]) by ciao.gmane.io with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1mMbK8-00048p-5A for gtp-pandoc-discuss@m.gmane-mx.org; Sat, 04 Sep 2021 21:27:32 +0200 Original-Received: by mail-oo1-f61.google.com with SMTP id k18-20020a4a94920000b029026767722880sf1675440ooi.7 for ; Sat, 04 Sep 2021 12:27:32 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1630783650; cv=pass; d=google.com; s=arc-20160816; b=IqjYolSPUPU+jOo2qhrAjvLCf5O/9q2jGqFg/KyK/zorRMVCAnvXTpwvjn3VpJ7YdC I0r2r2yF5SxPycNcSNNTYKRDg1/EyhvfntDmIshJgEHN99zMjb8oNz1WlwW8ibDtepmq n+v7My9rwqimpioSfYcbv2H0vd/himAvHsrCS+7voNS2xrdb0h4KlkmBXF6bZ8MAfZ7w NMMUJlF5Z5PXWU6P18PUyynrwZHN9pY0aWWV4CRfw4b702xqvW+ISP8cU0l6zWGuVveM LJyAyFym7O27nb4v6HYd/rozAtZvBFDG44edH70DDmsr0bDZBIFBeQdO9BQdrabQ6svl jElw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:reply-to:to:subject:message-id:date :from:in-reply-to:references:mime-version:sender:dkim-signature; bh=eSUHAWohIt8FLJnVhqtPmvnGjt59bdLdYaFiwgeS904=; b=AvSxasUDKtG2Z+jJXotUU0C5eY1l0gnzDuLoZYIfJQSgpTqfjosRYK+YC+YIPDc9fc tF2HAAgkEVt/5S13aOY9oT2MXdvQqlnR05OS5Q2wuiEj1S2+hPm5yvOJVHWTsq8trygz IikQv8jNn6a21HEMFkScu2U3fj3UFH+71G17MKQ+229sagxiXUXezsSJxkien1G5LmJn UBpEepm9PpHr+5lUuW9IS2dnd7KRf5JSMZXKsqLpYYUt52peoePVbYCXiENc/z9HLiJb fsGTBVVZ5/hk3wk4HMF4ZApsNEtE+O41uoPh1gMD4hur6EdZQ89xGAKIWDbnyFFQtoDA qpLQ== ARC-Authentication-Results: i=2; gmr-mx.google.com; spf=pass (google.com: domain of gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 209.85.166.47 as permitted sender) smtp.mailfrom=gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20210112; h=sender:mime-version:references:in-reply-to:from:date:message-id :subject:to:x-original-sender:x-original-authentication-results :reply-to:precedence:mailing-list:list-id:list-post:list-help :list-archive:list-subscribe:list-unsubscribe; bh=eSUHAWohIt8FLJnVhqtPmvnGjt59bdLdYaFiwgeS904=; b=S4nNUnwlCgiAjf0lrCIp7vDrpC/RSHJLFb9d1qAAc6pQZG+6ibfUIMO3S1tJop9mNJ 9MslmwuioTzH9t0sZzGRXTYUlMyDq/7t3Y8+IuH7I+ksoz1URZl1IJHbcgvV+ah+cBlH 8mu3lUVsIEUCLyYWpmzElVPOYHVYCN3oEy8a6aVf+uHos3r9USmkazyY4xilwPkTxVFB G9oxreSIBoFcstPhNe5n46pQpEH6zXZcb21pg7TvjCgQ4gLndq0MdgIFqwuddQdEFpQ0 gNEKtqaRvLHfJnLp5zi7b9jZLVa9mfZ+hrOeOS055meiiGT1m2cbMcvqGVqaQNz2lLWP oiQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:mime-version:references:in-reply-to:from :date:message-id:subject:to:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:x-spam-checked-in-group:list-post:list-help:list-archive :list-subscribe:list-unsubscribe; bh=eSUHAWohIt8FLJnVhqtPmvnGjt59bdLdYaFiwgeS904=; b=g/WCEC+LCkCCO5+0IGx78f4GRbOAGscNGjTq8qacelvKDvZ/JdvK1a81OSrrq6AQ9G kr3yELReWbcsl7xvMaOm5dO8j1waYdf8LiUdsMAkgPBn3FWa1IArbp8CW3YZHg/NvJHP ZKn3IgKjnWQwb8QXnpDFxbeHseRYyRvcFX83XrLoe3wxqZZWHCY13vSSovek/+OiRJct 0WQG/IvV2MWW1GyRJZ3EY1D0T9TNrUcZnN8f9sFltYuvq8yIwrblhlGfVhqzpBo1WVWU GOwIdYnynJqrGC9MHDtmFmJ+j/Kl1VInCfo456BG6KPOpj0DYLq7wv15NM4Id2AIVuZ4 Avsw== Original-Sender: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org X-Gm-Message-State: AOAM533bWnYpbJ0uDSm1pn8T9i8bdU2ag9hXaLwtHJv0SA2nJHASyigY 6Of69vnkW5uLCvMK4ZWO4HE= X-Google-Smtp-Source: ABdhPJyaQjDmErC6vN7UlbkBdXlXsrddlJ/V2ILwEsZNVdAoPdKQ24SGUQsT93bram/L/5D/GK8Hmw== X-Received: by 2002:a9d:400a:: with SMTP id m10mr4704580ote.8.1630783650790; Sat, 04 Sep 2021 12:27:30 -0700 (PDT) X-BeenThere: pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org Original-Received: by 2002:aca:3c86:: with SMTP id j128ls542306oia.7.gmail; Sat, 04 Sep 2021 12:27:29 -0700 (PDT) X-Received: by 2002:aca:3111:: with SMTP id x17mr3653534oix.20.1630783649174; Sat, 04 Sep 2021 12:27:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1630783649; cv=none; d=google.com; s=arc-20160816; b=zWfq4d31bQ/NaZZWrYiH6D/s+onEmp45OABiYh8qRByHrgc4mahlAcLq2rEseKOGAv 9ME3yINfP0uX/P+GeviVjHqmywp4FpPGMlIGM2xZF/C3DM9OUxL4LqrYRGbJKh5FVVB5 4tFF9XfKV33YoYcPI9q0jZpehxToATgd1ML6rvdEPPJvAXtjCd0qwoR3qs+adWIdvfIQ 1W/zPPSDvwdt2U3otmG8pzp/EcrTMrvt48zR/3vIbMV4MB0RGpbZWP71BV5wqqFhCbFo 6bJgpYLg5fbA+d5uMCS6FAus5ZIoqoBg106ab8oLYrYHh762+Qf2v6TXePztuzMZps1J Vjqw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=to:subject:message-id:date:from:in-reply-to:references:mime-version; bh=3o7RFAsg9udYMqXxjgmW/onIBMrpRQrV5zcdHTjcDvM=; b=rBMdoLkdAXwhIG/a4lBFnFsaKa70BKV8lO3hIgnmmWXdSwaxW2fs5pAgv+c9o6Eam6 WyOwY0pB5iTbnt2RxdAJvlkHJyALhXFXFamEflD82zXJojkUE7RJ87LSIOIm5M6plxbN 0FjGGJ+i4OR+wQmXtWoLBuHACZDVjlhHTpprhBrMoE7BtLhTIMXd03DZtbASexyGECyW +vcr78ZWvqXW7Ylp6JFFZ/3w00GcKHbCgaJSZyNHem0YdL52HzANkzpTFq2mvCa8oTZj +/lHg4PHdr0MBvjdV2xdSYlUnK5KABZW1V02s6vgqoxoey9oP743mCSmn7WBKcNTEdp9 I3Og== ARC-Authentication-Results: i=1; gmr-mx.google.com; spf=pass (google.com: domain of gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 209.85.166.47 as permitted sender) smtp.mailfrom=gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Original-Received: from mail-io1-f47.google.com (mail-io1-f47.google.com. [209.85.166.47]) by gmr-mx.google.com with ESMTPS id q18si172537otm.3.2021.09.04.12.27.29 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sat, 04 Sep 2021 12:27:29 -0700 (PDT) Received-SPF: pass (google.com: domain of gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 209.85.166.47 as permitted sender) client-ip=209.85.166.47; Original-Received: by mail-io1-f47.google.com with SMTP id y18so3291906ioc.1 for ; Sat, 04 Sep 2021 12:27:29 -0700 (PDT) X-Received: by 2002:a5e:a813:: with SMTP id c19mr3980349ioa.199.1630783648428; Sat, 04 Sep 2021 12:27:28 -0700 (PDT) In-Reply-To: <915213fc-e4c4-480b-a6a2-fd3420777ddan-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> X-Original-Sender: gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org designates 209.85.166.47 as permitted sender) smtp.mailfrom=gwern0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Precedence: list Mailing-list: list pandoc-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org; contact pandoc-discuss+owners-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org List-ID: X-Google-Group-Id: 1007024079513 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , Xref: news.gmane.io gmane.text.pandoc:29170 Archived-At: You might find my own struggles with similar issues interesting: https://groups.google.com/g/pandoc-discuss/c/jgb-Q0F2p1Y/m/1DogIEAtAQAJ https://github.com/gwern/gwern.net/blob/master/build/Typography.hs A few things I'll note: your `isThe` only works if you assume that all strings have been fully split into Str/Space/Str sequences. But it's entirely valid to have a Str containing spaces, like `Str " the "`; your rewrite would fail there. So in your example, you could do something like `isThe (Str s) = if "the" `T.isInfixOf` s then Str (T.replace "the" "" s) else (Str s)` and then simply `removeThe = walk isThe`. (I don't know if it meaningfully differs in performance from filtering but if it does, it's probably faster. May still be slower than your wrong version.) > The same question actually applies to any transformation which "changes the number of elements": This is a general problem I've beaten my head against many times. If you do the concat trick, you start running into issues with how the `walk` operates exactly so it'll sometimes repeat operations or omit substitutions you expected to happen, and bottomUp/topDown sometimes cause issues like their own different omissions or just infinite loops (always fun). The best approach I've found is to *not* change the number of elements, using instead `Inline -> Inline` by using 'the Span trick': 'Span' is the *only* Inline which can wrap multiple Inlines without meaningfully affecting them. (If you wrap them in Para or Div, it's a Block; if you wrap them in Italics, they're italicized; if you wrap them in Str, you need to compile down to a flat string deleting all formatting etc.) This is how I do auto-smallcaps, inserting wordbreak characters after slashes, automatically turning matching text into hyperlinks, etc: I walk the tree, check Strs for matches of a regexp or character, either return a Str unaltered (so still an Inline) or return a Span (still an Inline) whose '[Inline]' payload can be anything like '[Str "foo", Span "smallcaps-auto" [Str "SMALLCAPS"], Str "bar"]' or '[Link nullAttr ["GPU"] ("https://en.wikipedia.org/wiki/Graphics_processing_unit","")]'... The walk typechecks, the substitution happens consistently as you expect without infinite loops or missing or redundant substitutions, and it isn't too hard to understand. The drawback is that Span wrappers aren't great in terms of efficiency, but correctness is more important than speed, and you can always postprocess the HTML to remove the wrappers not doing anything. (You can substitute in more powerful rewrites by going to raw HTML but it gets very difficult when you work with `RawInline`, because now you have to be very careful about the opacity of the Text blob inside RawInline and the ordering and accidentally clobbering your HTML literals or messing them up. Try to stick to Span rewrites if at all possible.) -- gwern https://www.gwern.net