From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.1 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FROM autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 14487 invoked from network); 27 Jul 2022 22:55:43 -0000 Received: from hurricane.the-brannons.com (2602:ff06:725:1:20::25) by inbox.vuxu.org with ESMTPUTF8; 27 Jul 2022 22:55:43 -0000 Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by hurricane.the-brannons.com (OpenSMTPD) with ESMTP id 9e7703cf for ; Wed, 27 Jul 2022 15:55:39 -0700 (PDT) Received: from mail-wm1-x330.google.com (mail-wm1-x330.google.com [2a00:1450:4864:20::330]) by hurricane.the-brannons.com (OpenSMTPD) with ESMTPS id 8e0f1ef6 (TLSv1.3:TLS_AES_256_GCM_SHA384:256:NO) for ; Wed, 27 Jul 2022 15:55:34 -0700 (PDT) Received: by mail-wm1-x330.google.com with SMTP id 2-20020a1c0202000000b003a3a22178beso72950wmc.3 for ; Wed, 27 Jul 2022 15:55:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc; bh=fCx06r3iqdt2mTD3HO5fBr/2SzemKu2XLglNslD+4a4=; b=hThmkXMfRQM7WjrTciaUSzds2j40hN4RMLQVigK27su5q0cgrO/Qhq+Wj4ZXqSZknt 4UOesDBO5OoHuNc1+glyZpZcWrVR3EwO3ie0eMfMNgFxhaGx19XEVC8uwEgCb0ikvFLc aRUGV6G79FHanCD33VYa30gibQrOsdfMpzRr8i5L7xETvVw898D2sZ8W9rfjDEaqhLfQ ozxOlVesUvBoc/V4W4BY36QE2Z4/1HWWzKMYr5ZrtbGWAQ8sI80RfmUT9LP0Gd0XBDOS d6+njxKw0F6IAfGRnUEAjqL1uYI0UkmrX2rPsdFEaNva7tFVQBdXUdKGWVqhfZmIuKvn nZgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc; bh=fCx06r3iqdt2mTD3HO5fBr/2SzemKu2XLglNslD+4a4=; b=iRv7eMW7UkmqHJxQs8KEd9ycH6+L8hGKyDFxWEFFWNtw2dyCqWDc+3sEiMgIzsw61E aJZPaN7KLZbYsD0geO7KZwqy7g41eUvrOcNZxVIcFSc6NzONykP2NPPiRXWuBCuZ/YEz BSUuDKhyaMfQu7ymbPsYjIM0C18fMH4BHgyoKJZCbPWcyOLbAd0ogg9w3JDZvDr2ODGo xs+fcTR8GbPR/grv/xvGj59fGvKo7d/xwyTpSQf0IiYF1ZUKCXIsu+Bu6j4rF02Gt4lW iahsjfbGjKuvyAzRone7Qh3Ge57UttDdQxI10s49ZJXSCdeCbGv7cfQF/KqfypZxy8Jq ylow== X-Gm-Message-State: AJIora/oRdaVysOLdYuKdQG/mTXeZW7GO57J5hbpWGzTDzwH+cA8h1oh qj+hil5E7jOIFJwhCIdVOkU= X-Google-Smtp-Source: AGRyM1sh6T86FmluHfEjpqYSyYQVIF11OZfJY/gxdONnJCqav0TV3lE5frsoRR12adZ7fyKoGjZ2HQ== X-Received: by 2002:a1c:2184:0:b0:3a3:30d7:7314 with SMTP id h126-20020a1c2184000000b003a330d77314mr4382478wmh.19.1658962532395; Wed, 27 Jul 2022 15:55:32 -0700 (PDT) Received: from pinebook-pro (8.f.6.7.4.5.2.5.4.5.a.b.8.5.e.b.1.4.0.9.2.4.1.1.0.b.8.0.1.0.0.2.ip6.arpa. [2001:8b0:1142:9041:be58:ba54:5254:76f8]) by smtp.gmail.com with ESMTPSA id j14-20020adff54e000000b0021eb70e5edbsm133550wrp.97.2022.07.27.15.55.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Jul 2022 15:55:31 -0700 (PDT) Date: Wed, 27 Jul 2022 23:58:23 +0100 From: Adam Thompson To: Karl Dahlke Cc: edbrowse-dev@edbrowse.org Subject: Re: html scanner Message-ID: References: <20220620164559.eklhad@comcast.net> X-BeenThere: edbrowse-dev@edbrowse.org List-Id: Edbrowse Development List MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220620164559.eklhad@comcast.net> On Wed, Jul 20, 2022 at 04:45:59PM -0400, Karl Dahlke wrote: > Eventually we reach a tipping point. > > tidy is not maintained, and projects that aren't maintained are soon not distributed. > > 1. People will have to build tidy from source, (once it is no longer packaged), for as long as the source remains on line. > 2. Building it is a pain since you have to use cmake. > 3. there are bugs in tidy we can't fix, and can't work around. At least one is an infinite loop so this is no longer a trivial matter. > 4. It is yet another dependency. The fewer dependencies the better. It's a shame that tidy's got into such a state but yeah, we can't depend on unmaintained code, that's not sustainable. > With this in mind, I finally said, oh fuck it, it's time to write our own. > An html scanner isn't trivial, but it's not terribly hard either, > it's not like a js engine, which is, for us, impossible! > So I've spent three days on it, and it's pretty dog gone close to done. > html-tags.c > Just three days, why didn't we do this sooner? Probably because we had our own before and switched from it. Fortunately we've learned from that and this would appear to be the result which sounds like sustainable progress. > And it's only a little more code than the code we used to interface to tidy. > No kidding - for the same amount of code we can roll our own. It'll also allow us to remain current with the ever-changing world of the web, at least in terms of new html tags hopefully. > So here's how to use it. > There is a temporary edbrowse toggle command > tidy > So you can use tidy or not, and even compare the outputs. > Our users guide is almost 500 lines long when rendered, and it comes out the same either way, that's pretty good. > jsrt also comes out the same, though there are some issues when trying to use it. > 4 of the tests in acid3 fail using my scanner. > So sure there are still issues, but this is clearly the way to go. Agreed, will start battle-testing. > I'd like to have this working solid, maybe in a month, then divest from tidy, then cut version 3.8.3 > We will, at that time, update our installation procedures. Sounds like a solid plan. > So if you dare, type in tidy, then browse around like usual, and see if things blow up, or look wrong, etc. > If you're not sure, revert back to tidy and browse and compare. Sounds fun. Cheers, Adam.