ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
From: Taco Hoekwater via ntg-context <ntg-context@ntg.nl>
To: mailing list for ConTeXt users <ntg-context@ntg.nl>
Cc: Taco Hoekwater <taco@bittext.nl>, Pablo Rodriguez <oinos@gmx.es>
Subject: Re: ignore not closed tags in XML input
Date: Mon, 16 May 2022 20:13:34 +0200	[thread overview]
Message-ID: <40642198-7105-4F8C-8897-C85F59B37D73@bittext.nl> (raw)
In-Reply-To: <a678ec61-7d15-49b3-9621-3bf4c0111817@gmx.es>



> On 16 May 2022, at 18:50, Pablo Rodriguez via ntg-context <ntg-context@ntg.nl> wrote:
> 
> On 5/16/22 17:30, Hans van der Meer via ntg-context wrote:
>> Can't you use an editor with grep, searching for something like the
>> pattern <meta.*^/>?
> 
> Many thanks for your reply, dr. van der Meer.
> 
> If I want to typeset the whole book
> (https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/), I will
> have to download and sanitize over 20 HTML files.

Which can be done with a couple of command lines. Xmllint usually does a good
job of cleaning up dodgy html input:

  xmllint --html --xmlout <crappy.html> > <nice.xml>

(As good as can be expected from a program, anyway).

> It is really a pity that ConTeXt cannot totally ignore any given XML elements.

This statement is a little unfair: the problem is exactly that your input is NOT proper XML.
 
If it was proper XML, ConTeXt would not have problems with it. ConTeXt explicitly has
the capability to handle XML files, which your input simply is not. In fact, it is
sloppy HTML-esque data that modern webbrowsers happen to be able to handle more or less
correctly. It is not valid HTML either, because valid HTML has to be valid SGML, which your
input clearly is not.

That said, Tools like xmllint exist for this stuff. Just write a small batch driver file in 
some scripting language ((power)shell, lua, python, perl, etc.) to preprocess the HTML 
stuff into clean XML, and you should be fine.

Taco

— 
Taco Hoekwater              E: taco@bittext.nl
genderfluid (all pronouns)



___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : http://contextgarden.net
___________________________________________________________________________________

  reply	other threads:[~2022-05-16 18:13 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-16 15:08 Pablo Rodriguez via ntg-context
2022-05-16 15:22 ` mf via ntg-context
2022-05-16 16:37   ` Pablo Rodriguez via ntg-context
2022-05-16 15:30 ` Hans van der Meer via ntg-context
2022-05-16 16:50   ` Pablo Rodriguez via ntg-context
2022-05-16 18:13     ` Taco Hoekwater via ntg-context [this message]
2022-05-17 16:36       ` Pablo Rodriguez via ntg-context
2022-05-18  1:23         ` Thangalin via ntg-context
2022-05-18 16:00           ` Pablo Rodriguez via ntg-context
2022-05-18 17:14             ` Thangalin via ntg-context
2022-05-21 17:01               ` Pablo Rodriguez via ntg-context
2022-05-18 22:09             ` Bruce Horrocks via ntg-context
2022-05-21 17:28               ` Pablo Rodriguez via ntg-context
2022-05-19 15:33             ` juh via ntg-context
2022-05-21 18:23               ` Pablo Rodriguez via ntg-context

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=40642198-7105-4F8C-8897-C85F59B37D73@bittext.nl \
    --to=ntg-context@ntg.nl \
    --cc=oinos@gmx.es \
    --cc=taco@bittext.nl \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).