ntg-context - mailing list for ConTeXt users
 help / color / mirror / Atom feed
From: Pablo Rodriguez via ntg-context <ntg-context@ntg.nl>
To: Taco Hoekwater via ntg-context <ntg-context@ntg.nl>
Cc: Pablo Rodriguez <oinos@gmx.es>
Subject: Re: ignore not closed tags in XML input
Date: Tue, 17 May 2022 18:36:32 +0200	[thread overview]
Message-ID: <c5f83d3e-c0c0-7255-fc9e-541829796908@gmx.es> (raw)
In-Reply-To: <40642198-7105-4F8C-8897-C85F59B37D73@bittext.nl>

On 5/16/22 20:13, Taco Hoekwater via ntg-context wrote:
>> On 16 May 2022, at 18:50, Pablo Rodriguez via ntg-context <ntg-context@ntg.nl> wrote:
>> [...]
>> If I want to typeset the whole book
>> (https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/), I will
>> have to download and sanitize over 20 HTML files.
>
> Which can be done with a couple of command lines. Xmllint usually does a good
> job of cleaning up dodgy html input:
>
>   xmllint --html --xmlout <crappy.html> > <nice.xml>

Many thanks for your reply, Taco.

Since I have to recursively download the site (with "wget -r"), I hope I
can find a way to pipe and get all in a single invocation.

>> It is really a pity that ConTeXt cannot totally ignore any given XML elements.
>
> This statement is a little unfair: the problem is exactly that your input is NOT proper XML.

My apologies. I really think ConTeXt rocks.

I wanted to write an introduction on how to typeset XML sources with
ConTeXt (at least, in Spanish).

One of the main issues I face is to find examples.

It seemed natural to me to use HTML edited texts. But it turned out,
it’s way trickier than I first thought.

HTML edited texts could be an eye-candy for some potential interested
people. But if one has to add web crawler plus XML sanitizer to the
dependencies, this makes it way harder (even for myself).

> If it was proper XML, ConTeXt would not have problems with it. ConTeXt explicitly has
> the capability to handle XML files, which your input simply is not. In fact, it is
> sloppy HTML-esque data that modern webbrowsers happen to be able to handle more or less
> correctly. It is not valid HTML either, because valid HTML has to be valid SGML, which your
> input clearly is not.

I agree my input isn’t proper XML, but it is valid SGML. One of the main
differences between both is that SGML allows unclosed tags.

This is why cases such as this one are corner-cases:
https://validator.w3.org/nu/?doc=https%3A%2F%2Fseumasjeltzz.github.io%2FLinguaeGraecaePerSeIllustrata%2F.

Since I considered this a corner-case, I thought that a command such as
\xmlignore{#1}{head/(meta|link)} would make sense.

> That said, Tools like xmllint exist for this stuff. Just write a small batch driver file in
> some scripting language ((power)shell, lua, python, perl, etc.) to preprocess the HTML
> stuff into clean XML, and you should be fine.

Many thanks for your for your reply again.

Maybe all XML handling is way more complex than I originally thought.

Many thanks for your help,

Pablo
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : http://contextgarden.net
___________________________________________________________________________________

  reply	other threads:[~2022-05-17 16:36 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-16 15:08 Pablo Rodriguez via ntg-context
2022-05-16 15:22 ` mf via ntg-context
2022-05-16 16:37   ` Pablo Rodriguez via ntg-context
2022-05-16 15:30 ` Hans van der Meer via ntg-context
2022-05-16 16:50   ` Pablo Rodriguez via ntg-context
2022-05-16 18:13     ` Taco Hoekwater via ntg-context
2022-05-17 16:36       ` Pablo Rodriguez via ntg-context [this message]
2022-05-18  1:23         ` Thangalin via ntg-context
2022-05-18 16:00           ` Pablo Rodriguez via ntg-context
2022-05-18 17:14             ` Thangalin via ntg-context
2022-05-21 17:01               ` Pablo Rodriguez via ntg-context
2022-05-18 22:09             ` Bruce Horrocks via ntg-context
2022-05-21 17:28               ` Pablo Rodriguez via ntg-context
2022-05-19 15:33             ` juh via ntg-context
2022-05-21 18:23               ` Pablo Rodriguez via ntg-context

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c5f83d3e-c0c0-7255-fc9e-541829796908@gmx.es \
    --to=ntg-context@ntg.nl \
    --cc=oinos@gmx.es \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).