9front - general discussion about 9front
 help / color / mirror / Atom feed
* [9front] htmlfmt anchor corner cases
@ 2020-12-20  9:29 umbraticus
  2020-12-20 22:03 ` cinap_lenrek
  0 siblings, 1 reply; 10+ messages in thread
From: umbraticus @ 2020-12-20  9:29 UTC (permalink / raw)
  To: 9front

With the -u flag, htmlfmt doesn't print relative links starting with a
slash correctly:

; echo '<a href=/blah.html>blah</a>' |
	htmlfmt -u http://site.dom/some/deep/path/index.html
blah [http://site.dom/some/deep/path/blah.html]
(should be http://site.dom/blah.html)

It also drops the space after “image” in the following example, since
the relative link starts with puntuation:

; echo '<img src=../blah.jpg>' | htmlfmt -a
[image../blah.jpg]

This messes up my elaborate rc + plumber webshit environment.  To
address this one I just decided to go for {imgpath} instead of [image
imgpath] (also fixes the issue just illustrated by piping this
paragraph through fmt...)  Patch below.

umbraticus

diff -r 1ae20c21a286 sys/src/cmd/htmlfmt/html.c
--- a/sys/src/cmd/htmlfmt/html.c	Sat Dec 19 19:15:02 2020 +0100
+++ b/sys/src/cmd/htmlfmt/html.c	Sun Dec 20 21:56:35 2020 +1300
@@ -170,6 +170,10 @@
 		if(base[strlen(base)-1]!='/' && (href==nil || href[0]!='/'))
 			result = eappend(result, "/", "");
 		free(base);
+		if(href!=nil && href[0]=='/'
+		&& (base = strchr(result, ':')) != nil
+		&& (base = strchr(base+3, '/')) != nil)
+			*base = '\0';
 	}
 	if(href){
 		if(result)
@@ -226,7 +230,7 @@
 			im = (Iimage*)il;
 			if(im->imsrc){
 				href = fullurl(u, im->imsrc);
-				renderbytes(t, "[image %s]", href);
+				renderbytes(t, "{%s}", href);
 				free(href);
 			}
 			break;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] htmlfmt anchor corner cases
  2020-12-20  9:29 [9front] htmlfmt anchor corner cases umbraticus
@ 2020-12-20 22:03 ` cinap_lenrek
  2020-12-30  3:47   ` umbraticus
  0 siblings, 1 reply; 10+ messages in thread
From: cinap_lenrek @ 2020-12-20 22:03 UTC (permalink / raw)
  To: 9front

i do not like this part so much.

+		if(href!=nil && href[0]=='/'
+		&& (base = strchr(result, ':')) != nil
+		&& (base = strchr(base+3, '/')) != nil)
+			*base = '\0';

the issue is htmlfmt's code to combine relative
urls is just wrong. handling urls can be hard.

but adding hacks like these does not solve the
problem sufficiently. theres code in webfs that
might be of help.

maybe we should make a version for libhtml,
which htmlfmt uses.

the other part wbout {} looks fine.

--
cinap

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] htmlfmt anchor corner cases
  2020-12-20 22:03 ` cinap_lenrek
@ 2020-12-30  3:47   ` umbraticus
  2020-12-31  9:42     ` umbraticus
  0 siblings, 1 reply; 10+ messages in thread
From: umbraticus @ 2020-12-30  3:47 UTC (permalink / raw)
  To: 9front

> i do not like this part so much.
> 
> +		if(href!=nil && href[0]=='/'
> +		&& (base = strchr(result, ':')) != nil
> +		&& (base = strchr(base+3, '/')) != nil)
> +			*base = '\0';
> 
> the issue is htmlfmt's code to combine relative
> urls is just wrong. handling urls can be hard.
> 
> but adding hacks like these does not solve the
> problem sufficiently. theres code in webfs that
> might be of help.
> 
> maybe we should make a version for libhtml,
> which htmlfmt uses.

Yes, each program that uses libhtml does its own thing:

/sys/src/cmd/htmlfmt/html.c:/^fullurl
/sys/src/cmd/abaco/urls.c:/^urlcombine	quote: /* this is a HACK */
/sys/src/cmd/mothra/url.c:/^fileget
/sys/src/cmd/mothra/url.c:/^webclone ← makes use of webfs

Below is a patch that makes urls absolute when possible during parsing.
I'm not sure if the comment // FOR NOW: leave the url relative.
indicates that this was intended all along...

This would obviate the functions in htmlfmt and abaco but mothra
doesn't even use parsehtml so...  is it even worth it?  I'll have
another go at tidying up htmlfmt and send a separate email.

umbraticus

diff -r f4a5c13bcd43 sys/src/libhtml/build.c
--- a/sys/src/libhtml/build.c	Mon Dec 28 12:24:47 2020 +0100
+++ b/sys/src/libhtml/build.c	Wed Dec 30 16:12:33 2020 +1300
@@ -309,7 +309,6 @@
 static void			pushfontstyle(Pstate* ps, int sty);
 static void			pushjust(Pstate* ps, int j);
 static Item*		textit(Pstate* ps, Rune* s);
-static Rune*		removeallwhite(Rune* s);
 static void			resetdocinfo(Docinfo* d);
 static void			setcurfont(Pstate* ps);
 static void			setcurjust(Pstate* ps);
@@ -425,6 +424,40 @@
 
 static Item *getitems(ItemSource* is, uchar* data, int datalen);
 
+// Return malloced url, given (possibly empty) path and base.
+// A relative path and absolute base are combined; otherwise, path is returned as is.
+// If path is nil and base absolute, base up to final slash is returned.
+// URL strings are not validated any further than checking for proto://
+Rune*
+_fullurl(Rune *path, Rune *base)
+{
+	Rune *r;
+
+	if(path != nil){
+		for(r = path; isalpha(*r); r++)
+			;
+		if(r > path && *r++ == ':' && *r++ == '/' && *r == '/')
+			return _Strdup(path);	/* path is already absolute */
+	}
+	if(base == nil)
+		return _Strdup(path);
+	for(r = base; isalpha(*r); r++)
+		;
+	if(r == base || *r++ != ':' || *r++ != '/' || *r++ != '/')
+		return _Strdup(path);	/* bad base url proto */
+	while(isalnum(*r) || *r == '_' || *r == '@' || *r == '-' || *r == ':' || *r == '.')
+		r++;
+	if(r[-1] == '/' || *r && *r != '/')
+		return _Strdup(path);	/* bad base url hostname */
+	if(*r == '/' && (path == nil || *path != '/'))
+		r = runestrrchr(r, '/');	/* find final slash if path is not rooted */
+	if(path == nil)
+		return runesmprint("%.*S/", (int)(r - base), base);
+	if(*path == '/')
+		path++;
+	return runesmprint("%.*S/%S", (int)(r - base), base, path);
+}
+
 // Parse an html document and create a list of layout items.
 // Allocate and return document info in *pdi.
 // When caller is done with the items, it should call
@@ -439,7 +472,7 @@
 
 	di = newdocinfo();
 	di->src = _Strdup(pagesrc);
-	di->base = _Strdup(pagesrc);
+	di->base = _fullurl(nil, pagesrc);
 	di->mediatype = mtype;
 	di->chset = chset;
 	*pdi = di;
@@ -2923,55 +2956,18 @@
 	return ans;
 }
 
-// Attribute value when value is a URL, possibly relative to base.
-// FOR NOW: leave the url relative.
+// Attribute value when value is a URL.
+// Relative URLs are converted to absolute if a suitable base is given.
 // Caller must free the result (eventually).
 static Rune*
 aurlval(Token* tok, int attid, Rune* dflt, Rune* base)
 {
 	Rune*	ans;
-	Rune*	url;
-
-	USED(base);
-	ans = nil;
-	if(_tokaval(tok, attid, &url, 0) && url != nil)
-		ans = removeallwhite(url);
+
+	_tokaval(tok, attid, &ans, 0);
 	if(ans == nil)
-		ans = _Strdup(dflt);
-	return ans;
-}
-
-// Return copy of s but with all whitespace (even internal) removed.
-// This fixes some buggy URL specification strings.
-static Rune*
-removeallwhite(Rune* s)
-{
-	int	j;
-	int	n;
-	int	i;
-	int	c;
-	Rune*	ans;
-
-	j = 0;
-	n = _Strlen(s);
-	for(i = 0; i < n; i++) {
-		c = s[i];
-		if(c >= 256 || !isspace(c))
-			j++;
-	}
-	if(j < n) {
-		ans = _newstr(j);
-		j = 0;
-		for(i = 0; i < n; i++) {
-			c = s[i];
-			if(c >= 256 || !isspace(c))
-				ans[j++] = c;
-		}
-		ans[j] = 0;
-	}
-	else
-		ans = _Strdup(s);
-	return ans;
+		return _Strdup(dflt);
+	return _fullurl(ans, base);
 }
 
 // Attribute value when mere presence of attr implies value of 1,

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] htmlfmt anchor corner cases
  2020-12-30  3:47   ` umbraticus
@ 2020-12-31  9:42     ` umbraticus
  2021-01-01  4:42       ` umbraticus
  0 siblings, 1 reply; 10+ messages in thread
From: umbraticus @ 2020-12-31  9:42 UTC (permalink / raw)
  To: 9front

This patch makes the following changes to htmlfmt:

• Print image src like {url} instead of [image url]
• Properly combine rooted paths with base url
• Handle “protocol relative” urls
• Respect <base> tag
• Print document title at top
• Implement footnote mode -f
• Remove unused crap

umbraticus

diff -r b24b6b01d46a sys/src/cmd/htmlfmt/dat.h
--- a/sys/src/cmd/htmlfmt/dat.h	Tue Dec 29 19:38:59 2020 +0000
+++ b/sys/src/cmd/htmlfmt/dat.h	Thu Dec 31 16:08:26 2020 +1300
@@ -3,6 +3,7 @@
 
 enum
 {
+	NONE, INLINE, FOOTNOTES,
 	STACK		= 8192,
 	EVENTSIZE	= 256,
 };
@@ -20,29 +21,15 @@
 	int		outfd;
 	int		type;
 
-	char		*url;
 	Item		*items;
 	Docinfo	*docinfo;
 };
 
-extern	char*	url;
-extern	int		aflag;
+extern	Rune*	baseurl;
+extern	int		links;
 extern	int		width;
 
 extern	char*	loadhtml(int);
-
-extern	char*	readfile(char*, char*, int*);
-extern	void*	emalloc(ulong);
-extern	char*	estrdup(char*);
-extern	char*	estrstrdup(char*, char*);
-extern	char*	egrow(char*, char*, char*);
-extern	char*	eappend(char*, char*, char*);
-extern	void		error(char*, ...);
-
 extern	void		growbytes(Bytes*, char*, long);
-
-extern	void		rendertext(URLwin*, Bytes*);
 extern	void		rerender(URLwin*);
 extern	void		freeurlwin(URLwin*);
-
-#pragma	varargck	argpos	error	1
diff -r b24b6b01d46a sys/src/cmd/htmlfmt/html.c
--- a/sys/src/cmd/htmlfmt/html.c	Tue Dec 29 19:38:59 2020 +0000
+++ b/sys/src/cmd/htmlfmt/html.c	Thu Dec 31 16:08:26 2020 +1300
@@ -7,14 +7,49 @@
 #include <ctype.h>
 #include "dat.h"
 
-char urlexpr[] =
-	"^(https?|ftp|file|gopher|mailto|news|nntp|telnet|wais|prospero)"
-	"://([a-zA-Z0-9_@\\-]+([.:][a-zA-Z0-9_@\\-]+)*)";
-Reprog	*urlprog;
-
 int inword = 0;
 int col = 0;
 int wordi = 0;
+Rune* proto;
+Rune* root;
+Rune* base;
+
+void
+setbaseurls(Rune *url)
+{
+	Rune *r;
+
+	if(url == nil)
+		return;
+	free(proto);
+	free(root);
+	free(base);
+
+	/* just a basic check... */
+	for(r = url; isalpha(*r); r++)
+		;
+	if(r == baseurl || r[0] != ':' || r[1] != '/' || r[2] != '/' || r[3] == 0){
+		fprint(2, "%s: ignoring invalid base url: %S\n", argv0, url);
+		proto = root = base = nil;
+		return;
+	}
+
+	r[1] = 0;
+	proto = runestrdup(url);
+	r[1] = '/';
+	if(r = runestrchr(r + 3, '/')){
+		*r = 0;
+		root = runestrdup(url);
+		*r = '/';
+		r = runestrrchr(r, '/');
+		*r = 0;
+		base = runestrdup(url);
+		*r = '/';
+		return;
+	}
+	base = runestrdup(url);
+	root = runestrdup(url);
+}
 
 char*
 loadhtml(int fd)
@@ -27,7 +62,6 @@
 	u = emalloc(sizeof(URLwin));
 	u->infd = fd;
 	u->outfd = 1;
-	u->url = estrdup(url);
 	u->type = TextHtml;
 
 	b = emalloc(sizeof(Bytes));
@@ -35,24 +69,13 @@
 		growbytes(b, buf, n);
 	if(b->b == nil)
 		return nil;	/* empty file */
-	rendertext(u, b);
+	u->items = parsehtml(b->b, b->n, baseurl, u->type, UTF_8, &u->docinfo);
+	setbaseurls(u->docinfo->base);
+	rerender(u);
 	freeurlwin(u);
 	return nil;
 }
 
-char*
-runetobyte(Rune *r, int n)
-{
-	char *s;
-
-	if(n == 0)
-		return emalloc(1);
-	s = smprint("%.*S", n, r);
-	if(s == nil)
-		error("malloc failed");
-	return s;
-}
-
 int
 closingpunct(char c)
 {
@@ -129,58 +152,23 @@
 	free(r);
 }
 
-char*
-baseurl(char *url)
+void
+renderurl(Bytes *t, Rune *path, char lc, char rc)
 {
-	char *base, *slash;
-	Resub rs[10];
+	Rune *r;
 
-	if(url == nil)
-		return nil;
-	if(urlprog == nil){
-		urlprog = regcomp(urlexpr);
-		if(urlprog == nil)
-			error("can't compile URL regexp");
+	if(path == nil){
+		renderbytes(t, "%cnull_url%c", lc, rc);
+		return;
 	}
-	memset(rs, 0, sizeof rs);
-	if(regexec(urlprog, url, rs, nelem(rs)) == 0)
-		return nil;
-	base = estrdup(url);
-	slash = strrchr(base, '/');
-	if(slash!=nil && slash>=&base[rs[0].ep-rs[0].sp])
-		*slash = '\0';
+	for(r = path; isalpha(*r); r++)
+		;
+	if(base == nil || r[0] == '#' || r > path && r[0] == ':' && r[1] == '/' && r[2] == '/' && r[3])
+		renderbytes(t, "%c%S%c", lc, path, rc);
+	else if(path[0] == '/')
+		renderbytes(t, "%c%S%S%c", lc, path[1] == '/' ? proto : root, path, rc);
 	else
-		base[rs[0].ep-rs[0].sp] = '\0';
-	return base;
-}
-
-char*
-fullurl(URLwin *u, Rune *rhref)
-{
-	char *base, *href, *hrefbase;
-	char *result;
-
-	if(rhref == nil)
-		return estrdup("NULL URL");
-	href = runetobyte(rhref, runestrlen(rhref));
-	hrefbase = baseurl(href);
-	result = nil;
-	if(hrefbase==nil && (base = baseurl(u->url))!=nil){
-		result = estrdup(base);
-		if(base[strlen(base)-1]!='/' && (href==nil || href[0]!='/'))
-			result = eappend(result, "/", "");
-		free(base);
-	}
-	if(href){
-		if(result)
-			result = eappend(result, "", href);
-		else
-			result = estrdup(href);
-	}
-	free(hrefbase);
-	if(result == nil)
-		return estrdup("***unknown***");
-	return result;
+		renderbytes(t, "%c%S/%S%c", lc, base, path, rc);
 }
 
 void
@@ -195,11 +183,12 @@
 	Anchor *a;
 	Table *tab;
 	Tablecell *cell;
-	char *href;
+	int nimg;
 
 	inword = 0;
 	col = 0;
 	wordi = 0;
+	nimg = 1;
 
 	for(il=items; il!=nil; il=il->next){
 		if(il->state & IFbrk)
@@ -221,17 +210,18 @@
 			renderbytes(t, "=======\n");
 			break;
 		case Iimagetag:
-			if(!aflag)
+			if(links == NONE)
 				break;
 			im = (Iimage*)il;
 			if(im->imsrc){
-				href = fullurl(u, im->imsrc);
-				renderbytes(t, "[image %s]", href);
-				free(href);
+				if(links & FOOTNOTES)
+					renderbytes(t, "{%d}", nimg++);
+				else
+					renderurl(t, im->imsrc, '{', '}');
 			}
 			break;
 		case Iformfieldtag:
-			if(aflag)
+			if(links != NONE)
 				renderbytes(t, "[formfield]");
 			break;
 		case Itabletag:
@@ -253,14 +243,15 @@
 				renderbytes(t, " ");
 			break;
 		default:
-			error("unknown item tag %d\n", il->tag);
+			sysfatal("unknown item tag %d\n", il->tag);
 		}
 		if(il->anchorid != 0 && il->anchorid!=curanchor){
 			for(a=u->docinfo->anchors; a!=nil; a=a->next)
-				if(aflag && a->index == il->anchorid){
-					href = fullurl(u, a->href);
-					renderbytes(t, "[%s]", href);
-					free(href);
+				if(links != NONE && a->index == il->anchorid){
+					if(links & FOOTNOTES)
+						renderbytes(t, "[%d]", a->index);
+					else
+						renderurl(t, a->href, '[', ']');
 					break;
 				}
 			curanchor = il->anchorid;
@@ -271,13 +262,55 @@
 }
 
 void
+afootnotes(URLwin *u, Bytes *t){
+	Anchor *x, *y, *z;
+
+	x = u->docinfo->anchors;
+	if(x == nil)
+		return;
+	renderbytes(t, "\n\nlinks:\n");
+
+	/* list needs reversing */
+	for(z = nil; x->next != nil; x = y){
+		y = x->next;
+		x->next = z;
+		z = x;
+	}
+	for(x->next = z; x != nil; x = x->next){
+		renderbytes(t, "[%d]", x->index);
+		renderurl(t, x->href, ' ', '\n');
+	};
+}
+
+void
+imgfootnotes(URLwin *u, Bytes *t){
+	Iimage *i;
+	int n;
+
+	i = u->docinfo->images;
+	if(i == nil)
+		return;
+	renderbytes(t, "\n\nimages:\n");
+	for(n=1; i!=nil; i=i->nextimage){
+		renderbytes(t, "{%d}", n++);
+		renderurl(t, i->imsrc, ' ', '\n');
+	}
+}
+
+void
 rerender(URLwin *u)
 {
 	Bytes *t;
 
 	t = emalloc(sizeof(Bytes));
 
+	if(u->docinfo->doctitle!=nil)
+		renderbytes(t, "%S\n\n", u->docinfo->doctitle);
 	render(u, t, u->items, 0);
+	if(links & FOOTNOTES){
+		afootnotes(u, t);
+		imgfootnotes(u, t);
+	}
 
 	if(t->n)
 		write(u->outfd, (char*)t->b, t->n);
@@ -286,19 +319,6 @@
 }
 
 void
-rendertext(URLwin *u, Bytes *b)
-{
-	Rune *rurl;
-
-	rurl = toStr((uchar*)u->url, strlen(u->url), UTF_8);
-	u->items = parsehtml(b->b, b->n, rurl, u->type, UTF_8, &u->docinfo);
-//	free(rurl);
-
-	rerender(u);
-}
-
-
-void
 freeurlwin(URLwin *u)
 {
 	freeitems(u->items);
diff -r b24b6b01d46a sys/src/cmd/htmlfmt/main.c
--- a/sys/src/cmd/htmlfmt/main.c	Tue Dec 29 19:38:59 2020 +0000
+++ b/sys/src/cmd/htmlfmt/main.c	Thu Dec 31 16:08:26 2020 +1300
@@ -5,8 +5,8 @@
 #include <html.h>
 #include "dat.h"
 
-char *url = "";
-int aflag;
+Rune *baseurl;
+int links;
 int width = 70;
 char *defcharset = "latin1";
 
@@ -53,11 +53,14 @@
 
 	ARGBEGIN{
 	case 'a':
-		aflag++;
+		links |= INLINE;
 		break;
 	case 'c':
 		defcharset = EARGF(usage());
 		break;
+	case 'f':
+		links |= FOOTNOTES;
+		break;
 	case 'l': case 'w':
 		err = EARGF(usage());
 		width = atoi(err);
@@ -65,8 +68,12 @@
 			usage();
 		break;
 	case 'u':
-		url = EARGF(usage());
-		aflag++;
+		err = EARGF(usage());
+		free(baseurl);
+		baseurl = emalloc((utflen(err) + 1) * sizeof(Rune));
+		for(i = 0; *err != '\0'; i++)
+			err += chartorune(baseurl + i, err);
+		links |= INLINE;
 		break;
 	default:
 		usage();
diff -r b24b6b01d46a sys/src/cmd/htmlfmt/util.c
--- a/sys/src/cmd/htmlfmt/util.c	Tue Dec 29 19:38:59 2020 +0000
+++ b/sys/src/cmd/htmlfmt/util.c	Thu Dec 31 16:08:26 2020 +1300
@@ -12,7 +12,7 @@
 
 	p = malloc(n);
 	if(p == nil)
-		error("can't malloc: %r");
+		sysfatal("malloc: %r");
 	memset(p, 0, n);
 	return p;
 }
@@ -22,88 +22,10 @@
 {
 	p = realloc(p, n);
 	if(p == nil)
-		error("can't malloc: %r");
+		sysfatal("realloc: %r");
 	return p;
 }
 
-char*
-estrdup(char *s)
-{
-	char *t;
-
-	t = emalloc(strlen(s)+1);
-	strcpy(t, s);
-	return t;
-}
-
-char*
-estrstrdup(char *s, char *t)
-{
-	long ns, nt;
-	char *u;
-
-	ns = strlen(s);
-	nt = strlen(t);
-	/* use malloc to avoid memset */
-	u = malloc(ns+nt+1);
-	if(u == nil)
-		error("can't malloc: %r");
-	memmove(u, s, ns);
-	memmove(u+ns, t, nt);
-	u[ns+nt] = '\0';
-	return u;
-}
-
-char*
-eappend(char *s, char *sep, char *t)
-{
-	long ns, nsep, nt;
-	char *u;
-
-	if(t == nil)
-		u = estrstrdup(s, sep);
-	else{
-		ns = strlen(s);
-		nsep = strlen(sep);
-		nt = strlen(t);
-		/* use malloc to avoid memset */
-		u = malloc(ns+nsep+nt+1);
-		if(u == nil)
-			error("can't malloc: %r");
-		memmove(u, s, ns);
-		memmove(u+ns, sep, nsep);
-		memmove(u+ns+nsep, t, nt);
-		u[ns+nsep+nt] = '\0';
-	}
-	free(s);
-	return u;
-}
-
-char*
-egrow(char *s, char *sep, char *t)
-{
-	s = eappend(s, sep, t);
-	free(t);
-	return s;
-}
-
-void
-error(char *fmt, ...)
-{
-	va_list arg;
-	char buf[256];
-	Fmt f;
-
-	fmtfdinit(&f, 2, buf, sizeof buf);
-	fmtprint(&f, "Mail: ");
-	va_start(arg, fmt);
-	fmtvprint(&f, fmt, arg);
-	va_end(arg);
-	fmtprint(&f, "\n");
-	fmtfdflush(&f);
-	exits(fmt);
-}
-
 void
 growbytes(Bytes *b, char *s, long ns)
 {
@@ -112,7 +34,7 @@
 		/* use realloc to avoid memset */
 		b->b = realloc(b->b, b->nalloc);
 		if(b->b == nil)
-			error("growbytes: can't realloc: %r");
+			sysfatal("growbytes: can't realloc: %r");
 	}
 	memmove(b->b+b->n, s, ns);
 	b->n += ns;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] htmlfmt anchor corner cases
  2020-12-31  9:42     ` umbraticus
@ 2021-01-01  4:42       ` umbraticus
  2021-01-01 10:05         ` Steve Simon
                           ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: umbraticus @ 2021-01-01  4:42 UTC (permalink / raw)
  To: 9front

Updated patch below (nimg should be global so
subelements don't cause duplicate image
footnotes).  Quite enjoying these changes;
footnote mode much easier to read and work with.

umbraticus

diff -r b24b6b01d46a sys/src/cmd/htmlfmt/dat.h
--- a/sys/src/cmd/htmlfmt/dat.h	Tue Dec 29 19:38:59 2020 +0000
+++ b/sys/src/cmd/htmlfmt/dat.h	Fri Jan 01 17:35:59 2021 +1300
@@ -3,6 +3,7 @@
 
 enum
 {
+	NONE, INLINE, FOOTNOTES,
 	STACK		= 8192,
 	EVENTSIZE	= 256,
 };
@@ -20,29 +21,15 @@
 	int		outfd;
 	int		type;
 
-	char		*url;
 	Item		*items;
 	Docinfo	*docinfo;
 };
 
-extern	char*	url;
-extern	int		aflag;
+extern	Rune*	baseurl;
+extern	int		links;
 extern	int		width;
 
 extern	char*	loadhtml(int);
-
-extern	char*	readfile(char*, char*, int*);
-extern	void*	emalloc(ulong);
-extern	char*	estrdup(char*);
-extern	char*	estrstrdup(char*, char*);
-extern	char*	egrow(char*, char*, char*);
-extern	char*	eappend(char*, char*, char*);
-extern	void		error(char*, ...);
-
 extern	void		growbytes(Bytes*, char*, long);
-
-extern	void		rendertext(URLwin*, Bytes*);
 extern	void		rerender(URLwin*);
 extern	void		freeurlwin(URLwin*);
-
-#pragma	varargck	argpos	error	1
diff -r b24b6b01d46a sys/src/cmd/htmlfmt/html.c
--- a/sys/src/cmd/htmlfmt/html.c	Tue Dec 29 19:38:59 2020 +0000
+++ b/sys/src/cmd/htmlfmt/html.c	Fri Jan 01 17:35:59 2021 +1300
@@ -7,14 +7,50 @@
 #include <ctype.h>
 #include "dat.h"
 
-char urlexpr[] =
-	"^(https?|ftp|file|gopher|mailto|news|nntp|telnet|wais|prospero)"
-	"://([a-zA-Z0-9_@\\-]+([.:][a-zA-Z0-9_@\\-]+)*)";
-Reprog	*urlprog;
-
 int inword = 0;
 int col = 0;
 int wordi = 0;
+int nimg = 1;
+Rune* proto;
+Rune* root;
+Rune* base;
+
+void
+setbaseurls(Rune *url)
+{
+	Rune *r;
+
+	if(url == nil)
+		return;
+	free(proto);
+	free(root);
+	free(base);
+
+	/* just a basic check... */
+	for(r = url; isalpha(*r); r++)
+		;
+	if(r == baseurl || r[0] != ':' || r[1] != '/' || r[2] != '/' || r[3] == 0){
+		fprint(2, "%s: ignoring invalid base url: %S\n", argv0, url);
+		proto = root = base = nil;
+		return;
+	}
+
+	r[1] = 0;
+	proto = runestrdup(url);
+	r[1] = '/';
+	if(r = runestrchr(r + 3, '/')){
+		*r = 0;
+		root = runestrdup(url);
+		*r = '/';
+		r = runestrrchr(r, '/');
+		*r = 0;
+		base = runestrdup(url);
+		*r = '/';
+		return;
+	}
+	base = runestrdup(url);
+	root = runestrdup(url);
+}
 
 char*
 loadhtml(int fd)
@@ -27,7 +63,6 @@
 	u = emalloc(sizeof(URLwin));
 	u->infd = fd;
 	u->outfd = 1;
-	u->url = estrdup(url);
 	u->type = TextHtml;
 
 	b = emalloc(sizeof(Bytes));
@@ -35,24 +70,13 @@
 		growbytes(b, buf, n);
 	if(b->b == nil)
 		return nil;	/* empty file */
-	rendertext(u, b);
+	u->items = parsehtml(b->b, b->n, baseurl, u->type, UTF_8, &u->docinfo);
+	setbaseurls(u->docinfo->base);
+	rerender(u);
 	freeurlwin(u);
 	return nil;
 }
 
-char*
-runetobyte(Rune *r, int n)
-{
-	char *s;
-
-	if(n == 0)
-		return emalloc(1);
-	s = smprint("%.*S", n, r);
-	if(s == nil)
-		error("malloc failed");
-	return s;
-}
-
 int
 closingpunct(char c)
 {
@@ -129,58 +153,23 @@
 	free(r);
 }
 
-char*
-baseurl(char *url)
+void
+renderurl(Bytes *t, Rune *path, char lc, char rc)
 {
-	char *base, *slash;
-	Resub rs[10];
+	Rune *r;
 
-	if(url == nil)
-		return nil;
-	if(urlprog == nil){
-		urlprog = regcomp(urlexpr);
-		if(urlprog == nil)
-			error("can't compile URL regexp");
+	if(path == nil){
+		renderbytes(t, "%cnull_url%c", lc, rc);
+		return;
 	}
-	memset(rs, 0, sizeof rs);
-	if(regexec(urlprog, url, rs, nelem(rs)) == 0)
-		return nil;
-	base = estrdup(url);
-	slash = strrchr(base, '/');
-	if(slash!=nil && slash>=&base[rs[0].ep-rs[0].sp])
-		*slash = '\0';
+	for(r = path; isalpha(*r); r++)
+		;
+	if(base == nil || r[0] == '#' || r > path && r[0] == ':' && r[1] == '/' && r[2] == '/' && r[3])
+		renderbytes(t, "%c%S%c", lc, path, rc);
+	else if(path[0] == '/')
+		renderbytes(t, "%c%S%S%c", lc, path[1] == '/' ? proto : root, path, rc);
 	else
-		base[rs[0].ep-rs[0].sp] = '\0';
-	return base;
-}
-
-char*
-fullurl(URLwin *u, Rune *rhref)
-{
-	char *base, *href, *hrefbase;
-	char *result;
-
-	if(rhref == nil)
-		return estrdup("NULL URL");
-	href = runetobyte(rhref, runestrlen(rhref));
-	hrefbase = baseurl(href);
-	result = nil;
-	if(hrefbase==nil && (base = baseurl(u->url))!=nil){
-		result = estrdup(base);
-		if(base[strlen(base)-1]!='/' && (href==nil || href[0]!='/'))
-			result = eappend(result, "/", "");
-		free(base);
-	}
-	if(href){
-		if(result)
-			result = eappend(result, "", href);
-		else
-			result = estrdup(href);
-	}
-	free(hrefbase);
-	if(result == nil)
-		return estrdup("***unknown***");
-	return result;
+		renderbytes(t, "%c%S/%S%c", lc, base, path, rc);
 }
 
 void
@@ -195,7 +184,6 @@
 	Anchor *a;
 	Table *tab;
 	Tablecell *cell;
-	char *href;
 
 	inword = 0;
 	col = 0;
@@ -221,17 +209,18 @@
 			renderbytes(t, "=======\n");
 			break;
 		case Iimagetag:
-			if(!aflag)
+			if(links == NONE)
 				break;
 			im = (Iimage*)il;
 			if(im->imsrc){
-				href = fullurl(u, im->imsrc);
-				renderbytes(t, "[image %s]", href);
-				free(href);
+				if(links & FOOTNOTES)
+					renderbytes(t, "{%d}", nimg++);
+				else
+					renderurl(t, im->imsrc, '{', '}');
 			}
 			break;
 		case Iformfieldtag:
-			if(aflag)
+			if(links != NONE)
 				renderbytes(t, "[formfield]");
 			break;
 		case Itabletag:
@@ -253,14 +242,15 @@
 				renderbytes(t, " ");
 			break;
 		default:
-			error("unknown item tag %d\n", il->tag);
+			sysfatal("unknown item tag %d\n", il->tag);
 		}
 		if(il->anchorid != 0 && il->anchorid!=curanchor){
 			for(a=u->docinfo->anchors; a!=nil; a=a->next)
-				if(aflag && a->index == il->anchorid){
-					href = fullurl(u, a->href);
-					renderbytes(t, "[%s]", href);
-					free(href);
+				if(links != NONE && a->index == il->anchorid){
+					if(links & FOOTNOTES)
+						renderbytes(t, "[%d]", a->index);
+					else
+						renderurl(t, a->href, '[', ']');
 					break;
 				}
 			curanchor = il->anchorid;
@@ -271,13 +261,55 @@
 }
 
 void
+afootnotes(URLwin *u, Bytes *t){
+	Anchor *x, *y, *z;
+
+	x = u->docinfo->anchors;
+	if(x == nil)
+		return;
+	renderbytes(t, "\n\nlinks:\n");
+
+	/* list needs reversing */
+	for(z = nil; x->next != nil; x = y){
+		y = x->next;
+		x->next = z;
+		z = x;
+	}
+	for(x->next = z; x != nil; x = x->next){
+		renderbytes(t, "[%d]", x->index);
+		renderurl(t, x->href, ' ', '\n');
+	};
+}
+
+void
+imgfootnotes(URLwin *u, Bytes *t){
+	Iimage *i;
+	int n;
+
+	i = u->docinfo->images;
+	if(i == nil)
+		return;
+	renderbytes(t, "\n\nimages:\n");
+	for(n=1; i!=nil; i=i->nextimage){
+		renderbytes(t, "{%d}", n++);
+		renderurl(t, i->imsrc, ' ', '\n');
+	}
+}
+
+void
 rerender(URLwin *u)
 {
 	Bytes *t;
 
 	t = emalloc(sizeof(Bytes));
 
+	if(u->docinfo->doctitle!=nil)
+		renderbytes(t, "%S\n\n", u->docinfo->doctitle);
 	render(u, t, u->items, 0);
+	if(links & FOOTNOTES){
+		afootnotes(u, t);
+		imgfootnotes(u, t);
+	}
 
 	if(t->n)
 		write(u->outfd, (char*)t->b, t->n);
@@ -286,19 +318,6 @@
 }
 
 void
-rendertext(URLwin *u, Bytes *b)
-{
-	Rune *rurl;
-
-	rurl = toStr((uchar*)u->url, strlen(u->url), UTF_8);
-	u->items = parsehtml(b->b, b->n, rurl, u->type, UTF_8, &u->docinfo);
-//	free(rurl);
-
-	rerender(u);
-}
-
-
-void
 freeurlwin(URLwin *u)
 {
 	freeitems(u->items);
diff -r b24b6b01d46a sys/src/cmd/htmlfmt/main.c
--- a/sys/src/cmd/htmlfmt/main.c	Tue Dec 29 19:38:59 2020 +0000
+++ b/sys/src/cmd/htmlfmt/main.c	Fri Jan 01 17:35:59 2021 +1300
@@ -5,8 +5,8 @@
 #include <html.h>
 #include "dat.h"
 
-char *url = "";
-int aflag;
+Rune *baseurl;
+int links;
 int width = 70;
 char *defcharset = "latin1";
 
@@ -53,11 +53,14 @@
 
 	ARGBEGIN{
 	case 'a':
-		aflag++;
+		links |= INLINE;
 		break;
 	case 'c':
 		defcharset = EARGF(usage());
 		break;
+	case 'f':
+		links |= FOOTNOTES;
+		break;
 	case 'l': case 'w':
 		err = EARGF(usage());
 		width = atoi(err);
@@ -65,8 +68,12 @@
 			usage();
 		break;
 	case 'u':
-		url = EARGF(usage());
-		aflag++;
+		err = EARGF(usage());
+		free(baseurl);
+		baseurl = emalloc((utflen(err) + 1) * sizeof(Rune));
+		for(i = 0; *err != '\0'; i++)
+			err += chartorune(baseurl + i, err);
+		links |= INLINE;
 		break;
 	default:
 		usage();
diff -r b24b6b01d46a sys/src/cmd/htmlfmt/util.c
--- a/sys/src/cmd/htmlfmt/util.c	Tue Dec 29 19:38:59 2020 +0000
+++ b/sys/src/cmd/htmlfmt/util.c	Fri Jan 01 17:35:59 2021 +1300
@@ -12,7 +12,7 @@
 
 	p = malloc(n);
 	if(p == nil)
-		error("can't malloc: %r");
+		sysfatal("malloc: %r");
 	memset(p, 0, n);
 	return p;
 }
@@ -22,88 +22,10 @@
 {
 	p = realloc(p, n);
 	if(p == nil)
-		error("can't malloc: %r");
+		sysfatal("realloc: %r");
 	return p;
 }
 
-char*
-estrdup(char *s)
-{
-	char *t;
-
-	t = emalloc(strlen(s)+1);
-	strcpy(t, s);
-	return t;
-}
-
-char*
-estrstrdup(char *s, char *t)
-{
-	long ns, nt;
-	char *u;
-
-	ns = strlen(s);
-	nt = strlen(t);
-	/* use malloc to avoid memset */
-	u = malloc(ns+nt+1);
-	if(u == nil)
-		error("can't malloc: %r");
-	memmove(u, s, ns);
-	memmove(u+ns, t, nt);
-	u[ns+nt] = '\0';
-	return u;
-}
-
-char*
-eappend(char *s, char *sep, char *t)
-{
-	long ns, nsep, nt;
-	char *u;
-
-	if(t == nil)
-		u = estrstrdup(s, sep);
-	else{
-		ns = strlen(s);
-		nsep = strlen(sep);
-		nt = strlen(t);
-		/* use malloc to avoid memset */
-		u = malloc(ns+nsep+nt+1);
-		if(u == nil)
-			error("can't malloc: %r");
-		memmove(u, s, ns);
-		memmove(u+ns, sep, nsep);
-		memmove(u+ns+nsep, t, nt);
-		u[ns+nsep+nt] = '\0';
-	}
-	free(s);
-	return u;
-}
-
-char*
-egrow(char *s, char *sep, char *t)
-{
-	s = eappend(s, sep, t);
-	free(t);
-	return s;
-}
-
-void
-error(char *fmt, ...)
-{
-	va_list arg;
-	char buf[256];
-	Fmt f;
-
-	fmtfdinit(&f, 2, buf, sizeof buf);
-	fmtprint(&f, "Mail: ");
-	va_start(arg, fmt);
-	fmtvprint(&f, fmt, arg);
-	va_end(arg);
-	fmtprint(&f, "\n");
-	fmtfdflush(&f);
-	exits(fmt);
-}
-
 void
 growbytes(Bytes *b, char *s, long ns)
 {
@@ -112,7 +34,7 @@
 		/* use realloc to avoid memset */
 		b->b = realloc(b->b, b->nalloc);
 		if(b->b == nil)
-			error("growbytes: can't realloc: %r");
+			sysfatal("growbytes: can't realloc: %r");
 	}
 	memmove(b->b+b->n, s, ns);
 	b->n += ns;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] htmlfmt anchor corner cases
  2021-01-01  4:42       ` umbraticus
@ 2021-01-01 10:05         ` Steve Simon
  2021-01-01 19:26         ` ori
  2021-01-20  2:20         ` ori
  2 siblings, 0 replies; 10+ messages in thread
From: Steve Simon @ 2021-01-01 10:05 UTC (permalink / raw)
  To: 9front

if you have your head wrapped around htmlfmt...

cinap and i did similar but different changes to htmlfmt some years ago which should be merged. if you would like to take a look my version is here:
   
    quintile.net/pkg/htmlfmt.tbz

the plan for both was to support tables which the labs htmlfmt did not cope with. i also tried to add some support for tbl/troff output so you could print the document.

i have not compared cinap’s work to mine but there may be good ideas in both.

-Steve




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] htmlfmt anchor corner cases
  2021-01-01  4:42       ` umbraticus
  2021-01-01 10:05         ` Steve Simon
@ 2021-01-01 19:26         ` ori
  2021-01-20  2:20         ` ori
  2 siblings, 0 replies; 10+ messages in thread
From: ori @ 2021-01-01 19:26 UTC (permalink / raw)
  To: 9front

Quoth umbraticus@prosimetrum.com:
> Updated patch below (nimg should be global so
> subelements don't cause duplicate image
> footnotes).  Quite enjoying these changes;
> footnote mode much easier to read and work with.

Yes, I like it quite a bit. And the number of '-'s
in the diff is a nice bonus too:

	% diffstat /mail/fs/mbox/47572/body
	+++ 120
	--- 172

A couple of comments:

> +	/* just a basic check... */
> +	for(r = url; isalpha(*r); r++)
> +		;
> +	if(r == baseurl || r[0] != ':' || r[1] != '/' || r[2] != '/' || r[3] == 0){

runestrncmp? directly indexing here without a length check makes me nervous,
and runestrncmp is more compact.


> -	memset(rs, 0, sizeof rs);
> -	if(regexec(urlprog, url, rs, nelem(rs)) == 0)
> -		return nil;
> -	base = estrdup(url);
> -	slash = strrchr(base, '/');
> -	if(slash!=nil && slash>=&base[rs[0].ep-rs[0].sp])
> -		*slash = '\0';
> +	for(r = path; isalpha(*r); r++)
> +		;
> +	if(base == nil || r[0] == '#' || r > path && r[0] == ':' && r[1] == '/' && r[2] == '/' && r[3])
> +		renderbytes(t, "%c%S%c", lc, path, rc);
> +	else if(path[0] == '/')
> +		renderbytes(t, "%c%S%S%c", lc, path[1] == '/' ? proto : root, path, rc);

same as above

>  void
> @@ -195,7 +184,6 @@
>  	Anchor *a;
>  	Table *tab;
>  	Tablecell *cell;
> -	char *href;
>  
>  	inword = 0;
>  	col = 0;
> @@ -221,17 +209,18 @@
>  			renderbytes(t, "=======\n");
>  			break;
>  		case Iimagetag:
> -			if(!aflag)
> +			if(links == NONE)
>  				break;
>  			im = (Iimage*)il;
>  			if(im->imsrc){
> -				href = fullurl(u, im->imsrc);
> -				renderbytes(t, "[image %s]", href);
> -				free(href);
> +				if(links & FOOTNOTES)

You're using these as flags, but they don't act
like flags as far as I can tell?

I'd just use '==' here.


>  void
> +afootnotes(URLwin *u, Bytes *t){
> +	Anchor *x, *y, *z;
> +
> +	x = u->docinfo->anchors;
> +	if(x == nil)
> +		return;
> +	renderbytes(t, "\n\nlinks:\n");
> +
> +	/* list needs reversing */
> +	for(z = nil; x->next != nil; x = y){
> +		y = x->next;
> +		x->next = z;
> +		z = x;
> +	}
> +	for(x->next = z; x != nil; x = x->next){
> +		renderbytes(t, "[%d]", x->index);
> +		renderurl(t, x->href, ' ', '\n');
> +	};
> +}

Not a big deal, but why not just insert in the
right order in libhtml?

eg:

	di->anchors = newanchor(++is->nanchors, name, href, target, di->anchors);

could become:

	newanchor(di, ++is->nanchors, name, href, target);

which would do:

	if(di->tail != nil)
		di->tail->next = a;
	di->tail = a;

Would probably simplify things if we can assume that anchors
are in the right order.


>  	ARGBEGIN{
>  	case 'a':
> -		aflag++;
> +		links |= INLINE;
>  		break;
>  	case 'c':
>  		defcharset = EARGF(usage());
>  		break;
> +	case 'f':
> +		links |= FOOTNOTES;
> +		break;

Same comment about flags.

>  	case 'l': case 'w':
>  		err = EARGF(usage());
>  		width = atoi(err);
> @@ -65,8 +68,12 @@
>  			usage();

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] htmlfmt anchor corner cases
  2021-01-01  4:42       ` umbraticus
  2021-01-01 10:05         ` Steve Simon
  2021-01-01 19:26         ` ori
@ 2021-01-20  2:20         ` ori
  2021-01-20  2:49           ` Alex Musolino
  2021-01-20  3:17           ` umbraticus
  2 siblings, 2 replies; 10+ messages in thread
From: ori @ 2021-01-20  2:20 UTC (permalink / raw)
  To: 9front

Quoth umbraticus@prosimetrum.com:
> Updated patch below (nimg should be global so
> subelements don't cause duplicate image
> footnotes).  Quite enjoying these changes;
> footnote mode much easier to read and work with.

Oops -- I've been running this patch since
this was posted, and I completely forgot
about it.

I'm going to do another once-over on it,
but it's been working well for me.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] htmlfmt anchor corner cases
  2021-01-20  2:20         ` ori
@ 2021-01-20  2:49           ` Alex Musolino
  2021-01-20  3:17           ` umbraticus
  1 sibling, 0 replies; 10+ messages in thread
From: Alex Musolino @ 2021-01-20  2:49 UTC (permalink / raw)
  To: 9front

I've also been running it without issue.  Thanks!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9front] htmlfmt anchor corner cases
  2021-01-20  2:20         ` ori
  2021-01-20  2:49           ` Alex Musolino
@ 2021-01-20  3:17           ` umbraticus
  1 sibling, 0 replies; 10+ messages in thread
From: umbraticus @ 2021-01-20  3:17 UTC (permalink / raw)
  To: 9front

> Oops -- I've been running this patch since
> this was posted, and I completely forgot
> about it.
> 
> I'm going to do another once-over on it,
> but it's been working well for me.

Thanks fot taking a look!
Hang on, though, I've got a new one coming
incorporating some of your suggestions & other stuff.

umbraticus

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-01-20  3:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-20  9:29 [9front] htmlfmt anchor corner cases umbraticus
2020-12-20 22:03 ` cinap_lenrek
2020-12-30  3:47   ` umbraticus
2020-12-31  9:42     ` umbraticus
2021-01-01  4:42       ` umbraticus
2021-01-01 10:05         ` Steve Simon
2021-01-01 19:26         ` ori
2021-01-20  2:20         ` ori
2021-01-20  2:49           ` Alex Musolino
2021-01-20  3:17           ` umbraticus

9front - general discussion about 9front

This inbox may be cloned and mirrored by anyone:

	git clone --mirror http://inbox.vuxu.org/9front

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 9front 9front/ http://inbox.vuxu.org/9front \
		9front@9front.org
	public-inbox-index 9front

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.vuxu.org/vuxu.archive.9front


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git