mandocdb(8) full re-write

discuss@mandoc.bsd.lv
 help / color / mirror / Atom feed

* mandocdb(8) full re-write
@ 2012-04-04 18:26 Kristaps Dzonsons
  0 siblings, 0 replies; only message in thread
From: Kristaps Dzonsons @ 2012-04-04 18:26 UTC (permalink / raw)
  To: discuss

[-- Attachment #1: Type: text/plain, Size: 4046 bytes --]

Hi,

During AsiaBSDCon, I had the opportunity to take a more serious look at 
mandocdb(8).  As the code was rather complex, I opted to start over 
rather than whittling down.  The results are in the enclosed file, 
summarised as follows:

  (0) Overall code cleanliness.  mandocdb.c gained a lot of features 
real fast.  This re-write let me integrate those systematically.

  (1) Aggressive hashing of strings.
      All strings -- filename components, file suffixes, parsed words, 
and so on -- are hashed (using uthash).  Parsed manpage terms overlay 
the string hash, so after a few files, there are very few allocations at 
all.  This brings us a huge performance improvement: a lot of the last 
version, when profiled with valgrind, was spent allocating and twiddling 
with strings.

  (2) Use of fts(3) instead of ad hoc file walking.
      This makes the code much cleaner and neater.  This also improved 
performance because examining the file path is much easier by looking at 
the hierarchy level.  Again, less string twiddling.

  (3) De-duping/winnowing at the file-scan phase.
      I de-duplicate files by hashing inode/device and tossing dupes.  I 
also throw out non-conforming suffixes (if !use_all) early on, making 
the end list of files to parse much smaller.
      I'm much more picky about what's considered "mandoc source" in 
this version because mandoc(1) lets pretty much anything be parsed, 
defaulting to -man, which lead to lots of noise.  Now I require the 
right suffix or directory parts before using mandoc(3).

  (4) Using SQLite instead of Berkeley DB.
      Ok, this is the most controversial.  After talking with some 
OpenBSD and NetBSD folks, nobody could find anything against using 
SQLite.  NetBSD already has it in base, and apparently OpenBSD is moving 
in the same direction.
      Not to worry: it's really easy to plug in another database: the 
database functions (open/close/index/prune) completely contain the 
database routines.  Open/close are run for each manpath, index is run 
for each page, and prune for each page's removal.  Check out that DELETE 
CASCADE.  So easy!

  (5) Input encoding cleanup.
      The last mandocdb was a little fuzzy on encodings.  This time 
around, I store UTF-8 encoded strings directory.  Due to the hashing 
method, I only compute the UTF-8 string (which isn't all that expensive) 
once during the full parse lifetime!  This also makes apropos_db's job 
MUCH easier.

I cherry-picked schwarze@'s fine work with the last mandocdb.c to retain 
its behaviour regarding path sanitising.  There might be some omissions, 
but I think I have them all.

Some behaviour changes and possibilities:

  (1) I'll likely kick out searching by regexp in favour of globbing, 
which is better handled natively in SQLite, but we'll see---it's just a 
matter of search performance (SQLite supports regexp with matches, but 
it's not optimal).

  (2) Obviously, we now only have one database file with two tables. 
mandocdb(8) writes into a temporary file then rename(2)s into the real 
one (unless with -u or -d).  This is much neater and more readable.

  (3) Language and encoding.  I'd like to smartify the directory parse 
to recognise a language (e.g., ru/man1/amd64) alongside the rest.  This 
way, folks can use apropos to search for native-language manuals using 
the UTF-8 methods.

  (4) Full text search.  This will only be a few lines of code as the 
heavy lifting of word hashing is all in place.  I spoke with Jorg and 
Abhinav (NetBSD GSoC folks) about having a "natural-language" CGI in 
mdocml.bsd.lv.  I think it'd be awesome and a good pre-filter for, say, 
retarded misc@ questions ("how do I configure my bridge?").

Before committing anything, I'll transcribe apropos_db.c as well, then 
use it for a while "in production".  My plan is to make an OpenBSD 
package out of mdocml's "apropos tools" that install alternatives to the 
regular apropos and friends.  This way I can have fun and find bugs 
without displacing the prior tools.

Thoughts?

Kristaps

[-- Attachment #2: mandocdb.c --]
[-- Type: text/plain, Size: 43912 bytes --]

/*	$Id: mandocdb.c,v 1.46 2012/03/23 06:52:17 kristaps Exp $ */
/*
 * Copyright (c) 2011, 2012 Kristaps Dzonsons <kristaps@bsd.lv>
 * Copyright (c) 2011 Ingo Schwarze <schwarze@openbsd.org>
 *
 * Permission to use, copy, modify, and distribute this software for any
 * purpose with or without fee is hereby granted, provided that the above
 * copyright notice and this permission notice appear in all copies.
 *
 * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
 * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
 * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
 * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
 * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
 * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
 * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
 */
#ifdef HAVE_CONFIG_H
#include "config.h"
#endif

#include <sys/param.h>
#include <sys/stat.h>

#include <assert.h>
#include <ctype.h>
#include <errno.h>
#include <fcntl.h>
#include <fts.h>
#include <getopt.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include <sqlite3.h>
#include <uthash.h>

#include "mdoc.h"
#include "man.h"
#include "mandoc.h"
#include "mandocdb.h"
#include "manpath.h"

/* Post a warning to stderr. */
#define WARNING(_f, _b, _fmt, _args...) \
	do if (warnings) { \
		fprintf(stderr, "%s: ", (_b)); \
		fprintf(stderr, (_fmt), ##_args); \
		if ('\0' != *(_f)) \
			fprintf(stderr, ": %s", (_f)); \
		fprintf(stderr, "\n"); \
	} while (/* CONSTCOND */ 0)
/* Post a "verbose" message to stderr. */
#define	DEBUG(_f, _b, _fmt, _args...) \
	do if (verb) { \
		fprintf(stderr, "%s: ", (_b)); \
		fprintf(stderr, (_fmt), ##_args); \
		fprintf(stderr, ": %s\n", (_f)); \
	} while (/* CONSTCOND */ 0)

enum	op {
	OP_DEFAULT = 0, /* new dbs from dir list or default config */
	OP_CONFFILE, /* new databases from custom config file */
	OP_UPDATE, /* delete/add entries in existing database */
	OP_DELETE, /* delete entries from existing database */
	OP_TEST /* change no databases, report potential problems */
};

enum	form {
	FORM_SRC, /* format is -man or -mdoc */
	FORM_CAT, /* format is cat */
	FORM_NONE /* format is unknown */
};

struct	str {
	char		*key; /* the string itself */
	char		*utf8; /* key in UTF-8 form */
	const struct of *of; /* if set, the owning parse */
	struct str	*next; /* next in owning parse sequence */
	uint64_t	 mask; /* bitmask in sequence */
	UT_hash_handle	 hash_string; /* string hash */
};

struct	id {
	ino_t		 ino; /* inode of file */
	dev_t		 dev; /* device of file */
};

struct	of {
	struct id	 id; /* unique identifier */
	struct of	*next; /* next in ofs */
	enum form	 dform; /* path-cued form */
	enum form	 sform; /* suffix-cued form */
	const char	*file; /* filename rel. to manpath */
	const char	*desc; /* parsed description */
	const char	*sec; /* suffix-cued section (or empty) */
	const char	*dsec; /* path-cued section (or empty) */
	const char	*arch; /* path-cued arch. (or empty) */
	const char	*name; /* name (from filename) (not empty) */
	UT_hash_handle	 hash_ino; /* inode hash */
	UT_hash_handle	 hash_filename; /* filename hash */
};

enum	stmt {
	STMT_DELETE = 0, /* delete manpage */
	STMT_INSERT_DOC, /* insert manpage */
	STMT_INSERT_KEY, /* insert parsed key */
	STMT__MAX
};

typedef	int (*mdoc_fp)(struct of *, const struct mdoc_node *);

struct	mdoc_handler {
	mdoc_fp		 fp; /* optional handler */
	uint64_t	 mask;  /* set unless handler returns 0 */
	int		 flags;  /* for use by pmdoc_node */
#define	MDOCF_CHILD	 0x01  /* automatically index child nodes */
};

static	void	 dbclose(const char *, int);
static	void	 dbindex(struct mchars *, 
			const struct of *, const char *);
static	int	 dbopen(const char *, int);
static	void	 dbprune(const char *);
static	int	 dirscan(size_t, char *[], const char *);
static	int	 dirtreescan(const char *);
static	const char *filecheck(const char *);
static	void	 filescan(const char *, const char *);
static	int	 inocheck(const struct stat *);
static	void	 ofadd(const char *, int, const char *, 
			const char *, const char *, const char *, 
			const char *, const struct stat *st);
static	void	 offree(void);
static	int	 ofmerge(struct mchars *, 
			struct mparse *, const char *);
static	void	 parse_catpage(struct of *, const char *);
static	int	 parse_man(struct of *, 
			const struct man_node *);
static	void	 parse_mdoc(struct of *, const struct mdoc_node *);
static	int	 parse_mdoc_body(struct of *, const struct mdoc_node *);
static	int	 parse_mdoc_head(struct of *, const struct mdoc_node *);
static	int	 parse_mdoc_Fd(struct of *, const struct mdoc_node *);
static	int	 parse_mdoc_Fn(struct of *, const struct mdoc_node *);
static	int	 parse_mdoc_In(struct of *, const struct mdoc_node *);
static	int	 parse_mdoc_Nd(struct of *, const struct mdoc_node *);
static	int	 parse_mdoc_Nm(struct of *, const struct mdoc_node *);
static	int	 parse_mdoc_Sh(struct of *, const struct mdoc_node *);
static	int	 parse_mdoc_St(struct of *, const struct mdoc_node *);
static	int	 parse_mdoc_Xr(struct of *, const struct mdoc_node *);
static	void	 putkey(const struct of *, 
			const char *, uint64_t);
static	void	 putkeys(const struct of *, 
			const char *, int, uint64_t);
static	void	 putmdockey(const struct of *,
			const struct mdoc_node *, uint64_t);
static	char 	*stradd(const char *);
static	char 	*straddbuf(const char *, size_t);
static	size_t	 utf8(unsigned int, char [7]);
static	void	 utf8key(struct mchars *, struct str *);
static	void 	 wordadd(const struct of *, const char *, uint64_t);
static	void 	 wordaddbuf(const struct of *, 
			const char *, size_t, uint64_t);

static	char		*progname;
static	int	 	 use_all; /* use all found files */
static	int		 nodb; /* no database changes */
static	int	  	 verb; /* print what we're doing */
static	int	  	 warnings; /* warn about crap */
static	enum op	  	 op; /* operational mode */
static	struct of	*ofs = NULL; /* vector of files to parse */
static	struct of	*inos = NULL; /* table of inodes in path */
static	struct of	*filenames = NULL; /* table of filenames */
static	struct str	*strings = NULL; /* table of all strings */
static	struct str	*words = NULL; /* list of words in parse */
static	sqlite3		*db = NULL; /* the current database */
static	sqlite3_stmt	*stmts[STMT__MAX]; /* current statements */

static	const struct mdoc_handler mdocs[MDOC_MAX] = {
	{ NULL, 0, 0 },  /* Ap */
	{ NULL, 0, 0 },  /* Dd */
	{ NULL, 0, 0 },  /* Dt */
	{ NULL, 0, 0 },  /* Os */
	{ parse_mdoc_Sh, TYPE_Sh, MDOCF_CHILD }, /* Sh */
	{ parse_mdoc_head, TYPE_Ss, MDOCF_CHILD }, /* Ss */
	{ NULL, 0, 0 },  /* Pp */
	{ NULL, 0, 0 },  /* D1 */
	{ NULL, 0, 0 },  /* Dl */
	{ NULL, 0, 0 },  /* Bd */
	{ NULL, 0, 0 },  /* Ed */
	{ NULL, 0, 0 },  /* Bl */
	{ NULL, 0, 0 },  /* El */
	{ NULL, 0, 0 },  /* It */
	{ NULL, 0, 0 },  /* Ad */
	{ NULL, TYPE_An, MDOCF_CHILD },  /* An */
	{ NULL, TYPE_Ar, MDOCF_CHILD },  /* Ar */
	{ NULL, TYPE_Cd, MDOCF_CHILD },  /* Cd */
	{ NULL, TYPE_Cm, MDOCF_CHILD },  /* Cm */
	{ NULL, TYPE_Dv, MDOCF_CHILD },  /* Dv */
	{ NULL, TYPE_Er, MDOCF_CHILD },  /* Er */
	{ NULL, TYPE_Ev, MDOCF_CHILD },  /* Ev */
	{ NULL, 0, 0 },  /* Ex */
	{ NULL, TYPE_Fa, MDOCF_CHILD },  /* Fa */
	{ parse_mdoc_Fd, TYPE_In, 0 },  /* Fd */
	{ NULL, TYPE_Fl, MDOCF_CHILD },  /* Fl */
	{ parse_mdoc_Fn, 0, 0 },  /* Fn */
	{ NULL, TYPE_Ft, MDOCF_CHILD },  /* Ft */
	{ NULL, TYPE_Ic, MDOCF_CHILD },  /* Ic */
	{ parse_mdoc_In, TYPE_In, MDOCF_CHILD },  /* In */
	{ NULL, TYPE_Li, MDOCF_CHILD },  /* Li */
	{ parse_mdoc_Nd, TYPE_Nd, MDOCF_CHILD },  /* Nd */
	{ parse_mdoc_Nm, TYPE_Nm, MDOCF_CHILD },  /* Nm */
	{ NULL, 0, 0 },  /* Op */
	{ NULL, 0, 0 },  /* Ot */
	{ NULL, TYPE_Pa, MDOCF_CHILD },  /* Pa */
	{ NULL, 0, 0 },  /* Rv */
	{ parse_mdoc_St, TYPE_St, 0 },  /* St */
	{ NULL, TYPE_Va, MDOCF_CHILD },  /* Va */
	{ parse_mdoc_body, TYPE_Va, MDOCF_CHILD },  /* Vt */
	{ parse_mdoc_Xr, TYPE_Xr, 0 },  /* Xr */
	{ NULL, 0, 0 },  /* %A */
	{ NULL, 0, 0 },  /* %B */
	{ NULL, 0, 0 },  /* %D */
	{ NULL, 0, 0 },  /* %I */
	{ NULL, 0, 0 },  /* %J */
	{ NULL, 0, 0 },  /* %N */
	{ NULL, 0, 0 },  /* %O */
	{ NULL, 0, 0 },  /* %P */
	{ NULL, 0, 0 },  /* %R */
	{ NULL, 0, 0 },  /* %T */
	{ NULL, 0, 0 },  /* %V */
	{ NULL, 0, 0 },  /* Ac */
	{ NULL, 0, 0 },  /* Ao */
	{ NULL, 0, 0 },  /* Aq */
	{ NULL, TYPE_At, MDOCF_CHILD },  /* At */
	{ NULL, 0, 0 },  /* Bc */
	{ NULL, 0, 0 },  /* Bf */
	{ NULL, 0, 0 },  /* Bo */
	{ NULL, 0, 0 },  /* Bq */
	{ NULL, TYPE_Bsx, MDOCF_CHILD },  /* Bsx */
	{ NULL, TYPE_Bx, MDOCF_CHILD },  /* Bx */
	{ NULL, 0, 0 },  /* Db */
	{ NULL, 0, 0 },  /* Dc */
	{ NULL, 0, 0 },  /* Do */
	{ NULL, 0, 0 },  /* Dq */
	{ NULL, 0, 0 },  /* Ec */
	{ NULL, 0, 0 },  /* Ef */
	{ NULL, TYPE_Em, MDOCF_CHILD },  /* Em */
	{ NULL, 0, 0 },  /* Eo */
	{ NULL, TYPE_Fx, MDOCF_CHILD },  /* Fx */
	{ NULL, TYPE_Ms, MDOCF_CHILD },  /* Ms */
	{ NULL, 0, 0 },  /* No */
	{ NULL, 0, 0 },  /* Ns */
	{ NULL, TYPE_Nx, MDOCF_CHILD },  /* Nx */
	{ NULL, TYPE_Ox, MDOCF_CHILD },  /* Ox */
	{ NULL, 0, 0 },  /* Pc */
	{ NULL, 0, 0 },  /* Pf */
	{ NULL, 0, 0 },  /* Po */
	{ NULL, 0, 0 },  /* Pq */
	{ NULL, 0, 0 },  /* Qc */
	{ NULL, 0, 0 },  /* Ql */
	{ NULL, 0, 0 },  /* Qo */
	{ NULL, 0, 0 },  /* Qq */
	{ NULL, 0, 0 },  /* Re */
	{ NULL, 0, 0 },  /* Rs */
	{ NULL, 0, 0 },  /* Sc */
	{ NULL, 0, 0 },  /* So */
	{ NULL, 0, 0 },  /* Sq */
	{ NULL, 0, 0 },  /* Sm */
	{ NULL, 0, 0 },  /* Sx */
	{ NULL, TYPE_Sy, MDOCF_CHILD },  /* Sy */
	{ NULL, TYPE_Tn, MDOCF_CHILD },  /* Tn */
	{ NULL, 0, 0 },  /* Ux */
	{ NULL, 0, 0 },  /* Xc */
	{ NULL, 0, 0 },  /* Xo */
	{ parse_mdoc_head, TYPE_Fn, 0 },  /* Fo */
	{ NULL, 0, 0 },  /* Fc */
	{ NULL, 0, 0 },  /* Oo */
	{ NULL, 0, 0 },  /* Oc */
	{ NULL, 0, 0 },  /* Bk */
	{ NULL, 0, 0 },  /* Ek */
	{ NULL, 0, 0 },  /* Bt */
	{ NULL, 0, 0 },  /* Hf */
	{ NULL, 0, 0 },  /* Fr */
	{ NULL, 0, 0 },  /* Ud */
	{ NULL, TYPE_Lb, MDOCF_CHILD },  /* Lb */
	{ NULL, 0, 0 },  /* Lp */
	{ NULL, TYPE_Lk, MDOCF_CHILD },  /* Lk */
	{ NULL, TYPE_Mt, MDOCF_CHILD },  /* Mt */
	{ NULL, 0, 0 },  /* Brq */
	{ NULL, 0, 0 },  /* Bro */
	{ NULL, 0, 0 },  /* Brc */
	{ NULL, 0, 0 },  /* %C */
	{ NULL, 0, 0 },  /* Es */
	{ NULL, 0, 0 },  /* En */
	{ NULL, TYPE_Dx, MDOCF_CHILD },  /* Dx */
	{ NULL, 0, 0 },  /* %Q */
	{ NULL, 0, 0 },  /* br */
	{ NULL, 0, 0 },  /* sp */
	{ NULL, 0, 0 },  /* %U */
	{ NULL, 0, 0 },  /* Ta */
};

int
main(int argc, char *argv[])
{
	int		 ch, rc, i;
	const char	*dir;
	struct str	*keyp, *keypp;
	struct mchars	*mc;
	struct manpaths	 dirs;
	struct mparse	*mp;

	memset(stmts, 0, STMT__MAX * sizeof(sqlite3_stmt *));
	memset(&dirs, 0, sizeof(struct manpaths));

	progname = strrchr(argv[0], '/');
	if (progname == NULL)
		progname = argv[0];
	else
		++progname;

#define	CHECKOP(_op, _ch) do \
	if (OP_DEFAULT != (_op)) { \
		fprintf(stderr, "-%c: Conflicting option\n", (_ch)); \
		goto usage; \
	} while (/*CONSTCOND*/0)

	dir = NULL;
	op = OP_DEFAULT;

	while (-1 != (ch = getopt(argc, argv, "aC:d:ntu:vW")))
		switch (ch) {
		case ('a'):
			use_all = 1;
			break;
		case ('C'):
			CHECKOP(op, ch);
			dir = optarg;
			op = OP_CONFFILE;
			break;
		case ('d'):
			CHECKOP(op, ch);
			dir = optarg;
			op = OP_UPDATE;
			break;
		case ('n'):
			nodb = 1;
			break;
		case ('t'):
			CHECKOP(op, ch);
			dup2(STDOUT_FILENO, STDERR_FILENO);
			op = OP_TEST;
			nodb = use_all = warnings = 1;
			dir = ".";
			break;
		case ('u'):
			CHECKOP(op, ch);
			dir = optarg;
			op = OP_DELETE;
			break;
		case ('v'):
			verb++;
			break;
		case ('W'):
			warnings = 1;
			break;
		default:
			goto usage;
		}

	argc -= optind;
	argv += optind;

	if (OP_CONFFILE == op && argc > 0) {
		fprintf(stderr, "-C: Too many arguments\n");
		goto usage;
	}

	rc = 1;
	mp = mparse_alloc(MPARSE_AUTO, 
		MANDOCLEVEL_FATAL, NULL, NULL);
	mc = mchars_alloc();

	if (OP_UPDATE == op || OP_DELETE == op || OP_TEST == op) {
		/*
		 * All of these deal with a specific directory.
		 * Jump into that directory then collect files specified
		 * on the command-line.
		 */
		if (0 == (rc = dirscan(argc, argv, dir)))
			goto out;
		if (0 == (rc = dbopen(dir, 1)))
			goto out;
		if (OP_TEST != op)
			dbprune(dir);
		if (OP_DELETE != op)
			rc = ofmerge(mc, mp, dir);
		else
			dbclose(dir, 1);
	} else {
		/*
		 * If we have arguments, use them as our manpaths.
		 * If we don't, grok from manpath(1) or however else
		 * manpath_parse() wants to do it.
		 */

		if (argc > 0) {
			dirs.paths = mandoc_calloc
				(argc, sizeof(char *));
			dirs.sz = argc;
			for (i = 0; i < argc; i++) 
				dirs.paths[i] = mandoc_strdup(argv[i]);
		} else
			manpath_parse(&dirs, dir, NULL, NULL);

		/*
		 * First scan the tree rooted at a base directory.
		 * Then whak its database (if one exists), parse, and
		 * build up the database.
		 */

		for (i = 0; i < dirs.sz; i++) {
			if (0 == (rc = dirtreescan(dirs.paths[i])))
				goto out;
			remove(MANDOC_DB);
			if (0 == (rc = ofmerge(mc, mp, dirs.paths[i])))
				goto out;
			HASH_CLEAR(hash_ino, inos);
			HASH_CLEAR(hash_filename, filenames);
			offree();
		}
	}
out:
	manpath_free(&dirs);
	mchars_free(mc);
	mparse_free(mp);
	HASH_ITER(hash_string, strings, keyp, keypp) {
		HASH_DELETE(hash_string, strings, keyp);
		if (keyp->key != keyp->utf8)
			free(keyp->utf8);
		free(keyp->key);
		free(keyp);
	}
	HASH_CLEAR(hash_string, strings);
	HASH_CLEAR(hash_ino, inos);
	HASH_CLEAR(hash_filename, filenames);
	offree();
	return(rc ? EXIT_SUCCESS : EXIT_FAILURE);
usage:
	fprintf(stderr, "usage: %s [-anvW] [-C file]\n"
			"       %s [-anvW] dir ...\n"
			"       %s [-nvW] -d dir [file ...]\n"
			"       %s [-nvW] -u dir [file ...]\n"
			"       %s -t file ...\n",
		       progname, progname, progname, 
		       progname, progname);

	return(EXIT_FAILURE);
}

/*
 * Scan a directory tree rooted at "base" for manpages.
 * We use fts(), scanning directory parts along the way for clues to our
 * section and architecture.
 *
 * If use_all has been specified, grok all files.
 * If not, sanitise paths to the following:
 *
 *   [./]man*[/<arch>]/<name>.<section> 
 *   or
 *   [./]cat<section>[/<arch>]/<name>.0
 */
static int
dirtreescan(const char *base)
{
	FTS		*f;
	FTSENT		*ff;
	int		 fd, dform;
	size_t		 sz;
	char		*sec;
	const char	*file, *dsec, *arch, *cp, *name;
	char		 cwd[MAXPATHLEN];
	const char	*argv[2];

	/*
	 * Remember where we started by keeping a fd open to the origin
	 * path component.
	 * This is because we chdir() to relative paths, so we can't
	 * just re-chdir() into the cwd if it's also relative.
	 */

	if (NULL == getcwd(cwd, MAXPATHLEN)) {
		perror(NULL);
		return(0);
	} else if (-1 == (fd = open(cwd, O_RDONLY, 0))) {
		perror(cwd);
		return(0);
	}

	/* Sanitise the base directory.  */

	if (0 == strncmp(base, "./", 2))
		base += 2;
	sz = strlen(base) + 1;
	if ('/' == base[sz - 1])
		sz++;
	argv[0] = base;
	argv[1] = (char *)NULL;

	/*
	 * Walk through all components under the directory, using the
	 * logical descent of files.
	 */

	f = fts_open((char * const *)argv, FTS_LOGICAL, NULL);
	if (NULL == f) {
		perror(base);
		close(fd);
		return(0);
	}

	dsec = arch = NULL;
	dform = FORM_NONE;

	while (NULL != (ff = fts_read(f))) {
		/*
		 * If we're a regular file, add an "of" by using the
		 * stored directory data and handling the filename.
		 * Disallow duplicate (hard-linked) files.
		 */

		if (FTS_F == ff->fts_info) {
			if ( ! use_all && ff->fts_level < 2) {
				WARNING(ff->fts_path + sz, base,
					"Extraneous file");
				continue;
			} else if (inocheck(ff->fts_statp)) {
				WARNING(ff->fts_path + sz, base,
					"Duplicate file");
				continue;
			} 

			cp = ff->fts_name;
			if (NULL != (cp = strrchr(cp, '.'))) {
				if (0 == strcmp(cp + 1, "html")) {
					WARNING(ff->fts_path + sz, 
						base, "Skipping html");
					continue;
				} else if (0 == strcmp(cp + 1, "gz")) {
					WARNING(ff->fts_path + sz, 
						base, "Skipping gz");
					continue;
				} else if (0 == strcmp(cp + 1, "ps")) {
					WARNING(ff->fts_path + sz, 
						base, "Skipping ps");
					continue;
				} else if (0 == strcmp(cp + 1, "pdf")) {
					WARNING(ff->fts_path + sz, 
						base, "Skipping pdf");
					continue;
				}
			}

			file = stradd(ff->fts_path + sz);
			name = stradd(ff->fts_name);
			if (NULL != (sec = strrchr(name, '.')))
				*sec++ = '\0';
			ofadd(base, dform, file, 
				name, dsec, sec, arch, ff->fts_statp);
			continue;
		} else if (FTS_D != ff->fts_info && 
				FTS_DP != ff->fts_info)
			continue;

		switch (ff->fts_level) {
		case (0):
			/* Ignore the root directory. */
			break;
		case (1):
			/*
			 * This might contain manX/ or catX/.
			 * Try to infer this from the name.
			 * If we're not in use_all, enforce it.
			 */
			dsec = NULL;
			dform = FORM_NONE;
			cp = ff->fts_name;
			if (FTS_DP == ff->fts_info)
				break;

			if (0 == strncmp(cp, "man", 3)) {
				dform = FORM_SRC;
				dsec = stradd(cp + 3);
			} else if (0 == strncmp(cp, "cat", 3)) {
				dform = FORM_CAT;
				dsec = stradd(cp + 3);
			}

			if (NULL != dsec || use_all) 
				break;

			WARNING(ff->fts_path + sz, base,
				"Unknown directory part");
			fts_set(f, ff, FTS_SKIP);
			break;
		case (2):
			/*
			 * Possibly our architecture.
			 * If we're descending, keep tabs on it.
			 */
			arch = NULL;
			if (FTS_DP != ff->fts_info && NULL != dsec)
				arch = stradd(ff->fts_name);
			break;
		default:
			if (FTS_DP == ff->fts_info || use_all)
				break;
			WARNING(ff->fts_path + sz, base,
				"Extraneous directory part");
			fts_set(f, ff, FTS_SKIP);
			break;
		}
	}

	fts_close(f);
	if (errno) {
		perror(base);
		close(fd);
		return(0);
	}

	/*
	 * We want to exit in our base directory.
	 * To do so, first return to the original cwd.
	 * Then use chdir() relative to that.
	 */

	if (-1 == fchdir(fd)) {
		perror(cwd);
		close(fd);
		return(0);
	}
	close(fd);
	if (-1 == chdir(base)) {
		perror(base);
		return(0);
	}
	return(1);
}

static int
dirscan(size_t argc, char *argv[], const char *base)
{
	size_t		 i;

	if (-1 == chdir(base)) {
		perror(base);
		return(0);
	}

	for (i = 0; i < argc; i++)
		filescan(argv[i], base);

	return(1);
}

/*
 * Add a file to the file vector.
 * Do not verify that it's a "valid" looking manpage.
 *
 * Then try to infer the manual section, architecture and page name from
 * the path, assuming it looks like
 *
 *   [./]man*[/<arch>]/<name>.<section> 
 *   or
 *   [./]cat<section>[/<arch>]/<name>.0
 *
 * Stuff this information directly into the "of" vector.
 */
static void
filescan(const char *file, const char *base)
{
	const char	*sec, *arch, *name, *dsec, *filep;
	char		*p, *start, *buf;
	int		 dform;
	struct stat	 st;

	assert(use_all);

	if (0 == strncmp(file, "./", 2))
		file += 2;

	if (-1 == stat(file, &st)) {
		WARNING(file, base, "%s", strerror(errno));
		return;
	} else if ( ! (S_IFREG & st.st_mode)) {
		WARNING(file, base, "Not a regular file");
		return;
	} else if (inocheck(&st)) {
		WARNING(file, base, "Duplicate file");
		return;
	}

	filep = stradd(file);
	buf = mandoc_strdup(file);
	start = buf;
	sec = arch = name = dsec = NULL;
	dform = FORM_NONE;

	/*
	 * First try to guess our directory structure.
	 * If we find a separator, try to look for man* or cat*.
	 * If we find one of these and what's underneath is a directory,
	 * assume it's an architecture.
	 */

	if (NULL != (p = strchr(start, '/'))) {
		*p++ = '\0';
		if (0 == strncmp(start, "man", 3)) {
			dform = FORM_SRC;
			dsec = start + 3;
		} else if (0 == strncmp(start, "cat", 3)) {
			dform = FORM_CAT;
			dsec = start + 3;
		}

		start = p;
		if (NULL != dsec && NULL != (p = strchr(start, '/'))) {
			*p++ = '\0';
			arch = start;
			start = p;
		} 
	}

	/*
	 * Now check the file suffix.
	 * Suffix of `.0' indicates a catpage, `.1-9' is a manpage.
	 */

	p = strrchr(start, '\0');
	while (p-- > start && '/' != *p && '.' != *p)
		/* Loop. */ ;

	if ('.' == *p) {
		*p++ = '\0';
		sec = p;
	}

	/*
	 * Now try to parse the name.
	 * Use the filename portion of the path.
	 */

	name = start;
	if (NULL != (p = strrchr(start, '/'))) {
		name = p + 1;
		*p = '\0';
	} 

	ofadd(base, dform, filep, name, dsec, sec, arch, &st);
	free(buf);
}

static const char *
filecheck(const char *name)
{
	struct of	*p;

	HASH_FIND(hash_filename, filenames, name, strlen(name), p);
	return(NULL != p ? p->file : NULL);
}

static int
inocheck(const struct stat *st)
{
	struct id	 id;
	struct of	*p;

	memset(&id, 0, sizeof(struct id));
	id.ino = st->st_ino;
	id.dev = st->st_dev;

	HASH_FIND(hash_ino, inos, &id, sizeof(struct id), p);
	return(NULL != p);
}

static void
ofadd(const char *base, int dform, const char *file, 
		const char *name, const char *dsec, const char *sec, 
		const char *arch, const struct stat *st)
{
	struct of	*of;
	int		 sform;
	size_t		 sz;

	assert(NULL != file);

	if (NULL == name)
		name = "";
	if (NULL == sec)
		sec = "";
	if (NULL == dsec)
		dsec = "";
	if (NULL == arch)
		arch = "";

	sform = FORM_NONE;
	if (NULL != sec && *sec <= '9' && *sec >= '1')
		sform = FORM_SRC;
	else if (NULL != sec && *sec == '0')
		sform = FORM_CAT;

	/* XXX: structure warnings go here */

	of = mandoc_calloc(1, sizeof(struct of));
	of->file = file;
	of->name = name;
	of->sec = sec;
	of->dsec = dsec;
	of->arch = arch;
	of->sform = sform;
	of->dform = dform;
	of->id.ino = st->st_ino;
	of->id.dev = st->st_dev;
	of->next = ofs;
	sz = strlen(of->sec) + 1;
	ofs = of;

	/*
	 * Add to unique identifier hash.
	 * Then if it's a source manual and we're going to use source in
	 * favour of catpages, add it to that hash.
	 */
	HASH_ADD(hash_ino, inos, id, sizeof(struct id), of);
	HASH_ADD_KEYPTR(hash_filename, filenames, file, strlen(file) - sz, of);
}

static void
offree(void)
{
	struct of	*of;

	while (NULL != (of = ofs)) {
		ofs = of->next;
		free(of);
	}
}

static int
ofmerge(struct mchars *mc, struct mparse *mp, const char *base)
{
	int		 form;
	size_t		 sz;
	struct mdoc	*mdoc;
	struct man	*man;
	char		 buf[MAXPATHLEN];
	char		*bufp;
	const char	*msec, *march, *mtitle, *cp;
	struct of	*of;
	enum mandoclevel lvl;

	if (0 == dbopen(base, 0))
		return(0);

	for (of = ofs; NULL != of; of = of->next) {
		/*
		 * If we're a catpage (as defined by our path), then see
		 * if a manpage exists by the same name (ignoring the
		 * suffix).
		 * If it does, then we want to use it instead of our
		 * own.
		 */

		if ( ! use_all && FORM_CAT == of->dform) {
			sz = strlcpy(buf, of->file, MAXPATHLEN);
			if (sz >= MAXPATHLEN) {
				WARNING(of->file, base, 
					"Filename too long");
				continue;
			}
			bufp = strstr(buf, "cat");
			assert(NULL != bufp);
			memcpy(bufp, "man", 3);
			if (NULL != (bufp = strrchr(buf, '.')))
				*bufp = '\0';
			if (NULL != (cp = filecheck(buf))) {
				WARNING(of->file, base, "Man "
					"source exists: %s", cp);
				continue;
			}
		}

		words = NULL;
		mparse_reset(mp);
		mdoc = NULL;
		man = NULL;
		form = 0;
		msec = of->dsec;
		march = of->arch;
		mtitle = of->name;

		/*
		 * Try interpreting the file as mdoc(7) or man(7)
		 * source code, unless it is already known to be
		 * formatted.  Fall back to formatted mode.
		 */

		if (FORM_SRC == of->dform || FORM_SRC == of->sform) {
			lvl = mparse_readfd(mp, -1, of->file);
			if (lvl < MANDOCLEVEL_FATAL)
				mparse_result(mp, &mdoc, &man);
		} 

		if (NULL != mdoc) {
			form = 1;
			msec = mdoc_meta(mdoc)->msec;
			march = mdoc_meta(mdoc)->arch;
			mtitle = mdoc_meta(mdoc)->title;
		} else if (NULL != man) {
			form = 1;
			msec = man_meta(man)->msec;
			march = "";
			mtitle = man_meta(man)->title;
		} 

		if (NULL == msec) 
			msec = "";
		if (NULL == march) 
			march = "";
		if (NULL == mtitle) 
			mtitle = "";

		/*
		 * Check whether the manual section given in a file
		 * agrees with the directory where the file is located.
		 * Some manuals have suffixes like (3p) on their
		 * section number either inside the file or in the
		 * directory name, some are linked into more than one
		 * section, like encrypt(1) = makekey(8).  Do not skip
		 * manuals for such reasons.
		 */

		if (form && strcasecmp(msec, of->dsec))
			WARNING(of->file, base, "Section %s "
				"manual in %s directory", 
				msec, of->dsec);

		/*
		 * Manual page directories exist for each kernel
		 * architecture as returned by machine(1).
		 * However, many manuals only depend on the
		 * application architecture as returned by arch(1).
		 * For example, some (2/ARM) manuals are shared
		 * across the "armish" and "zaurus" kernel
		 * architectures.
		 * A few manuals are even shared across completely
		 * different architectures, for example fdformat(1)
		 * on amd64, i386, sparc, and sparc64.
		 * Thus, warn about architecture mismatches,
		 * but don't skip manuals for this reason.
		 */

		if (strcasecmp(march, of->arch))
			WARNING(of->file, base, "Architecture %s "
				"manual in %s directory", 
				march, of->arch);

		putkey(of, of->name, TYPE_Nm);

		if (NULL != mdoc) {
			if (NULL != (cp = mdoc_meta(mdoc)->name))
				putkey(of, cp, TYPE_Nm);
			parse_mdoc(of, mdoc_node(mdoc));
		} else if (NULL != man)
			parse_man(of, man_node(man));
		else
			parse_catpage(of, base);

		dbindex(mc, of, base);
	}

	dbclose(base, 0);
	return(1);
}

static void
parse_catpage(struct of *of, const char *base)
{
	FILE		*stream;
	char		*line, *p, *title;
	size_t		 len, plen, titlesz;

	if (NULL == (stream = fopen(of->file, "r"))) {
		WARNING(of->file, base, "%s", strerror(errno));
		return;
	}

	/* Skip to first blank line. */

	while (NULL != (line = fgetln(stream, &len)))
		if ('\n' == *line)
			break;

	/*
	 * Assume the first line that is not indented
	 * is the first section header.  Skip to it.
	 */

	while (NULL != (line = fgetln(stream, &len)))
		if ('\n' != *line && ' ' != *line)
			break;
	
	/*
	 * Read up until the next section into a buffer.
	 * Strip the leading and trailing newline from each read line,
	 * appending a trailing space.
	 * Ignore empty (whitespace-only) lines.
	 */

	titlesz = 0;
	title = NULL;

	while (NULL != (line = fgetln(stream, &len))) {
		if (' ' != *line || '\n' != line[len - 1])
			break;
		while (len > 0 && isspace((unsigned char)*line)) {
			line++;
			len--;
		}
		if (1 == len)
			continue;
		title = mandoc_realloc(title, titlesz + len);
		memcpy(title + titlesz, line, len);
		titlesz += len;
		title[titlesz - 1] = ' ';
	}

	/*
	 * If no page content can be found, or the input line
	 * is already the next section header, or there is no
	 * trailing newline, reuse the page title as the page
	 * description.
	 */

	if (NULL == title || '\0' == *title) {
		WARNING(of->file, base, "Cannot find NAME section");
		fclose(stream);
		free(title);
		return;
	}

	title = mandoc_realloc(title, titlesz + 1);
	title[titlesz] = '\0';

	/*
	 * Skip to the first dash.
	 * Use the remaining line as the description (no more than 70
	 * bytes).
	 */

	if (NULL != (p = strstr(title, "- "))) {
		for (p += 2; ' ' == *p || '\b' == *p; p++)
			/* Skip to next word. */ ;
	} else {
		WARNING(of->file, base, "No dash in title line");
		p = title;
	}

	plen = strlen(p);

	/* Strip backspace-encoding from line. */

	while (NULL != (line = memchr(p, '\b', plen))) {
		len = line - p;
		if (0 == len) {
			memmove(line, line + 1, plen--);
			continue;
		} 
		memmove(line - 1, line + 1, plen - len);
		plen -= 2;
	}

	of->desc = stradd(p);
	putkey(of, p, TYPE_Nd);
	fclose(stream);
	free(title);
}

static void
putkey(const struct of *of, const char *value, uint64_t type)
{

	wordadd(of, value, type);
}

static void
putkeys(const struct of *of, const char *value, int sz, uint64_t type)
{

	wordaddbuf(of, value, sz, type);
}

static void
putmdockey(const struct of *of, const struct mdoc_node *n, uint64_t m)
{

	for ( ; NULL != n; n = n->next) {
		if (n->child)
			putmdockey(of, n->child, m);
		if (MDOC_TEXT == n->type)
			putkey(of, n->string, m);
	}
}

static int
parse_man(struct of *of, const struct man_node *n)
{
	const struct man_node *head, *body;
	char		*start, *sv, *title;
	char		 byte;
	size_t		 sz, titlesz;

	if (NULL == n)
		return(0);

	/*
	 * We're only searching for one thing: the first text child in
	 * the BODY of a NAME section.  Since we don't keep track of
	 * sections in -man, run some hoops to find out whether we're in
	 * the correct section or not.
	 */

	if (MAN_BODY == n->type && MAN_SH == n->tok) {
		body = n;
		assert(body->parent);
		if (NULL != (head = body->parent->head) &&
				1 == head->nchild &&
				NULL != (head = (head->child)) &&
				MAN_TEXT == head->type &&
				0 == strcmp(head->string, "NAME") &&
				NULL != (body = body->child) &&
				MAN_TEXT == body->type) {

			title = NULL;
			titlesz = 0;

			/*
			 * Suck the entire NAME section into memory.
			 * Yes, we might run away.
			 * But too many manuals have big, spread-out
			 * NAME sections over many lines.
			 */

			for ( ; NULL != body; body = body->next) {
				if (MAN_TEXT != body->type)
					break;
				if (0 == (sz = strlen(body->string)))
					continue;
				title = mandoc_realloc
					(title, titlesz + sz + 1);
				memcpy(title + titlesz, body->string, sz);
				titlesz += sz + 1;
				title[titlesz - 1] = ' ';
			}
			if (NULL == title)
				return(1);

			title = mandoc_realloc(title, titlesz + 1);
			title[titlesz] = '\0';

			/* Skip leading space.  */

			sv = title;
			while (isspace((unsigned char)*sv))
				sv++;

			if (0 == (sz = strlen(sv))) {
				free(title);
				return(1);
			}

			/* Erase trailing space. */

			start = &sv[sz - 1];
			while (start > sv && isspace((unsigned char)*start))
				*start-- = '\0';

			if (start == sv) {
				free(title);
				return(1);
			}

			start = sv;

			/* 
			 * Go through a special heuristic dance here.
			 * Conventionally, one or more manual names are
			 * comma-specified prior to a whitespace, then a
			 * dash, then a description.  Try to puzzle out
			 * the name parts here.
			 */

			for ( ;; ) {
				sz = strcspn(start, " ,");
				if ('\0' == start[sz])
					break;

				byte = start[sz];
				start[sz] = '\0';

				putkey(of, start, TYPE_Nm);

				if (' ' == byte) {
					start += sz + 1;
					break;
				}

				assert(',' == byte);
				start += sz + 1;
				while (' ' == *start)
					start++;
			}

			if (sv == start) {
				putkey(of, start, TYPE_Nm);
				free(title);
				return(1);
			}

			while (isspace((unsigned char)*start))
				start++;

			if (0 == strncmp(start, "-", 1))
				start += 1;
			else if (0 == strncmp(start, "\\-\\-", 4))
				start += 4;
			else if (0 == strncmp(start, "\\-", 2))
				start += 2;
			else if (0 == strncmp(start, "\\(en", 4))
				start += 4;
			else if (0 == strncmp(start, "\\(em", 4))
				start += 4;

			while (' ' == *start)
				start++;

			assert(NULL == of->desc);
			of->desc = stradd(start);
			putkey(of, start, TYPE_Nd);
			free(title);
			return(1);
		}
	}

	for (n = n->child; n; n = n->next)
		if (parse_man(of, n))
			return(1);

	return(0);
}

static void
parse_mdoc(struct of *of, const struct mdoc_node *n)
{

	for (n = n->child; NULL != n; n = n->next) {
		switch (n->type) {
		case (MDOC_ELEM):
			/* FALLTHROUGH */
		case (MDOC_BLOCK):
			/* FALLTHROUGH */
		case (MDOC_HEAD):
			/* FALLTHROUGH */
		case (MDOC_BODY):
			/* FALLTHROUGH */
		case (MDOC_TAIL):
			if (NULL != mdocs[n->tok].fp)
			       if (0 == (*mdocs[n->tok].fp)(of, n))
				       break;

			if (MDOCF_CHILD & mdocs[n->tok].flags)
				putmdockey(of, n->child, mdocs[n->tok].mask);
			break;
		default:
			assert(MDOC_ROOT != n->type);
			continue;
		}
		if (NULL != n->child)
			parse_mdoc(of, n);
	}
}

static int
parse_mdoc_Fd(struct of *of, const struct mdoc_node *n)
{
	const char	*start, *end;
	size_t		 sz;

	if (SEC_SYNOPSIS != n->sec ||
			NULL == (n = n->child) || 
			MDOC_TEXT != n->type)
		return(0);

	/*
	 * Only consider those `Fd' macro fields that begin with an
	 * "inclusion" token (versus, e.g., #define).
	 */

	if (strcmp("#include", n->string))
		return(0);

	if (NULL == (n = n->next) || MDOC_TEXT != n->type)
		return(0);

	/*
	 * Strip away the enclosing angle brackets and make sure we're
	 * not zero-length.
	 */

	start = n->string;
	if ('<' == *start || '"' == *start)
		start++;

	if (0 == (sz = strlen(start)))
		return(0);

	end = &start[(int)sz - 1];
	if ('>' == *end || '"' == *end)
		end--;

	assert(end >= start);
	putkeys(of, start, end - start + 1, TYPE_In);
	return(1);
}

static int
parse_mdoc_In(struct of *of, const struct mdoc_node *n)
{

	if (NULL != n->child && MDOC_TEXT == n->child->type)
		return(0);

	putkey(of, n->child->string, TYPE_In);
	return(1);
}

static int
parse_mdoc_Fn(struct of *of, const struct mdoc_node *n)
{
	const char	*cp;

	if (NULL == (n = n->child) || MDOC_TEXT != n->type)
		return(0);

	/* 
	 * Parse: .Fn "struct type *name" "char *arg".
	 * First strip away pointer symbol. 
	 * Then store the function name, then type.
	 * Finally, store the arguments. 
	 */

	if (NULL == (cp = strrchr(n->string, ' ')))
		cp = n->string;

	while ('*' == *cp)
		cp++;

	putkey(of, cp, TYPE_Fn);

	if (n->string < cp)
		putkeys(of, n->string, cp - n->string, TYPE_Ft);

	for (n = n->next; NULL != n; n = n->next)
		if (MDOC_TEXT == n->type)
			putkey(of, n->string, TYPE_Fa);

	return(0);
}

static int
parse_mdoc_St(struct of *of, const struct mdoc_node *n)
{

	if (NULL == n->child || MDOC_TEXT != n->child->type)
		return(0);

	putkey(of, n->child->string, TYPE_St);
	return(1);
}

static int
parse_mdoc_Xr(struct of *of, const struct mdoc_node *n)
{

	if (NULL == (n = n->child))
		return(0);

	putkey(of, n->string, TYPE_Xr);
	return(1);
}

static int
parse_mdoc_Nd(struct of *of, const struct mdoc_node *n)
{
	size_t		 sz;
	char		*sv, *desc;

	if (MDOC_BODY != n->type)
		return(0);

	/*
	 * Special-case the `Nd' because we need to put the description
	 * into the document table.
	 */

	desc = NULL;
	for (n = n->child; NULL != n; n = n->next) {
		if (MDOC_TEXT == n->type) {
			sz = strlen(n->string) + 1;
			if (NULL != (sv = desc))
				sz += strlen(desc) + 1;
			desc = mandoc_realloc(desc, sz);
			if (NULL != sv)
				strlcat(desc, " ", sz);
			else
				*desc = '\0';
			strlcat(desc, n->string, sz);
		}
		if (NULL != n->child)
			parse_mdoc_Nd(of, n);
	}

	of->desc = NULL != desc ? stradd(desc) : NULL;
	free(desc);
	return(1);
}

static int
parse_mdoc_Nm(struct of *of, const struct mdoc_node *n)
{

	if (SEC_NAME == n->sec)
		return(1);
	else if (SEC_SYNOPSIS != n->sec || MDOC_HEAD != n->type)
		return(0);

	return(1);
}

static int
parse_mdoc_Sh(struct of *of, const struct mdoc_node *n)
{

	return(SEC_CUSTOM == n->sec && MDOC_HEAD == n->type);
}

static int
parse_mdoc_head(struct of *of, const struct mdoc_node *n)
{

	return(MDOC_HEAD == n->type);
}

static int
parse_mdoc_body(struct of *of, const struct mdoc_node *n)
{

	return(MDOC_BODY == n->type);
}

/*
 * See straddbuf().
 */
static char *
stradd(const char *cp)
{

	return(straddbuf(cp, strlen(cp)));
}

/* 
 * See wordaddbuf().
 */
static void
wordadd(const struct of *of, const char *cp, uint64_t mask)
{

	if (0 == cp[0])
		return;
	wordaddbuf(of, cp, strlen(cp), mask);
}

/*
 * This looks up or adds a string to the string table.
 * The string table is a table of all strings encountered during parse
 * or file scan.
 * In using it, we avoid having thousands of (e.g.) "cat1" string
 * allocations for the "of" table.
 * We also have a layer atop the string table for keeping track of words
 * in a parse sequence (see wordaddbuf()).
 */
static char *
straddbuf(const char *cp, size_t sz)
{
	struct str	*s;

	HASH_FIND(hash_string, strings, cp, sz, s);
	if (NULL != s) 
		return(s->key);

	s = mandoc_calloc(1, sizeof(struct str));
	s->key = mandoc_malloc(sz + 1);
	memcpy(s->key, cp, sz);
	s->key[sz] = '\0';
	HASH_ADD_KEYPTR(hash_string, strings, s->key, sz, s);
	return(s->key);
}

/*
 * Add a word to the current parse sequence.
 * Within the hashtable of strings, we maintain a list of strings that
 * are currently indexed.
 * Each of these ("words") has a bitmask modified within the parse.
 * When we finish a parse, we'll dump the list, then remove the head
 * entry -- since the next parse will have a new "of", it can keep track
 * of its entries without conflict.
 */
static void
wordaddbuf(const struct of *of, 
		const char *cp, size_t sz, uint64_t v)
{
	struct str	*s;

	if (0 == sz)
		return;

	HASH_FIND(hash_string, strings, cp, sz, s);
	if (NULL != s && of == s->of) {
		s->mask |= v;
		return;
	} else if (NULL == s) {
		s = mandoc_calloc(1, sizeof(struct str));
		s->key = mandoc_malloc(sz + 1);
		memcpy(s->key, cp, sz);
		s->key[sz] = '\0';
		HASH_ADD_KEYPTR(hash_string, strings, s->key, sz, s);
	}

	s->next = words;
	s->of = of;
	s->mask = v;
	words = s;
}

/*
 * Take a Unicode codepoint and produce its UTF-8 encoding.
 * This isn't the best way to do this, but it works.
 * The magic numbers are from the UTF-8 packaging.
 * They're not as scary as they seem: read the UTF-8 spec for details.
 */
static size_t
utf8(unsigned int cp, char out[7])
{
	size_t		 rc;

	rc = 0;
	if (cp <= 0x0000007F) {
		rc = 1;
		out[0] = (char)cp;
	} else if (cp <= 0x000007FF) {
		rc = 2;
		out[0] = (cp >> 6  & 31) | 192;
		out[1] = (cp       & 63) | 128;
	} else if (cp <= 0x0000FFFF) {
		rc = 3;
		out[0] = (cp >> 12 & 15) | 224;
		out[1] = (cp >> 6  & 63) | 128;
		out[2] = (cp       & 63) | 128;
	} else if (cp <= 0x001FFFFF) {
		rc = 4;
		out[0] = (cp >> 18 & 7) | 240;
		out[1] = (cp >> 12 & 63) | 128;
		out[2] = (cp >> 6  & 63) | 128;
		out[3] = (cp       & 63) | 128;
	} else if (cp <= 0x03FFFFFF) {
		rc = 5;
		out[0] = (cp >> 24 & 3) | 248;
		out[1] = (cp >> 18 & 63) | 128;
		out[2] = (cp >> 12 & 63) | 128;
		out[3] = (cp >> 6  & 63) | 128;
		out[4] = (cp       & 63) | 128;
	} else if (cp <= 0x7FFFFFFF) {
		rc = 6;
		out[0] = (cp >> 30 & 1) | 252;
		out[1] = (cp >> 24 & 63) | 128;
		out[2] = (cp >> 18 & 63) | 128;
		out[3] = (cp >> 12 & 63) | 128;
		out[4] = (cp >> 6  & 63) | 128;
		out[5] = (cp       & 63) | 128;
	} else
		return(0);

	out[rc] = '\0';
	return(rc);
}

/*
 * Store the UTF-8 version of a key, or alias the pointer if the key has
 * no UTF-8 transcription marks in it.
 */
static void
utf8key(struct mchars *mc, struct str *key)
{
	size_t		 sz, bsz, pos;
	char		 utfbuf[7];
	char		*buf;
	const char	*seq, *cpp, *val;
	int		 len, u;
	enum mandoc_esc	 esc;
	char 	 	 res[5];

	res[0] = '\\';
	res[1] = '\t';
	res[2] = ASCII_NBRSP;
	res[3] = ASCII_HYPH;
	res[4] = '\0';

	val = key->key;
	bsz = strlen(val);

	/*
	 * Pre-check: if we have no stop-characters, then set the
	 * pointer as ourselvse and get out of here.
	 */

	if (strcspn(val, res) == bsz) {
		key->utf8 = key->key;
		return;
	} 

	/* Pre-allocate by the length of the input */

	buf = mandoc_malloc(++bsz);
	pos = 0;

	while ('\0' != *val) {
		/*
		 * Halt on the first escape sequence.
		 * This also halts on the end of string, in which case
		 * we just copy, fallthrough, and exit the loop.
		 */
		if ((sz = strcspn(val, res)) > 0) {
			memcpy(&buf[pos], val, sz);
			pos += sz;
			val += sz;
		}

		if (ASCII_HYPH == *val) {
			buf[pos++] = '-';
			val++;
			continue;
		} else if ('\t' == *val || ASCII_NBRSP == *val) {
			buf[pos++] = ' ';
			val++;
			continue;
		} else if ('\\' != *val)
			break;

		/* Read past the slash. */

		val++;
		u = 0;

		/*
		 * Parse the escape sequence and see if it's a
		 * predefined character or special character.
		 */

		esc = mandoc_escape
			((const char **)&val, &seq, &len);
		if (ESCAPE_ERROR == esc)
			break;

		if (ESCAPE_SPECIAL != esc)
			continue;
		if (0 == (u = mchars_spec2cp(mc, seq, len)))
			continue;

		/*
		 * If we have a Unicode codepoint, try to convert that
		 * to a UTF-8 byte string.
		 */

		cpp = utfbuf;
		if (0 == (sz = utf8(u, utfbuf)))
			continue;

		/* Copy the rendered glyph into the stream. */

		sz = strlen(cpp);
		bsz += sz;

		buf = mandoc_realloc(buf, bsz);

		memcpy(&buf[pos], cpp, sz);
		pos += sz;
	}

	buf[pos] = '\0';
	key->utf8 = buf;
}

static void
dbindex(struct mchars *mc, const struct of *of, const char *base)
{
	struct str	*key;
	int64_t		 recno;

	if (nodb)
		return;

	sqlite3_bind_text
		(stmts[STMT_INSERT_DOC], 1, 
		 of->file, -1, SQLITE_STATIC);
	sqlite3_bind_text
		(stmts[STMT_INSERT_DOC], 2, 
		 of->sec, -1, SQLITE_STATIC);
	sqlite3_bind_text
		(stmts[STMT_INSERT_DOC], 3, 
		 of->arch, -1, SQLITE_STATIC);
	sqlite3_bind_text
		(stmts[STMT_INSERT_DOC], 3, 
		 NULL != of->desc ? of->desc : "", 
		 -1, SQLITE_STATIC);
	sqlite3_step(stmts[STMT_INSERT_DOC]);
	DEBUG(of->file, base, "Added to index");
	recno = sqlite3_last_insert_rowid(db);
	sqlite3_reset(stmts[STMT_INSERT_DOC]);

	for (key = words; NULL != key; key = key->next) {
		assert(key->of == of);
		if (NULL == key->utf8)
			utf8key(mc, key);
		sqlite3_bind_int64
			(stmts[STMT_INSERT_KEY], 1, key->mask);
		sqlite3_bind_text
			(stmts[STMT_INSERT_KEY], 2, 
			 key->utf8, -1, SQLITE_STATIC);
		sqlite3_bind_int64
			(stmts[STMT_INSERT_KEY], 3, recno);
	}
}

static void
dbprune(const char *base)
{
	struct of	*of;

	if (nodb)
		return;

	for (of = ofs; NULL != of; of = of->next) {
		sqlite3_bind_text
			(stmts[STMT_DELETE], 1, 
			 of->file, -1, SQLITE_STATIC);
		sqlite3_step(stmts[STMT_DELETE]);
		sqlite3_reset(stmts[STMT_DELETE]);
		DEBUG(of->file, base, "Deleted from index");
	}
}

/*
 * Close an existing database and its prepared statements.
 * If "real" is not set, rename the temporary file into the real one.
 */
static void
dbclose(const char *base, int real)
{
	size_t		 i;
	char		 file[MAXPATHLEN];

	if (nodb)
		return;

	for (i = 0; i < STMT__MAX; i++) {
		sqlite3_finalize(stmts[i]);
		stmts[i] = NULL;
	}

	sqlite3_close(db);
	db = NULL;

	if (real)
		return;

	strlcpy(file, MANDOC_DB, MAXPATHLEN);
	strlcat(file, "~", MAXPATHLEN);
	if (-1 == rename(file, MANDOC_DB))
		perror(MANDOC_DB);
}

/*
 * This is straightforward stuff.
 * Open a database connection to a "temporary" database, then open a set
 * of prepared statements we'll use over and over again.
 * If "real" is set, we use the existing database; if not, we truncate a
 * temporary one.
 * Must be matched by dbclose().
 */
static int
dbopen(const char *base, int real)
{
	char		 file[MAXPATHLEN];
	const char	*sql;
	int		 rc, ofl;
	size_t		 sz;

	if (nodb) 
		return(1);

	sz = strlcpy(file, MANDOC_DB, MAXPATHLEN);
	if ( ! real)
		sz = strlcat(file, "~", MAXPATHLEN);

	if (sz >= MAXPATHLEN) {
		fprintf(stderr, "%s: Path too long\n", file);
		return(0);
	}

	if ( ! real)
		remove(file);

	ofl = SQLITE_OPEN_PRIVATECACHE | SQLITE_OPEN_READWRITE;
	rc = sqlite3_open_v2(file, &db, ofl, NULL);
	if (SQLITE_OK == rc) 
		return(1);
	if (SQLITE_CANTOPEN != rc) {
		perror(file);
		return(0);
	}

	sqlite3_close(db);
	db = NULL;

	if (SQLITE_OK != (rc = sqlite3_open(file, &db))) {
		perror(file);
		return(0);
	}

	sql = "PRAGMA journal_mode=off;\n"
	      "PRAGMA encoding=\"UTF-8\";\n"
	      "\n"
	      "CREATE TABLE \"docs\" (\n"
	      " \"file\" TEXT NOT NULL,\n"
	      " \"sec\" TEXT NOT NULL,\n"
	      " \"arch\" TEXT NOT NULL,\n"
	      " \"desc\" TEXT NOT NULL,\n"
	      " \"id\" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL\n"
	      ");\n"
	      "\n"
	      "CREATE TABLE \"keys\" (\n"
	      " \"bits\" INTEGER NOT NULL,\n"
	      " \"key\" TEXT NOT NULL,\n"
	      " \"docid\" INTEGER NOT NULL REFERENCES docs(id) ON DELETE CASCADE,\n"
	      " \"id\" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL\n"
	      ");\n";

	if (SQLITE_OK != sqlite3_exec(db, sql, NULL, NULL, NULL)) {
		perror(sqlite3_errmsg(db));
		return(0);
	}

	sql = "DELETE FROM docs where file=?";
	sqlite3_prepare_v2(db, sql, -1, &stmts[STMT_DELETE], NULL);
	sql = "INSERT INTO docs (file,sec,arch,desc) VALUES (?,?,?,?)";
	sqlite3_prepare_v2(db, sql, -1, &stmts[STMT_INSERT_DOC], NULL);
	sql = "INSERT INTO keys (bits,key,docid) VALUES (?,?,?)";
	sqlite3_prepare_v2(db, sql, -1, &stmts[STMT_INSERT_KEY], NULL);
	return(1);
}


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2012-04-04 18:27 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-04 18:26 mandocdb(8) full re-write Kristaps Dzonsons

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).