Hi,

During AsiaBSDCon, I had the opportunity to take a more serious look at 
mandocdb(8).  As the code was rather complex, I opted to start over 
rather than whittling down.  The results are in the enclosed file, 
summarised as follows:

  (0) Overall code cleanliness.  mandocdb.c gained a lot of features 
real fast.  This re-write let me integrate those systematically.

  (1) Aggressive hashing of strings.
      All strings -- filename components, file suffixes, parsed words, 
and so on -- are hashed (using uthash).  Parsed manpage terms overlay 
the string hash, so after a few files, there are very few allocations at 
all.  This brings us a huge performance improvement: a lot of the last 
version, when profiled with valgrind, was spent allocating and twiddling 
with strings.

  (2) Use of fts(3) instead of ad hoc file walking.
      This makes the code much cleaner and neater.  This also improved 
performance because examining the file path is much easier by looking at 
the hierarchy level.  Again, less string twiddling.

  (3) De-duping/winnowing at the file-scan phase.
      I de-duplicate files by hashing inode/device and tossing dupes.  I 
also throw out non-conforming suffixes (if !use_all) early on, making 
the end list of files to parse much smaller.
      I'm much more picky about what's considered "mandoc source" in 
this version because mandoc(1) lets pretty much anything be parsed, 
defaulting to -man, which lead to lots of noise.  Now I require the 
right suffix or directory parts before using mandoc(3).

  (4) Using SQLite instead of Berkeley DB.
      Ok, this is the most controversial.  After talking with some 
OpenBSD and NetBSD folks, nobody could find anything against using 
SQLite.  NetBSD already has it in base, and apparently OpenBSD is moving 
in the same direction.
      Not to worry: it's really easy to plug in another database: the 
database functions (open/close/index/prune) completely contain the 
database routines.  Open/close are run for each manpath, index is run 
for each page, and prune for each page's removal.  Check out that DELETE 
CASCADE.  So easy!

  (5) Input encoding cleanup.
      The last mandocdb was a little fuzzy on encodings.  This time 
around, I store UTF-8 encoded strings directory.  Due to the hashing 
method, I only compute the UTF-8 string (which isn't all that expensive) 
once during the full parse lifetime!  This also makes apropos_db's job 
MUCH easier.

I cherry-picked schwarze@'s fine work with the last mandocdb.c to retain 
its behaviour regarding path sanitising.  There might be some omissions, 
but I think I have them all.

Some behaviour changes and possibilities:

  (1) I'll likely kick out searching by regexp in favour of globbing, 
which is better handled natively in SQLite, but we'll see---it's just a 
matter of search performance (SQLite supports regexp with matches, but 
it's not optimal).

  (2) Obviously, we now only have one database file with two tables. 
mandocdb(8) writes into a temporary file then rename(2)s into the real 
one (unless with -u or -d).  This is much neater and more readable.

  (3) Language and encoding.  I'd like to smartify the directory parse 
to recognise a language (e.g., ru/man1/amd64) alongside the rest.  This 
way, folks can use apropos to search for native-language manuals using 
the UTF-8 methods.

  (4) Full text search.  This will only be a few lines of code as the 
heavy lifting of word hashing is all in place.  I spoke with Jorg and 
Abhinav (NetBSD GSoC folks) about having a "natural-language" CGI in 
mdocml.bsd.lv.  I think it'd be awesome and a good pre-filter for, say, 
retarded misc@ questions ("how do I configure my bridge?").

Before committing anything, I'll transcribe apropos_db.c as well, then 
use it for a while "in production".  My plan is to make an OpenBSD 
package out of mdocml's "apropos tools" that install alternatives to the 
regular apropos and friends.  This way I can have fun and find bugs 
without displacing the prior tools.

Thoughts?

Kristaps