Hi, During AsiaBSDCon, I had the opportunity to take a more serious look at mandocdb(8). As the code was rather complex, I opted to start over rather than whittling down. The results are in the enclosed file, summarised as follows: (0) Overall code cleanliness. mandocdb.c gained a lot of features real fast. This re-write let me integrate those systematically. (1) Aggressive hashing of strings. All strings -- filename components, file suffixes, parsed words, and so on -- are hashed (using uthash). Parsed manpage terms overlay the string hash, so after a few files, there are very few allocations at all. This brings us a huge performance improvement: a lot of the last version, when profiled with valgrind, was spent allocating and twiddling with strings. (2) Use of fts(3) instead of ad hoc file walking. This makes the code much cleaner and neater. This also improved performance because examining the file path is much easier by looking at the hierarchy level. Again, less string twiddling. (3) De-duping/winnowing at the file-scan phase. I de-duplicate files by hashing inode/device and tossing dupes. I also throw out non-conforming suffixes (if !use_all) early on, making the end list of files to parse much smaller. I'm much more picky about what's considered "mandoc source" in this version because mandoc(1) lets pretty much anything be parsed, defaulting to -man, which lead to lots of noise. Now I require the right suffix or directory parts before using mandoc(3). (4) Using SQLite instead of Berkeley DB. Ok, this is the most controversial. After talking with some OpenBSD and NetBSD folks, nobody could find anything against using SQLite. NetBSD already has it in base, and apparently OpenBSD is moving in the same direction. Not to worry: it's really easy to plug in another database: the database functions (open/close/index/prune) completely contain the database routines. Open/close are run for each manpath, index is run for each page, and prune for each page's removal. Check out that DELETE CASCADE. So easy! (5) Input encoding cleanup. The last mandocdb was a little fuzzy on encodings. This time around, I store UTF-8 encoded strings directory. Due to the hashing method, I only compute the UTF-8 string (which isn't all that expensive) once during the full parse lifetime! This also makes apropos_db's job MUCH easier. I cherry-picked schwarze@'s fine work with the last mandocdb.c to retain its behaviour regarding path sanitising. There might be some omissions, but I think I have them all. Some behaviour changes and possibilities: (1) I'll likely kick out searching by regexp in favour of globbing, which is better handled natively in SQLite, but we'll see---it's just a matter of search performance (SQLite supports regexp with matches, but it's not optimal). (2) Obviously, we now only have one database file with two tables. mandocdb(8) writes into a temporary file then rename(2)s into the real one (unless with -u or -d). This is much neater and more readable. (3) Language and encoding. I'd like to smartify the directory parse to recognise a language (e.g., ru/man1/amd64) alongside the rest. This way, folks can use apropos to search for native-language manuals using the UTF-8 methods. (4) Full text search. This will only be a few lines of code as the heavy lifting of word hashing is all in place. I spoke with Jorg and Abhinav (NetBSD GSoC folks) about having a "natural-language" CGI in mdocml.bsd.lv. I think it'd be awesome and a good pre-filter for, say, retarded misc@ questions ("how do I configure my bridge?"). Before committing anything, I'll transcribe apropos_db.c as well, then use it for a while "in production". My plan is to make an OpenBSD package out of mdocml's "apropos tools" that install alternatives to the regular apropos and friends. This way I can have fun and find bugs without displacing the prior tools. Thoughts? Kristaps