From mboxrd@z Thu Jan 1 00:00:00 1970 From: arisawa Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Message-Id: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp> Date: Tue, 6 Aug 2013 10:14:36 +0900 To: 9fans@9fans.net Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\)) Subject: [9fans] text database Kirara Topicbox-Message-UUID: 6f42e866-ead8-11e9-9d60-3106f5b1d025 Hello 9fans, I have written a text database named Kirara. The following is a brief introduction to Kirara. If you are interested in, get Kirara from: http://plan9.aichi-u.ac.jp/netlib/kirara/ Kenji Arisawa ------------- Kirara ------------- Kirara is a text indexing/retrieval tool for Plan 9. Personal use: index/retrieve local files. Kirara is based on the idea similar to Glimpse. (1) indexing + grep (2) multi-level indexing (a) small space for indexing (b) small update time (c) quick search Note that: small indexing <-> quick search Kirara makes more index -> quick search Glimpse is single-level indexing. ------------- Query Kirara does not support phrase search. The database is index of words, supporting: QE mode (query expression mode) '&', '|', '*' The example: 'snoopy&html' 'snoop*&htm*' RE mode (regular expression mode) '&', RE where RE denotes regular expression. The example: 'sn.*y&h.+l' RE mode is a bit slow. (a few second.) ------------- Words Two or more runes. All words are converted to lower case. In English, words is composed of alphabets. The number of runes is configurable Assumption: Text is composed of space-separated words popular in English and many European Languages, but not in Japanese. ------------- The user's interface Best match with Rio term% kfind snoop G snoop /sys/src/9/ip/ G snoop /sys/src/cmd/spell/ G snoop /sys/src/9/kw/ ... term% G snoop /sys/src/9/ip devip.c:34: Qsnoop, devip.c:95: case Qsnoop: devip.c:98: devdir(c, q, "snoop", qlen(cv->sq), = cv->owner, 0400, dp); ... Note that: two steps 1. find directories 2. find files and the contents Step 2 is actually 'grep'. we can use RE. Two-steps search is not a weekness, but a desirable feature. Because we have so many files that are hit by the query. ------------- The organization My example /n/other/kirara/sysdb target: (/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc) /n/other/kirara/usrdb target: $home/^(bin/rc lib netlib doc adm issues srclib src sources) Indexing target is fully configurable. ------------- Multi-Level Indexing (1) Indexing (top level) word to directory mapping sysdb/index # main index = # used for RE mode sysdb/mindex # meta index (alphabetic index) # used for QE mode sysdb/dind/* # rough index of each directory sysdb/QTDir # map table (QID, mtime, path-to-dir) index # word to dir QID aa 0000000000014f0a aa 000000000001a1e0 aa 000000000001a26e mindex # word to range in index aa 0 126669 ab 126669 491569 ac 491569 1258566 ad 1258566 1852467 ... dind/* # `*' is a directory QID 0000000000014f05 0000000000014f0a 000000000001a1ce usrdb is same. (2) Indexing (directory level) # optional word to file mapping sysdb/find/*/ind.gz # fine index of the directory (gzipped) sysdb/find/*/qtn # map table (QID, mtime, name) where `*' is a directory QID usrdb is same as sysdb. ------------- Experiment (a) hardware GA-H61M-USB3-B3 Intel Pentium G860 (3GHz) DDR3 PC3 4GB (b) software 9front cwfs64x ------------- The performance (compression ratio) target target num_of_dirs indexing sysdb: 556 MB 1790 dirs 49 MB usrdb: 6620 MB 8948 dirs 150 MB compression ratio: 49/556 (sysdb) note: usrdb includes many non-text file. ------------- The performance (retrieval time) system dependent RQ search # kfind foo 0.1 seconds. It is not important to make this time smaller. (sufficiently small) RE search # kfind -r foo a few seconds ------------- The performance (construction/update) (a) Construction time system dependent Initial construction need 10 minutes for sysdb 30 minutes for usrdb (b) Updating time two commands for update mkdb 20 seconds to a few minutes for usrdb depends largely on state of cache mkdb1 (currently only for usrdb) 5 to 15 seconds for usrdb mkdb1 needs event log ------------- Scalability Main factors (a) retrieval time QE search: proportional to number of dirs that include the query RE search: proportional to size of index (b) initial construction time proportional to total data (c) update time mkdb: proportional to number of dirs and the changes mkdb1: proportional to changes and size of index ------------- Used Tools (1) rc (2) grep, sed, awk, sort, diff, gzip, ... (3) some new tools written in C ------------- What Kirara means? Kirara is name of a girl that appeared in a Japanese comic book. (But I have never read the book.) The name is seldom used in real world. =46rom the name we Japanese imagine something glittering. I like the name. ------------- References [1] GLIMPSE: A Tool to Search Through Entire File Systems Udi Manber and Sun Wu (1993) http://webglimpse.net/pubs/glimpse.pdf [2] Glimpse Documentation http://webglimpse.net/gdocs/glimpsehelp.html