From mboxrd@z Thu Jan 1 00:00:00 1970 References: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp> In-Reply-To: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp> Mime-Version: 1.0 (1.0) Content-Type: text/plain; charset=us-ascii Message-Id: <60B43026-4031-45BC-911C-57F00ADD16D3@lsub.org> Content-Transfer-Encoding: quoted-printable From: Francisco J Ballesteros Date: Tue, 6 Aug 2013 09:26:58 +0200 To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Subject: Re: [9fans] text database Kirara Topicbox-Message-UUID: 6f48c6fa-ead8-11e9-9d60-3106f5b1d025 nice. btw I think I have another tag tool at contrib. not sure, but it was c= alled tags, and the search tool was called F.=20 I say this because it used file and per file type tag listing (ms2txt, etc) t= o index other file types. It might be an idea to add to your nice tool.=20 On Aug 6, 2013, at 3:14 AM, arisawa wrote: > Hello 9fans, >=20 > I have written a text database named Kirara. > The following is a brief introduction to Kirara. > If you are interested in, get Kirara from: > http://plan9.aichi-u.ac.jp/netlib/kirara/ >=20 > Kenji Arisawa >=20 > ------------- >=20 > Kirara >=20 > ------------- >=20 > Kirara is a text indexing/retrieval tool for Plan 9. >=20 > Personal use: index/retrieve local files. >=20 > Kirara is based on the idea similar to Glimpse. >=20 > (1) indexing + grep > (2) multi-level indexing >=20 > (a) small space for indexing > (b) small update time > (c) quick search >=20 > Note that: > small indexing <-> quick search > Kirara makes more index -> quick search > Glimpse is single-level indexing. >=20 > ------------- >=20 > Query >=20 > Kirara does not support phrase search. > The database is index of words, >=20 > supporting: > QE mode (query expression mode) > '&', '|', '*' > The example: > 'snoopy&html' > 'snoop*&htm*' >=20 > RE mode (regular expression mode) > '&', RE > where RE denotes regular expression. > The example: > 'sn.*y&h.+l' >=20 > RE mode is a bit slow. (a few second.) >=20 > ------------- >=20 > Words >=20 > Two or more runes. > All words are converted to lower case. > In English, words is composed of alphabets. > The number of runes is configurable >=20 > Assumption: > Text is composed of space-separated words > popular in English and many European Languages, > but not in Japanese. >=20 > ------------- >=20 > The user's interface >=20 > Best match with Rio > term% kfind snoop > G snoop /sys/src/9/ip/ > G snoop /sys/src/cmd/spell/ > G snoop /sys/src/9/kw/ > ... > term% G snoop /sys/src/9/ip > devip.c:34: Qsnoop, > devip.c:95: case Qsnoop: > devip.c:98: devdir(c, q, "snoop", qlen(cv->sq), cv->owner, 0400,= dp); > ... > Note that: two steps > 1. find directories > 2. find files and the contents > Step 2 is actually 'grep'. we can use RE. >=20 > Two-steps search is not a weekness, but a desirable feature. > Because we have so many files that are hit by the query. >=20 > ------------- >=20 > The organization >=20 > My example >=20 > /n/other/kirara/sysdb > target: (/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc) >=20 > /n/other/kirara/usrdb > target: $home/^(bin/rc lib netlib doc adm issues srclib src sources) >=20 > Indexing target is fully configurable. >=20 > ------------- >=20 > Multi-Level Indexing >=20 > (1) Indexing (top level) > word to directory mapping >=20 > sysdb/index # main index # used for RE mode > sysdb/mindex # meta index (alphabetic index) # used for QE mode > sysdb/dind/* # rough index of each directory > sysdb/QTDir # map table (QID, mtime, path-to-dir) >=20 > index # word to dir QID > aa 0000000000014f0a > aa 000000000001a1e0 > aa 000000000001a26e >=20 > mindex # word to range in index > aa 0 126669 > ab 126669 491569 > ac 491569 1258566 > ad 1258566 1852467 > ... > dind/* # `*' is a directory QID > 0000000000014f05 > 0000000000014f0a > 000000000001a1ce >=20 > usrdb is same. >=20 > (2) Indexing (directory level) # optional > word to file mapping >=20 > sysdb/find/*/ind.gz # fine index of the directory (gzipped) > sysdb/find/*/qtn # map table (QID, mtime, name) > where `*' is a directory QID >=20 > usrdb is same as sysdb. >=20 > ------------- >=20 > Experiment >=20 > (a) hardware > GA-H61M-USB3-B3 > Intel Pentium G860 (3GHz) > DDR3 PC3 4GB >=20 > (b) software > 9front > cwfs64x >=20 > ------------- >=20 > The performance (compression ratio) >=20 > target target num_of_dirs indexing > sysdb: 556 MB 1790 dirs 49 MB > usrdb: 6620 MB 8948 dirs 150 MB >=20 > compression ratio: 49/556 (sysdb) > note: usrdb includes many non-text file. >=20 > ------------- >=20 > The performance (retrieval time) > system dependent >=20 > RQ search # kfind foo >=20 > 0.1 seconds. >=20 > It is not important to make this time smaller. > (sufficiently small) >=20 > RE search # kfind -r foo >=20 > a few seconds >=20 >=20 > ------------- >=20 > The performance (construction/update) >=20 > (a) Construction time > system dependent >=20 > Initial construction > need > 10 minutes for sysdb > 30 minutes for usrdb >=20 > (b) Updating time > two commands for update >=20 > mkdb > 20 seconds to a few minutes for usrdb > depends largely on state of cache >=20 > mkdb1 (currently only for usrdb) > 5 to 15 seconds for usrdb > mkdb1 needs event log >=20 > ------------- >=20 > Scalability >=20 > Main factors >=20 > (a) retrieval time > QE search: proportional to number of dirs that include the query > RE search: proportional to size of index >=20 > (b) initial construction time > proportional to total data >=20 > (c) update time > mkdb: proportional to number of dirs and the changes > mkdb1: proportional to changes and size of index >=20 > ------------- >=20 > Used Tools >=20 > (1) rc >=20 > (2) grep, sed, awk, sort, diff, gzip, ... >=20 > (3) some new tools written in C >=20 > ------------- >=20 > What Kirara means? >=20 > Kirara is name of a girl that appeared in a Japanese comic book. > (But I have never read the book.) > The name is seldom used in real world. > =46rom the name we Japanese imagine something glittering. > I like the name. >=20 > ------------- >=20 > References >=20 > [1] GLIMPSE: A Tool to Search Through Entire File Systems > Udi Manber and Sun Wu (1993) > http://webglimpse.net/pubs/glimpse.pdf > [2] Glimpse Documentation > http://webglimpse.net/gdocs/glimpsehelp.html >=20