From mboxrd@z Thu Jan  1 00:00:00 1970
References: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp>
In-Reply-To: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp>
Mime-Version: 1.0 (1.0)
Content-Type: text/plain;
	charset=us-ascii
Message-Id: <60B43026-4031-45BC-911C-57F00ADD16D3@lsub.org>
Content-Transfer-Encoding: quoted-printable
From: Francisco J Ballesteros <nemo@lsub.org>
Date: Tue,  6 Aug 2013 09:26:58 +0200
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Subject: Re: [9fans] text database Kirara
Topicbox-Message-UUID: 6f48c6fa-ead8-11e9-9d60-3106f5b1d025

nice. btw I think I have another tag tool at contrib. not sure, but it was c=
alled tags, and the search tool was called F.=20

I say this because it used file and per file type tag listing (ms2txt, etc) t=
o index other file types. It might be an idea to add to your nice tool.=20

On Aug 6, 2013, at 3:14 AM, arisawa <arisawa@ar.aichi-u.ac.jp> wrote:

> Hello 9fans,
>=20
> I have written a text database named Kirara.
> The following is a brief introduction to Kirara.
> If you are interested in, get Kirara from:
>  http://plan9.aichi-u.ac.jp/netlib/kirara/
>=20
> Kenji Arisawa
>=20
> -------------
>=20
> Kirara
>=20
> -------------
>=20
> Kirara is a text indexing/retrieval tool for Plan 9.
>=20
> Personal use: index/retrieve local files.
>=20
> Kirara is based on the idea similar to Glimpse.
>=20
> (1) indexing + grep
> (2) multi-level indexing
>=20
> (a) small space for indexing
> (b) small update time
> (c) quick search
>=20
> Note that:
> small indexing   <->    quick search
> Kirara makes more index -> quick search
> Glimpse is single-level indexing.
>=20
> -------------
>=20
> Query
>=20
> Kirara does not support phrase search.
> The database is index of words,
>=20
> supporting:
> QE mode (query expression mode)
>    '&', '|', '*'
> The example:
>    'snoopy&html'
>    'snoop*&htm*'
>=20
> RE mode (regular expression mode)
>    '&', RE
> where RE denotes regular expression.
> The example:
>    'sn.*y&h.+l'
>=20
> RE mode is a bit slow. (a few second.)
>=20
> -------------
>=20
> Words
>=20
> Two or more runes.
> All words are converted to lower case.
> In English, words is composed of alphabets.
> The number of runes is configurable
>=20
> Assumption:
> Text is composed of space-separated words
> popular in English and many European Languages,
> but not in Japanese.
>=20
> -------------
>=20
> The user's interface
>=20
> Best match with Rio
>    term% kfind snoop
>    G snoop /sys/src/9/ip/
>    G snoop /sys/src/cmd/spell/
>    G snoop /sys/src/9/kw/
>    ...
>    term% G snoop /sys/src/9/ip
>    devip.c:34:    Qsnoop,
>    devip.c:95:    case Qsnoop:
>    devip.c:98:        devdir(c, q, "snoop", qlen(cv->sq), cv->owner, 0400,=
 dp);
>    ...
> Note that: two steps
> 1. find directories
> 2. find files and the contents
> Step 2 is actually 'grep'. we can use RE.
>=20
> Two-steps search is not a weekness, but a desirable feature.
> Because we have so many files that are hit by the query.
>=20
> -------------
>=20
> The organization
>=20
> My example
>=20
> /n/other/kirara/sysdb
> target: (/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc)
>=20
> /n/other/kirara/usrdb
> target: $home/^(bin/rc lib netlib doc adm issues srclib src sources)
>=20
> Indexing target is fully configurable.
>=20
> -------------
>=20
> Multi-Level Indexing
>=20
> (1) Indexing (top level)
> word to directory mapping
>=20
> sysdb/index        # main index                    # used for RE mode
> sysdb/mindex    # meta index (alphabetic index)    # used for QE mode
> sysdb/dind/*    # rough index of each directory
> sysdb/QTDir        # map table (QID, mtime, path-to-dir)
>=20
> index        # word to dir QID
>    aa    0000000000014f0a
>    aa    000000000001a1e0
>    aa    000000000001a26e
>=20
> mindex    # word to range in index
>    aa 0 126669
>    ab 126669 491569
>    ac 491569 1258566
>    ad 1258566 1852467
>    ...
> dind/*                # `*' is a directory QID
>    0000000000014f05
>    0000000000014f0a
>    000000000001a1ce
>=20
> usrdb is same.
>=20
> (2) Indexing (directory level)    # optional
> word to file mapping
>=20
> sysdb/find/*/ind.gz    # fine index of the directory (gzipped)
> sysdb/find/*/qtn    # map table (QID, mtime, name)
> where `*' is a directory QID
>=20
> usrdb is same as sysdb.
>=20
> -------------
>=20
> Experiment
>=20
> (a) hardware
> GA-H61M-USB3-B3
> Intel Pentium G860 (3GHz)
> DDR3 PC3 4GB
>=20
> (b) software
> 9front
> cwfs64x
>=20
> -------------
>=20
> The performance (compression ratio)
>=20
> target     target     num_of_dirs      indexing
> sysdb:   556 MB       1790 dirs         49 MB
> usrdb:    6620 MB       8948 dirs        150 MB
>=20
> compression ratio: 49/556 (sysdb)
> note: usrdb includes many non-text file.
>=20
> -------------
>=20
> The performance (retrieval time)
> system dependent
>=20
> RQ search    # kfind foo
>=20
> 0.1 seconds.
>=20
> It is not important to make this time smaller.
> (sufficiently small)
>=20
> RE search    # kfind -r foo
>=20
> a few seconds
>=20
>=20
> -------------
>=20
> The performance (construction/update)
>=20
> (a) Construction time
> system dependent
>=20
> Initial construction
> need
>    10 minutes for sysdb
>    30 minutes for usrdb
>=20
> (b) Updating time
> two commands for update
>=20
> mkdb
>    20 seconds to a few minutes for usrdb
>    depends largely on state of cache
>=20
> mkdb1 (currently only for usrdb)
>    5 to 15 seconds for usrdb
>    mkdb1 needs event log
>=20
> -------------
>=20
> Scalability
>=20
> Main factors
>=20
> (a) retrieval time
> QE search: proportional to number of dirs that include the query
> RE search: proportional to size of index
>=20
> (b) initial construction time
> proportional to total data
>=20
> (c) update time
> mkdb:    proportional to number of dirs and the changes
> mkdb1:    proportional to changes and size of index
>=20
> -------------
>=20
> Used Tools
>=20
> (1) rc
>=20
> (2) grep, sed, awk, sort, diff, gzip, ...
>=20
> (3) some new tools written in C
>=20
> -------------
>=20
> What Kirara means?
>=20
> Kirara is name of a girl that appeared in a Japanese comic book.
> (But I have never read the book.)
> The name is seldom used in real world.
> =46rom the name we Japanese imagine something glittering.
> I like the name.
>=20
> -------------
>=20
> References
>=20
> [1] GLIMPSE: A Tool to Search Through Entire File Systems
> Udi Manber and Sun Wu (1993)
> http://webglimpse.net/pubs/glimpse.pdf
> [2] Glimpse Documentation
> http://webglimpse.net/gdocs/glimpsehelp.html
>=20