From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp> References: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp> Date: Tue, 6 Aug 2013 10:12:25 +0200 Message-ID: From: "Peter A. Cejchan" To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: multipart/alternative; boundary=e89a8ff2545435a0bd04e342fc17 Subject: Re: [9fans] text database Kirara Topicbox-Message-UUID: 6f5b395c-ead8-11e9-9d60-3106f5b1d025 --e89a8ff2545435a0bd04e342fc17 Content-Type: text/plain; charset=UTF-8 Sounds great! Thank you, Kenji !! ++pac On Tue, Aug 6, 2013 at 3:14 AM, arisawa wrote: > Hello 9fans, > > I have written a text database named Kirara. > The following is a brief introduction to Kirara. > If you are interested in, get Kirara from: > http://plan9.aichi-u.ac.jp/netlib/kirara/ > > Kenji Arisawa > > ------------- > > Kirara > > ------------- > > Kirara is a text indexing/retrieval tool for Plan 9. > > Personal use: index/retrieve local files. > > Kirara is based on the idea similar to Glimpse. > > (1) indexing + grep > (2) multi-level indexing > > (a) small space for indexing > (b) small update time > (c) quick search > > Note that: > small indexing <-> quick search > Kirara makes more index -> quick search > Glimpse is single-level indexing. > > ------------- > > Query > > Kirara does not support phrase search. > The database is index of words, > > supporting: > QE mode (query expression mode) > '&', '|', '*' > The example: > 'snoopy&html' > 'snoop*&htm*' > > RE mode (regular expression mode) > '&', RE > where RE denotes regular expression. > The example: > 'sn.*y&h.+l' > > RE mode is a bit slow. (a few second.) > > ------------- > > Words > > Two or more runes. > All words are converted to lower case. > In English, words is composed of alphabets. > The number of runes is configurable > > Assumption: > Text is composed of space-separated words > popular in English and many European Languages, > but not in Japanese. > > ------------- > > The user's interface > > Best match with Rio > term% kfind snoop > G snoop /sys/src/9/ip/ > G snoop /sys/src/cmd/spell/ > G snoop /sys/src/9/kw/ > ... > term% G snoop /sys/src/9/ip > devip.c:34: Qsnoop, > devip.c:95: case Qsnoop: > devip.c:98: devdir(c, q, "snoop", qlen(cv->sq), > cv->owner, 0400, dp); > ... > Note that: two steps > 1. find directories > 2. find files and the contents > Step 2 is actually 'grep'. we can use RE. > > Two-steps search is not a weekness, but a desirable feature. > Because we have so many files that are hit by the query. > > ------------- > > The organization > > My example > > /n/other/kirara/sysdb > target: (/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc) > > /n/other/kirara/usrdb > target: $home/^(bin/rc lib netlib doc adm issues srclib src sources) > > Indexing target is fully configurable. > > ------------- > > Multi-Level Indexing > > (1) Indexing (top level) > word to directory mapping > > sysdb/index # main index # > used for RE mode > sysdb/mindex # meta index (alphabetic index) # used for QE mode > sysdb/dind/* # rough index of each directory > sysdb/QTDir # map table (QID, mtime, path-to-dir) > > index # word to dir QID > aa 0000000000014f0a > aa 000000000001a1e0 > aa 000000000001a26e > > mindex # word to range in index > aa 0 126669 > ab 126669 491569 > ac 491569 1258566 > ad 1258566 1852467 > ... > dind/* # `*' is a directory QID > 0000000000014f05 > 0000000000014f0a > 000000000001a1ce > > usrdb is same. > > (2) Indexing (directory level) # optional > word to file mapping > > sysdb/find/*/ind.gz # fine index of the directory (gzipped) > sysdb/find/*/qtn # map table (QID, mtime, name) > where `*' is a directory QID > > usrdb is same as sysdb. > > ------------- > > Experiment > > (a) hardware > GA-H61M-USB3-B3 > Intel Pentium G860 (3GHz) > DDR3 PC3 4GB > > (b) software > 9front > cwfs64x > > ------------- > > The performance (compression ratio) > > target target num_of_dirs indexing > sysdb: 556 MB 1790 dirs 49 MB > usrdb: 6620 MB 8948 dirs 150 MB > > compression ratio: 49/556 (sysdb) > note: usrdb includes many non-text file. > > ------------- > > The performance (retrieval time) > system dependent > > RQ search # kfind foo > > 0.1 seconds. > > It is not important to make this time smaller. > (sufficiently small) > > RE search # kfind -r foo > > a few seconds > > > ------------- > > The performance (construction/update) > > (a) Construction time > system dependent > > Initial construction > need > 10 minutes for sysdb > 30 minutes for usrdb > > (b) Updating time > two commands for update > > mkdb > 20 seconds to a few minutes for usrdb > depends largely on state of cache > > mkdb1 (currently only for usrdb) > 5 to 15 seconds for usrdb > mkdb1 needs event log > > ------------- > > Scalability > > Main factors > > (a) retrieval time > QE search: proportional to number of dirs that include the query > RE search: proportional to size of index > > (b) initial construction time > proportional to total data > > (c) update time > mkdb: proportional to number of dirs and the changes > mkdb1: proportional to changes and size of index > > ------------- > > Used Tools > > (1) rc > > (2) grep, sed, awk, sort, diff, gzip, ... > > (3) some new tools written in C > > ------------- > > What Kirara means? > > Kirara is name of a girl that appeared in a Japanese comic book. > (But I have never read the book.) > The name is seldom used in real world. > From the name we Japanese imagine something glittering. > I like the name. > > ------------- > > References > > [1] GLIMPSE: A Tool to Search Through Entire File Systems > Udi Manber and Sun Wu (1993) > http://webglimpse.net/pubs/glimpse.pdf > [2] Glimpse Documentation > http://webglimpse.net/gdocs/glimpsehelp.html > > > --e89a8ff2545435a0bd04e342fc17 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Sounds great!
Thank you, Kenji !!
++pac


On Tue, Aug 6, 2013 at 3:14 AM, arisawa <arisawa@ar.aichi-u.a= c.jp> wrote:
Hello 9fans,

I have written a text database named Kirara.
The following is a brief introduction to Kirara.
If you are interested in, get Kirara from:
=C2=A0 http://plan9.aichi-u.ac.jp/netlib/kirara/

Kenji Arisawa

-------------

Kirara

-------------

Kirara is a text indexing/retrieval tool for Plan 9.

Personal use: index/retrieve local files.

Kirara is based on the idea similar to Glimpse.

(1) indexing + grep
(2) multi-level indexing

(a) small space for indexing
(b) small update time
(c) quick search

Note that:
small indexing =C2=A0 <-> =C2=A0 =C2=A0quick search
Kirara makes more index -> quick search
Glimpse is single-level indexing.

-------------

Query

Kirara does not support phrase search.
The database is index of words,

supporting:
QE mode (query expression mode)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 '&', '|', '*'
The example:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 'snoopy&html'
=C2=A0 =C2=A0 =C2=A0 =C2=A0 'snoop*&htm*'

RE mode (regular expression mode)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 '&', RE
where RE denotes regular expression.
The example:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 'sn.*y&h.+l'

RE mode is a bit slow. (a few second.)

-------------

Words

Two or more runes.
All words are converted to lower case.
In English, words is composed of alphabets.
The number of runes is configurable

Assumption:
Text is composed of space-separated words
popular in English and many European Languages,
but not in Japanese.

-------------

The user's interface

Best match with Rio
=C2=A0 =C2=A0 =C2=A0 =C2=A0 term% kfind snoop
=C2=A0 =C2=A0 =C2=A0 =C2=A0 G snoop /sys/src/9/ip/
=C2=A0 =C2=A0 =C2=A0 =C2=A0 G snoop /sys/src/cmd/spell/
=C2=A0 =C2=A0 =C2=A0 =C2=A0 G snoop /sys/src/9/kw/
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ...
=C2=A0 =C2=A0 =C2=A0 =C2=A0 term% G snoop /sys/src/9/ip
=C2=A0 =C2=A0 =C2=A0 =C2=A0 devip.c:34: =C2=A0 =C2=A0 Qsnoop,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 devip.c:95: =C2=A0 =C2=A0 case Qsnoop:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 devip.c:98: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 devdir(c, q, "snoop", qlen(cv->sq), cv->owner, 0400,= dp);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ...
Note that: two steps
1. find directories
2. find files and the contents
Step 2 is actually 'grep'. we can use RE.

Two-steps search is not a weekness, but a desirable feature.
Because we have so many files that are hit by the query.

-------------

The organization

My example

/n/other/kirara/sysdb
target: (/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc)

/n/other/kirara/usrdb
target: $home/^(bin/rc lib netlib doc adm issues srclib src sources)

Indexing target is fully configurable.

-------------

Multi-Level Indexing

(1) Indexing (top level)
word to directory mapping

sysdb/index =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # main index =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0# used for RE mode
sysdb/mindex =C2=A0 =C2=A0# meta index (alphabetic index) # used for QE mod= e
sysdb/dind/* =C2=A0 =C2=A0# rough index of each directory
sysdb/QTDir =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # map table (QID, mti= me, path-to-dir)

index =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # word to dir QID
=C2=A0 =C2=A0 =C2=A0 =C2=A0 aa =C2=A0 =C2=A0 =C2=A00000000000014f0a
=C2=A0 =C2=A0 =C2=A0 =C2=A0 aa =C2=A0 =C2=A0 =C2=A0000000000001a1e0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 aa =C2=A0 =C2=A0 =C2=A0000000000001a26e

mindex =C2=A0# word to range in index
=C2=A0 =C2=A0 =C2=A0 =C2=A0 aa 0 126669
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ab 126669 491569
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ac 491569 1258566
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ad 1258566 1852467
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ...
dind/* =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0# `*' is a directory QID
=C2=A0 =C2=A0 =C2=A0 =C2=A0 0000000000014f05
=C2=A0 =C2=A0 =C2=A0 =C2=A0 0000000000014f0a
=C2=A0 =C2=A0 =C2=A0 =C2=A0 000000000001a1ce

usrdb is same.

(2) Indexing (directory level) =C2=A0# optional
word to file mapping

sysdb/find/*/ind.gz =C2=A0 =C2=A0 # fine index of the directory (gzipped) sysdb/find/*/qtn =C2=A0 =C2=A0 =C2=A0 =C2=A0# map table (QID, mtime, name)<= br> where `*' is a directory QID

usrdb is same as sysdb.

-------------

Experiment

(a) hardware
GA-H61M-USB3-B3
Intel Pentium G860 (3GHz)
DDR3 PC3 4GB

(b) software
9front
cwfs64x

-------------

The performance (compression ratio)

target =C2=A0 target =C2=A0num_of_dirs =C2=A0 =C2=A0 =C2=A0indexing
sysdb: =C2=A0 556 MB =C2=A0 =C2=A01790 dirs =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 49 MB
usrdb: =C2=A06620 MB =C2=A0 =C2=A08948 dirs =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0150 MB

compression ratio: 49/556 (sysdb)
note: usrdb includes many non-text file.

-------------

The performance (retrieval time)
system dependent

RQ search =C2=A0 =C2=A0 =C2=A0 # kfind foo

0.1 seconds.

It is not important to make this time smaller.
(sufficiently small)

RE search =C2=A0 =C2=A0 =C2=A0 # kfind -r foo

a few seconds


-------------

The performance (construction/update)

(a) Construction time
system dependent

Initial construction
need
=C2=A0 =C2=A0 =C2=A0 =C2=A0 10 minutes for sysdb
=C2=A0 =C2=A0 =C2=A0 =C2=A0 30 minutes for usrdb

(b) Updating time
two commands for update

mkdb
=C2=A0 =C2=A0 =C2=A0 =C2=A0 20 seconds to a few minutes for usrdb
=C2=A0 =C2=A0 =C2=A0 =C2=A0 depends largely on state of cache

mkdb1 (currently only for usrdb)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 5 to 15 seconds for usrdb
=C2=A0 =C2=A0 =C2=A0 =C2=A0 mkdb1 needs event log

-------------

Scalability

Main factors

(a) retrieval time
QE search: proportional to number of dirs that include the query
RE search: proportional to size of index

(b) initial construction time
proportional to total data

(c) update time
mkdb: =C2=A0 proportional to number of dirs and the changes
mkdb1: =C2=A0proportional to changes and size of index

-------------

Used Tools

(1) rc

(2) grep, sed, awk, sort, diff, gzip, ...

(3) some new tools written in C

-------------

What Kirara means?

Kirara is name of a girl that appeared in a Japanese comic book.
(But I have never read the book.)
The name is seldom used in real world.
>>From the name we Japanese imagine something glittering.
I like the name.

-------------

References

[1] GLIMPSE: A Tool to Search Through Entire File Systems
Udi Manber and Sun Wu (1993)
http:/= /webglimpse.net/pubs/glimpse.pdf
[2] Glimpse Documentation
= http://webglimpse.net/gdocs/glimpsehelp.html



--e89a8ff2545435a0bd04e342fc17--