From: arisawa <arisawa@ar.aichi-u.ac.jp>
To: 9fans@9fans.net
Subject: [9fans] text database Kirara
Date: Tue, 6 Aug 2013 10:14:36 +0900 [thread overview]
Message-ID: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp> (raw)
Hello 9fans,
I have written a text database named Kirara.
The following is a brief introduction to Kirara.
If you are interested in, get Kirara from:
http://plan9.aichi-u.ac.jp/netlib/kirara/
Kenji Arisawa
-------------
Kirara
-------------
Kirara is a text indexing/retrieval tool for Plan 9.
Personal use: index/retrieve local files.
Kirara is based on the idea similar to Glimpse.
(1) indexing + grep
(2) multi-level indexing
(a) small space for indexing
(b) small update time
(c) quick search
Note that:
small indexing <-> quick search
Kirara makes more index -> quick search
Glimpse is single-level indexing.
-------------
Query
Kirara does not support phrase search.
The database is index of words,
supporting:
QE mode (query expression mode)
'&', '|', '*'
The example:
'snoopy&html'
'snoop*&htm*'
RE mode (regular expression mode)
'&', RE
where RE denotes regular expression.
The example:
'sn.*y&h.+l'
RE mode is a bit slow. (a few second.)
-------------
Words
Two or more runes.
All words are converted to lower case.
In English, words is composed of alphabets.
The number of runes is configurable
Assumption:
Text is composed of space-separated words
popular in English and many European Languages,
but not in Japanese.
-------------
The user's interface
Best match with Rio
term% kfind snoop
G snoop /sys/src/9/ip/
G snoop /sys/src/cmd/spell/
G snoop /sys/src/9/kw/
...
term% G snoop /sys/src/9/ip
devip.c:34: Qsnoop,
devip.c:95: case Qsnoop:
devip.c:98: devdir(c, q, "snoop", qlen(cv->sq), cv->owner, 0400, dp);
...
Note that: two steps
1. find directories
2. find files and the contents
Step 2 is actually 'grep'. we can use RE.
Two-steps search is not a weekness, but a desirable feature.
Because we have so many files that are hit by the query.
-------------
The organization
My example
/n/other/kirara/sysdb
target: (/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc)
/n/other/kirara/usrdb
target: $home/^(bin/rc lib netlib doc adm issues srclib src sources)
Indexing target is fully configurable.
-------------
Multi-Level Indexing
(1) Indexing (top level)
word to directory mapping
sysdb/index # main index # used for RE mode
sysdb/mindex # meta index (alphabetic index) # used for QE mode
sysdb/dind/* # rough index of each directory
sysdb/QTDir # map table (QID, mtime, path-to-dir)
index # word to dir QID
aa 0000000000014f0a
aa 000000000001a1e0
aa 000000000001a26e
mindex # word to range in index
aa 0 126669
ab 126669 491569
ac 491569 1258566
ad 1258566 1852467
...
dind/* # `*' is a directory QID
0000000000014f05
0000000000014f0a
000000000001a1ce
usrdb is same.
(2) Indexing (directory level) # optional
word to file mapping
sysdb/find/*/ind.gz # fine index of the directory (gzipped)
sysdb/find/*/qtn # map table (QID, mtime, name)
where `*' is a directory QID
usrdb is same as sysdb.
-------------
Experiment
(a) hardware
GA-H61M-USB3-B3
Intel Pentium G860 (3GHz)
DDR3 PC3 4GB
(b) software
9front
cwfs64x
-------------
The performance (compression ratio)
target target num_of_dirs indexing
sysdb: 556 MB 1790 dirs 49 MB
usrdb: 6620 MB 8948 dirs 150 MB
compression ratio: 49/556 (sysdb)
note: usrdb includes many non-text file.
-------------
The performance (retrieval time)
system dependent
RQ search # kfind foo
0.1 seconds.
It is not important to make this time smaller.
(sufficiently small)
RE search # kfind -r foo
a few seconds
-------------
The performance (construction/update)
(a) Construction time
system dependent
Initial construction
need
10 minutes for sysdb
30 minutes for usrdb
(b) Updating time
two commands for update
mkdb
20 seconds to a few minutes for usrdb
depends largely on state of cache
mkdb1 (currently only for usrdb)
5 to 15 seconds for usrdb
mkdb1 needs event log
-------------
Scalability
Main factors
(a) retrieval time
QE search: proportional to number of dirs that include the query
RE search: proportional to size of index
(b) initial construction time
proportional to total data
(c) update time
mkdb: proportional to number of dirs and the changes
mkdb1: proportional to changes and size of index
-------------
Used Tools
(1) rc
(2) grep, sed, awk, sort, diff, gzip, ...
(3) some new tools written in C
-------------
What Kirara means?
Kirara is name of a girl that appeared in a Japanese comic book.
(But I have never read the book.)
The name is seldom used in real world.
From the name we Japanese imagine something glittering.
I like the name.
-------------
References
[1] GLIMPSE: A Tool to Search Through Entire File Systems
Udi Manber and Sun Wu (1993)
http://webglimpse.net/pubs/glimpse.pdf
[2] Glimpse Documentation
http://webglimpse.net/gdocs/glimpsehelp.html
next reply other threads:[~2013-08-06 1:14 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-06 1:14 arisawa [this message]
2013-08-06 7:26 ` Francisco J Ballesteros
2013-08-06 8:12 ` Peter A. Cejchan
2013-08-06 18:14 ` brz-systemd-dev
2013-08-07 6:31 ` arisawa
2013-08-07 7:22 ` Skip Tavakkolian
2013-08-07 8:17 ` arisawa
2013-08-07 13:32 ` erik quanstrom
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp \
--to=arisawa@ar.aichi-u.ac.jp \
--cc=9fans@9fans.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).