From mboxrd@z Thu Jan  1 00:00:00 1970
MIME-Version: 1.0
In-Reply-To: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp>
References: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp>
Date: Tue,  6 Aug 2013 10:12:25 +0200
Message-ID: <CAM6ozu72FQGGsN73KR5EqcTKhsYqk0afnJY4fEPx9Wh8WSFZrg@mail.gmail.com>
From: "Peter A. Cejchan" <tyapca7@gmail.com>
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Content-Type: multipart/alternative; boundary=e89a8ff2545435a0bd04e342fc17
Subject: Re: [9fans] text database Kirara
Topicbox-Message-UUID: 6f5b395c-ead8-11e9-9d60-3106f5b1d025

--e89a8ff2545435a0bd04e342fc17
Content-Type: text/plain; charset=UTF-8

Sounds great!
Thank you, Kenji !!
++pac


On Tue, Aug 6, 2013 at 3:14 AM, arisawa <arisawa@ar.aichi-u.ac.jp> wrote:

> Hello 9fans,
>
> I have written a text database named Kirara.
> The following is a brief introduction to Kirara.
> If you are interested in, get Kirara from:
>   http://plan9.aichi-u.ac.jp/netlib/kirara/
>
> Kenji Arisawa
>
> -------------
>
> Kirara
>
> -------------
>
> Kirara is a text indexing/retrieval tool for Plan 9.
>
> Personal use: index/retrieve local files.
>
> Kirara is based on the idea similar to Glimpse.
>
> (1) indexing + grep
> (2) multi-level indexing
>
> (a) small space for indexing
> (b) small update time
> (c) quick search
>
> Note that:
> small indexing   <->    quick search
> Kirara makes more index -> quick search
> Glimpse is single-level indexing.
>
> -------------
>
> Query
>
> Kirara does not support phrase search.
> The database is index of words,
>
> supporting:
> QE mode (query expression mode)
>         '&', '|', '*'
> The example:
>         'snoopy&html'
>         'snoop*&htm*'
>
> RE mode (regular expression mode)
>         '&', RE
> where RE denotes regular expression.
> The example:
>         'sn.*y&h.+l'
>
> RE mode is a bit slow. (a few second.)
>
> -------------
>
> Words
>
> Two or more runes.
> All words are converted to lower case.
> In English, words is composed of alphabets.
> The number of runes is configurable
>
> Assumption:
> Text is composed of space-separated words
> popular in English and many European Languages,
> but not in Japanese.
>
> -------------
>
> The user's interface
>
> Best match with Rio
>         term% kfind snoop
>         G snoop /sys/src/9/ip/
>         G snoop /sys/src/cmd/spell/
>         G snoop /sys/src/9/kw/
>         ...
>         term% G snoop /sys/src/9/ip
>         devip.c:34:     Qsnoop,
>         devip.c:95:     case Qsnoop:
>         devip.c:98:             devdir(c, q, "snoop", qlen(cv->sq),
> cv->owner, 0400, dp);
>         ...
> Note that: two steps
> 1. find directories
> 2. find files and the contents
> Step 2 is actually 'grep'. we can use RE.
>
> Two-steps search is not a weekness, but a desirable feature.
> Because we have so many files that are hit by the query.
>
> -------------
>
> The organization
>
> My example
>
> /n/other/kirara/sysdb
> target: (/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc)
>
> /n/other/kirara/usrdb
> target: $home/^(bin/rc lib netlib doc adm issues srclib src sources)
>
> Indexing target is fully configurable.
>
> -------------
>
> Multi-Level Indexing
>
> (1) Indexing (top level)
> word to directory mapping
>
> sysdb/index             # main index                                    #
> used for RE mode
> sysdb/mindex    # meta index (alphabetic index) # used for QE mode
> sysdb/dind/*    # rough index of each directory
> sysdb/QTDir             # map table (QID, mtime, path-to-dir)
>
> index           # word to dir QID
>         aa      0000000000014f0a
>         aa      000000000001a1e0
>         aa      000000000001a26e
>
> mindex  # word to range in index
>         aa 0 126669
>         ab 126669 491569
>         ac 491569 1258566
>         ad 1258566 1852467
>         ...
> dind/*                          # `*' is a directory QID
>         0000000000014f05
>         0000000000014f0a
>         000000000001a1ce
>
> usrdb is same.
>
> (2) Indexing (directory level)  # optional
> word to file mapping
>
> sysdb/find/*/ind.gz     # fine index of the directory (gzipped)
> sysdb/find/*/qtn        # map table (QID, mtime, name)
> where `*' is a directory QID
>
> usrdb is same as sysdb.
>
> -------------
>
> Experiment
>
> (a) hardware
> GA-H61M-USB3-B3
> Intel Pentium G860 (3GHz)
> DDR3 PC3 4GB
>
> (b) software
> 9front
> cwfs64x
>
> -------------
>
> The performance (compression ratio)
>
> target   target  num_of_dirs      indexing
> sysdb:   556 MB    1790 dirs             49 MB
> usrdb:  6620 MB    8948 dirs            150 MB
>
> compression ratio: 49/556 (sysdb)
> note: usrdb includes many non-text file.
>
> -------------
>
> The performance (retrieval time)
> system dependent
>
> RQ search       # kfind foo
>
> 0.1 seconds.
>
> It is not important to make this time smaller.
> (sufficiently small)
>
> RE search       # kfind -r foo
>
> a few seconds
>
>
> -------------
>
> The performance (construction/update)
>
> (a) Construction time
> system dependent
>
> Initial construction
> need
>         10 minutes for sysdb
>         30 minutes for usrdb
>
> (b) Updating time
> two commands for update
>
> mkdb
>         20 seconds to a few minutes for usrdb
>         depends largely on state of cache
>
> mkdb1 (currently only for usrdb)
>         5 to 15 seconds for usrdb
>         mkdb1 needs event log
>
> -------------
>
> Scalability
>
> Main factors
>
> (a) retrieval time
> QE search: proportional to number of dirs that include the query
> RE search: proportional to size of index
>
> (b) initial construction time
> proportional to total data
>
> (c) update time
> mkdb:   proportional to number of dirs and the changes
> mkdb1:  proportional to changes and size of index
>
> -------------
>
> Used Tools
>
> (1) rc
>
> (2) grep, sed, awk, sort, diff, gzip, ...
>
> (3) some new tools written in C
>
> -------------
>
> What Kirara means?
>
> Kirara is name of a girl that appeared in a Japanese comic book.
> (But I have never read the book.)
> The name is seldom used in real world.
> From the name we Japanese imagine something glittering.
> I like the name.
>
> -------------
>
> References
>
> [1] GLIMPSE: A Tool to Search Through Entire File Systems
> Udi Manber and Sun Wu (1993)
> http://webglimpse.net/pubs/glimpse.pdf
> [2] Glimpse Documentation
> http://webglimpse.net/gdocs/glimpsehelp.html
>
>
>

--e89a8ff2545435a0bd04e342fc17
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div>Sounds great!<br></div>Thank you, Kenji !!<br></=
div>++pac<br></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_q=
uote">On Tue, Aug 6, 2013 at 3:14 AM, arisawa <span dir=3D"ltr">&lt;<a href=
=3D"mailto:arisawa@ar.aichi-u.ac.jp" target=3D"_blank">arisawa@ar.aichi-u.a=
c.jp</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hello 9fans,<br>
<br>
I have written a text database named Kirara.<br>
The following is a brief introduction to Kirara.<br>
If you are interested in, get Kirara from:<br>
=C2=A0 <a href=3D"http://plan9.aichi-u.ac.jp/netlib/kirara/" target=3D"_bla=
nk">http://plan9.aichi-u.ac.jp/netlib/kirara/</a><br>
<br>
Kenji Arisawa<br>
<br>
-------------<br>
<br>
Kirara<br>
<br>
-------------<br>
<br>
Kirara is a text indexing/retrieval tool for Plan 9.<br>
<br>
Personal use: index/retrieve local files.<br>
<br>
Kirara is based on the idea similar to Glimpse.<br>
<br>
(1) indexing + grep<br>
(2) multi-level indexing<br>
<br>
(a) small space for indexing<br>
(b) small update time<br>
(c) quick search<br>
<br>
Note that:<br>
small indexing =C2=A0 &lt;-&gt; =C2=A0 =C2=A0quick search<br>
Kirara makes more index -&gt; quick search<br>
Glimpse is single-level indexing.<br>
<br>
-------------<br>
<br>
Query<br>
<br>
Kirara does not support phrase search.<br>
The database is index of words,<br>
<br>
supporting:<br>
QE mode (query expression mode)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 &#39;&amp;&#39;, &#39;|&#39;, &#39;*&#39;<br>
The example:<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 &#39;snoopy&amp;html&#39;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 &#39;snoop*&amp;htm*&#39;<br>
<br>
RE mode (regular expression mode)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 &#39;&amp;&#39;, RE<br>
where RE denotes regular expression.<br>
The example:<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 &#39;sn.*y&amp;h.+l&#39;<br>
<br>
RE mode is a bit slow. (a few second.)<br>
<br>
-------------<br>
<br>
Words<br>
<br>
Two or more runes.<br>
All words are converted to lower case.<br>
In English, words is composed of alphabets.<br>
The number of runes is configurable<br>
<br>
Assumption:<br>
Text is composed of space-separated words<br>
popular in English and many European Languages,<br>
but not in Japanese.<br>
<br>
-------------<br>
<br>
The user&#39;s interface<br>
<br>
Best match with Rio<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 term% kfind snoop<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 G snoop /sys/src/9/ip/<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 G snoop /sys/src/cmd/spell/<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 G snoop /sys/src/9/kw/<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ...<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 term% G snoop /sys/src/9/ip<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 devip.c:34: =C2=A0 =C2=A0 Qsnoop,<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 devip.c:95: =C2=A0 =C2=A0 case Qsnoop:<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 devip.c:98: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 devdir(c, q, &quot;snoop&quot;, qlen(cv-&gt;sq), cv-&gt;owner, 0400,=
 dp);<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ...<br>
Note that: two steps<br>
1. find directories<br>
2. find files and the contents<br>
Step 2 is actually &#39;grep&#39;. we can use RE.<br>
<br>
Two-steps search is not a weekness, but a desirable feature.<br>
Because we have so many files that are hit by the query.<br>
<br>
-------------<br>
<br>
The organization<br>
<br>
My example<br>
<br>
/n/other/kirara/sysdb<br>
target: (/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc)<br>
<br>
/n/other/kirara/usrdb<br>
target: $home/^(bin/rc lib netlib doc adm issues srclib src sources)<br>
<br>
Indexing target is fully configurable.<br>
<br>
-------------<br>
<br>
Multi-Level Indexing<br>
<br>
(1) Indexing (top level)<br>
word to directory mapping<br>
<br>
sysdb/index =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # main index =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0# used for RE mode<br>
sysdb/mindex =C2=A0 =C2=A0# meta index (alphabetic index) # used for QE mod=
e<br>
sysdb/dind/* =C2=A0 =C2=A0# rough index of each directory<br>
sysdb/QTDir =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # map table (QID, mti=
me, path-to-dir)<br>
<br>
index =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # word to dir QID<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 aa =C2=A0 =C2=A0 =C2=A00000000000014f0a<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 aa =C2=A0 =C2=A0 =C2=A0000000000001a1e0<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 aa =C2=A0 =C2=A0 =C2=A0000000000001a26e<br>
<br>
mindex =C2=A0# word to range in index<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 aa 0 126669<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ab 126669 491569<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ac 491569 1258566<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ad 1258566 1852467<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ...<br>
dind/* =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0# `*&#39; is a directory QID<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 0000000000014f05<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 0000000000014f0a<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 000000000001a1ce<br>
<br>
usrdb is same.<br>
<br>
(2) Indexing (directory level) =C2=A0# optional<br>
word to file mapping<br>
<br>
sysdb/find/*/ind.gz =C2=A0 =C2=A0 # fine index of the directory (gzipped)<b=
r>
sysdb/find/*/qtn =C2=A0 =C2=A0 =C2=A0 =C2=A0# map table (QID, mtime, name)<=
br>
where `*&#39; is a directory QID<br>
<br>
usrdb is same as sysdb.<br>
<br>
-------------<br>
<br>
Experiment<br>
<br>
(a) hardware<br>
GA-H61M-USB3-B3<br>
Intel Pentium G860 (3GHz)<br>
DDR3 PC3 4GB<br>
<br>
(b) software<br>
9front<br>
cwfs64x<br>
<br>
-------------<br>
<br>
The performance (compression ratio)<br>
<br>
target =C2=A0 target =C2=A0num_of_dirs =C2=A0 =C2=A0 =C2=A0indexing<br>
sysdb: =C2=A0 556 MB =C2=A0 =C2=A01790 dirs =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 49 MB<br>
usrdb: =C2=A06620 MB =C2=A0 =C2=A08948 dirs =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0150 MB<br>
<br>
compression ratio: 49/556 (sysdb)<br>
note: usrdb includes many non-text file.<br>
<br>
-------------<br>
<br>
The performance (retrieval time)<br>
system dependent<br>
<br>
RQ search =C2=A0 =C2=A0 =C2=A0 # kfind foo<br>
<br>
0.1 seconds.<br>
<br>
It is not important to make this time smaller.<br>
(sufficiently small)<br>
<br>
RE search =C2=A0 =C2=A0 =C2=A0 # kfind -r foo<br>
<br>
a few seconds<br>
<br>
<br>
-------------<br>
<br>
The performance (construction/update)<br>
<br>
(a) Construction time<br>
system dependent<br>
<br>
Initial construction<br>
need<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 10 minutes for sysdb<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 30 minutes for usrdb<br>
<br>
(b) Updating time<br>
two commands for update<br>
<br>
mkdb<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 20 seconds to a few minutes for usrdb<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 depends largely on state of cache<br>
<br>
mkdb1 (currently only for usrdb)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 5 to 15 seconds for usrdb<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 mkdb1 needs event log<br>
<br>
-------------<br>
<br>
Scalability<br>
<br>
Main factors<br>
<br>
(a) retrieval time<br>
QE search: proportional to number of dirs that include the query<br>
RE search: proportional to size of index<br>
<br>
(b) initial construction time<br>
proportional to total data<br>
<br>
(c) update time<br>
mkdb: =C2=A0 proportional to number of dirs and the changes<br>
mkdb1: =C2=A0proportional to changes and size of index<br>
<br>
-------------<br>
<br>
Used Tools<br>
<br>
(1) rc<br>
<br>
(2) grep, sed, awk, sort, diff, gzip, ...<br>
<br>
(3) some new tools written in C<br>
<br>
-------------<br>
<br>
What Kirara means?<br>
<br>
Kirara is name of a girl that appeared in a Japanese comic book.<br>
(But I have never read the book.)<br>
The name is seldom used in real world.<br>
>>From the name we Japanese imagine something glittering.<br>
I like the name.<br>
<br>
-------------<br>
<br>
References<br>
<br>
[1] GLIMPSE: A Tool to Search Through Entire File Systems<br>
Udi Manber and Sun Wu (1993)<br>
<a href=3D"http://webglimpse.net/pubs/glimpse.pdf" target=3D"_blank">http:/=
/webglimpse.net/pubs/glimpse.pdf</a><br>
[2] Glimpse Documentation<br>
<a href=3D"http://webglimpse.net/gdocs/glimpsehelp.html" target=3D"_blank">=
http://webglimpse.net/gdocs/glimpsehelp.html</a><br>
<br>
<br>
</blockquote></div><br></div>

--e89a8ff2545435a0bd04e342fc17--