From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <7fb94c8c89d208107f12d7e69d02f2d6@neinchan.znet> To: 9fans@9fans.net Date: Tue, 6 Aug 2013 14:14:07 -0400 From: brz-systemd-dev@intma.in In-Reply-To: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="upas-xgljvioaohfaqwaewbzosbxttp" Subject: Re: [9fans] text database Kirara Topicbox-Message-UUID: 6f746288-ead8-11e9-9d60-3106f5b1d025 This is a multi-part message in MIME format. --upas-xgljvioaohfaqwaewbzosbxttp Content-Disposition: inline Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit I've played around with Kirara for a couple hours, now, and am pretty surprised at how simple it is. It's already become integrated into my workflow. Being able to quickly (and easily) search for relevant snippets of code throughout the system is quite useful. I feel compelled to mention that the code is abnormally high in quality. (This is seen, even in the rc scripts) Now I'm going to have to look through your other projects. Thanks for releasing this. - BurnZeZ Bug: kirara-1.1/INSTALL:9: mkdir -p $kirarar/bin/^(rc $objtype) Here (and on line 11), '$kirarar' is used instead of '$kirara'. --upas-xgljvioaohfaqwaewbzosbxttp Content-Type: message/rfc822 Content-Disposition: inline Return-Path: <9fans-bounces@9fans.net> Delivered-To: brz-systemd-dev@intma.in Received: (qmail 34532 invoked by uid 1005); 5 Aug 2013 21:18:45 -0400 Received: from mail.9fans.net (67.207.142.3) by intma.in with SMTP; 5 Aug 2013 21:18:45 -0400 Received: from localhost ([127.0.0.1] helo=[67.207.142.3]) by mail.9fans.net with esmtp (Exim 4.71) (envelope-from <9fans-bounces@9fans.net>) id 1V6W3m-0005Q7-G4; Tue, 06 Aug 2013 01:27:38 +0000 Received: from gw19.lax01.mailroute.net ([199.89.0.119] helo=mail.mailroute.net) by mail.9fans.net with esmtp (Exim 4.71) (envelope-from ) id 1V6W3k-0005Q2-5p for 9fans@9fans.net; Tue, 06 Aug 2013 01:27:36 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by gw19.lax01.mailroute.net (Postfix) with ESMTP id 3c8Hv85zg4z1jvhH for <9fans@9fans.net>; Tue, 6 Aug 2013 01:14:44 +0000 (GMT) X-Virus-Scanned: by MailRoute X-X-Spam-Flag: NO X-X-Spam-Score: 0.001 X-X-Spam-Level: X-X-Spam-Status: No, score=0.001 tagged_above=-9999 tests=[RCVD_VIA_APNIC=0.001] autolearn=disabled Received: from gw19.lax01.mailroute.net ([127.0.0.1]) by localhost (gw19.lax01.mailroute.net.mailroute.net [127.0.0.1]) (mroute_mailscanner, port 10024) with LMTP id YEhOfxfUMtw0 for <9fans@9fans.net>; Tue, 6 Aug 2013 01:14:42 +0000 (GMT) Received: from ar.aichi-u.ac.jp (ar.aichi-u.ac.jp [202.250.160.40]) by gw19.lax01.mailroute.net (Postfix) with ESMTP id 3c8Hv63NDvz1jvhP for <9fans@9fans.net>; Tue, 6 Aug 2013 01:14:42 +0000 (GMT) Received: from [192.168.1.106] ([125.192.156.8]) by ar; Tue Aug 6 10:14:37 JST 2013 From: arisawa Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Message-Id: <5E38D9FC-75B6-40C9-AB64-E210DACE0B4E@ar.aichi-u.ac.jp> Date: Tue, 6 Aug 2013 10:14:36 +0900 To: 9fans@9fans.net Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\)) X-Mailer: Apple Mail (2.1503) Subject: [9fans] text database Kirara X-BeenThere: 9fans@9fans.net X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> List-Id: Fans of the OS Plan 9 from Bell Labs <9fans.9fans.net> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: 9fans-bounces@9fans.net Errors-To: 9fans-bounces@9fans.net Hello 9fans, I have written a text database named Kirara. The following is a brief introduction to Kirara. If you are interested in, get Kirara from: http://plan9.aichi-u.ac.jp/netlib/kirara/ Kenji Arisawa ------------- Kirara ------------- Kirara is a text indexing/retrieval tool for Plan 9. Personal use: index/retrieve local files. Kirara is based on the idea similar to Glimpse. (1) indexing + grep (2) multi-level indexing (a) small space for indexing (b) small update time (c) quick search Note that: small indexing <-> quick search Kirara makes more index -> quick search Glimpse is single-level indexing. ------------- Query Kirara does not support phrase search. The database is index of words, supporting: QE mode (query expression mode) '&', '|', '*' The example: 'snoopy&html' 'snoop*&htm*' RE mode (regular expression mode) '&', RE where RE denotes regular expression. The example: 'sn.*y&h.+l' RE mode is a bit slow. (a few second.) ------------- Words Two or more runes. All words are converted to lower case. In English, words is composed of alphabets. The number of runes is configurable Assumption: Text is composed of space-separated words popular in English and many European Languages, but not in Japanese. ------------- The user's interface Best match with Rio term% kfind snoop G snoop /sys/src/9/ip/ G snoop /sys/src/cmd/spell/ G snoop /sys/src/9/kw/ ... term% G snoop /sys/src/9/ip devip.c:34: Qsnoop, devip.c:95: case Qsnoop: devip.c:98: devdir(c, q, "snoop", qlen(cv->sq), = cv->owner, 0400, dp); ... Note that: two steps 1. find directories 2. find files and the contents Step 2 is actually 'grep'. we can use RE. Two-steps search is not a weekness, but a desirable feature. Because we have so many files that are hit by the query. ------------- The organization My example /n/other/kirara/sysdb target: (/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc) /n/other/kirara/usrdb target: $home/^(bin/rc lib netlib doc adm issues srclib src sources) Indexing target is fully configurable. ------------- Multi-Level Indexing (1) Indexing (top level) word to directory mapping sysdb/index # main index = # used for RE mode sysdb/mindex # meta index (alphabetic index) # used for QE mode sysdb/dind/* # rough index of each directory sysdb/QTDir # map table (QID, mtime, path-to-dir) index # word to dir QID aa 0000000000014f0a aa 000000000001a1e0 aa 000000000001a26e mindex # word to range in index aa 0 126669 ab 126669 491569 ac 491569 1258566 ad 1258566 1852467 ... dind/* # `*' is a directory QID 0000000000014f05 0000000000014f0a 000000000001a1ce usrdb is same. (2) Indexing (directory level) # optional word to file mapping sysdb/find/*/ind.gz # fine index of the directory (gzipped) sysdb/find/*/qtn # map table (QID, mtime, name) where `*' is a directory QID usrdb is same as sysdb. ------------- Experiment (a) hardware GA-H61M-USB3-B3 Intel Pentium G860 (3GHz) DDR3 PC3 4GB (b) software 9front cwfs64x ------------- The performance (compression ratio) target target num_of_dirs indexing sysdb: 556 MB 1790 dirs 49 MB usrdb: 6620 MB 8948 dirs 150 MB compression ratio: 49/556 (sysdb) note: usrdb includes many non-text file. ------------- The performance (retrieval time) system dependent RQ search # kfind foo 0.1 seconds. It is not important to make this time smaller. (sufficiently small) RE search # kfind -r foo a few seconds ------------- The performance (construction/update) (a) Construction time system dependent Initial construction need 10 minutes for sysdb 30 minutes for usrdb (b) Updating time two commands for update mkdb 20 seconds to a few minutes for usrdb depends largely on state of cache mkdb1 (currently only for usrdb) 5 to 15 seconds for usrdb mkdb1 needs event log ------------- Scalability Main factors (a) retrieval time QE search: proportional to number of dirs that include the query RE search: proportional to size of index (b) initial construction time proportional to total data (c) update time mkdb: proportional to number of dirs and the changes mkdb1: proportional to changes and size of index ------------- Used Tools (1) rc (2) grep, sed, awk, sort, diff, gzip, ... (3) some new tools written in C ------------- What Kirara means? Kirara is name of a girl that appeared in a Japanese comic book. (But I have never read the book.) The name is seldom used in real world. =46rom the name we Japanese imagine something glittering. I like the name. ------------- References [1] GLIMPSE: A Tool to Search Through Entire File Systems Udi Manber and Sun Wu (1993) http://webglimpse.net/pubs/glimpse.pdf [2] Glimpse Documentation http://webglimpse.net/gdocs/glimpsehelp.html --upas-xgljvioaohfaqwaewbzosbxttp--