From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Delivered-To: caml-list@yquem.inria.fr Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by yquem.inria.fr (Postfix) with ESMTP id B6D90BC87 for ; Thu, 17 Mar 2005 03:12:25 +0100 (CET) Received: from first.in-berlin.de (dialin-145-254-064-192.arcor-ip.net [145.254.64.192]) by concorde.inria.fr (8.13.0/8.13.0) with ESMTP id j2H2CM3I008342 for ; Thu, 17 Mar 2005 03:12:23 +0100 Received: by first.in-berlin.de (Postfix, from userid 501) id 19891BC443; Thu, 17 Mar 2005 02:55:37 +0100 (CET) Date: Thu, 17 Mar 2005 02:55:37 +0100 From: Oliver Bandel To: caml-list@yquem.inria.fr Subject: Re: [Caml-list] OCaml mentioned in BraveGNUWorld Message-ID: <20050317015537.GF439@first.in-berlin.de> References: <20050315082053.GM321@first.in-berlin.de> <20050316212544.GA21821@pegasos> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20050316212544.GA21821@pegasos> User-Agent: Mutt/1.5.6i X-Miltered: at concorde with ID 4238E786.000 by Joe's j-chkmail (http://j-chkmail.ensmp.fr)! X-Spam: no; 0.00; oliver:01 bandel:01 oliver:01 in-berlin:01 caml-list:01 ocaml:01 sven:01 luther:01 bandel:01 ocaml:01 caml-list:01 regexps:01 regexps:01 abstraction:01 notation:01 X-Spam-Checker-Version: SpamAssassin 3.0.2 (2004-11-16) on yquem.inria.fr X-Spam-Status: No, score=0.1 required=5.0 tests=DRUGS_ERECTILE, FORGED_RCVD_HELO autolearn=disabled version=3.0.2 X-Spam-Level: On Wed, Mar 16, 2005 at 10:25:44PM +0100, Sven Luther wrote: > On Tue, Mar 15, 2005 at 09:20:54AM +0100, Oliver Bandel wrote: > > > > Hello, > > > > a small tool I wrote (mbox-cleaner) was mentioned in > > the current BraveGNUWorld. > > OCaml was mentioned there too and it was written, > > that "for a scripting language it's fast". ;-) > > > > So maybe OCaml will be more known in the future, if people > > would do more marketing on it. ;-) > > > > (But other things written in that article were not all correct :( > > but ... well...... ) > > > > > > http://www.linux-magazin.de/Artikel/ausgabe/2005/04/gnuwelt/gnu.html > > (in german only) > > > > (I don't know why since many months there are no english translations of the BraveGNUWorld ...)...) > > Hi Oliver, ... > > I read the article with interest, and i saw it mentioned spam-filtering too. Yes. I mentioned spam-filtering to Georg Greve, but as mentioned above, some things were out of context and therefore it may be irritating (and seems it IS irritating), what in the BraveGNUWorld was to read. I mentioned Spam-filtering to Georg Greve in two ways: 1.: in respect to the tool mbox-cleaner it was: It's useful to use that tool to filer out multiple mails with same contents (doublettes), when you may have a folder that will be the spam-folder. If you use mbox-cleaner to clean it up, it may shrinks the spam-folder to maybe 0.1 the size or less, than before - depending on how many spam with the same mailcontents you received. Before filtering my mailaccount on allowed/disallowed mail addresses, I had toms of spam that goes to random mail addresses. during weeks/months I had many hundred or many thousands (sic!) mails with same contents sent to many random addresses on my computer. So for a while I saved the spam in a special folder. I had written stuff in Perl that splits up mails into different files. Then I used my tool "multiple" to throw away files with same body. Then I use cat to rebuild a mbox-folder that contains only mails with unique body. Now I can use mbox-cleaner to do that job. But as I now filter out unused mailaddresses this is not a typical usage for that tool for me now, because pressure by spam has shrinked down to about 1% of before. 10 or 20 spammails a day is much less (correct english to say "much less"??!) than 800 or 2000 mails spam a day... But when getting mails as PM as well as via a mailing list (e.g. Caml-list), but saving them in the same folder (tagging mails with "Caml-list" in subject to sort them into the folder, I could use that tool to throw out the unnecessary stuff (same body of mail). 2.: My provider uses more than one tool to filter spam, and some can be configured by the user. One of these tools is websieve. But I hate the web-GUI for that tool. I have to use regexp's (and it seems that stuff doesn't work) and it's ugly to use regexps for every task, when they do not provide things like character classes... if I have to use regexps for matching instead of saying "filter out any charcters with ascii code > 127 and beyond 32 but allow german umlauts". So a stuff that is more high level would be fine. So I thought about implementing a tool that can filter on different levels of abstraction. Not using regexp's for all... not using charcter codes for all... not using charcter classes for all... (and so on). Instead using THAT description of the mail's contents to filter out, that is good to match the stuff that should be filtered. For example: If there is a lot of Spam that contains a certain cahacter or certain character codes than it makes sense to filter out all such stuff based on charcodes. (e.g.: if I disabled UTF-8 and I get stupid characters on the screen, when displaying the subject of the mail, I know it's <> ISO-8859-1 and if I don't want to accept mails with different charset, I can filter it out => So I thought about filtering out charclasses. => filtering on charcode / charcode sets --> who cares about regexp's description with octal notation, if I can say: "block all stuff that is <> ISO-8859-1" or "allow ISO-8859-1 as well as japanese" or sometging like that. If there is a certain string (keywords/phrases) that is copntained in the spam-mails, then I ignore the charset and look at the contents of that mails (e.g. contents of the subject). => filtering on strings/phrases --> Who cares about a certain charcode, if it is possible to allow/disallow certain phrases (maybe directly encoded in a certain language charset)... And many other ways to describe text data on a higher-level way and in a style that depends on the spam-data as well as the taste of the one who writes the filter rules... disallow.languages.all # should not allow any char to pass the filter => always SPAM (like let f x = true) allow.language.english allow.language.us allow.language.german allow.language.japanese allow.charset.iso8859-1 disallow.charcode.above.127 disallow.charcode.below.32 disallow.charcode.equals.char.'\t' allow.charcode.equals.char.'ä' allow.charcode.equals.char.'ö' allow.charcode.equals.char.'ü' allow.charcode.equals.char.'Ä' allow.charcode.equals.char.'Ö' allow.charcode.equals.char.'Ü' allow.charcode.equals.char.'ß' allow.charcode.equals.char.'\n' allow.charcode.equals.char.'\r' disallow.stringmatch."viagra" disallow.stringmatch."v.i.a.g.r.a" disallow.subjectline.length.above.120 disallow.subjectline.length.below.3 disallow.subjectline.stringmatch.no_case." ! ! look here " disallow.subjectline.re_stringmatch.no_case.".*v.+i.+a.+g.+r.+a" disallow.body.ask.spamassassin disallow.body.ask.my_little_spam_filter_that_matches_the_rest # ;-) disallow.body.ask.your_little_spam_filter_that_matches_the_rest_of_the_rest # ;-) (or something like that) And a newer rule overwrites older rules (as in Postscript new paint overpaints old paint). Such an approach seems to be more practical in writing ad-hoc spamfilter rules than to be constrained to have only classical Regexp-notation, using octal numbers to encode chars that can't be matched by the regexp's in an other way. > > How does it compare/interact/whatever to Xavier's spamoracle tool, and do you I don't know this tool. I will google for it (or, if you have a link there...) > think a generic mail library containing mail filtering code would be a good > thing ? Well, I really think such a maillib would be a nice thing. I'm currently writing a Mbox-module that supports filtering mechanisms. During the weekend I had a littlebid time and worked on it. Filtering mail is easy with this lib. It's only special for mbox-files because of the lexer (I used the mbox-lexer from my mbox-cleaner tool). If providing a different lexer it should be easy to use it to read/write different mailbox formats too. OCaml is so powerful that this should be easy to do. What already is working: ============================================================ type mbox_t type mail_t type match_t = Case | NoCase (* case-sensitive/case insensitive pattern matching *) (* ================================= *) (* basic functions for mbox-handling *) (* ================================= *) val read : string -> mbox_t val show : mbox_t -> unit val write_force : mbox_t -> unit val write : mbox_t -> unit val change_name : mbox_t -> string -> mbox_t (* maybe "renamed_copy" would be a better name...?! *) val append : mbox_t -> mbox_t -> mbox_t val append_rename : mbox_t -> mbox_t -> string -> mbox_t (* these two maybe can be romeved now... was helping me in development. *) (* but maybe it helps other developers (library users) too?! *) val dummy_matcher : mail_t -> bool (* always returns true *) val dummy_nonmatcher : mail_t -> bool (* always returns false *) (* ============================ *) (* FILTERing/MATCHing functions *) (* ============================ *) val filter : (mail_t -> bool) -> mbox_t -> mbox_t val filter_rename : (mail_t -> bool) -> mbox_t -> string -> mbox_t val create_header_match : (string -> bool) -> (mail_t -> bool) val create_body_match : (string -> bool) -> (mail_t -> bool) val match_mail_contains : char -> (mail_t -> bool) val match_header_contains : char -> (mail_t -> bool) val match_body_contains : char -> (mail_t -> bool) val match_mail_regexp : string -> match_t -> (mail_t -> bool) val match_body_regexp : string -> match_t -> (mail_t -> bool) val match_header_regexp : string -> match_t -> (mail_t -> bool) val partition_rename : (mail_t -> bool) -> mbox_t -> string -> string -> mbox_t * mbox_t (* let partition_rename matcher mbox name_matched name_unmatched = (...) *) ============================================================ What I think about implementing (interface may change a lot, because I have to think about it in more detail): ============================================================ val multi_partition : (string * (mail_t -> bool)) list -> mbox_t -> mbox_t list val get_mails : (mail_t -> bool) -> mbox_t -> (string,string) list (* don't know if I provide these two *) val rewind : mbox_t -> unit (* => rewinds mbox-pointer to first mail in mbox *) val next_mail : unit -> mail_t (* => gets next mail from mbox *) val create_mail_match : (string -> bool) -> mail_t -> bool (* ################ *) (* RegExp-Matchings *) (* ################ *) val match_mail_contains_crange : (char * char) -> (mail_t -> bool) val match_header_contains_crange : (char * char) -> (mail_t -> bool) val match_body_contains_crange : (char * char) -> (mail_t -> bool) val match_mail_contains_irange : (int * int) -> (mail_t -> bool) val match_header_contains_irange : (int * int) -> (mail_t -> bool) val match_body_contains_irange : (int * int) -> (mail_t -> bool) val header_contains : string -> mail_t -> bool val body_contains : string -> mail_t -> bool val grep : string -> mail_t -> string list val grep_nocase : string -> mail_t -> string list OR better: val grep : string -> match_t -> mail_t -> string list (* Case/NoCase *) val func : (string -> 'a) -> (mail_t -> 'a) (* iter/map/filter are preserving the order of the mails *) val iter : (mail_t -> unit) -> mbox_t -> unit val map : (mail_t -> 'a) -> mbox_t -> 'a list val sort : (string -> string -> int) -> mbox_t -> mbox_t ============================================================ Examples of usage: let filename = "sent-mail.experimental" in (* set mbox-filename *) let input_mbox = Mbox.read filename in (* reading mbox-file *) let rematcher = Mbox.match_body_regexp ".*ocaml" Mbox.NoCase (* create matcher-function *) in (* next line: splitting the mbox into matching/nonmatching mails *) let result = Mbox.partition_rename rematcher mb2 "OCAML-matched" "OCAML-NON-MATCHED" in print_endline "MATCHING OCAML:"; Mbox.show (fst result); (* prints resulting (matching) mbox to stdout *) print_endline "**NOT** MATCHING OCAML:"; Mbox.show (snd result); (* prints resulting (not matching) mbox to stdout *) (* now write new renew sulting mbox-files *) Mbox.write (fst result); (* writes matching mbox (filename: "OCAML-matched" *) Mbox.write (snd result) (* writes non-matching mbox (filename: "OCAML-NON-MATCHED") *) When using Mbox.multi_partition (not implemented right now), the tool get's extraordinary strength (IM(H?)O). The you can implement complex mail/mbox--filtering mechanisms with only some lines of code. The library provides powerful functions and adapting some CLI-stuff to the Lib seems to be the main task to implement a lot of different small (or some or one big mail/mbox-filtering tool). As you see above, I want to have matching mechanisms that work on chars as well as ints as well as strings/regexps to provide huge flexibility. And that approach matches the things I want to have in a Spam-detector-/blocker-tool. So writing my Mbox-library can go into the direction of Spam-detection, even if it's not the main intention right now. Ciao, Oliver P.S.: The charset operations may need to have UTF-8 functions inside OCaml. But IMHO that also could be done easily, because it seems to me (in very small example code) that OCaml-C binding really is easy/simple! :) So adding a unicode lib seems not to be a terrifying job.