Gnus development mailing list
 help / color / mirror / Atom feed
* Splitting based on character sets
@ 2011-04-03 18:03 Lars Magne Ingebrigtsen
  2011-04-04 13:42 ` Ted Zlatanov
  0 siblings, 1 reply; 9+ messages in thread
From: Lars Magne Ingebrigtsen @ 2011-04-03 18:03 UTC (permalink / raw)
  To: ding

90% of the spam that gets through SpamAssassin is in Russian, but
encoded in utf-8.

Does anybody have a nice recipe for how to split articles with Subject
headers that are obviously Russian into the spam group?

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Splitting based on character sets
  2011-04-03 18:03 Splitting based on character sets Lars Magne Ingebrigtsen
@ 2011-04-04 13:42 ` Ted Zlatanov
  2011-04-12 16:26   ` Lars Magne Ingebrigtsen
  0 siblings, 1 reply; 9+ messages in thread
From: Ted Zlatanov @ 2011-04-04 13:42 UTC (permalink / raw)
  To: ding

On Sun, 03 Apr 2011 20:03:48 +0200 Lars Magne Ingebrigtsen <larsi@gnus.org> wrote: 

LMI> 90% of the spam that gets through SpamAssassin is in Russian, but
LMI> encoded in utf-8.

LMI> Does anybody have a nice recipe for how to split articles with Subject
LMI> headers that are obviously Russian into the spam group?

Use CRM114 before mail is delivered.  It works *really well* for Russian
and any other languages (at least Bulgarian and Spanish in my experience).

Ted




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Splitting based on character sets
  2011-04-04 13:42 ` Ted Zlatanov
@ 2011-04-12 16:26   ` Lars Magne Ingebrigtsen
  2011-04-12 16:44     ` David Engster
  2011-04-12 16:48     ` Adam Sjøgren
  0 siblings, 2 replies; 9+ messages in thread
From: Lars Magne Ingebrigtsen @ 2011-04-12 16:26 UTC (permalink / raw)
  To: ding

Ted Zlatanov <tzz@lifelogs.com> writes:

> Use CRM114 before mail is delivered.  It works *really well* for Russian
> and any other languages (at least Bulgarian and Spanish in my experience).

Hm...  any particular reason CRM114 isn't used by SpamAssassin already?
(I've just skimmed the CRM114 page, and I'm somewhat unclear about what
it does.  :-)

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Splitting based on character sets
  2011-04-12 16:26   ` Lars Magne Ingebrigtsen
@ 2011-04-12 16:44     ` David Engster
  2011-04-12 16:48     ` Adam Sjøgren
  1 sibling, 0 replies; 9+ messages in thread
From: David Engster @ 2011-04-12 16:44 UTC (permalink / raw)
  To: ding

Lars Magne Ingebrigtsen writes:
> Ted Zlatanov <tzz@lifelogs.com> writes:
>
>> Use CRM114 before mail is delivered.  It works *really well* for Russian
>> and any other languages (at least Bulgarian and Spanish in my experience).
>
> Hm...  any particular reason CRM114 isn't used by SpamAssassin already?
> (I've just skimmed the CRM114 page, and I'm somewhat unclear about what
> it does.  :-)

There is a crm114 plugin for spamassassin; it's in the "CoolThings"
section of the crm114 site. It may be that it's well suited for foreign
languages, but I tried it some time ago, and wasn't particularly
impressed, especially regarding the elaborated setup. The thing which
made me drop it was that I got false positives (yes, I read the docs and
trained it correctly). Middle-of-the-road Spamassassin in combination
with the Bayes-plugin, Razor and a few blacklists catches practically
all spam for me, without any false positives.

-David



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Splitting based on character sets
  2011-04-12 16:26   ` Lars Magne Ingebrigtsen
  2011-04-12 16:44     ` David Engster
@ 2011-04-12 16:48     ` Adam Sjøgren
  2011-04-12 17:20       ` Ted Zlatanov
  1 sibling, 1 reply; 9+ messages in thread
From: Adam Sjøgren @ 2011-04-12 16:48 UTC (permalink / raw)
  To: ding

On Tue, 12 Apr 2011 18:26:05 +0200, Lars wrote:

> (I've just skimmed the CRM114 page, and I'm somewhat unclear about what
> it does.  :-)

CRM114 is a language that is good at classifying text, is the short
version, I think. It comes with a mailfilter.crm program that works
quite nicely.

I guess the result is similar to bogofilter, only crazy along a
different axis.


  Best regards,

    Adam

-- 
 "Shining for the sun is what we do"                          Adam Sjøgren
                                                         asjo@koldfront.dk




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Splitting based on character sets
  2011-04-12 16:48     ` Adam Sjøgren
@ 2011-04-12 17:20       ` Ted Zlatanov
  2011-04-14  8:28         ` David Engster
  0 siblings, 1 reply; 9+ messages in thread
From: Ted Zlatanov @ 2011-04-12 17:20 UTC (permalink / raw)
  To: ding

On Tue, 12 Apr 2011 18:44:49 +0200 David Engster <deng@randomsample.de> wrote: 

DE> Lars Magne Ingebrigtsen writes:
>> Ted Zlatanov <tzz@lifelogs.com> writes:
>> 
>>> Use CRM114 before mail is delivered.  It works *really well* for Russian
>>> and any other languages (at least Bulgarian and Spanish in my experience).
>> 
>> Hm...  any particular reason CRM114 isn't used by SpamAssassin already?
>> (I've just skimmed the CRM114 page, and I'm somewhat unclear about what
>> it does.  :-)

DE> There is a crm114 plugin for spamassassin; it's in the "CoolThings"
DE> section of the crm114 site. It may be that it's well suited for foreign
DE> languages, but I tried it some time ago, and wasn't particularly
DE> impressed, especially regarding the elaborated setup. The thing which
DE> made me drop it was that I got false positives (yes, I read the docs and
DE> trained it correctly). Middle-of-the-road Spamassassin in combination
DE> with the Bayes-plugin, Razor and a few blacklists catches practically
DE> all spam for me, without any false positives.

I've been happy with CRM114 (since the first Spam Conference :) so I
can't say why it didn't work for you.  I like that it only has one way
to classify spam, as opposed to the SA multi-pronged approach.

It was definitely better against foreign languages than SA 5 years ago,
when I tested it.

Ted




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Splitting based on character sets
  2011-04-12 17:20       ` Ted Zlatanov
@ 2011-04-14  8:28         ` David Engster
  2011-04-14 14:54           ` Ted Zlatanov
  2011-05-01 16:48           ` Lars Magne Ingebrigtsen
  0 siblings, 2 replies; 9+ messages in thread
From: David Engster @ 2011-04-14  8:28 UTC (permalink / raw)
  To: ding

Ted Zlatanov writes:
> On Tue, 12 Apr 2011 18:44:49 +0200 David Engster <deng@randomsample.de> wrote: 
> DE> There is a crm114 plugin for spamassassin; it's in the "CoolThings"
> DE> section of the crm114 site. It may be that it's well suited for foreign
> DE> languages, but I tried it some time ago, and wasn't particularly
> DE> impressed, especially regarding the elaborated setup. The thing which
> DE> made me drop it was that I got false positives (yes, I read the docs and
> DE> trained it correctly). Middle-of-the-road Spamassassin in combination
> DE> with the Bayes-plugin, Razor and a few blacklists catches practically
> DE> all spam for me, without any false positives.
>
> I've been happy with CRM114 (since the first Spam Conference :) so I
> can't say why it didn't work for you.  I like that it only has one way
> to classify spam, as opposed to the SA multi-pronged approach.

See, that's exactly what I like about SA. :-)

I've long tried to find a single, pure black-box-machine-learning
spam detection which could rival Spamassassin, main reason being that SA
can be such a memory hog on smaller servers. But I always came back.

> It was definitely better against foreign languages than SA 5 years ago,
> when I tested it.

At least for German, I subscribe to a custom rule list which is
regularly updated. But most of the German spam is actually catched by
iXhash (similar to Razor/Pyzor) and iXRBL, which are also located in
Germany.

But I guess what Lars originally asked was not to classify mails in
Russian, but to just classify them as spam, since he doesn't get Russian
ham. For this, one can use the Textcat plugin from SA, which will try to
guess the language of the mail and include a X-Language header.

-David



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Splitting based on character sets
  2011-04-14  8:28         ` David Engster
@ 2011-04-14 14:54           ` Ted Zlatanov
  2011-05-01 16:48           ` Lars Magne Ingebrigtsen
  1 sibling, 0 replies; 9+ messages in thread
From: Ted Zlatanov @ 2011-04-14 14:54 UTC (permalink / raw)
  To: ding

On Thu, 14 Apr 2011 10:28:24 +0200 David Engster <deng@randomsample.de> wrote: 

DE> I've long tried to find a single, pure black-box-machine-learning
DE> spam detection which could rival Spamassassin, main reason being that SA
DE> can be such a memory hog on smaller servers. But I always came back.

Yeah, the memory usage killed SA for me.  With 1-10 users it's OK, but
for a large mail server (where I worked at the time) it was not usable
in my testing.  When I stopped using SA there I also stopped using it
for myself.

DE> But I guess what Lars originally asked was not to classify mails in
DE> Russian, but to just classify them as spam, since he doesn't get
DE> Russian ham.

Yeah, I was thinking sort of sideways to his question.  Sorry.

DE> For this, one can use the Textcat plugin from SA, which will try to
DE> guess the language of the mail and include a X-Language header.

Yeah, that should work.  That's really useful, thanks for mentioning it.

Ted




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Splitting based on character sets
  2011-04-14  8:28         ` David Engster
  2011-04-14 14:54           ` Ted Zlatanov
@ 2011-05-01 16:48           ` Lars Magne Ingebrigtsen
  1 sibling, 0 replies; 9+ messages in thread
From: Lars Magne Ingebrigtsen @ 2011-05-01 16:48 UTC (permalink / raw)
  To: ding

David Engster <deng@randomsample.de> writes:

> But I guess what Lars originally asked was not to classify mails in
> Russian, but to just classify them as spam, since he doesn't get Russian
> ham. For this, one can use the Textcat plugin from SA, which will try to
> guess the language of the mail and include a X-Language header.

Ah, right.  Thanks, I've now installed it on the MTA.

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/




^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-05-01 16:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-03 18:03 Splitting based on character sets Lars Magne Ingebrigtsen
2011-04-04 13:42 ` Ted Zlatanov
2011-04-12 16:26   ` Lars Magne Ingebrigtsen
2011-04-12 16:44     ` David Engster
2011-04-12 16:48     ` Adam Sjøgren
2011-04-12 17:20       ` Ted Zlatanov
2011-04-14  8:28         ` David Engster
2011-04-14 14:54           ` Ted Zlatanov
2011-05-01 16:48           ` Lars Magne Ingebrigtsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).