Gnus development mailing list
 help / color / mirror / Atom feed
* messkeyw.el
@ 1998-10-28 19:23 Karl Kleinpaste
  1998-10-30 11:57 ` messkeyw.el [patch] Karl Kleinpaste
  0 siblings, 1 reply; 3+ messages in thread
From: Karl Kleinpaste @ 1998-10-28 19:23 UTC (permalink / raw)


This is a new module for Gnus, which provides for keyword generation
at the time you send a message.  Tie it in with
	(add-hook 'message-send-hook 'message-keyword-insert)
plus either an autoload of message-keyword-insert or a (require
'messkeyw), and you're done.

Several config variables:
- a regexp of stopwords, not to be counted (default English set provided).
- limits on #keywords to be generated, based on article length.
- line length limits, below and above which keywords are not generated.
  (triviality, below; excessive computation, above.)
- selectively interactive use and editable result, so you can make
  your own choices if you don't like what the generator found.

Based on rather simplistic word counting.  I've been using this module
for just about 2 years now; most of my messages are sent with a
Keywords header.  Highly effective.

Works fine with any Gnus since message mode has existed, including
current 5.6.* and Pterodactyl.

With this, plus recent pgnus' ability to do extra NOV headers, good
scoring upon words of significance to oneself can be done, at least
among Gnusers. :-)  Convince your newsadmins to enable Keywords in
overview.fmt (at least passively, with article-supplied Keywords, if
not active server-side auto-gen) and we might even start a mighty
useful trend.

--karl

;;; messkeyw.el --- automatic keyword support during composition.
;; Copyright (C) 1996,1997,1998 Free Software Foundation, Inc.

;; Author: Karl Kleinpaste <karl@jprc.com>
;; Keywords: mail, news, keywords

;; This file is part of GNU Emacs.

;; GNU Emacs is free software; you can redistribute it and/or modify
;; it under the terms of the GNU General Public License as published by
;; the Free Software Foundation; either version 2, or (at your option)
;; any later version.

;; GNU Emacs is distributed in the hope that it will be useful,
;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
;; GNU General Public License for more details.

;; You should have received a copy of the GNU General Public License
;; along with GNU Emacs; see the file COPYING.  If not, write to the
;; Free Software Foundation, Inc., 59 Temple Place - Suite 330,
;; Boston, MA 02111-1307, USA.

;;; Commentary:

;; This provides a hookable mechanism by which to have Keywords
;; headers automatically generated based on word frequency of the
;; body.  The goal is to make it possible to score on Keywords
;; provided, of course, that Keywords gets to the overview files.

;; To make this work, all one need do is:
;; (add-hook 'message-send-hook 'message-keyword-insert)

;; Keywords get into overviews at the server if it is capable and
;; configured to do so.  In INN, since 1.5b2, article-supplied
;; Keywords can be gotten into overviews by enabling the line
;; "Keywords:full" in overview.fmt.

;; Note as well that INN since 2.0 has been capable of doing server-
;; side keyword auto-generation, whether or not there are article-
;; supplied Keywords headers.  (Also available in 1.7, as a patch,
;; predating the integration of this code for 2.x.)  INN grinds
;; article bodies on the way through, supplying additional data
;; to overchan, thereby providing useful bits of article content
;; for scoring/killing purposes, without requiring full-body searches.

;----------------------------------------------------------------

;;; Code:

;; data bits for generator.

(defvar message-keyword-punctuation-regexp "[][\\~%^+!'`\"(){}<>:;,.|?=*_@$/]"
  "*Punctuation characters in need of removal.")

(defvar message-keyword-whitespace-regexp "[ \t]"
  "*Whitespace characters to be turned into newlines.")

(defvar message-keyword-english-trivia-regexp "^\\(.\\|..\\|[-+/0-9][-+/0-9]*\\|.*1st\\|.*2nd\\|.*3rd\\|.*[04-9]th\\|about\\|after\\|ago\\|all\\|already\\|also\\|among\\|and\\|any\\|anybody\\|anyhow\\|anyone\\|anywhere\\|are\\|bad\\|because\\|been\\|before\\|being\\|between\\|but\\|can\\|could\\|did\\|does\\|doing\\|done\\|dont\\|during\\|eight\\|eighth\\|eleven\\|else\\|elsewhere\\|every\\|everywhere\\|few\\|five\\|fifth\\|first\\|for\\|four\\|fourth\\|from\\|get\\|going\\|gone\\|good\\|got\\|had\\|has\\|have\\|having\\|he\\|her\\|here\\|hers\\|herself\\|him\\|himself\\|his\\|how\\|ill\\|into\\|its\\|ive\\|just\\|kn[eo]w\\|least\\|less\\|let\\|like\\|look\\|many\\|may\\|more\\|m[ou]st\\|myself\\|next\\|nine\\|ninth\\|not\\|now\\|off\\|one\\|only\\|onto\\|our\\|out\\|over\\|really\\|said\\|saw\\|says\\|second\\|see\\|set\\|seven\\|seventh\\|several\\|shall\\|she\\|should\\|since\\|six\\|sixth\\|some\\|somehow\\|someone\\|something\\|somewhere\\|such\\|take\\|ten\\|tenth\\|tha
 n!
!
!
\\|that\\|the\\|their\\|them\\|then\\|there\\|therell\\|theres\\|these\\|they\\|thing\\|things\\|third\\|this\\|those\\|three\\|thus\\|together\\|told\\|too\\|twelve\\|two\\|under\\|upon\\|very\\|via\\|want\\|wants\\|was\\|wasnt\\|way\\|were\\|weve\\|what\\|whatever\\|when\\|where\\|wherell\\|wheres\\|whether\\|which\\|while\\|who\\|why\\|will\\|will\\|with\\|would\\|write\\|writes\\|wrote\\|yes\\|yet\\|you\\|your\\|youre\\|yourself\\)$"
  "*Trivial (therefore ignorable) words in English.  Very empirical.")

(defvar message-keyword-trivia-regexp message-keyword-english-trivia-regexp
  "*General trivia regexp.")

;; data bits for interface to generator.

(defvar message-keyword-too-few-lines-to-bother 5
  "*Number of lines below which we won't generate Keywords at all.")

(defvar message-keyword-far-too-many-lines 500
  "*Number of lines above which we won't generate Keywords at all.")

(defvar message-keyword-short-article-limit 100
  "*Number of lines below which we generate a short-count Keywords list.")

(defvar message-keyword-short-count 8
  "*Max number of keywords to generate for articles of length between
message-keyword-too-few-lines-to-bother & message-keyword-short-article-limit.")

(defvar message-keyword-long-count 12
  "*Max number of keywords to generate for articles of length greater
than message-keyword-short-article-limit.")

(defvar message-keyword-interactive t
  "*Nil means just build it and insert it.  Non-nil means ask if
the result is ok.  The latter is tempered with the fact that there are
still no questions asked if there are no keywords and the message is
too short (== 2*too-few-lines).")

;----------------------------------------------------------------

(defun message-keyword-generate (maxkeys text)
  "*Generate a list of MAXKEYS keywords from the supplied TEXT.
Downcase, remove punctuation, whitespace->newline to create word list,
remove trivial words, sort, count unique occurrences > 2, ultimately
building a simple lisp list of the words."
  (interactive)
  (save-excursion
    (let ((retval nil)
	  (count 0)
	  (word "")
	  (textbuf  (generate-new-buffer " *ChewableText*"))
	  (countbuf (generate-new-buffer " *WordCount*"))
	  pmin)
      (if (or (not (numberp maxkeys))
	      (> maxkeys 25))			; absurdity defense.
	  (error "1st arg to keyword-generate must be an integer < 25"))
      (set-buffer textbuf)
      (insert text "\n")			; guarantee newline.
      (setq pmin (point-min))
      (downcase-region pmin (point-max))	; "tr [A-Z] [a-z]".
      (goto-char pmin)
      (while (search-forward-regexp message-keyword-punctuation-regexp nil t)
	(replace-match "" nil t))		; "tr -d [these chars]".
      (goto-char pmin)
      (while (search-forward-regexp message-keyword-whitespace-regexp nil t)
	(replace-match "\n" nil t))		; "tr [SPC TAB] '\012'".
      (goto-char pmin)
      (while (search-forward-regexp message-keyword-trivia-regexp nil t)
	(replace-match "" nil t))		; "egrep -v '^these|words'".
      (sort-lines nil pmin (point-max))
      (goto-char pmin)
      (while (looking-at "\n")
	(delete-char 1))			; "sed -e '/^$/d'".
      ; "uniq -c":
      ; while there are words to be counted...
      ;	get a word with leading ^ and newline attached.
      ;	count occurrences while deleting them (ignore counts < 3).
      (while (not (eobp))
	(setq count 0
	      word (concat "^" (buffer-substring
				1 (+ 2 (skip-chars-forward "^\n")))))
	(goto-char pmin)
	(while (search-forward-regexp word nil t)
	  (replace-match "" nil t)
	  (setq count (1+ count)))
	(if (>= count 3)
	    (insert-string (format "%5d\t%s" count word) countbuf)))
      ; in countbuf, "sort -nr".
      ; delete the counts. ("sed -e 's/^.*\t^//'")
      ; build a list with the results.
      (kill-this-buffer)
      (set-buffer countbuf)
      (sort-lines t pmin (point-max))
      (goto-char pmin)
      (while (search-forward-regexp "^.+\t^" nil t)
	(replace-match "" nil t))
      (goto-char pmin)
      (while (and (not (eobp))
		  (> maxkeys 0))
	(setq retval (append retval
			     (list (buffer-substring
				    (point)
				    (+ (point) (skip-chars-forward "^\n"))))))
	(decf maxkeys)
	(forward-char 1))
      (kill-this-buffer)
      retval
      )
    )
  )

(defun message-keyword-insert ()
  "*Take current (*message*) buffer's total contents,
and compute a Keywords header for it."
  (interactive)
  (let ((keywords "")
	quickbuf count)
    (save-excursion
      (unless (message-fetch-field "keywords")
	; get full article text, including header & signature.
	(setq quickbuf (generate-new-buffer " *BodyOnly*"))
	(insert-string (buffer-substring) quickbuf)
	(set-buffer quickbuf)
	(narrow-to-region (message-goto-body)
			  (progn (message-goto-signature)
				 (point)))
	(setq count (count-lines (point-min) (point-max)))
	; compute a string of keywords.  #keywords is based on line count.
	(setq keywords (mapconcat
			'identity
			(if (or
			     (< count message-keyword-too-few-lines-to-bother)
			     (> count message-keyword-far-too-many-lines))
			    nil
			  (message-keyword-generate
			   (if (> count message-keyword-short-article-limit)
			       message-keyword-long-count
			     message-keyword-short-count)
			   (buffer-substring)))
			  ","))
	(kill-this-buffer)))
    ; back in *message* buffer now.
    ; if user wants to have a hand in things, this is his chance.
    (if (and message-keyword-interactive
	     (or (not (string-equal keywords ""))
		 (> count (* 2 message-keyword-too-few-lines-to-bother))))
	(setq keywords (read-string "Keywords: " keywords)))
    ; but one way or another, now insert the keywords.
    (if (not (string-equal keywords ""))
	(save-excursion
	  (message-goto-keywords)
	  (insert-string keywords)))
    )
  )

(provide 'messkeyw)


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: messkeyw.el [patch]
  1998-10-28 19:23 messkeyw.el Karl Kleinpaste
@ 1998-10-30 11:57 ` Karl Kleinpaste
  1998-10-30 13:46   ` William M. Perry
  0 siblings, 1 reply; 3+ messages in thread
From: Karl Kleinpaste @ 1998-10-30 11:57 UTC (permalink / raw)


Thanx to Jean-Yves Perrier, who found an Emacs /vs/ XEmacs
compatibility bug.  Apparently, buffer-substring's START and END args
aren't optional in Emacs.

--karl

--- messkeyw.el.~1~	Fri Oct 30 06:17:27 1998
+++ messkeyw.el	Fri Oct 30 06:18:31 1998
@@ -170,7 +170,7 @@
       (unless (message-fetch-field "keywords")
 	; get full article text, including header & signature.
 	(setq quickbuf (generate-new-buffer " *BodyOnly*"))
-	(insert-string (buffer-substring) quickbuf)
+	(insert-string (buffer-substring (point-min) (point-max)) quickbuf)
 	(set-buffer quickbuf)
 	(narrow-to-region (message-goto-body)
 			  (progn (message-goto-signature)
@@ -187,7 +187,7 @@
 			   (if (> count message-keyword-short-article-limit)
 			       message-keyword-long-count
 			     message-keyword-short-count)
-			   (buffer-substring)))
+			   (buffer-substring (point-min) (point-max))))
 			  ","))
 	(kill-this-buffer)))
     ; back in *message* buffer now.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: messkeyw.el [patch]
  1998-10-30 11:57 ` messkeyw.el [patch] Karl Kleinpaste
@ 1998-10-30 13:46   ` William M. Perry
  0 siblings, 0 replies; 3+ messages in thread
From: William M. Perry @ 1998-10-30 13:46 UTC (permalink / raw)
  Cc: ding

Karl Kleinpaste <karl@jprc.com> writes:

> Thanx to Jean-Yves Perrier, who found an Emacs /vs/ XEmacs compatibility
> bug.  Apparently, buffer-substring's START and END args aren't optional
> in Emacs.

  If you are going to be using point-min/max, you might as well just use
buffer-string. :)

-Bill P.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~1998-10-30 13:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-10-28 19:23 messkeyw.el Karl Kleinpaste
1998-10-30 11:57 ` messkeyw.el [patch] Karl Kleinpaste
1998-10-30 13:46   ` William M. Perry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).