From: John (_You_ hide, they seek.) Mackin <john@civil.su.oz.au>
To: The rc Mailing List <rc@hawkwind.utcs.toronto.edu>,
Boyd Roberts <boyd@prl.dec.com>
Subject: compact - a history-file compaction command
Date: Sat, 3 Oct 1992 19:17:33 -0400 [thread overview]
Message-ID: <199210040917.15729.rc.baboz@civil.su.oz.au> (raw)
Here's my version. I'm sorry I didn't send it out much earlier and save
everyone the effort of writing those foul script versions :). Please
comment.
OK,
John.
# To unbundle, sh this file
echo README 1>&2
sed 's/.//' >README <<'//GO.SYSIN DD README'
-This is compact version 0.9, by John Mackin <john@civil.su.oz.au>.
-
--- What is compact?
-
-compact is a tool which analyses its input line-wise, and only outputs
-one copy of each duplicated line. Unlike uniq, compact detects duplicate
-lines even if they are not adjacent. Unlike sort | uniq, compact is stable
-with respect to the order of the lines; duplicated lines are output in the
-most recent (farthest along) position.
-
--- Why is that any use?
-
-I admit that I can't think offhand of any application for compact other
-than the specific one for which I wrote it. I hope your mileage varies :).
-
-It was written to compact shell history files. Some shells are able to
-simply append each command line to a file after it is read and before
-it is executed. This allows very simple, clean, and flexible implementation
-of history entirely outside the shell itself. With this kind of history,
-people seem to fall into two classes: those who like to throw their
-history away frequently (e.g., when a window or a login session dies),
-and those who like to keep it forever. I'm one of the latter. Of
-course, in that case, unless you do something about it your history
-file grows without bound. compact is what I do about it.
-
-It really helps; for a typical example, it shrinks one raw history file
-I have here from 164 000 characters to 67 000. In terms of lines, the
-saving is even more impressive, from 12 000 to 4 000 on the same example
-file. (This implies that rarer commands are longer, which is as it should
-be.)
-
-A history file that's compacted regularly still grows without bound,
-but the growth is much more under control.
-
--- Why a new program? Why not implement this using existing tools?
-
-I couldn't think of any way to do what compact does using normal
-standard UNIX tools that was even reasonably clean and efficient.
-If you _can_ think of one, please tell me. The solutions that I
-have since seen don't impress me, efficiency-wise; that matters,
-since my history file ends up being pretty big.
-
--- Why did you pick a command name that was already used?
-
-I am aware that a `compact' command already exists on some machines.
-For those who don't know, it was written at Berkeley, and introduced
-in 4.2BSD. It was a file compression command, like pack or compress.
-It predated compress and was intended as an alternative to pack.
-Unfortunately, on the majority of files, it took far more CPU time than
-pack to produce a result that was usually around the same in terms of
-disk space. Even before the advent of compress, it was almost
-never used. These days, it is effectively totally dead as a command.
-(How long has it been since _you_ saw a filename with a .C suffix?
-That wasn't C++?) It has even been deprecated at Berkeley, and is
-not present in the 4.4BSD sources on ftp.uu.net. Therefore, I have
-no qualms about re-using its name.
-
-The reason I chose to do so is because the idea for compact wasn't
-mine. The first `compact' I know of that did what mine does was
-written by Bruce Ellis at the University of Sydney (and it predated
-the Berkeley `compact'). I never saw the source to Brucee's implementation;
-I've just re-implemented his idea.
-
--- How do I compile and install it?
-
-Link Makefile.dist to Makefile and type "make". There is no
-configuration and I believe it should be 100% portable, with the single
-possible exception that you might need to take care of strrchr vs.
-rindex. Please tell me if it isn't, or if you have any problems.
-Installation is done by hand; strip the executable and copy it to where
-you want it. Copy the manual page to wherever it should go. All
-done.
-
-There is one other possible portability glitch: your compiler might
-not understand one or both of the declarations "unsigned char" or
-"unsigned long". If so, it is truly prehistoric, and you should mail
-me because I'm interested (unless you're on some sort of DOS box or such).
-
--- You're a dope! The algorithm you chose is ludicrously memory-hungry!
- I don't have a snowball's chance in hell of compacting any significant
- file on my PDP-11.
-
-Yes, I know. Please accept my apologies. As we are all aware, we have
-to constantly decide how to trade off speed for space and vice versa, and
-I wanted this program to run quickly. I do have an algorithm in mind for
-machines with small memories, but I haven't implemented it yet. When I
-get around to that depends in large part on demand: if you have a need for it,
-let me know.
-
--- What is this "flint" business? It looks like something separate.
-
-Yes, it was going to be. "flint" stands for fast, line-oriented input.
-It is part of a library I was developing which I was going to release
-publically. Unfortunately, I was beaten to the punch by Phong Vo and
-David Korn's sfio package. flint as it stands is still much lighter-weight
-and simpler than sfio, but then it only provides one function: efficient,
-safe, unlimited length line-oriented file reading. If you want to rip it
-off and use it in something else, let me know and I can send you some
-documentation. If you're interested in sfio, see <16948@ulysses.att.com>
-in comp.lang.c, or if you don't have unlimited access to old netnews (grin)
-the package is available from netlib:
-
- echo send sfio.shar from incoming | mail netlib@research.att.com
-
--- Why does compact handle null characters (at all, and as it does)?
-
-Good question. A well-formed UNIX text file consists of zero or more
-lines. A line is a sequence of non-null characters terminated by a
-newline. Hence, a well-formed UNIX text file never contains null
-characters. Since the shell I use is friendly enough to ignore null
-characters on input (with a warning) should I happen to type one
-accidentally, then null characters will never be written to my
-history file, right? Wrong. I run shells on more than one machine,
-all appending to the same history file by courtesy of that absolute
-_marvel_ of modern software technology, NFS (read: Not a FileSystem).
-One of the lovely features that gives me is large chunks of null
-characters where I should have data. Hence, compact needs to
-ignore lines containing null characters. Come on up to NFS, your
-international passport to crossmounting pleasure! You'll be so
-glad you did!
-
--- Where did this version of compact come from? Where can I get
- the latest version?
-
-Version 0.9 of compact is a limited release for portability evaluation and
-comments. Please let me know how it works (or doesn't work) for you,
-and what you think of it. This version was released by mail to the rc
-mailing list and selected other people. The program may be made available
-by anonymous ftp, and posted to netnews, later. Version 0.9 was released
-in October 1992.
-
-John Mackin <john@civil.su.oz.au>
-
- If that address should fail (shouldn't happen prior to January
- 1993, but who knows!), you could try <john@physiol.su.oz.au> or
- <mackin@vast.unsw.edu.au>.
-
-Zero is greater than minus zero, but don't ask by how much.
- -- Control Data Corporation 6600 reference manual
//GO.SYSIN DD README
echo compact.1 1>&2
sed 's/.//' >compact.1 <<'//GO.SYSIN DD compact.1'
-.TH COMPACT 1 local
-.SH NAME
-compact \- shrink a file by eliminating duplicate lines
-.SH SYNOPSIS
-.B compact
-[
-.I file
-]
-.SH DESCRIPTION
-.I compact
-reads the named input
-file,
-or the standard input
-if no file is named, and writes a subset of the lines read
-to the standard output. Lines are output in the same order that
-they appear in the input, but lines which appear in the input
-more than once will appear in the output exactly once,
-in the position in which they
-.I last
-appear in the input.
-.PP
-Blank lines and lines containing null characters are ignored. Should the last
-character of the input not be a newline, all characters after
-the final newline in the input will be ignored. Input characters
-other than null and newline are not distinguished in any way;
-in particular, characters with the high bit on will be treated normally.
-.SH EXAMPLE
-.I compact
-is ideal for shrinking shell history files. You might put something
-like this in your .profile, if your version of sh supports history:
-.PP
- trap 'compact $HISTORY >$HOME/.temphist; mv $HOME/.temphist $HISTORY' 0
-.SH AUTHOR
-John Mackin, <john@civil.su.oz.au>; original concept by Bruce Ellis
-(who had nothing to do with this implementation).
-.SH BUGS
-\&
-.SH "SEE ALSO"
-sort(1), uniq(1), comm(1)
//GO.SYSIN DD compact.1
echo Makefile.dist 1>&2
sed 's/.//' >Makefile.dist <<'//GO.SYSIN DD Makefile.dist'
-CFLAGS = -O
-LDFLAGS = # -g
-LIBS =
-
-compact: compact.o memory.o flint.o
- cc $(LDFLAGS) compact.o memory.o flint.o -o $@ $(LIBS)
-
-compact.o: compact.h
-
-memory.o: compact.h
//GO.SYSIN DD Makefile.dist
echo compact.h 1>&2
sed 's/.//' >compact.h <<'//GO.SYSIN DD compact.h'
-/*
- * Header file for compact. Include all random headers & define
- * some types.
- */
-
-#include <stdio.h>
-#include <ctype.h>
-#include "flint.h"
-
-#define SYSERROR (-1)
-
-#define HASHSIZE 3319
-
-struct cmd {
- struct cmd * c_next;
- struct cmd * c_prev;
- struct cmd * c_hlink;
- char * c_cmd;
- int c_len;
-};
-
-extern char * Name;
-
-extern struct cmd * newcmd();
-extern char * newstr();
//GO.SYSIN DD compact.h
echo compact.c 1>&2
sed 's/.//' >compact.c <<'//GO.SYSIN DD compact.c'
-/*
- * History file compaction. Usage: compact history-file.
- */
-
-#include "compact.h"
-
-/*
- * Head and Tail are the doubly linked list of commands in print
- * order. Hash is the hash table.
- */
-
-struct cmd * Head;
-struct cmd * Tail;
-struct cmd * Hash[HASHSIZE];
-
-char * Name;
-
-extern char * strrchr();
-
-long hashpjw();
-
-main(argc, argv)
-int argc;
-char ** argv;
-{
- register char * p;
-
- Name = argv[0];
- if (p = strrchr(Name, '/'))
- Name = ++p;
-
- if (argc == 1)
- compactfd("standard input", 0);
- else if (argc == 2)
- compact(argv[1]);
- else
- usage();
-
- exit(0);
-}
-
-compact(file)
-char * file;
-{
- int fd;
-
- fd = open(file, 0);
- if (fd == SYSERROR) {
- fprintf(stderr, "%s: can't open ", Name);
- perror(file);
- exit(1);
- }
- compactfd(file, fd);
- (void)close(fd);
-}
-
-compactfd(file, fd)
-char * file;
-int fd;
-{
- flint f;
- register char * s;
- register struct cmd * c;
- extern int errno;
-
- if (flopen(&f, fd, 'r') != 0)
- fatal("can't flopen");
-
- while ((s = flread(&f)) != (char *)0)
- if (fllength(&f) > 0) /* ignore blank lines */
- hashprobe(s, fllength(&f));
-
- if (flerror(&f)) {
- fprintf(stderr, "%s: read error on ", Name);
- errno = flerrno(&f);
- perror(file);
- exit(1);
- }
- flclose(&f);
-
- c = Head;
- while (c) {
- puts(c->c_cmd);
- c = c->c_next;
- }
-}
-
-hashprobe(cmd, len)
-char * cmd;
-register int len;
-{
- register struct cmd ** bucket;
- register struct cmd * c;
- register struct cmd * new;
- register long hashval;
-
- hashval = hashpjw(cmd, len);
- if (hashval == -1L) /* The line has embedded nulls. */
- return; /* Sodding NFS!! Ignore it. */
-
- bucket = &Hash[(int)(hashval % HASHSIZE)];
- c = *bucket;
- if (c == (struct cmd *)0) {
- new = newcmd();
- *bucket = new;
- new->c_cmd = newstr(cmd, len);
- new->c_len = len;
- new->c_hlink = (struct cmd *)0;
- tailadd(new);
- }
- else {
- /*
- * This hash bucket has something in it. Search to see
- * if this command is there already or not. We only need
- * to call strcmp if the length of the strings is the same,
- * which cuts down the number of calls by around 2/3 in a
- * typical case.
- */
- for (;;) {
- if (c->c_len == len && strcmp(cmd, c->c_cmd) == 0)
- break;
- if (c->c_hlink == (struct cmd *)0) {
- new = newcmd();
- c->c_hlink = new;
- new->c_cmd = newstr(cmd, len);
- new->c_len = len;
- new->c_hlink = (struct cmd *)0;
- tailadd(new);
- return;
- }
- c = c->c_hlink;
- }
- /*
- * This command is already in this bucket. Move it to the
- * end of the doubly-linked print order list, unless it is
- * already at the end in which case nothing need be done.
- */
- if (c->c_next != (struct cmd *)0) {
- if (c == Tail)
- fatal("can't be Tail");
- if (c->c_prev != (struct cmd *)0) {
- if (c == Head)
- fatal("can't be Head");
- c->c_prev->c_next = c->c_next;
- }
- else
- Head = c->c_next;
- c->c_next->c_prev = c->c_prev;
- tailadd(c);
- }
- }
-}
-
-tailadd(c)
-register struct cmd * c;
-{
- if (Head == (struct cmd *)0) { /* first time */
- Head = Tail = c;
- c->c_prev = c->c_next = (struct cmd *)0;
- }
- else {
- Tail->c_next = c;
- c->c_prev = Tail;
- c->c_next = (struct cmd *)0;
- Tail = c;
- }
-}
-
-/*
- * Weinberger's hash function.
- */
-
-long
-hashpjw(s, len)
-register char * s;
-register int len;
-{
- register unsigned char * p;
- register unsigned long g;
- register unsigned long h;
- register unsigned char c;
-
- h = 0;
- p = (unsigned char *)s;
- while (len-- > 0) {
- c = *p++;
- if (c == '\0')
- return (-1L);
- h = (h << 4) + c;
- if (g = h & 0xF0000000) {
- h = h ^ (g >> 24);
- h = h ^ g;
- }
- }
-
- return (h);
-}
-
-fatal(s)
-{
- fprintf(stderr, "%s: %s\n", Name, s);
- exit(1);
-}
-
-usage()
-{
- fprintf(stderr, "usage: %s [file]\n", Name);
- exit(1);
-}
//GO.SYSIN DD compact.c
echo memory.c 1>&2
sed 's/.//' >memory.c <<'//GO.SYSIN DD memory.c'
-/*
- * Memory allocation.
- */
-
-#include "compact.h"
-
-char * strcpy();
-
-static char * salloc();
-
-/*
- * It's definitely worth not having to call strlen() here, and since we
- * know the length anyway, why not? I considered making newstr() allocate
- * chunks and dole them out, in order to cut down on calls to malloc(),
- * but on the Ultrix MIPS box where I profiled this, malloc came to
- * only 3% in the end so I didn't bother. If your mileage varies,
- * that might be a good shot.
- */
-
-char *
-newstr(s, len)
-char * s;
-int len;
-{
- char * p;
-
- p = salloc(len + 1);
- strcpy(p, s);
- return (p);
-}
-
-static char *
-salloc(n)
-int n;
-{
- register char * s;
- extern char * malloc();
-
- s = malloc((unsigned)n);
- if (s == (char *)0) {
- fprintf(stderr, "%s: out of memory\n", Name);
- exit(1);
- }
-
- return (s);
-}
-
-/*
- * Special-purpose allocator for command struct's. Avoids all that
- * time spent in malloc. Freeing them isn't an issue.
- */
-
-#define CHUNK 512
-
-struct cmd *
-newcmd()
-{
- static int chunk = CHUNK;
- static int avail = 0;
- static struct cmd * cp;
- register int n;
-
- if (avail <= 0) {
- cp = (struct cmd *)salloc(chunk * sizeof (struct cmd));
- avail = chunk;
- n = chunk << 1;
- if (n > 0)
- chunk = n;
- }
-
- avail--;
- return (cp++);
-}
//GO.SYSIN DD memory.c
echo flint.h 1>&2
sed 's/.//' >flint.h <<'//GO.SYSIN DD flint.h'
-/*
- * Header file for flint - fast line oriented input
- */
-
-/* $Id: flint.h,v 1.2 1991/12/27 12:24:08 john Exp $ */
-
-#ifndef _FLINT_H
-#define _FLINT_H
-
-typedef struct _flint {
- char * fl_buf;
- int fl_bufsize;
- char * fl_ptr;
- char * fl_end;
- int fl_flags;
- int fl_fd;
- int fl_errno;
- int fl_length;
-} flint;
-
-/* flags in fl_flags */
-
-#define FLREAD 01
-#define FLEOF 02
-#define FLERROR 04
-
-#define FLBUFSIZE 4096 /* initial size */
-
-/* publically visible routines */
-
-extern int flopen();
-extern char * flread();
-extern void flclose();
-extern void flclear();
-extern long fltell();
-extern void (*flerrset())();
-extern void flreset();
-
-/* macros */
-
-#define flisopen(f) ((f)->fl_flags & FLREAD)
-#define fleof(f) ((f)->fl_flags & FLEOF)
-#define flerror(f) ((f)->fl_flags & FLERROR)
-#define flerrno(f) ((int)(f)->fl_errno)
-#define fllength(f) ((int)(f)->fl_length)
-#define flfd(f) ((int)(f)->fl_fd)
-
-#endif _FLINT_H
-/* Don't add anything following the above #endif */
//GO.SYSIN DD flint.h
echo flint.c 1>&2
sed 's/.//' >flint.c <<'//GO.SYSIN DD flint.c'
-/*
- * flint - fast line input
- */
-
-static char RCSid[] = "$Id: flint.c,v 1.3 1992/06/02 04:01:56 john Exp $";
-
-#include "flint.h"
-
-#ifndef SYSERROR
-#define SYSERROR (-1)
-#endif
-
-static void flexpand();
-static void flfill();
-static void flmoveback();
-static char * flmalloc();
-static char * flrealloc();
-static void flfatal();
-
-static void (*flerrfunc)() = (void (*)())0;
-
-int
-flopen(f, fd, mode)
-register flint * f;
-int fd;
-int mode;
-{
- if (mode != 'r')
- return (-1);
- f->fl_bufsize = FLBUFSIZE;
- f->fl_buf = flmalloc(FLBUFSIZE);
- f->fl_ptr = f->fl_end = f->fl_buf;
- f->fl_flags = FLREAD;
- f->fl_fd = fd;
- f->fl_errno = 0;
- f->fl_length = 0;
- return (0);
-}
-
-void
-flclose(f)
-register flint * f;
-{
- free(f->fl_buf);
- f->fl_flags = 0;
-}
-
-void
-flclear(f)
-register flint * f;
-{
- f->fl_flags &= ~(FLEOF | FLERROR);
- f->fl_errno = 0;
-}
-
-void
-flreset(f)
-register flint * f;
-{
- f->fl_ptr = f->fl_end = f->fl_buf;
-}
-
-long
-fltell(f)
-register flint * f;
-{
- register long pos;
- extern long lseek();
-
- if ((f->fl_flags & FLREAD) == 0)
- return (-2L);
- if (f->fl_flags & FLERROR)
- return (-2L);
-
- pos = lseek(f->fl_fd, 0L, 1);
- if (pos == (long)SYSERROR)
- return (-1L);
-
- pos -= f->fl_end - f->fl_ptr;
-
- if (pos >= 0L)
- return (pos);
- else
- return (-3L);
-}
-
-
-char *
-flread(f)
-register flint * f;
-{
- register char * p;
- register char * end;
- char * line;
-
- if ((f->fl_flags & FLREAD) == 0)
- return ((char *)0);
- for (;;) {
- p = f->fl_ptr;
- end = f->fl_end;
- while (p < end) {
- if (*p == '\n')
- goto break2;
- p++;
- }
- if (f->fl_flags & (FLEOF | FLERROR))
- return ((char *)0);
-
- if (f->fl_end > f->fl_ptr) {
- if (f->fl_ptr == f->fl_buf)
- flexpand(f);
- else
- flmoveback(f);
- }
- else
- f->fl_ptr = f->fl_end = f->fl_buf;
-
- flfill(f);
- }
-break2:
- /*
- * Getting here means that we have found a valid line which
- * we can proceed to return to the user.
- */
-
- line = f->fl_ptr;
- f->fl_length = p - line;
- *p++ = '\0'; /* null out the newline on the end of the line */
- f->fl_ptr = p;
-
- return (line);
-}
-
-static void
-flexpand(f)
-register flint * f;
-{
- char * save;
- register int n;
-
- n = f->fl_bufsize << 1;
- if (n > 0)
- f->fl_bufsize = n;
- else
- flfatal("buffer tried to grow too large");
-
- save = f->fl_buf;
- f->fl_buf = flrealloc(f->fl_buf, n);
- if (save != f->fl_buf) { /* did it move? */
- f->fl_ptr = f->fl_ptr - save + f->fl_buf;
- f->fl_end = f->fl_end - save + f->fl_buf;
- }
-}
-
-static void
-flfill(f)
-register flint * f;
-{
- register int n;
- register int try;
- extern int errno;
-
- if (f->fl_flags & (FLEOF | FLERROR))
- flfatal("flfill called with eof or error");
-
- try = (f->fl_buf + f->fl_bufsize) - f->fl_end;
- if (try <= 0)
- flfatal("can't happen in flfill");
- n = read(f->fl_fd, f->fl_end, try);
- if (n == 0) {
- f->fl_flags |= FLEOF;
- return;
- }
- if (n == SYSERROR) {
- f->fl_flags |= FLERROR;
- f->fl_errno = errno;
- return;
- }
- f->fl_end += n;
- if (f->fl_end > f->fl_buf + f->fl_bufsize)
- flfatal("buffer overrun in flfill");
-}
-
-static void
-flmoveback(f)
-register flint * f;
-{
- register int n;
- register char * from;
- register char * to;
-
- n = f->fl_end - f->fl_ptr;
- if (n < 0 || n >= f->fl_bufsize)
- flfatal("can't happen in flmoveback");
- from = f->fl_ptr;
- to = f->fl_buf;
- while (n-- > 0)
- *to++ = *from++;
- f->fl_ptr = f->fl_buf;
- f->fl_end = to;
-}
-
-static char *
-flmalloc(n)
-register int n;
-{
- register char * p;
- extern char * malloc();
-
- p = malloc((unsigned)n);
- if (p == (char *)0)
- flfatal("out of memory for flint buffers");
- return (p);
-}
-
-static char *
-flrealloc(old, n)
-register char * old;
-register int n;
-{
- register char * p;
- extern char * realloc();
-
- p = realloc(old, (unsigned)n);
- if (p == (char *)0)
- flfatal("out of memory for flint buffers");
- return (p);
-}
-
-static void
-flfatal(s)
-char * s;
-{
- char buf[256]; /* avoid stdio */
-
- if (flerrfunc == (void (*)())0) {
- strcpy(buf, "fatal error in flint: ");
- strcat(buf, s);
- strcat(buf, "\n");
- write(2, buf, strlen(buf));
- exit(1);
- }
- else
- (*flerrfunc)(s);
-}
-
-void
-(*flerrset(fp))()
-void (*fp)();
-{
- void (*ret)();
-
- ret = flerrfunc;
- flerrfunc = fp;
- return (ret);
-}
//GO.SYSIN DD flint.c
next reply other threads:[~1992-10-03 23:23 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
1992-10-03 23:17 John Mackin [this message]
1992-10-04 5:50 Byron Rakitzis
1992-10-04 5:57 ` John Mackin
1992-10-04 6:20 Byron Rakitzis
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=199210040917.15729.rc.baboz@civil.su.oz.au \
--to=john@civil.su.oz.au \
--cc=boyd@prl.dec.com \
--cc=rc@hawkwind.utcs.toronto.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).