9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: "James A. Robinson" <jim.robinson@stanford.edu>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] Gecko based web browser
Date: Wed, 19 Jul 2000 21:05:25 -0700	[thread overview]
Message-ID: <200007200405.AAA27933@cse.psu.edu> (raw)

Someone just pointed out 'hget ...|awk ...' would do what I was talking
about. Yes, I did know about it, but perhaps I should explain some of
the advantages I could see coming out of a web retrieval engine.

I realize a lot of this could be built into a stand alone binary, I
just don't see the point of doing that instead of an easier to navigate
fs-style server.  In any case, after this I won't harp on the subject
any more.

cache   A cache is a wonderful thing. This is from my point of view
        as one of the poor schmos who gets paged when Science Magazine
        has an article or three that a million people want *right now.*
        A few weeks ago they published a breaking article about water
        on Mars, an article about some new AIDS research, and an article
        about dinosaurs who had feathers.  This brought the merry hordes
        breaking down our door, and we were seeing 30 hits per second
        sustained for 48 hours.

proxy   You start up webfs on your firewall machine, and export the
        service to everyone on the local net.  They bind the firewall's
        /webfs to their own root, and off they go (taking advantage of
        the cache, and not having to futz with http vs https proxys like
        might have to in netscape)

past    Anyone else thing that IE actually has a pretty spiffy history
        of last visted links?  The break down into Yesterday, Last Week,
        2 Weeks Ago, etc., is a lot nicer than NS's default history buffer.
        Imagine what you could do in terms of an fs system:

        diff /webfs/past/20000718/bmj.com/index.dtl /webfs/now/bmj.com/index.dtl

        Of course I don't pretend to know how one can handle the file
        heirarchy stuff.  As I wrote to someone in private e-mail
        awhile ago, I don't think that just because a url has a '/'
        that it should automatically be a candidate for an fs storage
        heirarchy.  But I do agree that a LOT of it CAN be mapped.
        What if you could do searches in the cache of places you've
        visited over the past few weeks? For example, you remember
        seeing a real nice algorithm, but you now you've forgotten some
        important detail. You can visit Google and comb the web again,
        but if you know it's somewhere in that cache on local fs...

anon    I can't remember if it was Bell Labs or AT&T Labs, but someone at
        one of those places wrote about a neat proxy server which helped
        protect your identity.  You browsed via the proxy, and it
        would subsitute codes like '\@u' which you POST to forms with
        an expanded username like 'anon10007' or what-not.  It could
        generate a unique e-mail address for each '\@e' sent domains so
        that you could tell which corperation sold your e-mail address.

junk    Anyone else use the junkbuster? It strips out banner ads and other
        ad junk, replacing graphics with a transparent gif (and since
        it is transparent it can expand to the size of the gif it's
        replacing without looking weird). So sites look normal, but you
        don't have to view the ads.  It also ignores HREF to places
        like doubleclick.  Wouldn't it be nice to be able to write a
        set of common rules to control such features, and allow users
        to bind their own /webfs/junkbuster/rules into place?

secure  Strip out postscript or junk javascript according to default or
        user rules.

I guess all these are variations on a theme. Like I said, I understand
you can do all this in a stand alone binary. But for me it just seems to
call out for a fs style approach -- I'm under the impression that it would
be easier to manage than a bazillion command line options like wget has:

        ; wget --help
        GNU Wget 1.5.3, a non-interactive network retriever.
        Usage: wget [OPTION]... [URL]...
        
        Mandatory arguments to long options are mandatory for short options too.
        
        Startup:
          -V,  --version           display the version of Wget and exit.
          -h,  --help              print this help.
          -b,  --background        go to background after startup.
          -e,  --execute=COMMAND   execute a `.wgetrc' command.
        
        Logging and input file:
          -o,  --output-file=FILE     log messages to FILE.
          -a,  --append-output=FILE   append messages to FILE.
          -d,  --debug                print debug output.
          -q,  --quiet                quiet (no output).
          -v,  --verbose              be verbose (this is the default).
          -nv, --non-verbose          turn off verboseness, without being quiet.
          -i,  --input-file=FILE      read URL-s from file.
          -F,  --force-html           treat input file as HTML.
        
        Download:
          -t,  --tries=NUMBER           set number of retries to NUMBER (0 unlimits).
          -O   --output-document=FILE   write documents to FILE.
          -nc, --no-clobber             don't clobber existing files.
          -c,  --continue               restart getting an existing file.
               --dot-style=STYLE        set retrieval display style.
          -N,  --timestamping           don't retrieve files if older than local.
          -S,  --server-response        print server response.
               --spider                 don't download anything.
          -T,  --timeout=SECONDS        set the read timeout to SECONDS.
          -w,  --wait=SECONDS           wait SECONDS between retrievals.
          -Y,  --proxy=on/off           turn proxy on or off.
          -Q,  --quota=NUMBER           set retrieval quota to NUMBER.
        
        Directories:
          -nd  --no-directories            don't create directories.
          -x,  --force-directories         force creation of directories.
          -nH, --no-host-directories       don't create host directories.
          -P,  --directory-prefix=PREFIX   save files to PREFIX/...
               --cut-dirs=NUMBER           ignore NUMBER remote directory components.
        
        HTTP options:
               --http-user=USER      set http user to USER.
               --http-passwd=PASS    set http password to PASS.
          -C,  --cache=on/off        (dis)allow server-cached data (normally allowed).
               --ignore-length       ignore `Content-Length' header field.
               --header=STRING       insert STRING among the headers.
               --proxy-user=USER     set USER as proxy username.
               --proxy-passwd=PASS   set PASS as proxy password.
          -s,  --save-headers        save the HTTP headers to file.
          -U,  --user-agent=AGENT    identify as AGENT instead of Wget/VERSION.
        
        FTP options:
               --retr-symlinks   retrieve FTP symbolic links.
          -g,  --glob=on/off     turn file name globbing on or off.
               --passive-ftp     use the "passive" transfer mode.
        
        Recursive retrieval:
          -r,  --recursive             recursive web-suck -- use with care!.
          -l,  --level=NUMBER          maximum recursion depth (0 to unlimit).
               --delete-after          delete downloaded files.
          -k,  --convert-links         convert non-relative links to relative.
          -m,  --mirror                turn on options suitable for mirroring.
          -nr, --dont-remove-listing   don't remove `.listing' files.
        
        Recursive accept/reject:
          -A,  --accept=LIST                list of accepted extensions.
          -R,  --reject=LIST                list of rejected extensions.
          -D,  --domains=LIST               list of accepted domains.
               --exclude-domains=LIST       comma-separated list of rejected domains.
          -L,  --relative                   follow relative links only.
               --follow-ftp                 follow FTP links from HTML documents.
          -H,  --span-hosts                 go to foreign hosts when recursive.
          -I,  --include-directories=LIST   list of allowed directories.
          -X,  --exclude-directories=LIST   list of excluded directories.
          -nh, --no-host-lookup             don't DNS-lookup hosts.
          -np, --no-parent                  don't ascend to the parent directory.
        
        Mail bug reports and suggestions to <bug-wget@gnu.org>.


             reply	other threads:[~2000-07-20  4:05 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2000-07-20  4:05 James A. Robinson [this message]
  -- strict thread matches above, loose matches on Subject: below --
2000-07-26 17:43 miller
2000-07-26 17:50 ` James G. Stallings II
2000-07-27  7:43   ` Matt
2000-07-27  7:54     ` Lucio De Re
2000-07-27 17:28       ` Matt
2000-07-20  1:41 rob pike
2000-07-20  8:34 ` George Coulouris
2000-07-19  7:18 forsyth
2000-07-19  7:43 ` Lucio De Re
2000-07-19  7:58 ` Randolph Fritz
2000-07-19 15:23 ` Jonathan Sergent
2000-07-18 23:02 forsyth
2000-07-18 22:30 ` Frank Gleason
2000-07-19  0:17   ` Randolph Fritz
2000-07-19  0:01     ` Frank Gleason
2000-07-19  1:02 ` Skip Tavakkolian
2000-07-19 11:45 ` Theo Honohan
2000-07-18 22:33 rob pike
2000-07-18 22:59 ` Howard Trickey
2000-07-21  8:34 ` Alt
2000-07-25 15:07   ` Douglas A. Gwyn
2000-07-18 20:23 miller
2000-07-18 22:07 ` Randolph Fritz
2000-07-18 19:03 Stephen Harris
2000-07-18 19:17 ` Andrey Mirtchovski
2000-07-18 23:48   ` Randolph Fritz
2000-07-19  5:40     ` Randolph Fritz
2000-07-19  9:26   ` Michael Dingler
2000-07-19 15:22   ` Douglas A. Gwyn
2000-07-19 16:28     ` Andrey Mirtchovski
2000-07-19 16:47       ` Randolph Fritz
2000-07-19 22:52       ` sah
2000-07-20  1:16         ` James A. Robinson
2000-07-20  3:08         ` Boyd Roberts
2000-07-26  8:42     ` Ralph Corderoy
2000-07-19  9:27 ` Christopher Browne
2000-07-19 15:24   ` Andy Newman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200007200405.AAA27933@cse.psu.edu \
    --to=jim.robinson@stanford.edu \
    --cc=9fans@cse.psu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).