From mboxrd@z Thu Jan 1 00:00:00 1970 Message-Id: <200007200405.AAA27933@cse.psu.edu> From: "James A. Robinson" To: 9fans@cse.psu.edu Subject: Re: [9fans] Gecko based web browser MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <9996.964065925.1@aubrey.stanford.edu> Date: Wed, 19 Jul 2000 21:05:25 -0700 Topicbox-Message-UUID: e6ddfeb6-eac8-11e9-9e20-41e7f4b1d025 Someone just pointed out 'hget ...|awk ...' would do what I was talking about. Yes, I did know about it, but perhaps I should explain some of the advantages I could see coming out of a web retrieval engine. I realize a lot of this could be built into a stand alone binary, I just don't see the point of doing that instead of an easier to navigate fs-style server. In any case, after this I won't harp on the subject any more. cache A cache is a wonderful thing. This is from my point of view as one of the poor schmos who gets paged when Science Magazine has an article or three that a million people want *right now.* A few weeks ago they published a breaking article about water on Mars, an article about some new AIDS research, and an article about dinosaurs who had feathers. This brought the merry hordes breaking down our door, and we were seeing 30 hits per second sustained for 48 hours. proxy You start up webfs on your firewall machine, and export the service to everyone on the local net. They bind the firewall's /webfs to their own root, and off they go (taking advantage of the cache, and not having to futz with http vs https proxys like might have to in netscape) past Anyone else thing that IE actually has a pretty spiffy history of last visted links? The break down into Yesterday, Last Week, 2 Weeks Ago, etc., is a lot nicer than NS's default history buffer. Imagine what you could do in terms of an fs system: diff /webfs/past/20000718/bmj.com/index.dtl /webfs/now/bmj.com/index.dtl Of course I don't pretend to know how one can handle the file heirarchy stuff. As I wrote to someone in private e-mail awhile ago, I don't think that just because a url has a '/' that it should automatically be a candidate for an fs storage heirarchy. But I do agree that a LOT of it CAN be mapped. What if you could do searches in the cache of places you've visited over the past few weeks? For example, you remember seeing a real nice algorithm, but you now you've forgotten some important detail. You can visit Google and comb the web again, but if you know it's somewhere in that cache on local fs... anon I can't remember if it was Bell Labs or AT&T Labs, but someone at one of those places wrote about a neat proxy server which helped protect your identity. You browsed via the proxy, and it would subsitute codes like '\@u' which you POST to forms with an expanded username like 'anon10007' or what-not. It could generate a unique e-mail address for each '\@e' sent domains so that you could tell which corperation sold your e-mail address. junk Anyone else use the junkbuster? It strips out banner ads and other ad junk, replacing graphics with a transparent gif (and since it is transparent it can expand to the size of the gif it's replacing without looking weird). So sites look normal, but you don't have to view the ads. It also ignores HREF to places like doubleclick. Wouldn't it be nice to be able to write a set of common rules to control such features, and allow users to bind their own /webfs/junkbuster/rules into place? secure Strip out postscript or junk javascript according to default or user rules. I guess all these are variations on a theme. Like I said, I understand you can do all this in a stand alone binary. But for me it just seems to call out for a fs style approach -- I'm under the impression that it would be easier to manage than a bazillion command line options like wget has: ; wget --help GNU Wget 1.5.3, a non-interactive network retriever. Usage: wget [OPTION]... [URL]... Mandatory arguments to long options are mandatory for short options too. Startup: -V, --version display the version of Wget and exit. -h, --help print this help. -b, --background go to background after startup. -e, --execute=COMMAND execute a `.wgetrc' command. Logging and input file: -o, --output-file=FILE log messages to FILE. -a, --append-output=FILE append messages to FILE. -d, --debug print debug output. -q, --quiet quiet (no output). -v, --verbose be verbose (this is the default). -nv, --non-verbose turn off verboseness, without being quiet. -i, --input-file=FILE read URL-s from file. -F, --force-html treat input file as HTML. Download: -t, --tries=NUMBER set number of retries to NUMBER (0 unlimits). -O --output-document=FILE write documents to FILE. -nc, --no-clobber don't clobber existing files. -c, --continue restart getting an existing file. --dot-style=STYLE set retrieval display style. -N, --timestamping don't retrieve files if older than local. -S, --server-response print server response. --spider don't download anything. -T, --timeout=SECONDS set the read timeout to SECONDS. -w, --wait=SECONDS wait SECONDS between retrievals. -Y, --proxy=on/off turn proxy on or off. -Q, --quota=NUMBER set retrieval quota to NUMBER. Directories: -nd --no-directories don't create directories. -x, --force-directories force creation of directories. -nH, --no-host-directories don't create host directories. -P, --directory-prefix=PREFIX save files to PREFIX/... --cut-dirs=NUMBER ignore NUMBER remote directory components. HTTP options: --http-user=USER set http user to USER. --http-passwd=PASS set http password to PASS. -C, --cache=on/off (dis)allow server-cached data (normally allowed). --ignore-length ignore `Content-Length' header field. --header=STRING insert STRING among the headers. --proxy-user=USER set USER as proxy username. --proxy-passwd=PASS set PASS as proxy password. -s, --save-headers save the HTTP headers to file. -U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION. FTP options: --retr-symlinks retrieve FTP symbolic links. -g, --glob=on/off turn file name globbing on or off. --passive-ftp use the "passive" transfer mode. Recursive retrieval: -r, --recursive recursive web-suck -- use with care!. -l, --level=NUMBER maximum recursion depth (0 to unlimit). --delete-after delete downloaded files. -k, --convert-links convert non-relative links to relative. -m, --mirror turn on options suitable for mirroring. -nr, --dont-remove-listing don't remove `.listing' files. Recursive accept/reject: -A, --accept=LIST list of accepted extensions. -R, --reject=LIST list of rejected extensions. -D, --domains=LIST list of accepted domains. --exclude-domains=LIST comma-separated list of rejected domains. -L, --relative follow relative links only. --follow-ftp follow FTP links from HTML documents. -H, --span-hosts go to foreign hosts when recursive. -I, --include-directories=LIST list of allowed directories. -X, --exclude-directories=LIST list of excluded directories. -nh, --no-host-lookup don't DNS-lookup hosts. -np, --no-parent don't ascend to the parent directory. Mail bug reports and suggestions to .