From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id CAA13964; Tue, 3 Sep 2002 02:44:15 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id CAA13920 for ; Tue, 3 Sep 2002 02:44:14 +0200 (MET DST) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.177]) by nez-perce.inria.fr (8.11.1/8.11.1) with ESMTP id g830iED08171 for ; Tue, 3 Sep 2002 02:44:14 +0200 (MET DST) Received: from [212.227.126.160] (helo=mrelayng0.kundenserver.de) by moutng3.kundenserver.de with esmtp (Exim 3.35 #2) id 17m1nh-0008NK-00; Tue, 03 Sep 2002 02:44:13 +0200 Received: from [80.129.100.167] (helo=gate.gerd-stolpmann.de) by mrelayng0.kundenserver.de with asmtp (Exim 3.35 #1) id 17m1nh-000578-00; Tue, 03 Sep 2002 02:44:13 +0200 Received: from ice.gerd-stolpmann.de (ice.gerd-stolpmann.de [192.168.0.13]) by gate.gerd-stolpmann.de (Postfix) with ESMTP id BADA9CCDD; Tue, 3 Sep 2002 02:44:11 +0200 (CEST) Received: from ice (localhost [127.0.0.1]) by ice.gerd-stolpmann.de (Postfix) with ESMTP id 05E56B0B8; Tue, 3 Sep 2002 02:44:10 +0200 (CEST) Date: Tue, 3 Sep 2002 02:44:10 +0200 From: Gerd Stolpmann To: Oliver Bandel Cc: Caml list Subject: Re: [Caml-list] Ocaml-Weblib? Message-ID: <20020903004410.GG818@ice.gerd-stolpmann.de> References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: ; from oliver@first.in-berlin.de on Die, Sep 03, 2002 at 01:24:23 +0200 X-Mailer: Balsa 1.4.0 Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk Am 2002.09.03 01:24 schrieb(en) Oliver Bandel: > On Tue, 3 Sep 2002, Gerd Stolpmann wrote: > > [...] > > No, I have not yet found the time to do it. There are some aspects > > that cannot be explained in mli files well enough, and a tutorial > > would be a great thing. E.g. why ocamlnet has an object-oriented > > layer on top of channels (netchannels). You find them everywhere, > > but no introduction. > > > > There is an "examples" directory containing some very simple, and > > some advanced examples, especially for CGI programming. > > Yes, I have seen it. > The simple CGI-example is very nice. :) > Looks like easy programming. :) > The advanced examples are a littlebid too large for > Ocaml-and-ocamlnet-beginners. ;-) > > I first asked, because I want to write two different > tools: > > first one: a wget-like tool, which can parse the html-pages > (if possible, this f..... javascript-stuff too) and > can download not only recursively, but can also > select the pages for download e.g. by pattern-matching > on href-tags (url or text of the link) or by selection > of filesizes or so. HTML parsing can be done with Nethtml. Simple example: Nethtml.parse (new Netchannels.input_string "......") Returns something like [ Element("html",[], [ Element("head",[], [ ... ]); Element("body",[], [ ... ]) ) ] just try it in the toploop. If there are attributes (e.g. ) you get them instead of [], e.g. Element("body",["bgcolor", "#EEFF54"], [ ... ]). Note that all names are returned in lower-case. If you want to read from a file instead of a string, just use new Netchannels.input_channel ch instead of input_string (where ch is an open in_channel). There is no parser for javascript. To download the HTML pages you can use netclient (distributed separately). Simple example: Http_client.Convenience.http_get "http://caml.inria.fr" returns the contents of this location. If you need the HTTP headers (sometimes announcing file sizes), you can use let m = Http_client.Convenience.http_get_message "http://caml.inria.fr" in let contents = m # get_resp_body () in let size = m # assoc_resp_header "content-length" in ... (* may raise Not_found *) If you use http_head_message, only the headers are requested from the server, so you have the chance to determine the file size before downloading. In my own experiments I found out that there are HTTP servers that handle HEAD like GET causing protocol errors. So be prepared that you can get a strange exception when you call http_head_message. > > And the second tool I wanted to write was a similar tool > for nntp-protocol: Download by attributes (size, date, > MsgID, Subject, author, thread-length, ...). > (I once wrote such stuff (not completed) in Perl > and after the program grew more and more, it > becomes more and more a mess...). As far as I know there is no ready-to-use NNTP client. There are important components for an NNTP client, though. For example, there are parsers for messages in email format, and there is the working implementation for the POP protocol that has some similarities. To parse an email message, you can call Netmime.read_mime_message, e.g. Netmime.read_mime_message (new Netchannels.input_string "subject: xxx\nsize: 50\n...\n\nbody") This returns a structure like (header, `Body(body)) where "header" and "body" are objects: header # field "subject" returns "xxx" header # field "size" returns "50" body # value returns "body" Note that Netmime.read_mime_message decodes multipart messages by default, and you can also get something like (header, `Parts [ (part1_header, `Body part1_body); ... (partN_header, `Body partN_body)] or even deeper nested structures. You can control this by passing the argument multipart_style. There is also the function Netmime.read_mime_header returning only the header, but it is a bit more complicated to use. To parse a string: let header = Netmime.read_mime_header (new Netstream.input_stream (new Netchannels.input_string "...")) There is another object involved (input_stream) that has no effect if you read only from a string, but that allows you to read the header from non-seekable files (e.g. pipelines or sockets). But this is definitely a feature for experts. > > So I need access to sockets, some low-level stuff > (Unix.read) and such, or a good library, which helps > here. See the sources in netpop.ml for an example how to write a "telnet-style" client. Note that netpop.ml does not use sockets, it expects that the user of this module passes channels that are already connected sockets. See the example mbox_list.ml for the socket stuff (very simple). >I need a library, which can parse me the > html-pages and maybe nntp-headers, and I want only > to implement the logic of the tool, and let the > network stuff programming be the work, that the > lib can do. > > And I hope the ocamlnet/netstring can help here. > But if it will be more effort to understand the library > than writing the networking-code by myself, then > I will write the sockets-stuff by myself. I hope this short introduction gives you the right impression of the library. Gerd ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de ------------------------------------------------------------ ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners