caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: Oliver Bandel <oliver@first.in-berlin.de>
Cc: Caml list <caml-list@inria.fr>
Subject: Re: [Caml-list] Ocaml-Weblib?
Date: Tue, 3 Sep 2002 02:44:10 +0200	[thread overview]
Message-ID: <20020903004410.GG818@ice.gerd-stolpmann.de> (raw)
In-Reply-To: <Pine.LNX.3.95.1020903011208.2548B-100000@first.in-berlin.de>; from oliver@first.in-berlin.de on Die, Sep 03, 2002 at 01:24:23 +0200


Am 2002.09.03 01:24 schrieb(en) Oliver Bandel:
> On Tue, 3 Sep 2002, Gerd Stolpmann wrote:
> 
> [...]
> > No, I have not yet found the time to do it. There are some aspects
> > that cannot be explained in mli files well enough, and a tutorial
> > would be a great thing. E.g. why ocamlnet has an object-oriented
> > layer on top of channels (netchannels). You find them everywhere,
> > but no introduction.
> > 
> > There is an "examples" directory containing some very simple, and
> > some advanced examples, especially for CGI programming.
> 
> Yes, I have seen it.
> The simple CGI-example is very nice. :)
> Looks like easy programming. :)
> The advanced examples are a littlebid too large for
> Ocaml-and-ocamlnet-beginners. ;-)
> 
> I first asked, because I want to write two different
> tools:
> 
> first one: a wget-like tool, which can parse the html-pages
> (if possible, this f..... javascript-stuff too) and
> can download not only recursively, but can also
> select the pages for download e.g. by pattern-matching
> on href-tags (url or text of the link) or by selection
> of filesizes or so.

HTML parsing can be done with Nethtml. Simple example:

Nethtml.parse 
  (new Netchannels.input_string "<HTML><HEAD>...</HEAD><BODY>...</BODY></HTML>")

Returns something like

[ Element("html",[], [ Element("head",[], [ ... ]);
                       Element("body",[], [ ... ]) ) ]

just try it in the toploop.

If there are attributes (e.g. <BODY BGCOLOR="#EEFF54">) you get them
instead of [], e.g. Element("body",["bgcolor", "#EEFF54"], [ ... ]).
Note that all names are returned in lower-case.

If you want to read from a file instead of a string, just use

  new Netchannels.input_channel ch 

instead of input_string (where ch is an open in_channel).

There is no parser for javascript.

To download the HTML pages you can use netclient (distributed
separately). Simple example:

Http_client.Convenience.http_get "http://caml.inria.fr"

returns the contents of this location. If you need the HTTP headers
(sometimes announcing file sizes), you can use

let m = Http_client.Convenience.http_get_message "http://caml.inria.fr" in
let contents = m # get_resp_body () in
let size = m # assoc_resp_header "content-length" in ... (* may raise Not_found *)

If you use http_head_message, only the headers are requested from the server,
so you have the chance to determine the file size before downloading. In my
own experiments I found out that there are HTTP servers that handle HEAD like
GET causing protocol errors. So be prepared that you can get a strange exception
when you call http_head_message.
> 
> And the second tool I wanted to write was a similar tool
> for nntp-protocol: Download by attributes (size, date,
> MsgID, Subject, author, thread-length, ...).
> (I once wrote such stuff (not completed) in Perl
>  and after the program grew more and more, it
>  becomes more and more a mess...).

As far as I know there is no ready-to-use NNTP client. There are important
components for an NNTP client, though. For example, there are parsers for
messages in email format, and there is the working implementation for the
POP protocol that has some similarities.

To parse an email message, you can call Netmime.read_mime_message, e.g.

Netmime.read_mime_message
  (new Netchannels.input_string "subject: xxx\nsize: 50\n...\n\nbody")

This returns a structure like

  (header, `Body(body))

where "header" and "body" are objects:

  header # field "subject"    returns "xxx"
  header # field "size"       returns "50"
  body # value                returns "body"

Note that Netmime.read_mime_message decodes multipart messages by default,
and you can also get something like

  (header, `Parts [ (part1_header, `Body part1_body); ... (partN_header, `Body partN_body)]

or even deeper nested structures. You can control this by passing the argument
multipart_style.

There is also the function Netmime.read_mime_header returning only the
header, but it is a bit more complicated to use. To parse a string:

let header = Netmime.read_mime_header 
               (new Netstream.input_stream
                  (new Netchannels.input_string "..."))

There is another object involved (input_stream) that has no effect if you
read only from a string, but that allows you to read the header from 
non-seekable files (e.g. pipelines or sockets). But this is definitely
a feature for experts.

> 
> So I need access to sockets, some low-level stuff
> (Unix.read) and such, or a good library, which helps
> here. 

See the sources in netpop.ml for an example how to write a "telnet-style"
client. Note that netpop.ml does not use sockets, it expects that the
user of this module passes channels that are already connected sockets.
See the example mbox_list.ml for the socket stuff (very simple).

>I need a library, which can parse me the
> html-pages and maybe nntp-headers, and I want only
> to implement the logic of the tool, and let the
> network stuff programming be the work, that the
> lib can do.
> 
> And I hope the ocamlnet/netstring can help here.
> But if it will be more effort to understand the library
> than writing the networking-code by myself, then
> I will write the sockets-stuff by myself.

I hope this short introduction gives you the right impression of
the library.

Gerd
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
------------------------------------------------------------
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


  reply	other threads:[~2002-09-03  0:44 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-08-21 18:49 Oliver Bandel
2002-08-22  0:28 ` Gerd Stolpmann
2002-08-23 23:38   ` Oliver Bandel
2002-08-24 13:39     ` Gerd Stolpmann
2002-08-25 19:30       ` Oliver Bandel
2002-08-25 19:55       ` Oliver Bandel
2002-08-25 21:06         ` Alain Frisch
2002-08-25 21:33           ` Oliver Bandel
2002-08-25 23:37             ` Stefano Zacchiroli
2002-08-30 16:16               ` Oliver Bandel
2002-08-30 20:01                 ` Stefano Zacchiroli
2002-09-01 16:45                 ` Oliver Bandel
2002-09-02  7:22                   ` Stefano Zacchiroli
2002-09-02 10:47                     ` Oliver Bandel
2002-09-02 12:54                       ` Stefano Zacchiroli
2002-09-02 22:15                       ` Gerd Stolpmann
2002-09-02 23:24                         ` Oliver Bandel
2002-09-03  0:44                           ` Gerd Stolpmann [this message]
2002-09-03 11:21                             ` Oliver Bandel
2002-08-25 21:45           ` Gerd Stolpmann
2002-08-25 21:47         ` Gerd Stolpmann
2002-08-24 10:46   ` Dmitry Bely
2002-08-30 16:45     ` Alan Schmitt
2002-08-30 21:33       ` Oliver Bandel
2002-08-31  6:13         ` [Caml-list] Ocaml-Weblib? Michaël Grünewald
2002-08-31  8:16         ` [Caml-list] Ocaml-Weblib? Alan Schmitt
2002-09-05 20:13       ` Dmitry Bely
2002-09-06  5:02         ` Alan Schmitt
2002-09-06 17:32           ` Dmitry Bely
2002-09-07  9:37             ` Alan Schmitt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20020903004410.GG818@ice.gerd-stolpmann.de \
    --to=info@gerd-stolpmann.de \
    --cc=caml-list@inria.fr \
    --cc=oliver@first.in-berlin.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).