caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Prashanth Mundkur <pmundkur.ocaml@gmail.com>
To: caml-list@inria.fr
Subject: [Caml-list] [ANNOUNCE] ODisco, for large-scale data processing in OCaml
Date: Wed, 11 May 2011 11:16:05 -0700	[thread overview]
Message-ID: <20110511181605.GA26425@damage.nokiapaloalto.com> (raw)

Hello,

The Disco team is pleased to announce the possibility of doing
large-scale data analysis (ala map-reduce) in OCaml.

Disco [1] is an open-source distributed computing framework inspired
by the map-reduce paradigm.  It includes a distributed replicating
tag-based filesystem that allows you to store your datasets in a
fault-tolerant manner.  Disco comes with additional tools: DiscoDB [2]
for implementing efficient mapping objects and Discodex [3] for
distributed indices for querying large datasets.

Disco has been in production use at Nokia for two years, and is used
to process terabytes of data daily [4].

The core job scheduling, cluster monitoring and filesystem logic of
Disco is written in Erlang, leveraging the strengths of Erlang in
concurrency and distribution.  The primary language for writing
compute jobs is currently Python; however, the latest Disco 0.4
release [5] has opened up the Disco worker interface, allowing jobs
written to be written in any language.

ODisco is the first available non-Python implementation of this Disco
worker interface, and allows distributed processing of large-scale
datasets in OCaml.  The computation is not restricted to a
record-oriented key-value style interface; the OCaml task directly
gets access to the input data source and writes the output data in
whatever format it chooses.  The overall computation however currently
still follows the traditional map-reduce dataflow, with
map/shuffle/reduce stages.

ODisco is available at https://github.com/pmundkur/odisco and also in
the 3.12 section of Godi as the godi-odisco package.

Please let us know if you have any issues with either ODisco or Disco
on the Disco mailing list.

Happy hacking!

[1] Disco Project, http://discoproject.org
[2] DiscoDB, http://discoproject.org/doc/contrib/discodb/discodb.html
[3] Discodex, http://discoproject.org/doc/contrib/discodex/discodex.html#discodex
[4] Disco at Nokia, http://www.erlang-factory.com/conference/SFBay2011/speakers/VilleTuulos
[5] Disco 0.4 release, http://disco.posterous.com/disco-04

--prashanth

                 reply	other threads:[~2011-05-12  1:19 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110511181605.GA26425@damage.nokiapaloalto.com \
    --to=pmundkur.ocaml@gmail.com \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).