9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] ceph
@ 2009-07-30  0:43 Roman V Shaposhnik
  2009-07-30 16:31 ` sqweek
  0 siblings, 1 reply; 21+ messages in thread
From: Roman V Shaposhnik @ 2009-07-30  0:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

This is sort of off-topic, but does anybody have any experience with
Ceph?
   http://ceph.newdream.net/

Good or bad war stories (and general thoughts) would be quite welcome.

Thanks,
Roman.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-07-30  0:43 [9fans] ceph Roman V Shaposhnik
@ 2009-07-30 16:31 ` sqweek
  2009-08-01  5:24   ` Roman Shaposhnik
  0 siblings, 1 reply; 21+ messages in thread
From: sqweek @ 2009-07-30 16:31 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

2009/7/30 Roman V Shaposhnik <rvs@sun.com>:
> This is sort of off-topic, but does anybody have any experience with
> Ceph?
>   http://ceph.newdream.net/
>
> Good or bad war stories (and general thoughts) would be quite welcome.

 Not with ceph itself, but the description and terminology they use
remind me a lot of lustre (seems like it's a userspace version) which
we use at work. Does a damn fine job - as long as you get a stable
version. We have run into issues trying out new versions several
times...
-sqweek



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-07-30 16:31 ` sqweek
@ 2009-08-01  5:24   ` Roman Shaposhnik
  2009-08-01  5:41     ` ron minnich
  0 siblings, 1 reply; 21+ messages in thread
From: Roman Shaposhnik @ 2009-08-01  5:24 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs


On Jul 30, 2009, at 9:31 AM, sqweek wrote:

> 2009/7/30 Roman V Shaposhnik <rvs@sun.com>:
>> This is sort of off-topic, but does anybody have any experience with
>> Ceph?
>>   http://ceph.newdream.net/
>>
>> Good or bad war stories (and general thoughts) would be quite
>> welcome.
>
> Not with ceph itself, but the description and terminology they use
> remind me a lot of lustre (seems like it's a userspace version) which
> we use at work. Does a damn fine job - as long as you get a stable
> version. We have run into issues trying out new versions several
> times...

I guess that sums up my impression of ceph so far: I don't see where it
would fit. I think that in HPC it is 99% Lustre, in enterprise it is
either
CIFS or NFS, etc.

There's some internal push for it around here so I was wondering
whether I missed a memo once again...

Thanks,
Roman.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-01  5:24   ` Roman Shaposhnik
@ 2009-08-01  5:41     ` ron minnich
  2009-08-01  5:53       ` Roman Shaposhnik
  0 siblings, 1 reply; 21+ messages in thread
From: ron minnich @ 2009-08-01  5:41 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I'm not a big fan of lustre. In fact I'm talking to someone who really
wants 9p working well so he can have lustre on all but a few nodes,
and those lustre nodes export 9p.

ron



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-01  5:41     ` ron minnich
@ 2009-08-01  5:53       ` Roman Shaposhnik
  2009-08-01 15:47         ` ron minnich
  0 siblings, 1 reply; 21+ messages in thread
From: Roman Shaposhnik @ 2009-08-01  5:53 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Jul 31, 2009, at 10:41 PM, ron minnich wrote:
> I'm not a big fan of lustre. In fact I'm talking to someone who really
> wants 9p working well so he can have lustre on all but a few nodes,
> and those lustre nodes export 9p.


What are your clients running? What are their requirements as
far as POSIX is concerned? How much storage are talking about?

I'd be interested in discussing some aspects of what you're trying to
accomplish with 9P for the HPC guys.

Thanks,
Roman.

P.S. If it is ok with everybody else -- I'll keep the conversation on
the list.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-01  5:53       ` Roman Shaposhnik
@ 2009-08-01 15:47         ` ron minnich
  2009-08-04  1:32           ` Roman V Shaposhnik
  0 siblings, 1 reply; 21+ messages in thread
From: ron minnich @ 2009-08-01 15:47 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Fri, Jul 31, 2009 at 10:53 PM, Roman Shaposhnik<rvs@sun.com> wrote:

> What are your clients running?

Linux

> What are their requirements as
> far as POSIX is concerned?

10,000 machines, working on a single app, must have access to a common
file store with full posix semantics and it all has to work like it
were one machine (their desktop, of course).

This gets messy. It turns into an exercise of attempting to manage a
competing set of race conditions. It's like tuning
a multi-carburated enging from years gone by, assuming we ever had an
engine with 10,000 cylinders.

> How much storage are talking about?
In  round numbers, for the small clusters, usually a couple hundred T.
For anyhing else, more.

>
> I'd be interested in discussing some aspects of what you're trying to
> accomplish with 9P for the HPC guys.

The request: for each of the (lots of) compute nodes, have them mount
over 9p to, say 100x fewer io nodes, each of those to run lustre.
Which tells you right away that our original dreams for lustre did not
quite work out.

In all honesty, however, the 20K node Jaguar machine at ORNL claims to
run lustre and have it all "just work". I know as many people who have
de-installed lustre as use it, however.

ron



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-01 15:47         ` ron minnich
@ 2009-08-04  1:32           ` Roman V Shaposhnik
  2009-08-04  2:56             ` ron minnich
                               ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Roman V Shaposhnik @ 2009-08-04  1:32 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Sat, 2009-08-01 at 08:47 -0700, ron minnich wrote:
> > What are their requirements as
> > far as POSIX is concerned?
>
> 10,000 machines, working on a single app, must have access to a common
> file store with full posix semantics and it all has to work like it
> were one machine (their desktop, of course).
>
> This gets messy. It turns into an exercise of attempting to manage a
> competing set of race conditions. It's like tuning
> a multi-carburated enging from years gone by, assuming we ever had an
> engine with 10,000 cylinders.

Well, with Linux, at least you have a benefit of a gazillions of FS
clients being available either natively or via FUSE. With Solaris...
oh well...

> > How much storage are talking about?
> In  round numbers, for the small clusters, usually a couple hundred T.
> For anyhing else, more.

Is all of this storage attached to a very small number of IO nodes, or
is it evenly spread across the cluster?

In fact, I'm interested in both scenarios, so here come two questions:
  1. do we have anybody successfully managing that much storage (lets
     say ~100T) via something like humongous fossil installation (or
     kenfs for that matter)?

  2. do we have anybody successfully managing that much storage that is
     also spread across the nodes? And if so, what's the best practices
     out there to make the client not worry about where does the storage
     actually come from (IOW, any kind of proxying of I/O, etc)

I'm trying to see how the life after NFSv4 or AFS might look like for
the clients still clinging to the old ways of doing things, yet
trying to cooperatively use hundreds of T of storage.

> > I'd be interested in discussing some aspects of what you're trying to
> > accomplish with 9P for the HPC guys.
>
> The request: for each of the (lots of) compute nodes, have them mount
> over 9p to, say 100x fewer io nodes, each of those to run lustre.

Sorry for being dense, but what exactly is going to be accomplished
by proxying I/O in such a way?

Thanks,
Roman.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-04  1:32           ` Roman V Shaposhnik
@ 2009-08-04  2:56             ` ron minnich
  2009-08-04  5:07               ` roger peppe
  2009-08-04 23:51               ` Roman V Shaposhnik
  2009-08-04  7:23             ` Tim Newsham
  2009-08-04  8:43             ` Steve Simon
  2 siblings, 2 replies; 21+ messages in thread
From: ron minnich @ 2009-08-04  2:56 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Aug 3, 2009 at 6:32 PM, Roman V Shaposhnik<rvs@sun.com> wrote:

> Is all of this storage attached to a very small number of IO nodes, or
> is it evenly spread across the cluster?

it's on a server. A big Lustre server using a DDN over (currently, I
believe) fiber channel.


>  2. do we have anybody successfully managing that much storage that is
>     also spread across the nodes? And if so, what's the best practices
>     out there to make the client not worry about where does the storage
>     actually come from (IOW, any kind of proxying of I/O, etc)

Google?



>> The request: for each of the (lots of) compute nodes, have them mount
>> over 9p to, say 100x fewer io nodes, each of those to run lustre.
>
> Sorry for being dense, but what exactly is going to be accomplished
> by proxying I/O in such a way?

it makes the unscalable distributed lock manager and other such stuff
work, because you stop asking it to scale.

ron



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-04  2:56             ` ron minnich
@ 2009-08-04  5:07               ` roger peppe
  2009-08-04  8:23                 ` erik quanstrom
  2009-08-04  9:55                 ` C H Forsyth
  2009-08-04 23:51               ` Roman V Shaposhnik
  1 sibling, 2 replies; 21+ messages in thread
From: roger peppe @ 2009-08-04  5:07 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

2009/8/4 ron minnich <rminnich@gmail.com>:
>>  2. do we have anybody successfully managing that much storage that is
>>     also spread across the nodes? And if so, what's the best practices
>>     out there to make the client not worry about where does the storage
>>     actually come from (IOW, any kind of proxying of I/O, etc)
>
> Google?

the exception that proves the rule? they emphatically
don't go for posix semantics...



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-04  1:32           ` Roman V Shaposhnik
  2009-08-04  2:56             ` ron minnich
@ 2009-08-04  7:23             ` Tim Newsham
  2009-08-05  0:27               ` Roman V Shaposhnik
  2009-08-04  8:43             ` Steve Simon
  2 siblings, 1 reply; 21+ messages in thread
From: Tim Newsham @ 2009-08-04  7:23 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

>  2. do we have anybody successfully managing that much storage that is
>     also spread across the nodes? And if so, what's the best practices
>     out there to make the client not worry about where does the storage
>     actually come from (IOW, any kind of proxying of I/O, etc)

http://labs.google.com/papers/gfs.html
http://hadoop.apache.org/common/docs/current/hdfs_design.html

> I'm trying to see how the life after NFSv4 or AFS might look like for
> the clients still clinging to the old ways of doing things, yet
> trying to cooperatively use hundreds of T of storage.

the two I mention above are both used in conjunction with
distributed map/reduce calculations.  Calculations are done
on the nodes where the data is stored...

> Thanks,
> Roman.

Tim Newsham
http://www.thenewsh.com/~newsham/



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-04  5:07               ` roger peppe
@ 2009-08-04  8:23                 ` erik quanstrom
  2009-08-04  8:52                   ` roger peppe
  2009-08-04  9:55                 ` C H Forsyth
  1 sibling, 1 reply; 21+ messages in thread
From: erik quanstrom @ 2009-08-04  8:23 UTC (permalink / raw)
  To: 9fans

> >
> > Google?
>
> the exception that proves the rule? they emphatically
> don't go for posix semantics...

why would purveryers of 9p give a rip about posix sematics?

- erik



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-04  1:32           ` Roman V Shaposhnik
  2009-08-04  2:56             ` ron minnich
  2009-08-04  7:23             ` Tim Newsham
@ 2009-08-04  8:43             ` Steve Simon
  2009-08-05  0:01               ` Roman V Shaposhnik
  2 siblings, 1 reply; 21+ messages in thread
From: Steve Simon @ 2009-08-04  8:43 UTC (permalink / raw)
  To: 9fans

> Well, with Linux, at least you have a benefit of a gazillions of FS
> clients being available either natively or via FUSE.

Do you have a link to a site which lists interesting FUSE filesystems,
I am definitely not trying to troll, I am always intrigued by others
ideas of how to reprisent data/APIs as fs.

Sadly the Fuse fs I have seen have mostly been disappointing.There are
a few I which would be handy on plan9 (gmail, ipod, svn) but most
seem less useful.

-Steve



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-04  8:23                 ` erik quanstrom
@ 2009-08-04  8:52                   ` roger peppe
  0 siblings, 0 replies; 21+ messages in thread
From: roger peppe @ 2009-08-04  8:52 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

2009/8/4 erik quanstrom <quanstro@quanstro.net>:
>> >
>> > Google?
>>
>> the exception that proves the rule? they emphatically
>> don't go for posix semantics...
>
> why would purveryers of 9p give a rip about posix sematics?

from ron:
> 10,000 machines, working on a single app, must have access to a common
> file store with full posix semantics and it all has to work like it
> were one machine (their desktop, of course).

... and gfs does things that aren't easily compatible with 9p either,
such as returning the actually-written offset when appending to a file.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-04  5:07               ` roger peppe
  2009-08-04  8:23                 ` erik quanstrom
@ 2009-08-04  9:55                 ` C H Forsyth
  2009-08-04 10:25                   ` roger peppe
                                     ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: C H Forsyth @ 2009-08-04  9:55 UTC (permalink / raw)
  To: 9fans

>they emphatically don't go for posix semantics...

what are "posix semantics"?



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-04  9:55                 ` C H Forsyth
@ 2009-08-04 10:25                   ` roger peppe
  2009-08-04 14:45                   ` ron minnich
  2009-08-04 23:56                   ` Roman V Shaposhnik
  2 siblings, 0 replies; 21+ messages in thread
From: roger peppe @ 2009-08-04 10:25 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

2009/8/4 C H Forsyth <forsyth@vitanuova.com>:
>>they emphatically don't go for posix semantics...
>
> what are "posix semantics"?

perhaps wrongly, i'd assumed that the posix standard
implied some semantics in defining its file API, and
ron was referring to those. perhaps it defines less
than i assume - i've not studied it.

i was alluding to this sentence from the paper:

: GFS provides a familiar file system interface, though it
: does not implement a standard API such as POSIX.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-04  9:55                 ` C H Forsyth
  2009-08-04 10:25                   ` roger peppe
@ 2009-08-04 14:45                   ` ron minnich
  2009-08-04 23:56                   ` Roman V Shaposhnik
  2 siblings, 0 replies; 21+ messages in thread
From: ron minnich @ 2009-08-04 14:45 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Tue, Aug 4, 2009 at 2:55 AM, C H Forsyth<forsyth@vitanuova.com> wrote:
>>they emphatically don't go for posix semantics...
>
> what are "posix semantics"?

whatever today's customer happens to think they are.

ron



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-04  2:56             ` ron minnich
  2009-08-04  5:07               ` roger peppe
@ 2009-08-04 23:51               ` Roman V Shaposhnik
  1 sibling, 0 replies; 21+ messages in thread
From: Roman V Shaposhnik @ 2009-08-04 23:51 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, 2009-08-03 at 19:56 -0700, ron minnich wrote:
> >  2. do we have anybody successfully managing that much storage that is
> >     also spread across the nodes? And if so, what's the best practices
> >     out there to make the client not worry about where does the storage
> >     actually come from (IOW, any kind of proxying of I/O, etc)
>
> Google?

By "we" I mostly meant this community, but even if we don't focus on
9fans, Google is a non-example. They have no clients for this filesystem
per-se.

> >> The request: for each of the (lots of) compute nodes, have them mount
> >> over 9p to, say 100x fewer io nodes, each of those to run lustre.
> >
> > Sorry for being dense, but what exactly is going to be accomplished
> > by proxying I/O in such a way?
>
> it makes the unscalable distributed lock manager and other such stuff
> work, because you stop asking it to scale.

So strictly speaking you are not really using 9P as a filesystem
protocol, but rather as a convenient way for doing RPC, right?

Thanks,
Roman.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-04  9:55                 ` C H Forsyth
  2009-08-04 10:25                   ` roger peppe
  2009-08-04 14:45                   ` ron minnich
@ 2009-08-04 23:56                   ` Roman V Shaposhnik
  2 siblings, 0 replies; 21+ messages in thread
From: Roman V Shaposhnik @ 2009-08-04 23:56 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Tue, 2009-08-04 at 10:55 +0100, C H Forsyth wrote:
> >they emphatically don't go for posix semantics...
>
> what are "posix semantics"?

I'll bite:
   http://www.opengroup.org/onlinepubs/009695399/

   [ anything else that would take an FD as an argument ]

   http://www.opengroup.org/onlinepubs/009695399/

Thanks,
Roman.





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-04  8:43             ` Steve Simon
@ 2009-08-05  0:01               ` Roman V Shaposhnik
  0 siblings, 0 replies; 21+ messages in thread
From: Roman V Shaposhnik @ 2009-08-05  0:01 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Tue, 2009-08-04 at 09:43 +0100, Steve Simon wrote:
> > Well, with Linux, at least you have a benefit of a gazillions of FS
> > clients being available either natively or via FUSE.
>
> Do you have a link to a site which lists interesting FUSE filesystems,
> I am definitely not trying to troll, I am always intrigued by others
> ideas of how to reprisent data/APIs as fs.

I don't, and I probably should start documenting it. The easiest way
to find them, though, is to be suscbribed to fuse ML and collect
the domain names of posters. Turns out that anybody who's doing cloud
storage these days does it via FUSE (which, might not be as surprising
if you think about what's the dominant OS on EC2). You have companies
ranging from startups:
   http://www.nirvanix.com/
all the way to tyrannosaurus' like EMC and IBM betting on FUSE to get
them to storage in the cloud.

Sadly, none of them are open source as far as I can tell.

> Sadly the Fuse fs I have seen have mostly been disappointing.There are
> a few I which would be handy on plan9 (gmail, ipod, svn) but most
> seem less useful.

The OS ones, are not all that impressive. I agree.

Thanks,
Roman.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-04  7:23             ` Tim Newsham
@ 2009-08-05  0:27               ` Roman V Shaposhnik
  2009-08-05  3:30                 ` Tim Newsham
  0 siblings, 1 reply; 21+ messages in thread
From: Roman V Shaposhnik @ 2009-08-05  0:27 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, 2009-08-03 at 21:23 -1000, Tim Newsham wrote:
> >  2. do we have anybody successfully managing that much storage that is
> >     also spread across the nodes? And if so, what's the best practices
> >     out there to make the client not worry about where does the storage
> >     actually come from (IOW, any kind of proxying of I/O, etc)
>
> http://labs.google.com/papers/gfs.html
> http://hadoop.apache.org/common/docs/current/hdfs_design.html
>
> > I'm trying to see how the life after NFSv4 or AFS might look like for
> > the clients still clinging to the old ways of doing things, yet
> > trying to cooperatively use hundreds of T of storage.
>
> the two I mention above are both used in conjunction with
> distributed map/reduce calculations.  Calculations are done
> on the nodes where the data is stored...

Hadoop and GFS are good examples and they work great for the
single distributed application that is *written* with them
in mind.

Unfortunately, I can not stretch my imagination hard enough
to see them as general purpose filesystems backing up data
for gazillions of non-cooperative applications. The sort
of thing NFS and AFS were built to accomplish.

In that respect, ceph is more what I have in mind: it
assembles storage from clusters of unrelated OSDs into a
a hierarchy with a single point of entry for every
user/application.

The question, however, is how to avoid the complexity of
ceph and still have it look like a humongous kenfs or
fossil from the outside.

Thanks,
Roman.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [9fans] ceph
  2009-08-05  0:27               ` Roman V Shaposhnik
@ 2009-08-05  3:30                 ` Tim Newsham
  0 siblings, 0 replies; 21+ messages in thread
From: Tim Newsham @ 2009-08-05  3:30 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> Hadoop and GFS are good examples and they work great for the
> single distributed application that is *written* with them
> in mind.
>
> Unfortunately, I can not stretch my imagination hard enough
> to see them as general purpose filesystems backing up data
> for gazillions of non-cooperative applications. The sort
> of thing NFS and AFS were built to accomplish.

I *think* the folks at google also use GFS for shared $HOME
(ie. to stash files they want to share with others).  I
could be wrong.

> Roman.

Tim Newsham
http://www.thenewsh.com/~newsham/



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2009-08-05  3:30 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-30  0:43 [9fans] ceph Roman V Shaposhnik
2009-07-30 16:31 ` sqweek
2009-08-01  5:24   ` Roman Shaposhnik
2009-08-01  5:41     ` ron minnich
2009-08-01  5:53       ` Roman Shaposhnik
2009-08-01 15:47         ` ron minnich
2009-08-04  1:32           ` Roman V Shaposhnik
2009-08-04  2:56             ` ron minnich
2009-08-04  5:07               ` roger peppe
2009-08-04  8:23                 ` erik quanstrom
2009-08-04  8:52                   ` roger peppe
2009-08-04  9:55                 ` C H Forsyth
2009-08-04 10:25                   ` roger peppe
2009-08-04 14:45                   ` ron minnich
2009-08-04 23:56                   ` Roman V Shaposhnik
2009-08-04 23:51               ` Roman V Shaposhnik
2009-08-04  7:23             ` Tim Newsham
2009-08-05  0:27               ` Roman V Shaposhnik
2009-08-05  3:30                 ` Tim Newsham
2009-08-04  8:43             ` Steve Simon
2009-08-05  0:01               ` Roman V Shaposhnik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).