From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <13426df10710262244o625804ebs78881f5eefc53549@mail.gmail.com> Date: Fri, 26 Oct 2007 22:44:38 -0700 From: "ron minnich" To: "Fans of the OS Plan 9 from Bell Labs" <9fans@cse.psu.edu> Subject: Re: [9fans] parallel/distributed computation In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <13426df10710260801q593c6d05me5c1101af9e9151e@mail.gmail.com> Topicbox-Message-UUID: db4148d8-ead2-11e9-9d60-3106f5b1d025 On 10/26/07, erik quanstrom wrote: > could you elaborate or give a pointer explaining why > bsp is insufficient? BSP is essentially all about "the inner loop". In this loop, you do the work, and, at the bottom of the loop, you tell everyone what you have done. So you are either computing or communicating. Which means, on your $100M computer, that you are using about $50M of it over time. Which is undesirable. Nowadays, people work fairly hard to ensure that while computation is happening, the network is busy moving data. This problem with BSP is well known, which is why some folks have tried to time-share the nodes in the following way(www.ccs3.lanl.gov/pal/publications/papers/petrini01:feng.pdf): have N jobs (N usually 2). While N-1 jobs are using the network, and hence not computing, have 1 job computing. Of course, matching this all up is hard, and most compute jobs typically are sized to use all of memory, so this approach has not been used much. The nodes on the big machines are typically not shared between jobs. BSP was an interesting idea but is not commonly used any more, at least on the systems I know about. Rather, people work hard to overlap communication and computation. ron p.s. for more recent work see: www.cs.unm.edu/~fastos/06meeting/sft.pdf