From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Thu, 22 Oct 2009 02:43:23 +1100 From: Sam Watkins To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Message-ID: <20091021154323.GA10118@nipl.net> References: <<20091015105328.GA18947@nipl.net>> <4030fb6ae37f8ca8ae9c43ceefbdf57b@ladd.quanstro.net> <20091019155738.GB13857@nipl.net> <4ADD1D76.8050603@maht0x0r.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4ADD1D76.8050603@maht0x0r.net> User-Agent: Mutt/1.5.13 (2006-08-11) Subject: Re: [9fans] Barrelfish Topicbox-Message-UUID: 8d67e2ae-ead5-11e9-9d60-3106f5b1d025 I wrote: >I calculated roughly that encoding a 2-hour video could be parallelized by a >factor of perhaps 20 trillion, using pipelining and divide-and-conquer On Tue, Oct 20, 2009 at 03:16:22AM +0100, matt wrote: > I know you are using video / audio encoding as an example and there are > probably datasets that make sense but in this case, what use is it? I was using it to work out the *maximum* extent to which a common task can be parallelized. 20-trillion-fold is the answer I came up with. Someone was talking about Ahmdal's Law and saying that having large numbers of processors is not much use because Ahmdal's Law limits their utilization. I disagree. In reality 10,000 processing units might be a more sensible number to have than 20 trillion. If you have ever done H264 video encoding on a PC you would know that it is very slow, even normal mpeg encoding is barely faster than real time on a 1Ghz PC. Few people like having to wait 2 hours for a task to complete. This whole argument / discussion has come out of nowhere since it appears Ken's original comment was criticising the normal sort of multi-core systems, and he is more in favor of other approaches like FPGA. I fully agree with that. > You can't watch 2 hours of video per second and you can't write it to disk > fast enough to empty the pipeline. If I had a computer with 20 trillion processing units capable of recoding 2 billion hours of video per second, I would have superior storage media and IO systems to go with it. The system I described could encode 2 BILLION hours of video per second, not 2 hours per second. > You've got to feed in 2 hours of source material - 820Gb per stream, how? I suppose some sort of parallel bus of wires or optic fibres. If I have massively parallel processing I would want massively parallel IO to go with it. I.e. something like "read data starting from here" -> "here it is streaming one megabit in parallel down the bus at 1Ghz over 1 million channels" > Once you have your uncompressed stream, MPEG-2 encoding requires seeking > through the time dimension with keyframes every n frames and out of order > macro blocks, so we have to wait for n frames to be composited. For the best > quality the datarate is unconstrained on the first processing run and then > macro blocks best-fitted and re-ordered on the second to match the desired > output datarate, but again, this is n frames at a time. > > Amdahl is punching you in the face every time you say "see, it's easy". I'm no expert on video encoding but it seems to me you are assuming I would approach it the conventional stupid serial way. With massively parallel processing one could "seek" through the time dimension simply by comparing data from all time offsets at once in parallel. Can you give one example of a slow task that you think cannot benefit much from parallel processing? video is an extremely obvious example of one that certainly does benefit from just about as much parallel processing as you can throw at it, so I'm surprised you would argue about it. Probably my "20 trillion" upset you or something, it seems you didn't get my point. It might have been better to consider a simpler example, such as frequency analysis of audio data to perform pitch correction (for out of tune singers). I can write a simple shell script using ffmpeg to do h264 video encoding which would take advantage of perhaps 720 "cores" to encode a two hour video in 10 second chunks with barely any Ahmdal effects, running the encoding over a LAN. A server should be able to pipe the whole 800Mb input - I am assuming it is already encoded in xvid or something - over the network in about 10 seconds on a gigabit (or faster) network. Each participating computer will receive the chunk of data it needs. The encoding would take perhaps 30 seconds for the 10 seconds of video on each of 720 1Ghz computers. And another 10 seconds to pipe the data back to the server. Concatenating the video should take very little time, although perhaps the mp4 format is not the best for that, I'm not sure. The entire operation takes 50 seconds as opposed to 6 hours (21600 seconds). With my 721 computers I achieve a 432 times speed up. Ahmdal is not sucking up much there, only a little for transferring data around. And each computer could be doing something else while waiting for its chunk of data to arrive, the total actual utilization can be over 99%. People do this stuff every day. Have you heard of a render-farm? This applies for all Ahmdal arguments - if part of the system is idle due to serial constraints in the algorithm, it could likely be working on something else. Perhaps you have a couple of videos to recode? Then you can achieve close to 100% utilization. The time taken for a single task may be limited by the method or the hardware, but a batch of several tasks can be achieved close to N times faster if you have N processors/computers. I'm not sure why I'm wasting time writing about this, it's obvious anyway. Sam