Breakthrough: The Secret to Making Processors 1,000 Times Faster [VIDEO], page 2

posted on Sep, 12 2011 @ 06:42 PM

link

Originally posted by TreadUpon
They're wasting their time and money. I've seen the answer. It is built. It's been built up to petaflop processing power and it's the size of a VCR. It's already passed a technical audit with AT&T and Verizon because of our plan...that of course I'm not at liberty to talk about yet in detail.

It's coming but not on our time schedule. It's kinda messy at the top...

Hints? No moving parts.

How long do you think it will take to be announced? Sounds interesting if true.

moebius

posted on Sep, 13 2011 @ 01:50 AM

link

reply to post by Aim64C

This is a series-parallel argument, here. A parallel architecture can operate on series instructions - an arithmetic unit will perform an operation and pass it off into the cache for another (perhaps specialized) unit to operate on it before updating the value in the cache. It might, technically, be slower than a single complex instruction architecture - but you can operate on hundreds of the things at any given time.

A parallel architecture can operate on multiple instructions at once only if the instructions don't depend on the previous ones. And this is for a lot problems the case. A simple example is data compression that is inherently serial.

You can speculatively execute an instruction and throw away the result if the previous instruction doesn't get the predicted result. This is used to hide pipeline latency. Most processing units are built as a pipeline consisting of multiple simple subunits in series to increase processing speed, keep all components busy. But this is not parallel processing. For parallel processing your code has to be able to be executed in parallel.

Wertdagf

posted on Sep, 13 2011 @ 02:15 AM

link

why not just form the silicon structure into the shape of a fan and spin it?

Very cool.

Aim64C

posted on Sep, 13 2011 @ 02:24 AM

link

reply to post by moebius

A parallel architecture can operate on multiple instructions at once only if the instructions don't depend on the previous ones. And this is for a lot problems the case. A simple example is data compression that is inherently serial.

This is simply not the case. For virtually any task out there, it can be accomplished in a parallel manner. We are simply used to working in terms of serial processes (as that is what works best on a single process core).

people.engr.ncsu.edu...

The goal of our recent research was to develop parallel lossless compression algorithms that avoid both problems described above. We want to have B computational units processing a single length-N input, such that the compression ratio is almost as good as that obtained with a regular (serial) compression algorithm.

Building on these insights, we developed techniques that compress using B parallel computational units, where each unit has only O(N/B) memory. The units learn the statistics of the input in a collabortive manner; the compression itself is then performed in parallel. Our approach results in a speedup that is linear in the number of parallel processing units B while the compression quality experiences only a minor degradation.

www.inderscience.com...

Abstract: In this paper, we present parallel algorithms for lossless data compression based on the Burrows-Wheeler Transform (BWT) block-sorting technique. We investigate the performance of using data parallelism and task parallelism for both multi-threaded and message-passing programming. The output produced by the parallel algorithms is fully compatible with their sequential counterparts. To balance the workload among processors we develop a task scheduling strategy. An extensive set of experiments is performed with a shared memory NUMA system using up to 120 processors and on a distributed memory cluster using up to 100 processors. Our experimental results show that significant speedup can be achieved with both data parallel and task parallel methodologies. These algorithms will greatly reduce the amount of time it takes to compress large amounts of data while the compressed data remains in a form that users without access to multiple processor systems can still use.

Granted - you have to pay to actually see the paper - but the abstract serves my point.

moebius

posted on Sep, 13 2011 @ 04:01 AM

link

reply to post by Aim64C

This is simply not the case. For virtually any task out there, it can be accomplished in a parallel manner.

Sorry, but you are wrong here. Tasks that depend on the previous results can not be parallelized. Think of something simple as sin(log(x*x)) or fibonacci f(n) = f(n-1) + f(n-2).

I am aware of parallel compression approaches. What they do is essentially split up the compressed data into blocks and compress them in parallel sharing some state. But you have to understand that the compression algorithms themselves are serial. They need to be aware of the previous state to work.

compression quality experiences only a minor degradation

It helps if you've big chunks of data. But you can always find a data sequence where this will break, noticeably worsening the compression ratio.

TreadUpon

posted on Sep, 13 2011 @ 07:19 AM

link

reply to post by _Phoenix_

We were supposed to have a press conference next week but it might have been extended...SOON. And it's market ready.

Aim64C

posted on Sep, 13 2011 @ 09:42 AM

link

reply to post by moebius

I am aware of parallel compression approaches. What they do is essentially split up the compressed data into blocks and compress them in parallel sharing some state. But you have to understand that the compression algorithms themselves are serial. They need to be aware of the previous state to work.

In that context, yes, every operation is serial in nature. The formula used to assign a hue and shade to a pixel is serial in nature (or many parts of it). All the programming for a graphics card does is assign 'computers' to each pixel before handing it off to a z-buffer with some post processing.

Sorry, but you are wrong here. Tasks that depend on the previous results can not be parallelized. Think of something simple as sin(log(x*x)) or fibonacci f(n) = f(n-1) + f(n-2).

Experience is the best teacher, I suppose:

www.catonmat.net... - have fun with that.

homepages.ius.edu...

Parallel execution of Fibonacci algorithms is both an example of the limitations of parallel computing and an example of how dynamic it can be. As the number of recursive operations increases, the parallelism of the algorithm also increases. For a short run of the sequence, parallel execution is not necessary - and only increases the overhead. However, when you expect a long sequence of returns, there is a substantial benefit to implementing parallel processing.

Vorlon128

posted on Sep, 13 2011 @ 09:46 AM

link

reply to post by dainoyfb

What is drawing most of the juice from your computer is the fan and cooling. If there is a way for passive cooling to be done and the work to be spread to multiple processors (100:s in the video) then there will be less heat generated and less power to cool the chip.

john_bmth

posted on Sep, 13 2011 @ 09:54 AM

link

Originally posted by Aim64C
In that context, yes, every operation is serial in nature. The formula used to assign a hue and shade to a pixel is serial in nature (or many parts of it). All the programming for a graphics card does is assign 'computers' to each pixel before handing it off to a z-buffer with some post processing.

[pedantry]Blocks of fragments (not pixels, blame MS for that misconception) are processed in lockstep. They get their own registers etc. but they will execute the same instructions. This is why things like branching (something a CPU handles with ease) is such a headache as all fragments in the block must take the same branch in order to avoid a stall. Furthermore, fragments are written to the framebuffer. The z-buffer is entirely optional (and can be disabled entirely) and is used only for rendering fragments in the correct z-order[/pedantry]

Aim64C

posted on Sep, 13 2011 @ 10:21 AM

link

reply to post by john_bmth

Blocks of fragments (not pixels, blame MS for that misconception) are processed in lockstep. They get their own registers etc. but they will execute the same instructions.

*sigh* There is no misconception. There is a -linear- computational process behind the values assigned to each -pixel- on a display, regardless of how the hardware and software go about establishing it.

This is why things like branching (something a CPU handles with ease) is such a headache as all fragments in the block must take the same branch in order to avoid a stall.

The CPU doesn't handle it with "ease." A single processor simplifies coding. There is a considerable difference.

Parallel branching programs are the same as the Fibonacci sequence - the bigger your problem, the greater gain from processing in parallel. It may not always be the simplest programming execution - but a lot of that is becoming transparent, these days, with software-assisted debug and coding.

Furthermore, fragments are written to the framebuffer. The z-buffer is entirely optional (and can be disabled entirely) and is used only for rendering fragments in the correct z-order[/pedantry]

I will take responsibility for that one. An incorrect term on my part. I was really describing the frame-buffer, you are correct.

john_bmth

posted on Sep, 13 2011 @ 10:31 AM

link

Originally posted by Aim64C
*sigh* There is no misconception. There is a -linear- computational process behind the values assigned to each -pixel- on a display, regardless of how the hardware and software go about establishing it.

Pixels are the final output of the framebuffer. Fragments are the intermediary values. A pixel will often be shaded with many fragments. That is the misconception. Of course the operations behind fragment shading are serial in nature but it is the nature of graphics processing that is the reason why GPUs are massively parallel. All of the fragments written in a single pass will execute the same instructions.

The CPU doesn't handle it with "ease." A single processor simplifies coding. There is a considerable difference

Compared to GPUs they do handle branching with ease. Branching on older GPUs is simply not possible. Even on recent GPUs, branching can (and often will) cause a stall. The penalty of the stall is dependent of the amount of fragments processed in the block. This block size could be as small as 2x2 but also much larger. A single fragment taking a different branch will cause the entire block to stall.

Parallel branching programs are the same as the Fibonacci sequence - the bigger your problem, the greater gain from processing in parallel. It may not always be the simplest programming execution - but a lot of that is becoming transparent, these days, with software-assisted debug and coding.

The debugging tools for GPUs are near non-existent. They do exist, but in very crude and limited forms.

jonnywhite

posted on Sep, 13 2011 @ 10:46 AM

link

I program as a hobby. Got a 2 year degree over 10 years ago. But I have always hated myself for not keeping up with the times. I've felt for some time now that parallel/multi-threaded programming is where the future seems to be going. I haven't researched this as I should. I think that my brain is very sequential and it's difficult for me to imagine how things i've done in the past could be converted to a multi-threaded approach. Additionally, I have to wonder whether abandoning completely the old ways of doing things in favor of something that's more friendly to the multi-threaded world is is not the better path to go down. What I am saying here is that we shouldn't wrap our processors around our programs. It should be the other way around. We should wrap our programs around the processors. This might mean a drastic change in how games or other kinds of software are created and used or played. Just a thought!!

What kinds of things in games are frinedly towards multi-threading? I see the connection to how parallel routines cannot be dependent on each other (otherwise, they would be serial).

edit on 13-9-2011 by jonnywhite because: (no reason given)

john_bmth

posted on Sep, 13 2011 @ 10:52 AM

link

Originally posted by jonnywhite
I program as a hobby. Got a 2 year degree over 10 years ago. But I have always hated myself for not keeping up with the times. I've felt for some time now that parallel/multi-threaded programming is where the future seems to be going. I haven't researched this as I should. I think that my brain is very sequential and it's difficult for me to imagine how things i've done in the past could be converted to a multi-threaded approach. Additionally, I have to wonder whether abandoning completely the old ways of doing things in favor of something that's more friendly to the multi-threaded world is is not the better path to go down. What I am saying here is that we shouldn't wrap our processors around our programs. It should be the other way around. We should wrap our programs around the processors. This might mean a drastic change in how games or other software is created and used or played. Just a thought!!
edit on 13-9-2011 by jonnywhite because: (no reason given)

Multi-threading in a broader sense say, for an application or computer game, it's very difficult to break down the tasks enough to really utilise all of the available cores in an efficient way, if at all. I'm sure there will be a paradigm shift at some point but at the moment it's still a major ball ache.

_Phoenix_

posted on Sep, 13 2011 @ 11:00 AM

link

Originally posted by TreadUpon
reply to post by _Phoenix_

We were supposed to have a press conference next week but it might have been extended...SOON. And it's market ready.

Well I'll keep my eyes out for that! Thanks for the heads up

I love seeing technology evolve, we live in interesting times, 20 years from now will be quite a sight I imagine.

Aim64C

posted on Sep, 13 2011 @ 11:09 AM

link

reply to post by john_bmth

Pixels are the final output of the framebuffer. Fragments are the intermediary values. A pixel will often be shaded with many fragments. That is the misconception. Of course the operations behind fragment shading are serial in nature but it is the nature of graphics processing that is the reason why GPUs are massively parallel. All of the fragments written in a single pass will execute the same instructions.

This is not necessarily so in today's architectures. Not only are the expanding DX and OpenGL expanding upon the types of processing done within the GPU, but also the very architecture that does the processing. Today's GPUs are substantially different from the way they were just seven years ago.

Even on recent GPUs, branching can (and often will) cause a stall. The penalty of the stall is dependent of the amount of fragments processed in the block. This block size could be as small as 2x2 but also much larger. A single fragment taking a different branch will cause the entire block to stall.

This is down to your process management strategies, really. It is also somewhat dependent upon the architecture (as to what management strategies will be most effective). This is no different than branch prediction in the CPU - you are simply dedicating a process to it and applying it across various smaller processes.

The debugging tools for GPUs are near non-existent. They do exist, but in very crude and limited forms.

Pish posh. AMD Stream and Nvidia's Cuda are both fully supported with native C and C++ language support. Many programing environments (to include Visual C++) have tools that support automatic serial-to-parallel coding conversions (where available). The field of GPU computing has -exploded- since the appearance of DX10 and has only gotten more crazy since DX11 compliant hardware appeared.

Compared to how it used to be (I recall some research papers done back around the turn of the millennium about database programs being run on GPUs like the VooDoo) - it's the difference between programming the 8080 and Visual Basic.

john_bmth

posted on Sep, 13 2011 @ 11:22 AM

link

Originally posted by Aim64C
This is not necessarily so in today's architectures. Not only are the expanding DX and OpenGL expanding upon the types of processing done within the GPU, but also the very architecture that does the processing. Today's GPUs are substantially different from the way they were just seven years ago.

GPU architectures change all the time but they still follow the same principles. Even over the last decade since GPUs have been opened up for programming, the core paradigms have not changed much at all. Intel are taking steps to a more unified CPU/GPU architecture so it will be interesting to see where things go in the years to come.

This is down to your process management strategies, really. It is also somewhat dependent upon the architecture (as to what management strategies will be most effective). This is no different than branch prediction in the CPU - you are simply dedicating a process to it and applying it across various smaller processes.

Not really, it's a fact of life. It only makes sense to branch in instances when you can guarantee that for the majority of cases, all of the fragments will take the same branch. This limits branching for special cases. You cannot branch left right and centre like you can on a CPU as the penalties will literally bring the GPU to a crawl if you were to.

Pish posh. AMD Stream and Nvidia's Cuda are both fully supported with native C and C++ language support. Many programing environments (to include Visual C++) have tools that support automatic serial-to-parallel coding conversions (where available). The field of GPU computing has -exploded- since the appearance of DX10 and has only gotten more crazy since DX11 compliant hardware appeared.

I program GPUs for a living, not GPGPU but as a graphics researcher. You can get very limited feedback on state, buffers, metrics, etc. but you are still very restricted. Programming GPUs can be a very tedious affair.

moebius

posted on Sep, 14 2011 @ 05:16 AM

link

reply to post by Aim64C

Parallel execution of Fibonacci algorithms is both an example of the limitations of parallel computing and an example of how dynamic it can be. As the number of recursive operations increases, the parallelism of the algorithm also increases. For a short run of the sequence, parallel execution is not necessary - and only increases the overhead. However, when you expect a long sequence of returns, there is a substantial benefit to implementing parallel processing.

Thanks for the link. Very clever. They parallelize fibonacci at the expense of having the threads to calculate the same values in parallel with increasing depth. Fib(2) is calculated in 5 times for Fib(6). What if the Fib calculation was very expensive. Don't think this is the best(efficient) way to use the thousands of cores, but that is a whole other topic.

This still won't work for strict dependencies, where you have to have the previus value to be able to compute the next.