It looks like you're using an Ad Blocker.
Please white-list or disable AboveTopSecret.com in your ad-blocking tool.
Thank you.
Some features of ATS will be disabled while you continue to use an ad-blocker.
Originally posted by TreadUpon
They're wasting their time and money. I've seen the answer. It is built. It's been built up to petaflop processing power and it's the size of a VCR. It's already passed a technical audit with AT&T and Verizon because of our plan...that of course I'm not at liberty to talk about yet in detail.
It's coming but not on our time schedule. It's kinda messy at the top...
Hints? No moving parts.
This is a series-parallel argument, here. A parallel architecture can operate on series instructions - an arithmetic unit will perform an operation and pass it off into the cache for another (perhaps specialized) unit to operate on it before updating the value in the cache. It might, technically, be slower than a single complex instruction architecture - but you can operate on hundreds of the things at any given time.
A parallel architecture can operate on multiple instructions at once only if the instructions don't depend on the previous ones. And this is for a lot problems the case. A simple example is data compression that is inherently serial.
The goal of our recent research was to develop parallel lossless compression algorithms that avoid both problems described above. We want to have B computational units processing a single length-N input, such that the compression ratio is almost as good as that obtained with a regular (serial) compression algorithm.
Building on these insights, we developed techniques that compress using B parallel computational units, where each unit has only O(N/B) memory. The units learn the statistics of the input in a collabortive manner; the compression itself is then performed in parallel. Our approach results in a speedup that is linear in the number of parallel processing units B while the compression quality experiences only a minor degradation.
Abstract: In this paper, we present parallel algorithms for lossless data compression based on the Burrows-Wheeler Transform (BWT) block-sorting technique. We investigate the performance of using data parallelism and task parallelism for both multi-threaded and message-passing programming. The output produced by the parallel algorithms is fully compatible with their sequential counterparts. To balance the workload among processors we develop a task scheduling strategy. An extensive set of experiments is performed with a shared memory NUMA system using up to 120 processors and on a distributed memory cluster using up to 100 processors. Our experimental results show that significant speedup can be achieved with both data parallel and task parallel methodologies. These algorithms will greatly reduce the amount of time it takes to compress large amounts of data while the compressed data remains in a form that users without access to multiple processor systems can still use.
Sorry, but you are wrong here. Tasks that depend on the previous results can not be parallelized. Think of something simple as sin(log(x*x)) or fibonacci f(n) = f(n-1) + f(n-2).
This is simply not the case. For virtually any task out there, it can be accomplished in a parallel manner.
compression quality experiences only a minor degradation
I am aware of parallel compression approaches. What they do is essentially split up the compressed data into blocks and compress them in parallel sharing some state. But you have to understand that the compression algorithms themselves are serial. They need to be aware of the previous state to work.
Sorry, but you are wrong here. Tasks that depend on the previous results can not be parallelized. Think of something simple as sin(log(x*x)) or fibonacci f(n) = f(n-1) + f(n-2).
Originally posted by Aim64C
In that context, yes, every operation is serial in nature. The formula used to assign a hue and shade to a pixel is serial in nature (or many parts of it). All the programming for a graphics card does is assign 'computers' to each pixel before handing it off to a z-buffer with some post processing.
Blocks of fragments (not pixels, blame MS for that misconception) are processed in lockstep. They get their own registers etc. but they will execute the same instructions.
This is why things like branching (something a CPU handles with ease) is such a headache as all fragments in the block must take the same branch in order to avoid a stall.
Furthermore, fragments are written to the framebuffer. The z-buffer is entirely optional (and can be disabled entirely) and is used only for rendering fragments in the correct z-order[/pedantry]
Originally posted by Aim64C
*sigh* There is no misconception. There is a -linear- computational process behind the values assigned to each -pixel- on a display, regardless of how the hardware and software go about establishing it.
The CPU doesn't handle it with "ease." A single processor simplifies coding. There is a considerable difference
Parallel branching programs are the same as the Fibonacci sequence - the bigger your problem, the greater gain from processing in parallel. It may not always be the simplest programming execution - but a lot of that is becoming transparent, these days, with software-assisted debug and coding.
Originally posted by jonnywhite
I program as a hobby. Got a 2 year degree over 10 years ago. But I have always hated myself for not keeping up with the times. I've felt for some time now that parallel/multi-threaded programming is where the future seems to be going. I haven't researched this as I should. I think that my brain is very sequential and it's difficult for me to imagine how things i've done in the past could be converted to a multi-threaded approach. Additionally, I have to wonder whether abandoning completely the old ways of doing things in favor of something that's more friendly to the multi-threaded world is is not the better path to go down. What I am saying here is that we shouldn't wrap our processors around our programs. It should be the other way around. We should wrap our programs around the processors. This might mean a drastic change in how games or other software is created and used or played. Just a thought!!edit on 13-9-2011 by jonnywhite because: (no reason given)
Originally posted by TreadUpon
reply to post by _Phoenix_
We were supposed to have a press conference next week but it might have been extended...SOON. And it's market ready.
Pixels are the final output of the framebuffer. Fragments are the intermediary values. A pixel will often be shaded with many fragments. That is the misconception. Of course the operations behind fragment shading are serial in nature but it is the nature of graphics processing that is the reason why GPUs are massively parallel. All of the fragments written in a single pass will execute the same instructions.
Even on recent GPUs, branching can (and often will) cause a stall. The penalty of the stall is dependent of the amount of fragments processed in the block. This block size could be as small as 2x2 but also much larger. A single fragment taking a different branch will cause the entire block to stall.
The debugging tools for GPUs are near non-existent. They do exist, but in very crude and limited forms.
Originally posted by Aim64C
This is not necessarily so in today's architectures. Not only are the expanding DX and OpenGL expanding upon the types of processing done within the GPU, but also the very architecture that does the processing. Today's GPUs are substantially different from the way they were just seven years ago.
This is down to your process management strategies, really. It is also somewhat dependent upon the architecture (as to what management strategies will be most effective). This is no different than branch prediction in the CPU - you are simply dedicating a process to it and applying it across various smaller processes.
Pish posh. AMD Stream and Nvidia's Cuda are both fully supported with native C and C++ language support. Many programing environments (to include Visual C++) have tools that support automatic serial-to-parallel coding conversions (where available). The field of GPU computing has -exploded- since the appearance of DX10 and has only gotten more crazy since DX11 compliant hardware appeared.
Parallel execution of Fibonacci algorithms is both an example of the limitations of parallel computing and an example of how dynamic it can be. As the number of recursive operations increases, the parallelism of the algorithm also increases. For a short run of the sequence, parallel execution is not necessary - and only increases the overhead. However, when you expect a long sequence of returns, there is a substantial benefit to implementing parallel processing.