As already mentioned, the key to AMD’s unified shader architecture is the superscalar stream processors at the heart of the GPU. There are 320 stream processors, but these are split up into four SIMD (Single Instruction Multiple Data) arrays with 80 stream units in each. Despite the fact that the stream processors are separated into four SIMD arrays, pretty much any instruction thread can be piped to any SIMD array.
Although the stream processors themselves are pretty clever, it’s the way the data is handled that’s the really smart bit. Every stream processor can be sent two threads simultaneously, so that one is queued up and ready, while the other is executing. The reason that a second thread is always queued up in reserve is that every operation is constantly monitored to ensure that the most efficient operation is in effect.
Thread arbiter units constantly monitor the execution process to ensure maximum efficiency. If it is determined that a currently executing thread is in a hold state as it waits for data from elsewhere, that thread will instantly be sidelined and replaced with a new one, thus ensuring that every stream processor is actively executing code, rather than sitting there idle.
All the temporary data associated with the bumped thread is saved so that it can continue to execute later. There can be literally hundreds of queued threads waiting for the requested data they need to continue processing. As soon as that data is received, these queued threads are moved straight back to a stream unit and completed.
A sequencer unit is attached to each thread arbiter unit – this determines what the optimal instruction order for each thread is and again helps to ensure efficient processing across the entire array of stream processors.
Above the arbiter units are the shader command queues – here there are queues for vertex, geometry and pixel shaders. As shader instructions are queued up and fed to the arbiters, they are then routed to each of the four SIMD arrays for execution on the stream processors. As mentioned earlier, there is no distinction between vertex, geometry or pixel shaders, with each stream processor able to act as any type of shader.
What’s interesting is that AMD is pitching the HD 2900 XT up against nVidia’s GeForce 8800 GTS 640MB card, which only sports 96 stream processors – in fact even the GeForce 8800 GTX only has 128 stream units and AMD freely admits that the GTX is a faster card. With that in mind, it’s clear that real world performance is not simply dictated by the amount of stream processors you can squeeze on a die.