Review Price free/subscription
RV770: AMD ATI Radeon HD 4870 - RV770: The Architecture
As well as the difference in texturing configuration, the SFUs in nVidia's architecture only appear at this point whereas there is an SFU in each MSPU in ATI's architecture. So, for every SIMD you get 16 SFUs whereas as each SM only has two.
Moving out a further step, things look even more different. Whereas nVidia combines three SMs and adds texturing capabilities (eight per TPC) to create a Texture/Processing Cluster (TPC) then combines 10 of these. ATI on the other hand skips straight to adding 10 SIMD cores together. The result is RV770 ends up with 640 SPUs, 160 SFUs (remember these can also do everything the SPUs can), and 40 texture units whereas GT200 has 240 SPUs, 60 SFUs, and 80 texture units. In comparison, R600 and RV670 had four SIMDs, making for a total of 256 SPUs, 128 SFUs, and 16 texture units.
Now it only takes a brief glance at those numbers to realise that somewhere along the way nVidia and ATI's tactics have massively differing results as ATI doesn't even pretend the 800 (640 + 160) SPUs in RV770 come close to competing with the 240 on GT200. We already mentioned that ATIs SPUs can only work on one thread at a time, so in many ways the 800 total SPUs could be considered to only have the same processing power as 160 SPs, which does more closely reflect average real world performance figures.
However, it's not quite as simple as that. If software is written in such a way that it benefits from the extra processors in ATI's architecture then it will run considerably faster, if it isn't then it will be slower than nVidia's cards. The only real conclusion we can draw at the moment is that nVidia's simpler approach will likely give more consistent performance in the short term. If software developers begin to embrace more complicated routines than ATI's hardware could pull away in the long term.
The same is true when you consider RV770 vs GT200 for GPGPU applications. While RV770 has theoretically more compute power, at 1.2 TeraFLOPs compared to GT200s 933GigaFLOPs, it requires programs to be written in such a way that they take full advantage. Time, and future benchmarks, will tell which turns out to be the best method.