Intel designed Nehalem with scalability in mind, to allow more cores and/or cache to be added as easily as possible as and when required. Word has it that we'll even see dual-core Core i7 CPUs with integrated graphics in the near future (codenamed Havendale currently), which sounds like a very interesting proposition.
For the initial launch we'll only see standard quad-core chips, capable of handling eight threads, but eight-core, 16-thread chips are definitely on the roadmap, too. Obviously leading with a quad-core solution makes sense for Intel as it allows for the best direct comparison to its last generation chips. Who really needs to process 16 simultaneous threads anyway? Although there are probably a large number of workstation users with their hands up right now.
Nehalem's cores have a lot of similarities to those of the Core generation, but with some significant tweaks that, on paper at least, translate into some decent performance gains.
The same four instruction issue design remains, so there's no improvement on that front. There is, however, a 33 per cent increase in the number of in-flight micro-ops that can be executed, now up to 128 from 96 with Penryn. The upshot is an increase in parallelism at an instruction level, which is further benefited by the fact that each core is capable of handing two threads at any one time.
Further to those improvements Nehalem also brings with it SSE4.2, which has a few tweaks, one of which improves the processing of XML files. As XML is becoming more prevalent almost by the day, that's no bad thing. Other new instructions can help speed up error detection code calculation, voice recognition and DNA sequencing.
Branch prediction - the processor's ability to 'guess' what it is likely to have to do next and set itself up beforehand so it does it faster - has also seen some tweaks. A second level branch predictor has been added with a much larger data set and deeper history allowing for broader searching and thus more accurate prediction.
Although this second-level predictor is inevitably slower, its ability to correct mis-predictions made at the first level, and thus avoid any speed penalty as a result of the CPU doing more work, should more than make up for that. This couples with the Renamed Return Stack Buffer to ensure that the correct data is pulled from the CPU's stack, even if there is a mis-prediction.
In real-world terms, what all this means is that applications, be they multi-threaded or not should run more efficiently (i.e. faster) on Nehalem than they did on Penryn on a clock-for-clock basis.