One of the features of Nehalem's cores that caught many people's attention is Simultaneous Multi-Threading (SMT), or HyperThreading as Intel calls it. The justification for enabling each core to execute multiple threads simultaneously is pretty sound. Not only do you get much better multi-threaded performance, but that improvement is achieved within a much tighter thermal envelope and requires a lot less silicon than adding another core.
Additionally, HyperThreading can also mean less CPU downtime and wasted clock cycles as work can be carried out on one thread while the core waits for data required for the other. As Nehalem's cores are inherently faster than those of the Netburst era, due to the shorter execution pipeline and improved ability to access data when needed, HyperThreading should likewise bring more of an advantage this time around. Even on Pentium 4, though, Hyperthreading brought performance benefits in threaded applications, so we're expecting great things from Nehalem in this regard.
Importantly this multi-threaded performance also has benefits when Nehalem is dealing with non-multi-threaded tasks. Each core's resources are shared out as the workload requires, so if two threads are executing, then each gets half the available resources, but if only one thread is being pushed through the core, it gets all the attention. It's the proverbial win-win situation.
As a by product of moving the memory controller onto the CPU die, Intel has also changed the cache structure. As per Phenom, Nehalem uses a three-level hierarchy, although unlike AMD, Intel's cache is inclusive - meaning that each cache level also contains a copy of those below it.
On a quad-core Nehalem die, there is 8MB of shared L3 cache, 256KB of L2 cache per core and 64KB of L1 cache, divided into a 32KB Instruction cache and a 32KB Data cache. L1 cache is slightly higher latency on Nehalem than Penryn, but L2 cache is lower latency by a greater proportion so the overall upshot should be faster overall cache access on Nehalem.
Nehalem is able to get away with using a smaller cache (relatively) than Conroe and Penryn primarily because it no longer needs a lot of cache to 'hide' memory latency as a result of sending data over the FSB. It has a shiny new integrated memory controller and QPI for that.
Intel has added a second-level Translation Look-aside Buffer (TLB) to Nehalem, which works in much the same way as the second-level branch predictor. If the first TLB doesn't have a requested address mapping, then the CPU can look in the second-level TLB which, while slower, will still be faster than searching the cache for a memory address. Here, as with the core hierarchy and addition of SMT, the upshot is that the CPU is always kept fed with data when it needs it and avoids wasting clock cycles.