The problem that Intel sees itself trying to solve with Larabee is that of offering both the programmability of a traditional CPU with the throughput performance of a traditional GPU in the same package. When first considering this, rather than consider creating an entirely new architecture, Intel instead took a look at how much it would have to change its current chips to improve throughput performance.
As a purely theoretical exercise Intel’s engineers considered what kind of performance could be achieved using the same die size and power of a current dual-core CPU, but with an architecture optimised for throughput performance. In this particular case Intel decided to focus on specifically vector processing, as it is a heavy component of graphics applications.
What Intel came up with was a chip with 10 in-order cores capable of issuing two instructions per clock, as opposed to a desktop CPUs two out-of-order cores issuing four instructions per clock, and a 16 lane wide VPU Vector Processing Unit (VPU) in each core. The result was peak vector throughput of 160 per clock, versus the eight on a Core 2 Duo; a five time performance increase within the same die area and TDP.
As suggested already, it isn’t necessary to create an entirely new architecture in order to create the kind of core best suited for graphics processing. To that end, in creating Larrabee Intel decided to start with a CPU core it already had and tweak it to get the GPU core it wanted. Larrabee is, at heart, derived from the original Pentium processor design although you can hardly tell to look at it.
As the block diagram above shows, Larrabee does still look more like a multi-core CPU than a GPU, such as the nVidia GeForce GTX 280, although there are some similarities too, as you would expect. Notable features of the Larrabee design include a 1,024 bits wide ring bus (512 bits, bi-directional) over which all the components can communicate, dedicated texture units and a fixed-function unit.
Larrabee uses a shared L2 cache hierarchy whereby each core has an allocated area to which it has read/write access and every other core has read access, allowing for data sharing among multiple cores. Coupled with the wide, high-speed ring bus this provides both high bandwidth and fast access to fixed-function blocks