Delving deeper, into the cores themselves we see one important difference from any GPU currently on the market because Larrabee is fully x86 capable. What that means in real terms is that whereas nVidia and AMD’s cards can only run the APIs (OpenGL and DirectX) they have been designed to, Larrabee can run any code that can run on any other x86 processor. Each core, then, boasts a separate scalar and vector unit, each with its own dedicated register, coupled in turn to a shared L1 cache which finally backs onto its own 256MB L2 cache local subset.
Each core supports four separate execution threads with separate registers per thread, allowing for a short, efficient in-order pipeline but without sacrificing the latency-hiding benefit of a more complex out-of-order pipeline. The vector unit itself has a 16 lane wide vector ALU controlled by mask registers, which maintain data flow control, in turn enabling the mapping of a separate execution kernel to each of these VPU lanes. Basically, this makes the vector unit extremely efficient at crunching through maths operations.
The VPU instruction set supports numeric type conversion, which can cause a lot of slowdowns, on the cache read and write cycles and also allows data replication and lane rearrangement on the register read cycle. Notably, Larrabee also supports both 32-bit (single precision) and 64-bit (double precision) floating point data.
Unlike a typical GPU, Larrabee’s cores offer such features as context switching and pre-emptive multitasking, basically helping each core make the most efficient use of each available clock cycle. Other differences Intel highlights include the addition of a ring bus for inter-block communication, low-latency L1 and L2 caches and the removal of most fixed-function logic - Larrabee features no hardware setup or rasterisation units, for example.
That said, in one important area, in the context of graphics, Larrabee does still defer to fixed function logic. For texture sampling, Intel concluded that a software implementation was simply far too inefficient, 12 times so for filtering and 40 times if decompression was necessary. Larrabee’s texture logic unit, then, performs standard texture operations such as anisotropic filtering and decompression, communication with the cores via the chip’s L2 cache using 2x2-pixel tiles, 8-bit colour values.
The second fixed function unit Larrabee sports hasn’t had its use specified yet, but basically, just like the texture unit, will do anything Intel decides isn’t efficient enough to perform in code. As an example, given the 2009-2010 launch schedule for retail GPUs, it seems likely that Intel would probably add a hardware tessellation engine to Larrabee, as that feature is being introduced with the DirectX 11 spec.
So, just how does Larrabee deal with current graphics implementations?