Back before cloud computing became a buzzword and quadcore became mainstream, fabless semiconductor startup Tilera launched its initial attempt to redo the RISC revolution with a 64-core processor that put an unprecedented amount of hardware under the control of the compiler. The original Tile64 was based on MIT's RAW project, and the basic idea behind it was to use an on-chip mesh network and a grid of lightweight cores to boost CPU performance—wire delays from hopping between cores are exposed to a compiler for scheduling purposes.
Never having seen a set of third-party benchmarks for a Tilera CPU, I can't really speak to the company's success in boosting CPU performance, but it has now repackaged the many-core + mesh idea as a performance per watt play for cloud datacenters. In connection with this cloud push, Tilera and Quanta are announcing a new "cloud server," the S2Q, that packs 512 cores into just two rack units. This is considerably less space (and power) than the 512-core SeaMicro server announced last week but, despite having the same core count and target market, the two aren't necessarily directly comparable.
Tilera's cores implement a very simple VLIW design with two integer ALUs and a load-store pipe (at least, I'm pretty sure that the third execution pipeline is load-store). Each core also has a small bit of L1 and L2 cache associated with it, and it's connected to the larger mesh network and to a chip-wide, fully coherent L3 cache via a small, private switch. The lack of floating-point and vector hardware won't really hurt Tilera much on cloud workloads, but the differences between 512 cores of Tilera and 512 cores of Atom are much deeper than just a lack of support for two popular arithmetic types.
Tilera advertises its TILEPro64 chip as 64 cores worth of general-purpose compute power, but one of its signal features is that the cores are made to be ganged together to solve problems. The cores' VLIW nature puts the 100 percent of the burden of scheduling instructions for maximum performance onto the compiler and the compiler knows the latency between cores. As a result, a Tilera CPU can theoretically do a kind of reverse SMT, where two or more cores can execute instructions from a single thread. In other words, the compiler can actually do a kind of instruction pipelining across cores. The upshot of all this is that, depending on how the compiler and hypervisor layer are partitioning the hardware, more than one core may be doing the same work that a single Atom core does.
All of the complexity that I've described above is hidden from the programmer via virtualization and a standard Linux and C development environment. TILEPro64 currently runs the standard LAMP stack, so the new S2Q server will slot right into some datacenters.
Ultimately, Tilera's design does in fact look like a pretty great fit for cloud workloads, at least in theory. Groups of cores can be partitioned off from one another so that they run separate OS and application stacks. Those partitions are pretty hard—the machine will throw an exception if a packet from one group of cores tries to access another group. So multitenancy is built into the processor's DNA.
The thing that's holding Tilera back is that the company is still producing parts on a positively stone-age 90nm process. The only two reasons I can imagine that they're doing this is because it's dirt cheap and they're still not profitable. Otherwise, surely they could at least move up to the 65nm node. The second, more charitable reason is that Tilera's main competition right now is probably a FPGA, which considerably lags the leading edge in clockspeed, performance, and process technology.
If Tilera stays afloat into 2011—and there's no reason that I know of to suspect that it won't—the company has announced a move to a 40nm half-node process and a family of 64- to 100-core processors. Tilera will need that process jump if they plan to seriously compete with x86, either in its Xeon and Opteron incarnations or in the form of a SeaMicro-style Atom server.
Source: ars technica