With so much riding on AMD's upcoming server architecture, codenamed Bulldozer, it's no surprise that the recent appearance of a set of benchmarks of a Bulldozer engineering sample is creating quite a stir. Interpreting these benchmarks is not easy, given their complete lack of critical context (i.e., compiler options, software versions, optimizations, etc.) and the unknown state of the sample chip that they were run on. But David Kanter at RealWorldTech has done a heroic job of taking them apart and looking for clues as to what, if anything, the benchmarks signify.
I won't recap his analysis here, since it's perfectly accessible and worth reading. But I will summarize a bit.
First, as was noted above, the benchmarks are so devoid of critical contextual details that they're borderline worthless. Optimizations can make huge differences in performance, but it's impossible to tell what, if any, optimizations were applied in this instance. It's also the case that the engineering sample itself seems to be crippled in a few respects. The clockspeed is an unnaturally low 1.8GHz, and the part's extremely poor memory performance suggests that the probe filter was disabled and that valuable bandwidth is being eaten up by cache coherence traffic.
All of that said, Kanter tries to isolate out a few benchmarks that might provide some glimpse of how Bulldozer's will ultimately perform vs. its predecessor, Istanbul. The results are a mixed bag.
Bulldozer's "cores"—AMD calls them "modules" since, due to 'Dozer's unique design, each module is equivalent to some one and a half normal cores—range from 0.6 times the performance of Istanbul's cores to 1.3 times. This is a huge amount of variation, and it appears highly workload dependent.
When I saw these results, I was immediately reminded of the first benchmarks that came out for the hyperthreaded Pentium 4. Hyperthreading (or simultaneous multithreading as it's commonly called in non-Intel implementations) turned out to be a mixed bag for the P4, sometimes reducing the per-core performance, and sometimes boosting it. And in fact, this sort of 0.5x to 1.5x range was about what we saw at hyperthreading's debut (if memory serves).
I've previously described Bulldozer as a sort of "extreme hyperthreading" approach, where the integer ALUs are replicated along with the normal replication and expansion of queues and buffers. So intuitively, it makes some sense that Bulldozer will give SMT-like results. Or, to put it differently, SMT is sort of finicky and your mileage will vary depending on the workload; it may turn out that Bulldozer is also finicky, and that it's going to work great for some niches and not-so-great for others.
Right now, however, all of this discussion is extremely preliminary—probably even premature. As Kanter points out, there are too many unknowns with both the engineering sample and the actual benches to lean too heavily on any interpretation. But I will admit that these results have got me thinking, and my expectation now is that Bulldozer will benefit very heavily from optimization work, and that it will be more finicky than Istanbul about what workloads it likes.