Okay, it's been too long since I typed a Tech Babble. But before I start, let me just say that this is, like all my Tech Babbles, an opinion piece based on facts and my own knowledge. This one, moreso, as I don't have concrete information on why this is, but I think this is highly likely the reason. This probably won't be a very long one, but hey, I figured I'd type a bit on it anyway.
How Zen cores are connected within a CCX
It's fairly well know by now, how the "Zen" architecture groups cores together. They are clustered into quartets, that AMD calls "Core CompleXs" - or Just a Core Complex. That's a CCX.
Forgive my poorly drawn annotations. :D
So, from the above pictures we can clearly see that although the "Zen2" ("Matisse") CPU die has most of the I/O moved off-die (it's on a seperate I/O Die, but that's not what I wanted to type about here), the overall structure in which the cores are arranged, remains the same. Of course, you also have the doubled L3 cache in the middle, there's a lot more of that on "Zen2" as you probably know.
I read a lot of people asking why Zen2 didn't move from a quad-core complex, to a six or even an eight-core complex. And I did some thinking and here I am going type why I think why.
It's all about how those cores are connected
Okay, so firstly we need to understand how the individual cores are connected within the CCX, and how they differ from, for example, Intel's Coffee Lake die with 2, 4, 6 and 8 core configurations available without a "Complex" of their own (well, kinda).
The cores within each CCX in Zen are tightly linked within the L3 cache. The Cache itself, as with most CPU designs isn't a single, monolithic block of cache, but in fact it is split into chunks ("slices"). And each core is connected into that the closest chunk.
So from the above slide, that describes how the cores communicate, we can see how it works. The 1MB Slice of cache closes to the core is that core's direct link into the L3 cache, and that is wired into all four other slices. So in essence, all four cores have direct links into all of the three other cores. So how does this differ from Intel's approach?
Well Intel has two different mechanisms for connecting cores right now. They use the "Ring-bus" topology for their smaller, up to 8 core dies. (and mybe 10 cores next year). This design is like a train-track of sorts, or a bus route (it's a ring bus after all), and it a high-speed "Data highway" where data is sent to the core where it needs to go. The data, for example from core 1 to core 7, would follow the ring bus and then exit it, upon reaching the core it needs to go to. It's like the Mesh I talk about below, but much faster and somewhat less direct - it also doesn't scale very well to large core counts, hence why Mesh is used there...
Mesh, on the other hand is actually similar to the CCX design, but instead of direct links, the cores have "stations" where data can be passed onto the next core on its road to the core where it needs to ultimately go. In essence, the data fabric decides the shortest path to the destination core, and then routes it through the Mesh to that core It's not as fast as a direct link to said core, but uhm... I will come to that just now.
Direct connections are complicated
And that problem gets very much more problematic, as the core count goes up. So I made some roughly annotated pictures of the Zen1/+ "Zeppelin"'s CCX and I also made an attempt to annotate a theoretical 8-core CCX with the direct links that Zen uses as an architectural basis.
In these pictures I simplify things by placing links between the cores directly, when in actuality they are linked to the L3 cache slice closes to that core. FYI.
Okay, so this sort of explains my theory. Zen is build using a very fast, same average, low latency connection between all of the cores in the Complex, it's actually slightly faster than Intel's Ring-bus on their 6 and 8 core dies, and comparable if slightly faster than, the 4-core ring-bus too. Zen is built on this design of core modularity and it's how AMD solved the large core count problem and scaling issues that Intel had with their original Ring-Bus (Haswell-based Xeons ran into the limit of the Ring bus design, around 20~ cores or so).
So why exactly? Well if you look at the second picture, that is a hugely complex amount of wiring between all the cores and cache slices, in order to facilitate that direct connection system. This would increase the transistor requirements, cache complexity and the sheer amount of metal wires in the circuit, immensely. Would it be fast? Maybe it would be quicker when these Eight cores need to communicate, faster than an Inter-CCX (4-core) data transmission, but the complexity of the circuit and the cache design to allow that 8-way connectivity is pretty mental I think.
Zen is built to be simple, economical, modular and highly scalable
One of the Zen architecture's strengths, is that the topology it uses can scale from small, quad-core dies (even dual-cores can be made from this) to the 32-core EPYCs of the 1st generation, and later on the "ROME" based 2nd-generation EPYC parts; up to 64 cores without a mesh-based huge single-die solution with terrible yields (Intel Xeon), or an impossibly complex large-CCX architecture. Zen makes some trade offs for this scalibility and simplicity, but it pays off in development and manufacturing costs, allowing AMD to push tons of Zen cores into the market aggressively cutting into Intel's high-margin product stack.
And it also performs very well, even with these trade-offs. Ryzen and EPYC are examples of that. Zen is a huge engineering acomplishment from AMD and its engineers, and with a budget a fraction of the competition's.
A conclusion, I guess
So to conclude, the Quad-core CCX persists with Zen2 despite the 'opportunity' to use a larger 8-core CCX design in the "Matisse" CPU Chiplet, likely in my opinion, due to interconnect complexity. And the trade-offs aren't that bad, either.
Wondering why AMD didn't go with a "Ring-bus" design for more cores per CCX, I would say that keeping the "building blocks" of Zen to a minimum in size, increases the granularity at which core counts can be deployed. For Zen2, the quad-core design kept it simple and was the best choice, obviously. Otherwise AMD's engineers wouldn't have kept it. At least that's what I think anyway.
Who knows what the future holds, Zen3 or even Zen4 and beyond may switch to a "ring-bus" like design within each Chiplet or complex, and have more cores as core counts across multiple markets begin to increase rapidly (yay!). But for now, Zen2 makes sense, it's cost-efficient, simple and powerful.
Remember this is my thoughts and is like an opinion piece.
Thanks for reading. <3
Comments