top of page
Writer's pictureSasha W.

(Mini Tech Babble) "IPC" and Instruction Level Parallelism are are related but different

This mini-tech babble is aimed at people who like to use the industry-known term "IPC" as a blanket term that applies to a CPU core's "performance per clock in a vacuum", and how that is incorrect and there are actually many different 'branches' of the CPU core that contribute to the overall "IPC" "Instructions Per Clock".


So, to start, I will say that it makes me feel better if I type "mini Tech Babble" in the category brackets instead of one of the "larger" numbered series. That is, because, I am more likely to commit to typing it (even if it becomes large enough to justify a regular full tech babble) if I go into the attempt thinking it is a small one. Thanks, ADHD.


Anyway, the actual start. Here it is. This is in 'response' to reading comments and talking to people about "IPC", specifically in relation to CPU cores, and specifically AMD's new Ryzen 7 5800X3D processor with the stacked L3 cache. I think I can accurately summarise my points in a mini babble so here goes.


Here I am, on disqus, doing that keyboard-warrior thing (is that the correct term in this context?) with people who seem to think that the ability for a CPU core to execute instructions in a vacuum (let's say, completely separate to the caches) is what "IPC" actually means. Well, that's wrong because just how many instructions is the CPU core going to execute, per clock, if it's waiting 1000 clock cycles each time it prepares an instruction just to fetch data from memory? Not many, for those 1000 clock cycles, I'll tell you that.


Goddamnit, my attention de-railed replying to more comments. I need to take my medication, but I will put a transcript of my pretty simple explanation of my take on IPC from that post, here:


Say you have 4 instructions waiting to execute, but 3 of them are memory / latency sensitive. The core is the exact same in this comparison.
CPU Core A) with a smaller L3 is executing instruction 1 but waiting for instruction 2, 3 and 4 to retrieve data from DRAM because they exceed the L3 capacity. IPC is 1.
CPU Core B) with a larger L3 executes instruction 1, too, but it can also execute instruction 2 because instruction 2's data dependency was in the larger L3 cache prefetched by the core from DRAM earlier, where it was evicted on CPU Core A. IPC is 2.

I also want to make a note of the difference between overall IPC of the CPU core (which includes the entire memory subsystem from internal caches to the main system memory, and the notion of Instruction Level Parallelism; which is the ability for the CPU core's execution resources (as long as it's fed by the front end and has backend bandwidth, CPU cores aren't simple lol) to execute instructions from a single 'serial instruction stream' (i.e a Thread) in parallel. This is what most people simply call "IPC", and in a vacuum, it is roughly that.


The issue is, that effective "IPC" is entirely dependant on the memory subsystem, as I posted above in the quote. Two identical CPU cores with the exact same circuitry can have vastly different IPC in the exact same workload depending entirely on the performance and capacity of the memory subsystem. As in, the observed "performance per clock" would be greatly different with absolutely no direct change to the CPU core logic itself.


So when using the term "IPC", I think this is a semantic issue with most people thinking IPC is entirely related to the core's execution resources, without understanding that the number of instructions per clock is dependent on the workload and memory performance, and actually, a whole bunch of other factors.


So, in summary, the hardware ILP capabilities of a core are a separate component of IPC, just like the size and speed of the core's caches. So yes, IPC is directly affected by L3 cache size and/or performance, so yes, 5800X3D will have higher IPC in many workloads gated by this than 5800X, and no, IPC isn't "fixed" over all workloads; it's usually given as an average number across a wide variety of workloads but each one has different gains (sometimes even regressions).


Also, no, I refuse to acknowledge anyone that uses the term "IPC" when referring to GPU shader processors. Ugh. (For the record you could very broadly apply the term "IPC" to a GPU's SM for NVIDIA or CU/WGP for Radeon. VERY BROADLY, it could be analogous to measuring how many vector operations were performed per clock there, but it's generally not applicable to wide SIMD engines vs serial processors like CPUs.

Recent Posts

See All

תגובות


bottom of page