CPU Test time! Yay! Today I am testing whether SMT benefits this workload with a fixed power limit enforced, but frequency is not locked. That is, Precision Boost Override is enabled and the clock speed of the processor depends on AMD's internal power/health management system with the variables I have specified.
This test is not to determine how much Zen3 gains from SMT in a given workload, though you could theoretically extrapolate that information from the results. The test is to determine the end-result performance of the processor within a given power limit in this workload, and that includes the fact that with SMT disabled, the processor is able to run at elevated frequencies due to lower current from lower circuit utilisation per core.
Here is the video:
I typed a conclusion in the video end, but I will summarise the result here, too. Please note, this is my theory. It is not guaranteed to be 100% accurate.
I observe a significant performance difference between SMT enabled with 32 threads and SMT disabled with only 16, in the latter, each core only runs a single thread so there is no resource/cache sharing at all. The result shows that in the initial render-phase, the SMT advantage is a -26% reduction in time taken to complete that phase. This is likely because render is CPU-core limited with an emphasis on driving instructions through the execution engine, where SMT can allow multiple execution ports to extract Thread Level Parallelism in addition to Instruction Level Parallelism within each thread.
However, when it comes to the filter phase, which is likely quite reliant on memory access/bandwidth, due to the nature of comparing a tile / pixel to nearby ones and averaging out colours to remove noise. In this situation, the additional threads per core take up work from the software and place additional contention on the internal caches, and relatively narrow dual-channel memory controller of this AM4 'consumer' desktop 16-core processor. I have backed this up by measuring the "DRAM Read/Write bandwidth" reporting sensor from HWINFO64, which is signfiicantly higher during Filter, along with the program using more than twice the system memory during this phase, all supporting a lot of L3 cache misses. The performance loss here, compared to SMT off, is not caused by clock speed; as the processor is so restricted by memory in 32T mode, the cores are ironically less utilised than with SMT disabled; backed up by the lower total package power (cores use less power when under less load).
Having 32 threads fighting for the Dual-Channel, DDR4-3600 memory interface is causing each thread's performance to drop below the point where SMT is a net gain, when all threads are in use. This is compounded by the lower per-thread hit-rate on all cache levels including the internal uOp cache (though I doubt this is a major bottleneck) and so the processor slows down.
The Ryzen 9 5950X is hitting the limits, I believe, of what can be achieved with Dual-channel DDR4, at least until the stacked V-Cache models hit next year, where each 8-core CCD would have access to up to 96MiB of combined L3, which will surely help alleviate this bandwidth issue that I believe is occurring.
Astute readers might see the potential work-around here by using dynamic software-based thread affinity control (i.e via something like Process Lasso) to limit the filter-stage of the workload to 16T, when running SMT enabled. This would allow Render to use all 32 T but would only allow the program to address 16 threads in filter; whereby the SMT-aware Windows scheduler would load up the 16 physical cores first, allowing me to have best of both worlds. That of course, is a lot of effort, and I cannot be bothered. For now, SMT is remaining off until I need it.
I already did a small (and admittedly rather limited/incomplete) test regarding SMT in World Community Grid, and the results show it favours SMT on Zen2, but only slightly.
Comments