assembly - Can Zen 4 run more than 1 branch per cycle - Stack Overflow

In Performance optimization, and how to do it wrong the author claims:the CPU can't predict more

In Performance optimization, and how to do it wrong the author claims:

  1. the CPU can't predict more than one branch per cycle

  2. A single if statement inside a loop is enough to stop any further instructions from being decoded in that cycle.

1 contradicts measurements by uops.info for jz and jnz which show a reciprocal throughput of 0.50. Are there are different port limits for taken vs not taken branches like Haswell/Skylake?

2 is not mentioned in the Software Optimization Guide for the AMD Zen4 Microarchitecture. The only similar note is in 2.9 Instruction Fetch and Decode, but jcc is only 6 bytes in length. Do the decoders stop after decoding a branch?

Only the first decode slot (of four) can decode instructions greater than 10 bytes in length. Avoid having more than one instruction in a sequence of four that is greater than 10 bytes in length.

Are these performance limits for Zen 4 documented anywhere?

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744905841a4600248.html

相关推荐

  • assembly - Can Zen 4 run more than 1 branch per cycle - Stack Overflow

    In Performance optimization, and how to do it wrong the author claims:the CPU can't predict more

    1天前
    40

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信