You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So it actually matters how the count register is initialized. It seems like if it's a 64-bit op, then you get 3c latency. But a 32-bit op that implicitly zeroes the top half gets it down to 1c latency. Other things like MOV R10D, R10D also work.
I'm not sure how the nanoBench commands on uops.info are generated, so I'm reporting this here.
https://uops.info/html-instr/SHLX_R64_R64_R64.html#ADL-P lists
SHLX
as having 3-cycle latency for both operands. This is in contrast to Intel's docs and InstLatx64's measurements. So what gives?I looked into it and figured out something strange. If you run the specified nanoBench command, you'll see
Here R10 has some arbitrary value. Let's try setting it to 1:
So it actually matters how the count register is initialized. It seems like if it's a 64-bit op, then you get 3c latency. But a 32-bit op that implicitly zeroes the top half gets it down to 1c latency. Other things like
MOV R10D, R10D
also work.I'm not sure how the nanoBench commands on uops.info are generated, so I'm reporting this here.
Relevant discussion here:
The text was updated successfully, but these errors were encountered: