Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uops.info Alder Lake-P latency for SHLX #33

Open
tavianator opened this issue Dec 31, 2024 · 1 comment
Open

uops.info Alder Lake-P latency for SHLX #33

tavianator opened this issue Dec 31, 2024 · 1 comment

Comments

@tavianator
Copy link

https://uops.info/html-instr/SHLX_R64_R64_R64.html#ADL-P lists SHLX as having 3-cycle latency for both operands. This is in contrast to Intel's docs and InstLatx64's measurements. So what gives?

I looked into it and figured out something strange. If you run the specified nanoBench command, you'll see

# lscpu
...
  Model name:             12th Gen Intel(R) Core(TM) i7-1280P
...
# ./nanoBench.sh -f -unroll 100 -warm_up_count 10 -config configs/cfg_AlderLakeP_common.txt -cpu 0 \
    -asm "SHLX R9, R8, R10; MOVSX R8, R9D; MOVSX R8, R8D; MOVSX R8, R8D; MOVSX R8, R8D; MOVSX R8, R8D" \
...
Core cycles: 8.00
...

Here R10 has some arbitrary value. Let's try setting it to 1:

# ./nanoBench.sh -f -unroll 100 -warm_up_count 10 -config configs/cfg_AlderLakeP_common.txt -cpu 0 \
    -asm "SHLX R9, R8, R10; MOVSX R8, R9D; MOVSX R8, R8D; MOVSX R8, R8D; MOVSX R8, R8D; MOVSX R8, R8D" \
    -asm_init "MOV R10, 1"
...
Core cycles: 8.00
...
# ./nanoBench.sh -f -unroll 100 -warm_up_count 10 -config configs/cfg_AlderLakeP_common.txt -cpu 0 \
    -asm "SHLX R9, R8, R10; MOVSX R8, R9D; MOVSX R8, R8D; MOVSX R8, R8D; MOVSX R8, R8D; MOVSX R8, R8D" \
    -asm_init "MOV R10D, 1"
...
Core cycles: 6.00
...

So it actually matters how the count register is initialized. It seems like if it's a 64-bit op, then you get 3c latency. But a 32-bit op that implicitly zeroes the top half gets it down to 1c latency. Other things like MOV R10D, R10D also work.

I'm not sure how the nanoBench commands on uops.info are generated, so I'm reporting this here.

Relevant discussion here:

@tavianator
Copy link
Author

More details here: https://tavianator.com/2025/shlxplained.html

It's related to the "small immediate add renaming" optimization introduced in Alder Lake

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant