-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selective merged prefill #643
base: mlperf_features
Are you sure you want to change the base?
Selective merged prefill #643
Conversation
da935cf
to
d6bdc90
Compare
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
6ddfcac
to
f6c0c84
Compare
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]> Signed-off-by: Chendi Xue <[email protected]>
b6f6961
to
2d6ceb9
Compare
vllm/attention/backends/hpu_attn.py
Outdated
key_cache, value_cache = HPUPagedAttention.split_kv_cache( | ||
kv_cache, self.num_kv_heads, self.head_size) | ||
|
||
key_cache = self.k_cache(padded_key_tensor, key_cache, block_indices, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that when decoding, padded_key_tensor
is not defined. Would this be a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, but it didn't trigger any error, I'll look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yangw1234 , after checking with the codes, since enable_merged_prefill is only enabled in prefill_fwd, I'll clean up the codes to make it more readable
vllm/attention/backends/hpu_attn.py
Outdated
max_len=attn_metadata.slot_mapping.size(1) | ||
seq_lens_tensor_list = attn_metadata.seq_lens_tensor.tolist() | ||
# we need to copy the key and value tensors to the padded tensors | ||
# shape is [bacth_size, entire_seq_len, num_kv_heads, head_size] | ||
padded_key_tensor = split_and_pad_to_length(key, max_len, seq_lens_tensor_list) | ||
padded_value_tensor = split_and_pad_to_length(value, max_len, seq_lens_tensor_list) | ||
padded_key_tensor = padded_key_tensor.flatten(0, 1).unflatten(0, (block_indices.size(0), -1)) | ||
padded_value_tensor = padded_value_tensor.flatten(0, 1).unflatten(0, (block_indices.size(0), -1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to get rid of these line if we prepare block_indices
and block_offsets
in a way that excludes the padded tokens?
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
d18df2d
to
92bf903
Compare
Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Chendi.Xue <[email protected]>
b7d0931
to
ce48860
Compare
Signed-off-by: Chendi.Xue <[email protected]>
ce48860
to
a3602f2
Compare
Signed-off-by: Chendi.Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Chendi.Xue <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
No description provided.