-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[controller] Fix mismatch between hybrid version partition count and real-time partition count #1338
base: main
Are you sure you want to change the base?
[controller] Fix mismatch between hybrid version partition count and real-time partition count #1338
Conversation
Moving to draft as I would like to move topic creation in addVersion and see how that works out |
...ient/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java
Show resolved
Hide resolved
...enice-common/src/main/java/com/linkedin/venice/controllerapi/RequestTopicForPushRequest.java
Outdated
Show resolved
Hide resolved
...ces/venice-controller/src/main/java/com/linkedin/venice/controller/server/CreateVersion.java
Outdated
Show resolved
Hide resolved
...ces/venice-controller/src/main/java/com/linkedin/venice/controller/server/CreateVersion.java
Outdated
Show resolved
Hide resolved
I agree with the solution "STREAM push type is now disallowed if there is no online hybrid version." However, I think, instead of returning error, we should create a new version (that would be of right partition count) and then let the push start. If we look carefully, changing partition count without creating a version (even for batch stores) is similar to changing partition for a hybrid store; even for the batch stores, because batch stores can be changed to hybrid I am skeptical about "The requestTopicForPushing method no longer creates a real-time topic if it does not already exist."; I believe in most of the cases, this is a benign operation and we should not prevent creation of the RT. |
I think this is not safe in some cases, for example, user wants to create a new version but keep old data.
I'm don't really follow the rest of the comment. Could you please elaborate? Thanks
We do not allow this as of now
I disagree that this is a harmless operation. The case mentioned in the commit message is one example why this is not a good idea. Moreover, we should tie hybrid version materialization to RT creation for better alignment and not let RT creation happen at every other place IMHO |
c0d5fa3
to
c709994
Compare
6547aa6
to
7f27513
Compare
internal/venice-common/src/main/java/com/linkedin/venice/pubsub/PubSubConstants.java
Outdated
Show resolved
Hide resolved
...e-controller/src/main/java/com/linkedin/venice/controller/VeniceControllerClusterConfig.java
Show resolved
Hide resolved
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java
Show resolved
Hide resolved
…rtition count This fix addresses an issue where the real-time topic partition count did not align with the hybrid version partition count, causing errors during hybrid store operations. The issue occurred in the following scenario: 1. Create a store with 1 partition. 2. Perform a batch push, creating a batch version with 1 partition. 3. Update the store to 3 partitions and convert it to a hybrid store. 4. Start real-time writes using push type STREAM. 5. Perform a full push to create a hybrid version with 3 partitions. This push fails because, after the topic switch, real-time consumers cannot find partitions 2 and 3 due to the real-time topic having only 1 partition. Root Cause: - In step 4, if the real-time topic did not exist, it was created with a partition count derived from the largest existing version (batch version with 1 partition), leading to a mismatch. Solution: - STREAM push type is now disallowed if there is no online hybrid version. - If an online hybrid version exists, it ensures the real-time topic partition count matches the hybrid version partition count. - The `requestTopicForPushing` method no longer creates a real-time topic if it does not already exist. Move real-time topic creation logic in addVersion enable participant message store and revert HB interval Fix tests in 1430 Fix test Fix tests Fix cc tests revert log4j2 Disable topic creation in RT topic switcher Fix flakies Do not call getRealTimeTopic fix tests
7f27513
to
07d57b0
Compare
...ient/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/pubsub/PubSubConstants.java
Show resolved
Hide resolved
...st-common/src/integrationTest/java/com/linkedin/venice/controller/TestInstanceRemovable.java
Show resolved
Hide resolved
...e-controller/src/main/java/com/linkedin/venice/controller/VeniceControllerClusterConfig.java
Show resolved
Hide resolved
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java
Show resolved
Hide resolved
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java
Show resolved
Hide resolved
public Version getIncrementalPushVersion(String clusterName, String storeName) { | ||
Version incrementalPushVersion = getVeniceHelixAdmin().getIncrementalPushVersion(clusterName, storeName); | ||
public Version getIncrementalPushVersion(String clusterName, String storeName, String pushJobId) { | ||
Version incrementalPushVersion = getVeniceHelixAdmin().getIncrementalPushVersion(clusterName, storeName, pushJobId); | ||
String incrementalPushTopic = incrementalPushVersion.kafkaTopicName(); | ||
ExecutionStatus status = getOffLinePushStatus(clusterName, incrementalPushTopic).getExecutionStatus(); | ||
|
||
return getIncrementalPushVersion(incrementalPushVersion, status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not read this code path before, but I am confused by the method implementation, as it seems like it requires the retrieved hybrid version to be terminal status, which I don't think is really needed?
services/venice-controller/src/main/java/com/linkedin/venice/controller/Admin.java
Show resolved
Hide resolved
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java
Show resolved
Hide resolved
if (!getTopicManager().containsTopicAndAllPartitionsAreOnline(separateRtTopic) | ||
|| isTopicTruncated(separateRtTopic.getName())) { | ||
LOGGER.error( | ||
"Incremental push: {} cannot be started for store: {} in cluster: {} because the topic: {} is either absent or being truncated", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify that by implementation, inc push is only pushed to one of the above topics based on VPJ config. But I think more checking is always good. nit: can consider wrapping the lines into a small check method and reuse for RT and sepRT.
Ensure real-time topic partition count matches hybrid version partition count
This fix addresses an issue where the real-time topic partition count did not align with the hybrid version
partition count, causing errors during hybrid store ingestion. The issue occurred in the following scenario:
switch, real-time consumers cannot find partitions 2 and 3 due to the real-time topic having only 1 partition.
Root Cause:
existing version (batch version with 1 partition), which lead to mismatch.
Solution in this PR:
partition count.
requestTopicForPushing
method no longer creates a real-time topic if it does not already exist with STREAM PushType.Misc:
How was this PR tested?
WIP
Does this PR introduce any user-facing changes?