-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(fuzzer): Support multiple joins in the join node "toSql" methods for reference query runners #11801
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
This pull request was exported from Phabricator. Differential Revision: D66977480 |
e573440
to
f24cd12
Compare
This pull request was exported from Phabricator. Differential Revision: D66977480 |
f24cd12
to
6e9702c
Compare
…in Presto QR (facebookincubator#11801) Summary: Currently, the hash join "toSql" method for PrestoQueryRunner only supports a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
This pull request was exported from Phabricator. Differential Revision: D66977480 |
6e9702c
to
3cbd9e0
Compare
…for reference query runners (facebookincubator#11801) Summary: Currently, the hash join "toSql" method for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
This pull request was exported from Phabricator. Differential Revision: D66977480 |
3cbd9e0
to
28f2244
Compare
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
This pull request was exported from Phabricator. Differential Revision: D66977480 |
28f2244
to
180fe64
Compare
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
This pull request was exported from Phabricator. Differential Revision: D66977480 |
180fe64
to
731d39f
Compare
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
This pull request was exported from Phabricator. Differential Revision: D66977480 |
731d39f
to
211a0b3
Compare
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
This pull request was exported from Phabricator. Differential Revision: D66977480 |
211a0b3
to
3befde2
Compare
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
This pull request was exported from Phabricator. Differential Revision: D66977480 |
3befde2
to
1db434c
Compare
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this @DanielHunte . Left a few comments
probeTableName = fmt::format("({})", *probeSubQuery); | ||
buildTableName = fmt::format("({})", *buildSubQuery); | ||
} else { | ||
return std::nullopt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this throw an exception, or this is a legit valid case?
@@ -66,6 +66,13 @@ class ReferenceQueryRunner { | |||
return true; | |||
} | |||
|
|||
/// Executes SQL query returned by the 'toSql' method based on the plan. | |||
virtual std::multiset<std::vector<velox::variant>> execute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not making this pure virtual? ( = 0;
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparkQueryRunner does not currently need an implementation of this method.
@@ -88,6 +95,13 @@ class ReferenceQueryRunner { | |||
return false; | |||
} | |||
|
|||
/// Similar to 'execute' but returns results in RowVector format. | |||
virtual std::vector<RowVectorPtr> executeVector( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparkQueryRunner does not currently need an implementation of this method.
3cbb0cf
to
f19fcc8
Compare
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
This pull request was exported from Phabricator. Differential Revision: D66977480 |
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
f19fcc8
to
a872daa
Compare
This pull request was exported from Phabricator. Differential Revision: D66977480 |
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
a872daa
to
4d95f9d
Compare
This pull request was exported from Phabricator. Differential Revision: D66977480 |
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
… for reference query runners (facebookincubator#11801) Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query. Differential Revision: D66977480
4d95f9d
to
2d9c75b
Compare
This pull request was exported from Phabricator. Differential Revision: D66977480 |
virtual std::vector<RowVectorPtr> executeAndReturnVector( | ||
const std::string& sql, | ||
const core::PlanNodePtr& plan) { | ||
VELOX_UNSUPPORTED(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does executeAndReturnVector need to be in ReferenceQueryRunner/virtual? It looks like it's only called from PrestoQueryRunner and could just be a private helper method.
if (const auto joinNode = | ||
std::dynamic_pointer_cast<const core::ValuesNode>(plan)) { | ||
return toSql(joinNode); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you copied this from above, valuesNode would be a more appropriate variable name here.
@@ -60,6 +65,8 @@ class DuckQueryRunner : public ReferenceQueryRunner { | |||
const RowTypePtr& resultType) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still use this function anywhere, or can we delete this flavor of execute?
const std::string& sql, | ||
const core::PlanNodePtr& plan) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw the discussion that caused you to add sql as parameter this function, but it feels a little unnecessarily dangerous to me to pass in plan and sql (which is wholly derived from plan) and rely on them to be in sync without being able to check.
What do you think about passing in the plan and only converting the plan to SQL in execute?
The code in JoinFuzzer is
if (auto sql = referenceQueryRunner_->toSql(plan)) {
return referenceQueryRunner_->execute(*sql, plan);
}
LOG(INFO) << "Query not supported by the reference DB";
return std::nullopt;
if we do the conversion in execute and return std::optional<std::multiset<std::vectorvelox::variant>> then all that code in JoinFuzzer could just be
return referenceQueryRunner_->execute(plan);
which still addresses the duplication that Wei had concerns about.
(as a small bonus the log message could be a little more explicit "Query not supported by DuckDB" and "Query not supported by Presto" in their respective implementations)
/// Returns the name of the values node table in the form t_<id>. | ||
std::string getTableName(const core::ValuesNodePtr& valuesNode) { | ||
return fmt::format("t_{}", valuesNode->id()); | ||
} | ||
// Traverses all nodes in the plan and returns all tables and their names. | ||
std::unordered_map<std::string, std::vector<velox::RowVectorPtr>> | ||
getAllTables(const core::PlanNodePtr& plan) { | ||
std::unordered_map<std::string, std::vector<velox::RowVectorPtr>> result; | ||
if (const auto valuesNode = | ||
std::dynamic_pointer_cast<const core::ValuesNode>(plan)) { | ||
result.insert({getTableName(valuesNode), valuesNode->values()}); | ||
} else { | ||
for (const auto& source : plan->sources()) { | ||
auto tablesAndNames = getAllTables(source); | ||
result.insert(tablesAndNames.begin(), tablesAndNames.end()); | ||
} | ||
} | ||
return result; | ||
} | ||
|
||
bool isSupportedDwrfType(const TypePtr& type) { | ||
if (type->isDate() || type->isIntervalDayTime() || type->isUnKnown()) { | ||
return false; | ||
} | ||
|
||
for (auto i = 0; i < type->size(); ++i) { | ||
if (!isSupportedDwrfType(type->childAt(i))) { | ||
return false; | ||
} | ||
} | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These could be protected right? (It looks like getTableName and isSupportedDwrfType could even be private)
out << filterToSql(joinNode.filter()); | ||
} | ||
return out.str(); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
} | ||
|
||
/// Same as the above toSql but for hash join nodes. | ||
virtual std::optional<std::string> toSql( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this change ReferenceQueryRunner was pretty trivial. With these more complicated function definitions it's probably worth adding a ReferenceQueryRunner.cpp and moving the implementation of these toSql methods there (just like they were split in PrestoQueryRunner).
If you do that, the static methods declared above could probably be moved outside the class into functions declared in an anonymous namespace in the cpp file.
probeTableName = probeSubQuery->find(" ") != std::string::npos | ||
? fmt::format("({})", *probeSubQuery) | ||
: *probeSubQuery; | ||
buildTableName = buildSubQuery->find(" ") != std::string::npos | ||
? fmt::format("({})", *buildSubQuery) | ||
: *buildSubQuery; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: it should still work if you just always wrap the sub queries in parentheses right? that seems simpler
std::string probeTableName; | ||
std::string buildTableName; | ||
const std::optional<std::string> probeSubQuery = | ||
toSql(joinNode->sources()[0]); | ||
const std::optional<std::string> buildSubQuery = | ||
toSql(joinNode->sources()[1]); | ||
if (probeSubQuery && buildSubQuery) { | ||
probeTableName = probeSubQuery->find(" ") != std::string::npos | ||
? fmt::format("({})", *probeSubQuery) | ||
: *probeSubQuery; | ||
buildTableName = buildSubQuery->find(" ") != std::string::npos | ||
? fmt::format("({})", *buildSubQuery) | ||
: *buildSubQuery; | ||
} else { | ||
return std::nullopt; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this code is duplicated, could you make a helper function, e.g. something like
std::optional<std::string> joinSourceToSql(const PlanNodePtr& node) {
const std::optional<std::string> subQuery = toSql(node);
if (!subQuery) {
return std::nullopt;
}
return fmt::format("({})", *subQuery);
}
Then in both join toSql functions you'd have
auto probeTableName = joinSourceToSql(joinNode->sources()[0]);
auto buildTableName = joinSourceToSql(joinNode->sources()[1]);
if (!probeTableName || !buildTableName) {
return std::nullopt;
}
VELOX_CHECK( | ||
joinNode->joinCondition() == nullptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you could make this VELOX_CHECK_NULL(joinNode->joinCondition(), ....
Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.
Differential Revision: D66977480