Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(fuzzer): Support multiple joins in the join node "toSql" methods for reference query runners #11801

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DanielHunte
Copy link

@DanielHunte DanielHunte commented Dec 9, 2024

Summary: Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 9, 2024
Copy link

netlify bot commented Dec 9, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 2d9c75b
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67733cdebc5f9e0008dbf81d

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66977480

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66977480

DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 9, 2024
…in Presto QR (facebookincubator#11801)

Summary:

Currently, the hash join "toSql" method for PrestoQueryRunner only supports a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66977480

DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 10, 2024
…for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join "toSql" method for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66977480

@DanielHunte DanielHunte changed the title feat(fuzzer): Support multiple joins in the hash node "toSql" method in Presto QR feat(fuzzer): Support multiple joins in the hash node "toSql" method for reference query runners Dec 10, 2024
@DanielHunte DanielHunte changed the title feat(fuzzer): Support multiple joins in the hash node "toSql" method for reference query runners feat(fuzzer): Support multiple joins in the join node "toSql" methods for reference query runners Dec 10, 2024
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 10, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66977480

DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 14, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66977480

DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 15, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66977480

DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 16, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66977480

DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 16, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66977480

DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 27, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 27, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 27, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
Copy link
Contributor

@pedroerp pedroerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this @DanielHunte . Left a few comments

velox/core/PlanNode.h Show resolved Hide resolved
velox/exec/fuzzer/DuckQueryRunner.cpp Outdated Show resolved Hide resolved
probeTableName = fmt::format("({})", *probeSubQuery);
buildTableName = fmt::format("({})", *buildSubQuery);
} else {
return std::nullopt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this throw an exception, or this is a legit valid case?

velox/exec/fuzzer/PrestoQueryRunner.cpp Outdated Show resolved Hide resolved
velox/exec/fuzzer/PrestoQueryRunner.cpp Outdated Show resolved Hide resolved
@@ -66,6 +66,13 @@ class ReferenceQueryRunner {
return true;
}

/// Executes SQL query returned by the 'toSql' method based on the plan.
virtual std::multiset<std::vector<velox::variant>> execute(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not making this pure virtual? ( = 0;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkQueryRunner does not currently need an implementation of this method.

@@ -88,6 +95,13 @@ class ReferenceQueryRunner {
return false;
}

/// Similar to 'execute' but returns results in RowVector format.
virtual std::vector<RowVectorPtr> executeVector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkQueryRunner does not currently need an implementation of this method.

DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66977480

DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66977480

DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66977480

DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 30, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
DanielHunte pushed a commit to DanielHunte/velox that referenced this pull request Dec 31, 2024
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
… for reference query runners (facebookincubator#11801)

Summary:

Currently, the hash join and nested loop join "toSql" methods for all reference query runners only support a single join. This change extends it to support multiple joins, only needing the join node of the last join in the tree. It traverses up the tree and recursively builds the sql query.

Differential Revision: D66977480
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66977480

Comment on lines +100 to +104
virtual std::vector<RowVectorPtr> executeAndReturnVector(
const std::string& sql,
const core::PlanNodePtr& plan) {
VELOX_UNSUPPORTED();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does executeAndReturnVector need to be in ReferenceQueryRunner/virtual? It looks like it's only called from PrestoQueryRunner and could just be a private helper method.

Comment on lines +179 to +182
if (const auto joinNode =
std::dynamic_pointer_cast<const core::ValuesNode>(plan)) {
return toSql(joinNode);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you copied this from above, valuesNode would be a more appropriate variable name here.

@@ -60,6 +65,8 @@ class DuckQueryRunner : public ReferenceQueryRunner {
const RowTypePtr& resultType) override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still use this function anywhere, or can we delete this flavor of execute?

Comment on lines +51 to +52
const std::string& sql,
const core::PlanNodePtr& plan) override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the discussion that caused you to add sql as parameter this function, but it feels a little unnecessarily dangerous to me to pass in plan and sql (which is wholly derived from plan) and rely on them to be in sync without being able to check.

What do you think about passing in the plan and only converting the plan to SQL in execute?

The code in JoinFuzzer is

if (auto sql = referenceQueryRunner_->toSql(plan)) {
  return referenceQueryRunner_->execute(*sql, plan);
}
LOG(INFO) << "Query not supported by the reference DB";
return std::nullopt;

if we do the conversion in execute and return std::optional<std::multiset<std::vectorvelox::variant>> then all that code in JoinFuzzer could just be

return referenceQueryRunner_->execute(plan);

which still addresses the duplication that Wei had concerns about.

(as a small bonus the log message could be a little more explicit "Query not supported by DuckDB" and "Query not supported by Presto" in their respective implementations)

Comment on lines +133 to +165
/// Returns the name of the values node table in the form t_<id>.
std::string getTableName(const core::ValuesNodePtr& valuesNode) {
return fmt::format("t_{}", valuesNode->id());
}
// Traverses all nodes in the plan and returns all tables and their names.
std::unordered_map<std::string, std::vector<velox::RowVectorPtr>>
getAllTables(const core::PlanNodePtr& plan) {
std::unordered_map<std::string, std::vector<velox::RowVectorPtr>> result;
if (const auto valuesNode =
std::dynamic_pointer_cast<const core::ValuesNode>(plan)) {
result.insert({getTableName(valuesNode), valuesNode->values()});
} else {
for (const auto& source : plan->sources()) {
auto tablesAndNames = getAllTables(source);
result.insert(tablesAndNames.begin(), tablesAndNames.end());
}
}
return result;
}

bool isSupportedDwrfType(const TypePtr& type) {
if (type->isDate() || type->isIntervalDayTime() || type->isUnKnown()) {
return false;
}

for (auto i = 0; i < type->size(); ++i) {
if (!isSupportedDwrfType(type->childAt(i))) {
return false;
}
}

return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could be protected right? (It looks like getTableName and isSupportedDwrfType could even be private)

out << filterToSql(joinNode.filter());
}
return out.str();
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}

/// Same as the above toSql but for hash join nodes.
virtual std::optional<std::string> toSql(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this change ReferenceQueryRunner was pretty trivial. With these more complicated function definitions it's probably worth adding a ReferenceQueryRunner.cpp and moving the implementation of these toSql methods there (just like they were split in PrestoQueryRunner).

If you do that, the static methods declared above could probably be moved outside the class into functions declared in an anonymous namespace in the cpp file.

Comment on lines +224 to +229
probeTableName = probeSubQuery->find(" ") != std::string::npos
? fmt::format("({})", *probeSubQuery)
: *probeSubQuery;
buildTableName = buildSubQuery->find(" ") != std::string::npos
? fmt::format("({})", *buildSubQuery)
: *buildSubQuery;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: it should still work if you just always wrap the sub queries in parentheses right? that seems simpler

Comment on lines +314 to +329
std::string probeTableName;
std::string buildTableName;
const std::optional<std::string> probeSubQuery =
toSql(joinNode->sources()[0]);
const std::optional<std::string> buildSubQuery =
toSql(joinNode->sources()[1]);
if (probeSubQuery && buildSubQuery) {
probeTableName = probeSubQuery->find(" ") != std::string::npos
? fmt::format("({})", *probeSubQuery)
: *probeSubQuery;
buildTableName = buildSubQuery->find(" ") != std::string::npos
? fmt::format("({})", *buildSubQuery)
: *buildSubQuery;
} else {
return std::nullopt;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this code is duplicated, could you make a helper function, e.g. something like

std::optional<std::string> joinSourceToSql(const PlanNodePtr& node) {
  const std::optional<std::string> subQuery = toSql(node);

  if (!subQuery) {
    return std::nullopt;
  }

  return fmt::format("({})", *subQuery);
}

Then in both join toSql functions you'd have

auto probeTableName = joinSourceToSql(joinNode->sources()[0]);
auto buildTableName = joinSourceToSql(joinNode->sources()[1]);

if (!probeTableName || !buildTableName) {
  return std::nullopt;
}

Comment on lines +335 to +336
VELOX_CHECK(
joinNode->joinCondition() == nullptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could make this VELOX_CHECK_NULL(joinNode->joinCondition(), ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants