Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Support for Multi-Level Partition Tables #88

Conversation

shamb0
Copy link

@shamb0 shamb0 commented Aug 20, 2024

Closes #56

What

Implements a demonstration test for multi-level partition tables, addressing issue #56.

Why

This demo showcases the pg_analytics extension's capability to support multi-level partitioned tables. The implementation organizes data hierarchically, enabling efficient access to context-relevant information.

How

  1. Generates a simulated Auto Sales dataset and creates a local Parquet data source.
  2. Utilizes DataFusion to load sales data from the Parquet file into a DataFrame.
  3. Partitions the data and uploads partitions to an S3 bucket.
  4. Configures necessary tables and pg_analytics Foreign Data Wrapper (FDW) in PostgreSQL using S3 data.
  5. Executes analytics test queries on the multi-level partitioned table.

Tests

To run the demonstration test:

RUST_LOG=info \
    cargo test \
    --test \
    test_mlp_auto_sales \
    -- \
    --nocapture

Test traces are available in the attached log file :: https://gist.githubusercontent.com/shamb0/2ed909ac9604c610af1d7fa0e87f9a82/raw/02a4203cdc1d675181d9f9700c578c81405becdb/wk2434-pg_analytics-mlp-demo.md

@CLAassistant
Copy link

CLAassistant commented Aug 20, 2024

CLA assistant check
All committers have signed the CLA.

@shamb0 shamb0 marked this pull request as ready for review August 20, 2024 09:13
@shamb0 shamb0 changed the title Demo: Support for Multi-Level Partition Tables test: Support for Multi-Level Partition Tables Aug 20, 2024
Copy link
Contributor

@rebasedming rebasedming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain in the PR README how you set up the Postgres tables to match the Parquet partitions? It looks like this is something we'll need to document in the main repo to show users how to do.

@shamb0
Copy link
Author

shamb0 commented Aug 22, 2024

Could you explain in the PR README how you set up the Postgres tables to match the Parquet partitions? It looks like this is something we'll need to document in the main repo to show users how to do.

Hi @rebasedming,

I’ve put together a detailed README for the PR, which you can find here.

I aimed to be thorough in capturing all the necessary details. If you find it too lengthy or in need of restructuring, please let me know, and I’ll make the necessary adjustments.

Copy link
Contributor

@rebasedming rebasedming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put up #91 which brings all our text fixtures into this crate. If possible, I'd appreciate it if you could align your PR with this one once it's merged. For instance, you can now put your fixtures as a new file in the fixtures crate, and auto_sales can go under tables.

}

// Implement Printable for tuples up to 12 elements
impl_printable_for_tuple!(T1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not bundle this into this PR, it's a separate thing that we can discuss whether it's worth bringing into the repo.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done. I've removed the macros. Let me know if you want tests/common/print_utils.rs removed as well.

@rebasedming
Copy link
Contributor

Could you explain in the PR README how you set up the Postgres tables to match the Parquet partitions? It looks like this is something we'll need to document in the main repo to show users how to do.

Hi @rebasedming,

I’ve put together a detailed README for the PR, which you can find here.

I aimed to be thorough in capturing all the necessary details. If you find it too lengthy or in need of restructuring, please let me know, and I’ll make the necessary adjustments.

Nice, thank you for the extremely thorough writeup.

What I was most interested in is your approach of putting partitioned heap tables in front of foreign tables to pass partition keys to the Parquet file string. I wasn't aware this was possible and was hoping you could elaborate on that.

@shamb0
Copy link
Author

shamb0 commented Aug 24, 2024

Thanks for pointing out the missing critical piece. I've introduced the section Partitioned Table Structure and S3 Integration with the necessary details, and I've also created a TL;DR quick overview.

@shamb0 shamb0 force-pushed the shamb0/demo-multi-level-partition-table-dset-auto-sales branch from ffc9515 to 50cdc5b Compare August 24, 2024 04:35
@shamb0 shamb0 requested a review from philippemnoel as a code owner August 24, 2024 04:35
@shamb0 shamb0 force-pushed the shamb0/demo-multi-level-partition-table-dset-auto-sales branch 2 times, most recently from b8c27f9 to 9491b82 Compare August 24, 2024 05:15
@shamb0
Copy link
Author

shamb0 commented Aug 24, 2024

Hi @rebasedming,

I've made some changes to address the clippy warnings in this PR:

  1. Created a new crate: pg_analytics_test_helpers
  2. Moved fixtures and common modules from the tests folder to this new crate

These changes were necessary to get the CI build passing. I know this might be outside the scope of our current PR, so let me know if you'd prefer I revert these changes.

Thanks for your input on this.

@philippemnoel
Copy link
Collaborator

Hi @rebasedming,

I've made some changes to address the clippy warnings in this PR:

  1. Created a new crate: pg_analytics_test_helpers
  2. Moved fixtures and common modules from the tests folder to this new crate

These changes were necessary to get the CI build passing. I know this might be outside the scope of our current PR, so let me know if you'd prefer I revert these changes.

Thanks for your input on this.

Can you explain which Clippy warnings you needed a new crate for? I don't think we should complicate the project with an extra crate. We can simply ignore some clippy warnings in the test files if they're causing issues.

@philippemnoel
Copy link
Collaborator

I also took a look at your CI error, and I'm a bit confused by it. I suspect this has to do with the newly introduced crate. If you keep a single crate, it shouldn't find two different PG versions.

@shamb0
Copy link
Author

shamb0 commented Aug 24, 2024

Hi @philippemnoel,

I've identified an issue with commit 9491b824e5dfffd3d636dcb7bb4b67a3fe9ee858:

  1. cargo test runs successfully
  2. cargo clippy --all-targets fails

The problem:

  • Error occurs in: use crate::common::{execute_query, fetch_results, print_utils};
  • Clippy searches for the common module in the pg_analytics crate instead of the tests folder

I tried several solutions (e.g., using super) to fix the module path. But no luck. As a temporary fix, I've:

  • Moved the module fixture and common out of the tests folder
  • Created a new crate: pg_analytics_test_helpers

image

@philippemnoel
Copy link
Collaborator

Hi @philippemnoel,

I've identified an issue with commit 9491b824e5dfffd3d636dcb7bb4b67a3fe9ee858:

  1. cargo test runs successfully
  2. cargo clippy --all-targets fails

The problem:

  • Error occurs in: use crate::common::{execute_query, fetch_results, print_utils};
  • Clippy searches for the common module in the pg_analytics crate instead of the tests folder

I tried several solutions (e.g., using super) to fix the module path. But no luck. As a temporary fix, I've:

  • Moved the module fixture and common out of the tests folder
  • Created a new crate: pg_analytics_test_helpers

image

This code is added by your PR right? Can we instead fix the import path and keep things in the same crate?

@shamb0 shamb0 force-pushed the shamb0/demo-multi-level-partition-table-dset-auto-sales branch 2 times, most recently from e947e42 to 06f0b36 Compare August 25, 2024 02:21
@shamb0
Copy link
Author

shamb0 commented Aug 25, 2024

Hi @philippemnoel,

The PR is ready for the next level of review. I've removed the duplicate code, performed a cleanup, and ensured proper integration with PR #91. Additionally, fixtures/tables/auto_sales.rs now follows the pattern used in other test implementations across the workspace.

@philippemnoel
Copy link
Collaborator

Hi @shamb0. This looks super clean! Thank you for integrating everything properly, I'm very excited about this PR.

I believe it should also have documentation, so that the users can know that partitions are supported and know how to use them. Our documentation is stored in https://github.com/paradedb/paradedb/tree/dev/docs. Would you be willing to submit a PR with documentation to that repository as well? Then I think everything will be complete :). I'll let Ming do a more thorough review

@shamb0
Copy link
Author

shamb0 commented Aug 25, 2024

Hi @philippemnoel,

I wanted to update you on PR#1568, which includes recent documentation changes.

Currently, I’ve placed the new topic under ingest/configuration/multi-level-partitioned-tables. Could you please review and suggest if this is the most appropriate location, or if you have any preferred path for this topic?

image

Copy link
Collaborator

@philippemnoel philippemnoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shamb0! Thank you for doing this. We appreciate the docs PRs. We'll probably push a commit to it to make it more similar to our existing documentation, and then this should be good. I'll let @rebasedming do a proper review

@philippemnoel
Copy link
Collaborator

Sorry, we had to merge a few other PRs to get v0.1.1 out. There shouldn't be anything else that introduces a conflict, but could you please rebase it? We'll prioritize it :)

@shamb0 shamb0 force-pushed the shamb0/demo-multi-level-partition-table-dset-auto-sales branch from 06f0b36 to 3ea0fc0 Compare August 26, 2024 02:44
@shamb0
Copy link
Author

shamb0 commented Aug 26, 2024

Hi @philippemnoel,

The rebase is complete, and the PR should now be ready for intake review. Please let me know if you encounter any issues.

Thanks again!

@philippemnoel
Copy link
Collaborator

Hi @philippemnoel,

The rebase is complete, and the PR should now be ready for intake review. Please let me know if you encounter any issues.

Thanks again!

Thank you! Could you please take a look at the failing test?

@rebasedming
Copy link
Contributor

Hi @shamb0 , I've spent some time testing this PR and I have some bad news.

While the strategy you used of setting partitions to foreign tables works, it comes at a significant performance penalty. In pg_analytics, we push down the entire query to DuckDB by intercepting it in the executor hook. By querying the top-level partitioned table you put in front of the foreign tables, the executor hook is not run, and the query is executed by the Postgres FDW API, which essentially performs a sequential scan of the underlying Parquet file (with predicate/limit pushdown). This scan is significantly slower than a full DuckDB query for lots of use cases and obviates the performance benefits of only scanning one Parquet file.

@shamb0
Copy link
Author

shamb0 commented Aug 27, 2024

Hi @shamb0 , I've spent some time testing this PR and I have some bad news.

While the strategy you used of setting partitions to foreign tables works, it comes at a significant performance penalty. In pg_analytics, we push down the entire query to DuckDB by intercepting it in the executor hook. By querying the top-level partitioned table you put in front of the foreign tables, the executor hook is not run, and the query is executed by the Postgres FDW API, which essentially performs a sequential scan of the underlying Parquet file (with predicate/limit pushdown). This scan is significantly slower than a full DuckDB query for lots of use cases and obviates the performance benefits of only scanning one Parquet file.

Hi @rebasedming,

Thank you for the insightful feedback and for highlighting the performance issue with the strategy used in the PR.

Based on your comments, I understand that the current approach of using partitioned heap tables in front of foreign tables bypasses the executor hook, which normally pushes queries to DuckDB. This results in slower query execution through PostgreSQL's FDW API, as it ends up performing a sequential scan of the underlying Parquet files.

I will investigate this issue further and work on an improved strategy that maintains the performance benefits of DuckDB. I’ll get back to you soon with a better solution.

Thanks for your patience!

@philippemnoel
Copy link
Collaborator

Hi @shamb0 , I've spent some time testing this PR and I have some bad news.
While the strategy you used of setting partitions to foreign tables works, it comes at a significant performance penalty. In pg_analytics, we push down the entire query to DuckDB by intercepting it in the executor hook. By querying the top-level partitioned table you put in front of the foreign tables, the executor hook is not run, and the query is executed by the Postgres FDW API, which essentially performs a sequential scan of the underlying Parquet file (with predicate/limit pushdown). This scan is significantly slower than a full DuckDB query for lots of use cases and obviates the performance benefits of only scanning one Parquet file.

Hi @rebasedming,

Thank you for the insightful feedback and for highlighting the performance issue with the strategy used in the PR.

Based on your comments, I understand that the current approach of using partitioned heap tables in front of foreign tables bypasses the executor hook, which normally pushes queries to DuckDB. This results in slower query execution through PostgreSQL's FDW API, as it ends up performing a sequential scan of the underlying Parquet files.

I will investigate this issue further and work on an improved strategy that maintains the performance benefits of DuckDB. I’ll get back to you soon with a better solution.

Thanks for your patience!

We're excited to see the next iteration :)

@rebasedming
Copy link
Contributor

Closing this now as per the above discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multi-level partition tables
4 participants