Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(iceberg): Add support for writing iceberg tables #10996

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

imjalpreet
Copy link
Contributor

Introduce IcebergPageSink

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 13, 2024
Copy link

netlify bot commented Sep 13, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit a1ee3c5
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6776b341b064e5000838222c


virtual ~IcebergInsertTableHandle() = default;

std::shared_ptr<const VeloxIcebergSchema> schema() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay. I was occupied with other work, which limited my progress on this PR over the past couple of months.

The current changes do not utilize the schema and partitionSpec, but they will be necessary when adding support for partition transforms. At present, the changes only handle identity partitions for Iceberg tables, and the required partition and schema information is available in the table handle. However, since the table handle does not include details about partition transforms, we will need the Iceberg schema and partitionSpec to support them.

I will be looking into adding support for partition transforms next.

@Yuhta Yuhta requested a review from xiaoxmeng September 16, 2024 18:07
#include "velox/connectors/hive/iceberg/IcebergDataSink.h"

#include <utility>
#include "velox/common/base/Counters.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are lots of headers unused. Please remove.

velox/connectors/hive/iceberg/IcebergDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/iceberg/IcebergDataSink.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Show resolved Hide resolved
velox/connectors/hive/HiveDataSink.cpp Outdated Show resolved Hide resolved
return;
}

dataChannels_ = getDataChannels(partitionChannels_, inputType_->size());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use insertTableHandle number of channels. partitionChannels_ come from it. I'm not sure inputType_ size is always the same as insertTableHandle's.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used inputType_ since we are already using it in HiveDataSink, I can verify how inputType_ is derived and whether it will always be the same as insertTableHandle columns.

const std::vector<column_index_t>& partitionChannels,
const column_index_t childrenSize) const {
// Create a vector of all possible channels
std::vector<column_index_t> dataChannels(childrenSize);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add dataChannels_ field to IcebergDataSink than allocating a vector every time appendData() is called.

@@ -808,7 +786,7 @@ HiveWriterId HiveDataSink::getWriterId(size_t row) const {
return HiveWriterId{partitionId, bucketId};
}

void HiveDataSink::splitInputRowsAndEnsureWriters() {
void HiveDataSink::splitInputRowsAndEnsureWriters(RowVectorPtr input) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment out input (RowVectorPtr /* input */) {

insertTableHandle_);

for (auto i = 0; i < partitionChannels_.size(); ++i) {
auto type =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I pushed it by mistake in this commit, it was part of the changes for writing tables with partition transforms. I will remove it from this commit.

icebergInsertTableHandle->inputColumns()[partitionChannels_[i]]
->dataType();
auto block = input->childAt(partitionChannels_[i]);
partitionValues.insert(partitionValues.begin() + i, block->toString(row));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionValues[i] = block->toString(row);

partitionValues.insert(partitionValues.begin() + i, block->toString(row));
}

partitionData_.insert(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionData_[index] = std::make_shared(partitionValues);


void IcebergDataSink::extendBuffersForPartitionedTables() {
// Extends the buffer used for partition rows calculations.
partitionSizes_.emplace_back(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the 3 lines to call HiveDataSink::extendBuffersForPartitionedTables()

@imjalpreet imjalpreet force-pushed the icebergWriter branch 2 times, most recently from f3565a6 to e3aa199 Compare November 14, 2024 23:18
};

class PartitionData {
private:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the private fields to the end.


#include "velox/common/base/Fs.h"
#include "velox/connectors/hive/TableHandle.h"
#include "velox/exec/OperatorUtils.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used

@zhouyuan
Copy link
Contributor

FWIW, there's a new project to add C++ support for iceberg:
apache/iceberg-cpp#2

@imjalpreet imjalpreet marked this pull request as ready for review January 2, 2025 15:37
@imjalpreet imjalpreet marked this pull request as draft January 2, 2025 15:42
@imjalpreet imjalpreet changed the title Add support for writing iceberg tables feat(iceberg): Add support for writing iceberg tables Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants