-
-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schema evolution when writing the row groups does not work #907
Comments
Fastparquet likes to maintain a global _metadata for its datasets for speed of loading and filtering data. This (and _common_metadata) are not compatible with schema evolution: it contains exactly one schema which all files must meet. There is no current way to avoid rewriting _metadata and requiring consistency during append. (writing via dask may work however) If, instead, you write new files into your directories without |
example
|
Thanks a lot for the detailed response!
Since in your API, update metadata is kind of allowed. Then this should not be a problem. At each writing, in theory, we could update the schema in common meta to the merged/evolved schema. The fact is we already update the _metadata, but why not also the _common_metadata? Can we have something like arrow's unify_schema()? The real problem here is even I manually update the schema in _*metadata to make it compatible with both files (with the c column), the fastparquet cannot load the data and complain column c is missing from one file. Maybe I did something wrong here.
The point is to use the append to let it automatically update _metadata and automatically generate proper file name. And we also like to be able to read data written by other programs for example spark, etc. There are TB level data you would never want to import/export all the times. That is probably the biggest reason we choose to use parquet at the first place. So now I see dask seems not intend to maintain the compatibility for data exchange but rather to treat parquet as a kind of internal data format should only be read/write by dask/fastparquet. It is pretty fine though, different projects will have different priorities. If this is the case we can close this issue. |
It could be done, and is probably worthwhile, but I don't know who has the effort to spare. I'll just clarify a couple of points:
So, to implement this (given fastparquet already has some of the basic functionality), the main task is to decide what should happen for the various cases. |
Describe the issue:
when we write row groups, schema evolution should be easy and should be supported.
This is very important for long existing live dataset, we usually want to add/remove some additional columns
Rewrite the entire dataset to adopt the schema is too expensive.
The big advantage of parquet is to avoid re-write the entire dataset for small schema changes like this
arrow actually support it at least in rust and R I think.
Minimal Complete Verifiable Example:
ValueError: Column names of new data are ['a', 'c']. But column names in existing file are ['a']. {'c'} are columns being either only in existing file or only in new data. This is not possible.
Environment:
The text was updated successfully, but these errors were encountered: