Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support different vector serialization format for streaming shuffle (#…
…11445) Summary: Pull Request resolved: #11445 Add to support row wise shuffle to optimize both cpu and memory efficiency in workload with very large number of shuffle columns. This PR includes: (1) make the vector serde format configurable through plan node: producer side: partition output consumer side: exchange and merge exchange (2) all the three shuffle operators support to get the vector serde for serialization/deserialization based on the one specified in the corresponding query plan node. A separate change from Presto coordinator will set the format based on the number of column streams in shuffle data type plus a per-query session property to enable/disable (3) add two set of APIs in VectorSerde and the corresponding serializer to support estimate and serialize the vectors in two different row formats: compact row and unsafe row. The construction of the two row formats are cpu intensive and we shall construct them once per one partition output batch processing. Also the two row formats serialized size estimations are accurate which is unlike the columnar size estimation which is approximate and the actual memory allocations happens progressively during the serialization process through a memory arena. The row wise serialization allocates memory once before the actual serialization. Given that, we optimize that by directly leveraging the estimated serialized size for serialization buffer allocation. So we construct row wise vector once, estimate serialized size once. (4) changes exchange to optimize the row deserializer which don't support append deserialization (as for now) by merging the iobufs from multiple serialized pages and deserialize all together and we expect row deserializer can consume all the serialized vectors given its serialization format (5) Fix the bug in compact row and unsafe row's deserialization code path as the vector extends, the pointers (string view) to the existing buffers are invalid. The current processing is very inefficient and optimize the process with unique_ptr to the string buffer and avoid unnecessary data copy. (6) Add an operator runtime stats to indicate if the serialization format used for shuffle to help performance debugging (7) Fix a couple of issues in operator trace replay like fix partition output close to call base operator close to write the summary trace file (8) some code, test refactor and cleanup in relevant code path. With partition output replay with trace collected from production, we have seen the peak memory usage has been reduced by 10x and replay execution time reduced by half. A couple of followup optimizations: (1) support compression for unsafe row and compact row on flush if needs (2) support columnar wise row size estimations which might have 2x improvement on estimation time based on a draft implementation (3) support row wise deserialize append to avoid potential small vectors at consumer side. Reviewed By: arhimondr, oerling Differential Revision: D65258176
- Loading branch information