Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient conversions while iterating a dataframe #3634

Open
sagiahrac opened this issue Dec 25, 2024 · 0 comments
Open

Inefficient conversions while iterating a dataframe #3634

sagiahrac opened this issue Dec 25, 2024 · 0 comments
Labels
bug Something isn't working needs triage

Comments

@sagiahrac
Copy link
Contributor

sagiahrac commented Dec 25, 2024

Describe the bug

Iterating over the following dataframe with daft is over 1000x slower than converting the daft dataframe to pandas and iterating over the pandas dataframe instead.

To Reproduce

import numpy as np
import daft

np.random.seed(0)
n_rows = 1_000
list_size = 100_000
data = {"list": np.random.randint(0, 256, (n_rows, list_size), dtype=np.uint8)}
df = daft.from_pydict(data)

print("Iter with pandas:")
%timeit for row in df.to_pandas().itertuples(index=False): pass

print("Iter with daft:")
%timeit for x in df.iter_rows(): pass
Iter with pandas:
20.8 ms ± 551 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Iter with daft:
32.2 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Expected behavior

I would expect iteration alone to be faster than conversion + iteration.
Using a pyarrow array view as a numpy ndarray could resolve that issue and potentially similar ones for structs:
https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
I wonder if it could also work for tensors (using pyarrow.Tensor or reshaping a flattened lists), as these casts are quite slow too.

Component(s)

Expressions, Python Runner

Additional context

It appears that cast operations are typically heavy, but at times it seems like only one cpu core is being utilized during these operations.

daft v0.4.0
python 3.10.13

@sagiahrac sagiahrac added bug Something isn't working needs triage labels Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage
Projects
None yet
Development

No branches or pull requests

1 participant