You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Iterating over the following dataframe with daft is over 1000x slower than converting the daft dataframe to pandas and iterating over the pandas dataframe instead.
To Reproduce
importnumpyasnpimportdaftnp.random.seed(0)
n_rows=1_000list_size=100_000data= {"list": np.random.randint(0, 256, (n_rows, list_size), dtype=np.uint8)}
df=daft.from_pydict(data)
print("Iter with pandas:")
%timeitforrowindf.to_pandas().itertuples(index=False): passprint("Iter with daft:")
%timeitforxindf.iter_rows(): pass
Iter with pandas:
20.8 ms ± 551 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Iter with daft:
32.2 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Expected behavior
I would expect iteration alone to be faster than conversion + iteration.
Using a pyarrow array view as a numpy ndarray could resolve that issue and potentially similar ones for structs: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
I wonder if it could also work for tensors (using pyarrow.Tensor or reshaping a flattened lists), as these casts are quite slow too.
Component(s)
Expressions, Python Runner
Additional context
It appears that cast operations are typically heavy, but at times it seems like only one cpu core is being utilized during these operations.
daft v0.4.0
python 3.10.13
The text was updated successfully, but these errors were encountered:
Describe the bug
Iterating over the following dataframe with daft is over 1000x slower than converting the daft dataframe to pandas and iterating over the pandas dataframe instead.
To Reproduce
Expected behavior
I would expect iteration alone to be faster than conversion + iteration.
Using a pyarrow array view as a numpy ndarray could resolve that issue and potentially similar ones for structs:
https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
I wonder if it could also work for tensors (using
pyarrow.Tensor
or reshaping a flattened lists), as these casts are quite slow too.Component(s)
Expressions, Python Runner
Additional context
It appears that cast operations are typically heavy, but at times it seems like only one cpu core is being utilized during these operations.
daft v0.4.0
python 3.10.13
The text was updated successfully, but these errors were encountered: