-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(UDF) Simplified multi-input multi-output (ala HuggingFace Datasets, Ray, ..) #3567
Comments
Do you have any thoughts @kevinzwang ? |
I think this is a good idea. I'm thinking we could use |
Hi @Lundez, thanks bringing this up. I have a few questions:
Here's a quick example of doing multi-input multi-output with the things I mentioned above: >>> import daft
>>> @daft.udf(return_dtype=daft.DataType.struct({
... "x": daft.DataType.int64(),
... "y": daft.DataType.int64(),
... }))
... def my_udf(a, b):
... # simple UDF that just returns the two inputs as a struct column
... result = []
... for a_elem, b_elem in zip(a.to_pylist(), b.to_pylist()):
... result.append({"x": a_elem, "y": b_elem})
... return result
...
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> # call UDF
>>> df = df.select(my_udf(df["a"], df["b"]).alias("udf_result"))
>>> # unnest struct fields
>>> df = df.select("udf_result.*")
>>> df.show()
╭───────┬───────╮
│ x ┆ y │
│ --- ┆ --- │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1 ┆ 4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 6 │
╰───────┴───────╯
(Showing first 3 of 3 rows) |
What we could maybe to is also support returning a dictionary of lists instead of a list of dictionaries for struct type columns. |
Hi, I know it's technically possible to do right now (as I noted with my comment regarding struct). If it's how you prefer the DX to be I'm fine. I'm merely suggesting adding another way that feels easier to work with, which could potentially help adoption. The Feel free to close issue if you're happy with the state of today 👍 |
Ah I see, thanks for the feedback. I do want to get around to improving the ergonomics of UDFs, I think we'll have some time after the new years to flesh it out. Will keep this issue open for others in the community to voice their thoughts too. |
Here's my proposal:
@jaychia do you have any thoughts? |
I like this. |
Is your feature request related to a problem?
Hi,
I'd like to see a simplified way to work with multiple columns in, multiple columns out.
One of the more pythonic approaches I've seen is to use
dict[str, np.ndarray] -> dict[str, np.ndarray]
(alternativelydict[str, Any]
).This approach is taken by Ray (map_batches) and HuggingFace Datasets (map)
Why is this important for Deep Learning?
When working with tasks such as Object Detection you need to transform the Bounding Box and Image the same way. Transforming could be done "in parallell", cumbersome but possible. It turns into a big problem when it comes to Augmenting data... Augmentation is commonly done with a probability p to be applied, and what is applied is also random (e.g. RandomCrop, RandomRescale, MixUp, ...). This means that the augmentation has to be applied exactly the same to both BBox and Image. Only way I see this is possible now is through building a struct, possible but not pythonic.
P.S. It's great that
batch_size
is already enabled as batched transforms are excellent for certain augmentations, e.g. MixUp.Describe the solution you'd like
A multi-input, multi-output API for UDF's
Describe alternatives you've considered
I've thought of using
struct
but it's not as smooth as the more "pythonic" approach of usingdict
.Wondering what your idea is.
Additional Context
is how albumentations is applied, where
sample
is a dict of values.transforms
takes kwargs.A guide on how Albumentations to use with HF Datasets.
Would you like to implement a fix?
Maybe, if you guide me I could try to get it done during the weekend.
The text was updated successfully, but these errors were encountered: