(UDF) Simplified multi-input multi-output (ala HuggingFace Datasets, Ray, ..) #3567

Lundez · 2024-12-13T19:29:49Z

Is your feature request related to a problem?

Hi,

I'd like to see a simplified way to work with multiple columns in, multiple columns out.
One of the more pythonic approaches I've seen is to use dict[str, np.ndarray] -> dict[str, np.ndarray] (alternatively dict[str, Any]).

This approach is taken by Ray (map_batches) and HuggingFace Datasets (map)

Why is this important for Deep Learning?
When working with tasks such as Object Detection you need to transform the Bounding Box and Image the same way. Transforming could be done "in parallell", cumbersome but possible. It turns into a big problem when it comes to Augmenting data... Augmentation is commonly done with a probability p to be applied, and what is applied is also random (e.g. RandomCrop, RandomRescale, MixUp, ...). This means that the augmentation has to be applied exactly the same to both BBox and Image. Only way I see this is possible now is through building a struct, possible but not pythonic.

P.S. It's great that batch_size is already enabled as batched transforms are excellent for certain augmentations, e.g. MixUp.

Describe the solution you'd like

A multi-input, multi-output API for UDF's

Describe alternatives you've considered

I've thought of using struct but it's not as smooth as the more "pythonic" approach of using dict.

Wondering what your idea is.

Additional Context

import albumentations as A

transforms = A.Compose([
    A.RandomResizedCrop(size=(224, 224), antialias=True),
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
],
bbox_params=A.BboxParams(format="pascal_voc", label_fields=["category_id"]))
transforms(**sample)

is how albumentations is applied, where sample is a dict of values. transforms takes kwargs.
A guide on how Albumentations to use with HF Datasets.

Would you like to implement a fix?

Maybe, if you guide me I could try to get it done during the weekend.

The text was updated successfully, but these errors were encountered:

andrewgazelka · 2024-12-13T19:34:36Z

Do you have any thoughts @kevinzwang ?

universalmind303 · 2024-12-13T19:48:42Z

I think this is a good idea. I'm thinking we could use struct under the hood, but provide some nice abstractions over it to make the udf experience as seamless as possible.

kevinzwang · 2024-12-13T19:58:34Z

Hi @Lundez, thanks bringing this up. I have a few questions:

you should already be able to construct UDFs with multiple inputs by simply adding more arguments to your UDF. Does that work for you?
it's true that UDFs don't have a great mechanism for outputting multiple values at the moment. Is there an interface that you would like to propose for this? The workaround at the moment that we recommend is returning a struct dtype as a list of dictionaries in your UDF. Then, you can expand the struct fields with col("struct_col.*").

Here's a quick example of doing multi-input multi-output with the things I mentioned above:

>>> import daft
>>> @daft.udf(return_dtype=daft.DataType.struct({
...     "x": daft.DataType.int64(),
...     "y": daft.DataType.int64(),
... }))
... def my_udf(a, b):
...     # simple UDF that just returns the two inputs as a struct column
...     result = []
...     for a_elem, b_elem in zip(a.to_pylist(), b.to_pylist()):
...         result.append({"x": a_elem, "y": b_elem})
...     return result
... 
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> # call UDF
>>> df = df.select(my_udf(df["a"], df["b"]).alias("udf_result"))
>>> # unnest struct fields
>>> df = df.select("udf_result.*")
>>> df.show()
╭───────┬───────╮
│ x     ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
╰───────┴───────╯

(Showing first 3 of 3 rows)

kevinzwang · 2024-12-13T19:59:54Z

What we could maybe to is also support returning a dictionary of lists instead of a list of dictionaries for struct type columns.

Lundez · 2024-12-13T20:24:48Z

Hi,

I know it's technically possible to do right now (as I noted with my comment regarding struct). If it's how you prefer the DX to be I'm fine.

I'm merely suggesting adding another way that feels easier to work with, which could potentially help adoption.

The col("struct.*") syntax was quite cool, though the ".unnest()" approach seems clearer (IMO).

Feel free to close issue if you're happy with the state of today 👍

kevinzwang · 2024-12-13T20:33:16Z

Ah I see, thanks for the feedback. I do want to get around to improving the ergonomics of UDFs, I think we'll have some time after the new years to flesh it out. Will keep this issue open for others in the community to voice their thoughts too.

kevinzwang · 2024-12-13T20:42:22Z

Here's my proposal:

Add something like an unnest_output parameter in @daft.udf that tells Daft to automatically convert a struct output into columns
more ways to return struct type arrays (in particular, dict of list)
a way to configure a UDF to take in an arbitrary amount of of input columns + something like selectors in Polars to allow users to easily pass in specific sets of columns.

@jaychia do you have any thoughts?

Lundez · 2024-12-13T20:47:50Z

Here's my proposal:

Add something like an unnest_output parameter in @daft.udf that tells Daft to automatically convert a struct output into columns

more ways to return struct type arrays (in particular, dict of list)

a way to configure a UDF to take in an arbitrary amount of of input columns + something like selectors in Polars to allow users to easily pass in specific sets of columns.

@jaychia do you have any thoughts?

I like this.
And regarding selectors from polars, those are exceptional. Great idea to add!

Lundez added enhancement New feature or request needs triage labels Dec 13, 2024

andrewgazelka added the p2 Nice to have features label Dec 13, 2024

kevinzwang removed the needs triage label Dec 13, 2024

kevinzwang self-assigned this Dec 13, 2024

ccmao1130 added the help wanted Extra attention is needed label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(UDF) Simplified multi-input multi-output (ala HuggingFace Datasets, Ray, ..) #3567

(UDF) Simplified multi-input multi-output (ala HuggingFace Datasets, Ray, ..) #3567

Lundez commented Dec 13, 2024 •

edited

Loading

andrewgazelka commented Dec 13, 2024

universalmind303 commented Dec 13, 2024

kevinzwang commented Dec 13, 2024

kevinzwang commented Dec 13, 2024

Lundez commented Dec 13, 2024

kevinzwang commented Dec 13, 2024

kevinzwang commented Dec 13, 2024

Lundez commented Dec 13, 2024

(UDF) Simplified multi-input multi-output (ala HuggingFace Datasets, Ray, ..) #3567

(UDF) Simplified multi-input multi-output (ala HuggingFace Datasets, Ray, ..) #3567

Comments

Lundez commented Dec 13, 2024 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional Context

Would you like to implement a fix?

andrewgazelka commented Dec 13, 2024

universalmind303 commented Dec 13, 2024

kevinzwang commented Dec 13, 2024

kevinzwang commented Dec 13, 2024

Lundez commented Dec 13, 2024

kevinzwang commented Dec 13, 2024

kevinzwang commented Dec 13, 2024

Lundez commented Dec 13, 2024

Lundez commented Dec 13, 2024 •

edited

Loading