-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Added a Pandas based Transformation and BaseTransformation #141
Draft
dannymeijer
wants to merge
4
commits into
main
Choose a base branch
from
feature/79-dbr-ml-support
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 2 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
356ea1e
Added a Pandas based Transformtion class after abstracting Transforma…
dannymeijer cddc781
Merge branch 'main' into feature/79-dbr-ml-support
dannymeijer f922e82
Merge branch 'main' into feature/79-dbr-ml-support
dannymeijer f8f9d9b
Merge branch 'main' into feature/79-dbr-ml-support
dannymeijer File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,166 @@ | ||
""" | ||
Module for the BaseReader and BaseTransformation classes | ||
""" | ||
|
||
from typing import Optional, TypeVar | ||
from abc import ABC, abstractmethod | ||
|
||
from koheesio import Step | ||
from koheesio.models import Field | ||
|
||
DataFrameType = TypeVar("DataFrameType") | ||
"""Defines a type variable that can be any type of DataFrame""" | ||
|
||
|
||
class BaseReader(Step, ABC): | ||
"""Base class for all Readers | ||
|
||
Concept | ||
------- | ||
A Reader is a Step that reads data from a source based on the input parameters | ||
and stores the result in self.output.df (DataFrame). | ||
|
||
When implementing a Reader, the execute() method should be implemented. | ||
The execute() method should read from the source and store the result in self.output.df. | ||
|
||
The Reader class implements a standard read() method that calls the execute() method and returns the result. This | ||
method can be used to read data from a Reader without having to call the execute() method directly. Read method | ||
does not need to be implemented in the child class. | ||
|
||
The Reader class also implements a shorthand for accessing the output Dataframe through the df-property. If the | ||
output.df is None, .execute() will be run first. | ||
""" | ||
|
||
@property | ||
def to_df(self) -> Optional[DataFrameType]: | ||
"""Shorthand for accessing self.output.df | ||
If the output.df is None, .execute() will be run first | ||
|
||
aliases: | ||
- toDF, mimics the Delta API | ||
- df | ||
""" | ||
if not self.output.df: | ||
self.execute() | ||
return self.output.df | ||
|
||
toDF = to_df | ||
df = to_df | ||
|
||
@abstractmethod | ||
def execute(self) -> Step.Output: | ||
"""Execute on a Reader should handle self.output.df (output) as a minimum | ||
Read from whichever source -> store result in self.output.df | ||
""" | ||
pass | ||
|
||
def read(self) -> DataFrameType: | ||
"""Read from a Reader without having to call the execute() method directly""" | ||
self.execute() | ||
return self.output.df | ||
|
||
|
||
class BaseTransformation(Step, ABC): | ||
"""Base class for all Transformations | ||
|
||
Concept | ||
------- | ||
A Transformation is a Step that takes a DataFrame as input and returns a DataFrame as output. The DataFrame is | ||
transformed based on the logic implemented in the `execute` method. Any additional parameters that are needed for | ||
the transformation can be passed to the constructor. | ||
|
||
When implementing a Transformation, the `execute` method should be implemented. The `execute` method should take the | ||
input DataFrame, transform it, and store the result in `self.output.df`. | ||
|
||
The Transformation class implements a standard `transform` method that calls the `execute` method and returns the | ||
result. This method can be used to transform a DataFrame without having to call the `execute` method directly. The | ||
`transform` method does not need to be implemented in the child class. | ||
|
||
The Transformation class also implements a shorthand for accessing the output DataFrame through the | ||
`to-df`-property (alias: `toDF`). If the `output.df` is `None`, `.execute()` will be run first. | ||
""" | ||
|
||
df: Optional[DataFrameType] = Field(default=None, description="The input DataFrame") | ||
|
||
@abstractmethod | ||
def execute(self) -> Step.Output: | ||
"""Execute on a Transformation should handle self.df (input) and set self.output.df (output) | ||
|
||
This method should be implemented in the child class. The input DataFrame is available as `self.df` and the | ||
output DataFrame should be stored in `self.output.df`. | ||
|
||
For example: | ||
```python | ||
# pyspark example | ||
def execute(self): | ||
self.output.df = self.df.withColumn( | ||
"new_column", f.col("old_column") + 1 | ||
) | ||
``` | ||
|
||
The transform method will call this method and return the output DataFrame. | ||
""" | ||
# self.df # input DataFrame | ||
# self.output.df # output DataFrame | ||
self.output.df = ... # implement the transformation logic | ||
raise NotImplementedError | ||
|
||
@property | ||
def to_df(self) -> Optional[DataFrameType]: | ||
"""Shorthand for accessing self.output.df | ||
If the output.df is None, .execute() will be run first | ||
""" | ||
if not self.output.df: | ||
self.execute() | ||
return self.output.df | ||
|
||
toDF = to_df | ||
"""Alias for the to_df property - mimics the Delta API""" | ||
|
||
def transform(self, df: Optional[DataFrameType] = None) -> DataFrameType: | ||
"""Execute the transformation and return the output DataFrame | ||
|
||
Note: when creating a child from this, don't implement this transform method. Instead, implement `execute`. | ||
|
||
See Also | ||
-------- | ||
`Transformation.execute` | ||
|
||
Parameters | ||
---------- | ||
df: Optional[DataFrameType] | ||
The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor | ||
will be used. | ||
|
||
Returns | ||
------- | ||
DataFrameType | ||
The transformed DataFrame | ||
""" | ||
if df is not None: | ||
self.df = df | ||
if self.df is None: | ||
raise RuntimeError("No valid Dataframe was passed") | ||
self.execute() | ||
return self.output.df | ||
|
||
def __call__(self, *args, **kwargs): | ||
"""Allow the class to be called as a function. | ||
This is especially useful when using a Pyspark DataFrame's transform method. | ||
|
||
Example | ||
------- | ||
Using pyspark's DataFrame transform method: | ||
```python | ||
input_df = spark.range(3) | ||
|
||
output_df = input_df.transform(AddOne(target_column="foo")).transform( | ||
AddOne(target_column="bar") | ||
) | ||
``` | ||
|
||
In the above example, the `AddOne` transformation is applied to the `input_df` DataFrame using the `transform` | ||
method. The `output_df` will now contain the original DataFrame with an additional columns called `foo` and | ||
`bar', each with the values of `id` + 1. | ||
""" | ||
return self.transform(*args, **kwargs) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,50 +1,9 @@ | ||
""" | ||
Module for the BaseReader class | ||
""" | ||
|
||
from typing import Optional, TypeVar | ||
from abc import ABC, abstractmethod | ||
|
||
from koheesio import Step | ||
|
||
# Define a type variable that can be any type of DataFrame | ||
DataFrameType = TypeVar("DataFrameType") | ||
|
||
|
||
class BaseReader(Step, ABC): | ||
"""Base class for all Readers | ||
|
||
A Reader is a Step that reads data from a source based on the input parameters | ||
and stores the result in self.output.df (DataFrame). | ||
|
||
When implementing a Reader, the execute() method should be implemented. | ||
The execute() method should read from the source and store the result in self.output.df. | ||
|
||
The Reader class implements a standard read() method that calls the execute() method and returns the result. This | ||
method can be used to read data from a Reader without having to call the execute() method directly. Read method | ||
does not need to be implemented in the child class. | ||
|
||
The Reader class also implements a shorthand for accessing the output Dataframe through the df-property. If the | ||
output.df is None, .execute() will be run first. | ||
""" | ||
|
||
@property | ||
def df(self) -> Optional[DataFrameType]: | ||
"""Shorthand for accessing self.output.df | ||
If the output.df is None, .execute() will be run first | ||
""" | ||
if not self.output.df: | ||
self.execute() | ||
return self.output.df | ||
This acts as a pass-through for the BaseReader class in the models.dataframe module. | ||
""" | ||
|
||
@abstractmethod | ||
def execute(self) -> Step.Output: | ||
"""Execute on a Reader should handle self.output.df (output) as a minimum | ||
Read from whichever source -> store result in self.output.df | ||
""" | ||
pass | ||
from koheesio.models.dataframe import BaseReader | ||
|
||
def read(self) -> DataFrameType: | ||
"""Read from a Reader without having to call the execute() method directly""" | ||
self.execute() | ||
return self.output.df | ||
__all__ = ["BaseReader"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
""" | ||
Module for the BaseTransformation class | ||
|
||
This acts as a pass-through for the BaseTransformation class in the models.dataframe module. | ||
""" | ||
|
||
from koheesio.models.dataframe import BaseTransformation | ||
|
||
__all__ = ["BaseTransformation"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newly added is
ml
- I sorted the features afterwards