Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cucat Featurization base #486

Open
wants to merge 98 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
cf07249
cucat feat support
tanmoyio May 15, 2023
d73a2db
cudf test env var added for test_feature_utils.py
tanmoyio May 15, 2023
382e18b
some import fixes
tanmoyio May 15, 2023
44200ac
passthru DT encode/umap, add back for timebar
dcolinmorgan Jun 13, 2023
777afd4
lint
dcolinmorgan Jul 21, 2023
c1bc6f1
updated cu-cat version for optional install
dcolinmorgan Jul 26, 2023
48e4017
type check without loading cudf, via getmodule
dcolinmorgan Jul 28, 2023
6b0b52b
ok we still need the check_cudf def
dcolinmorgan Jul 28, 2023
e4b0c0a
swap lazy import defs
dcolinmorgan Jul 29, 2023
7c0c0c6
working thru comments
dcolinmorgan Aug 4, 2023
f344dd8
address few issues
dcolinmorgan Aug 6, 2023
b6f6388
swap cudf=None type sig for lazy calls
dcolinmorgan Aug 8, 2023
f185a2f
swap cudf=None type sig for lazy calls
dcolinmorgan Aug 8, 2023
410c40d
swap cudf=None type sig for lazy calls
dcolinmorgan Aug 8, 2023
b9067c0
type check lint
dcolinmorgan Aug 8, 2023
8f0bc3a
lint isinstance all over
dcolinmorgan Aug 8, 2023
b7b8e63
lint isinstance all over
dcolinmorgan Aug 8, 2023
e8eb85a
rename lazy cucat to cuda
dcolinmorgan Aug 8, 2023
501ff3b
cudf df constructor change
dcolinmorgan Aug 9, 2023
918ebee
towards single engine=cuda flag
dcolinmorgan Aug 9, 2023
ccf6f47
towards single engine=cuda flag
dcolinmorgan Aug 9, 2023
60de1cf
single cuda flag
dcolinmorgan Aug 11, 2023
0b66776
lint
dcolinmorgan Aug 11, 2023
9f086c8
robust logging for cu_cat
dcolinmorgan Aug 11, 2023
78015f1
single cuda flag
dcolinmorgan Aug 11, 2023
616009b
assert after if
dcolinmorgan Aug 11, 2023
dc38d3b
super > table
dcolinmorgan Aug 11, 2023
376890e
Update feature_utils.py
dcolinmorgan Aug 11, 2023
b9828c5
rollback constant CUDA_CAT
dcolinmorgan Aug 11, 2023
8d13cbe
rollback constant CUDA_CAT
dcolinmorgan Aug 11, 2023
92769bf
else all
dcolinmorgan Aug 11, 2023
af0fc8a
else all
dcolinmorgan Aug 11, 2023
4f78b76
else all
dcolinmorgan Aug 11, 2023
b8a0db2
feat pytest tweaks
dcolinmorgan Aug 15, 2023
6e11117
feat pytest tweaks
dcolinmorgan Aug 15, 2023
b0d36cd
see if last commit induced numba install error
dcolinmorgan Aug 15, 2023
5677bea
feat pytest tweaks
dcolinmorgan Aug 15, 2023
8e15e5e
datetime passthrough for cudf
dcolinmorgan Aug 17, 2023
20200d6
add unadulterated dt back
dcolinmorgan Aug 20, 2023
26cd39c
more flexible multi-dt column add
dcolinmorgan Aug 21, 2023
c4c1bd8
start DT test
dcolinmorgan Aug 23, 2023
d889581
start DT test
dcolinmorgan Aug 24, 2023
48a7308
Merge branch 'master' into feat/gpu-featurization
dcolinmorgan Aug 24, 2023
ba25c89
Merge branch 'feat/gpu-featurization' of https://github.com/graphistr…
dcolinmorgan Aug 25, 2023
8a0ab5c
lint
dcolinmorgan Aug 25, 2023
151ab5b
lint
dcolinmorgan Aug 25, 2023
d63d729
cucat may be erroneously involked
dcolinmorgan Aug 28, 2023
ada126e
maybe fastencoder issue
dcolinmorgan Aug 28, 2023
21a475d
defaulting to cucat, concrete mixedup perhaps
dcolinmorgan Aug 29, 2023
49976e8
defaulting to cucat, concrete mixedup perhaps
dcolinmorgan Aug 29, 2023
f24411e
try basic assert isinstance
dcolinmorgan Aug 30, 2023
d303afb
nope
dcolinmorgan Aug 30, 2023
b34ee85
nope
dcolinmorgan Aug 30, 2023
2456b70
type checking node attributes causing issues
dcolinmorgan Aug 30, 2023
8fc0b22
type checking node attributes causing issues
dcolinmorgan Aug 30, 2023
ee6c523
type checking node attributes causing issues
dcolinmorgan Aug 30, 2023
4808428
defaulting to cucat, concrete mixedup perhaps
dcolinmorgan Aug 30, 2023
a22e85e
type checking node attributes causing issues
dcolinmorgan Aug 30, 2023
86fc662
type checking node attributes causing issues
dcolinmorgan Aug 30, 2023
614fff4
type checking node attributes causing issues
dcolinmorgan Aug 30, 2023
b88e3ea
type checking node attributes causing issues
dcolinmorgan Aug 30, 2023
a72d4b1
type checking node attributes causing issues
dcolinmorgan Aug 30, 2023
4eef71c
type checking node attributes causing issues
dcolinmorgan Aug 30, 2023
0522981
check which column is off
dcolinmorgan Aug 30, 2023
73ba5d1
trying everything
dcolinmorgan Aug 30, 2023
9da0b11
remove print, add print
dcolinmorgan Aug 30, 2023
f9e9260
same df every time, remove [cols]
dcolinmorgan Aug 30, 2023
58d1461
revert, remove +target_names_node from targets
dcolinmorgan Aug 30, 2023
d5acc1a
revert, remove +target_names_node from targets
dcolinmorgan Aug 30, 2023
614d9f3
nan raising equality issues, filled with 0
dcolinmorgan Aug 31, 2023
31b5f5e
add feat tests back
dcolinmorgan Sep 7, 2023
bc4f290
Merge branch 'master' into feat/gpu-featurization
dcolinmorgan Sep 7, 2023
74a2460
Merge branch 'feat/gpu-featurization' of https://github.com/graphistr…
dcolinmorgan Sep 7, 2023
624c721
comment anxiety assert
dcolinmorgan Sep 7, 2023
2fc6be5
single cuda engine flag
dcolinmorgan Sep 9, 2023
178adba
try constant substitution
dcolinmorgan Sep 9, 2023
90bd8b7
add cuda/gpu generic engine flag for full gpu pipeline
dcolinmorgan Sep 19, 2023
5d16a9e
most comments
dcolinmorgan Sep 21, 2023
e931456
most comments
dcolinmorgan Sep 21, 2023
fc212a8
most comments
dcolinmorgan Sep 21, 2023
d4b1fbe
most comments
dcolinmorgan Sep 21, 2023
498a4de
most comments
dcolinmorgan Sep 21, 2023
aab2ad9
remove single engine flag, try in next PR
dcolinmorgan Sep 21, 2023
f0eb1bf
latest cu-cat version
dcolinmorgan Sep 21, 2023
867874d
edge concat interop
dcolinmorgan Dec 29, 2023
5a69233
Merge branch 'master' into feat/gpu-featurization
dcolinmorgan Dec 29, 2023
cdda3e7
better dc default
dcolinmorgan Dec 29, 2023
63398b3
renaming
dcolinmorgan Jan 3, 2024
b720bc1
renaming
dcolinmorgan Jan 3, 2024
ed824ec
cupyx csr toarray for features_out
dcolinmorgan Jan 4, 2024
1735134
cupyx csr toarray for features_out
dcolinmorgan Jan 4, 2024
824d940
cupyx csr toarray for features_out
dcolinmorgan Jan 4, 2024
c7ce92c
add gpu-umap test, allow cucat to test w/o gpu
dcolinmorgan Jan 4, 2024
30a04a4
add gpu-umap test, allow cucat to test w/o gpu
dcolinmorgan Jan 4, 2024
50df365
dirty_cat version with Table&SuperVectorizer
dcolinmorgan Jan 4, 2024
a654f9f
dirty_cat version with Table&SuperVectorizer
dcolinmorgan Jan 4, 2024
a86be5c
better dimension try
dcolinmorgan Jan 5, 2024
4bd056c
Merge branch 'master' into feat/gpu-featurization
dcolinmorgan Jul 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion docker/test-gpu-local.sh
Original file line number Diff line number Diff line change
Expand Up @@ -44,5 +44,4 @@ docker run \
${NETWORK} \
graphistry/test-gpu:${TEST_CPU_VERSION} \
--maxfail=1 \
--ignore=graphistry/tests/test_feature_utils.py \
$@
169 changes: 141 additions & 28 deletions graphistry/feature_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,16 @@
SuperVectorizer = Any
GapEncoder = Any
SimilarityEncoder = Any
try:
from cu_cat import (
SuperVectorizer,
GapEncoder,
SimilarityEncoder,
) # type: ignore
except:
SuperVectorizer = Any
GapEncoder = Any
SimilarityEncoder = Any
try:
from sklearn.preprocessing import FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin
Expand Down Expand Up @@ -93,6 +103,28 @@ def lazy_import_has_min_dependancy():
except ModuleNotFoundError as e:
return False, e

def lazy_import_has_dependancy_cu_cat():
import warnings
warnings.filterwarnings("ignore")
try:
import scipy.sparse # noqa
from scipy import __version__ as scipy_version
from cu_cat import __version__ as cu_cat_version
import cu_cat
from sklearn import __version__ as sklearn_version
from cuml import __version__ as cuml_version
import cuml
from cudf import __version__ as cudf_version
import cudf
logger.debug(f"SCIPY VERSION: {scipy_version}")
logger.debug(f"Cuda CAT VERSION: {cu_cat_version}")
logger.debug(f"sklearn VERSION: {sklearn_version}")
logger.debug(f"cuml VERSION: {cuml_version}")
logger.debug(f"cudf VERSION: {cudf_version}")
return True, 'ok', cudf
except ModuleNotFoundError as e:
return False, e, None


def assert_imported_text():
has_dependancy_text_, import_text_exn, _ = lazy_import_has_dependancy_text()
Expand All @@ -114,6 +146,33 @@ def assert_imported():
raise import_min_exn


def assert_cuml_cucat():
has_cuml_dependancy_, import_cuml_exn, cudf = lazy_import_has_dependancy_cu_cat()
if not has_cuml_dependancy_:
logger.error( # noqa
"cuml not found, trying running" # noqa
"`pip install rapids`" # noqa
)
raise import_cuml_exn


def make_safe_gpu_dataframes(X, y, engine):
has_cudf_dependancy_, _, cudf = lazy_import_has_dependancy_cu_cat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add assert cudf is not None ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also probably good to switch from lazy_import...cu_cat to a cudf one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok -- after the if statement seems best here again like other assert you mentioned

if has_cudf_dependancy_:
new_kwargs = {}
kwargs = {'X': X, 'y': y}
for key, value in kwargs.items():
if isinstance(value, cudf.DataFrame) and engine in ["pandas", "dirty_cat", "torch"]:
new_kwargs[key] = value.to_pandas()
elif isinstance(value, pd.DataFrame) and engine in ["cuml", "cu_cat"]:
new_kwargs[key] = cudf.from_pandas(value)
else:
new_kwargs[key] = value
return new_kwargs['X'], new_kwargs['y']
else:
return X, y


# ############################################################################
#
# Rough calltree
Expand All @@ -137,29 +196,32 @@ def assert_imported():
#
# _featurize_or_get_edges_dataframe_if_X_is_None

FeatureEngineConcrete = Literal["none", "pandas", "dirty_cat", "torch"]
FeatureEngineConcrete = Literal["none", "pandas", "dirty_cat", "torch", "cu_cat"]
FeatureEngine = Literal[FeatureEngineConcrete, "auto"]


def resolve_feature_engine(
feature_engine: FeatureEngine,
) -> FeatureEngineConcrete: # noqa

if feature_engine in ["none", "pandas", "dirty_cat", "torch"]:
if feature_engine in ["none", "pandas", "dirty_cat", "torch", "cu_cat"]:
return feature_engine # type: ignore

if feature_engine == "auto":
has_dependancy_text_, _, _ = lazy_import_has_dependancy_text()
if has_dependancy_text_:
return "torch"
has_cuml_dependancy_, _, cudf = lazy_import_has_dependancy_cu_cat()
if has_cuml_dependancy_:
return "cu_cat"
has_min_dependancy_, _ = lazy_import_has_min_dependancy()
if has_min_dependancy_:
return "dirty_cat"
return "pandas"

raise ValueError( # noqa
f'feature_engine expected to be "none", '
'"pandas", "dirty_cat", "torch", or "auto"'
'"pandas", "dirty_cat", "torch", "cu_cat", or "auto"'
f'but received: {feature_engine} :: {type(feature_engine)}'
)

Expand Down Expand Up @@ -230,18 +292,19 @@ def features_without_target(
:param y: target DataFrame
:return: DataFrames of model and target
"""
_, _, cudf = lazy_import_has_dependancy_cu_cat()
if y is None:
return df
remove_cols = []
if y is None:
pass
elif isinstance(y, pd.DataFrame):
elif isinstance(y, pd.DataFrame) or isinstance(y, cudf.DataFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcolinmorgan or (cudf is not None and isinstance(y, cudf.DataFrame)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe same problem elsewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cudf is None, I think the current would throw an exn

yc = y.columns
xc = df.columns
for c in yc:
if c in xc:
remove_cols.append(c)
elif isinstance(y, pd.Series):
elif isinstance(y, pd.Series) or isinstance(y, cudf.Series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle non-cu_cat import returning None for cudf

if y.name and (y.name in df.columns):
remove_cols = [y.name]
elif isinstance(y, List):
Expand All @@ -265,7 +328,7 @@ def remove_node_column_from_symbolic(X_symbolic, node):
logger.info(f"Removing `{node}` from input X_symbolic list")
X_symbolic.remove(node)
return X_symbolic
if isinstance(X_symbolic, pd.DataFrame):
if isinstance(X_symbolic, pd.DataFrame) or 'cudf' in str(getmodule(X_symbolic)):
logger.info(f"Removing `{node}` from input X_symbolic DataFrame")
return X_symbolic.drop(columns=[node], errors="ignore")

Expand Down Expand Up @@ -619,11 +682,19 @@ def fit_pipeline(
columns = X.columns
index = X.index

X = transformer.fit_transform(X)
if keep_n_decimals:
X = np.round(X, decimals=keep_n_decimals) # type: ignore # noqa

return pd.DataFrame(X, columns=columns, index=index)
X_type = str(getmodule(X))
if 'cudf' not in X_type:
X = transformer.fit_transform(X)
if keep_n_decimals:
X = np.round(X, decimals=keep_n_decimals) # type: ignore # noqa
X = pd.DataFrame(X, columns=columns, index=index)
else:
X = transformer.fit_transform(X.to_numpy())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we know if the transformer is cpu vs gpu? it seems to always assume cpu here, but if X is cudf and transformer is gpu, can't we keep X on gpu?

Copy link
Contributor

@lmeyerov lmeyerov Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a sometimes-ok soln would be checking transformer for being from cuml or maybe cu_cat, but that seems non-generalizable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow this nearly gave me a heart attack -- good thoughts i will work with... but this is an artifact, no .to_numpy needed

if keep_n_decimals:
X = np.round(X, decimals=keep_n_decimals) # type: ignore # noqa
_, _, cudf = lazy_import_has_dependancy_cu_cat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert cudf is not None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good practice to assert even after if statement check? good to know, yay learning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful when you don't want the 'if' but can imagine future changes or misuses accidentally getting the assumption wrong

X = cudf.DataFrame(X, columns=columns, index=index)
return X


def impute_and_scale_df(
Expand Down Expand Up @@ -848,6 +919,7 @@ def process_dirty_dataframes(
similarity: Optional[str] = None, # "ngram",
categories: Optional[str] = "auto",
multilabel: bool = False,
feature_engine: Optional[str] = "dirty_cat",
lmeyerov marked this conversation as resolved.
Show resolved Hide resolved
) -> Tuple[
pd.DataFrame,
Optional[pd.DataFrame],
Expand All @@ -873,8 +945,16 @@ def process_dirty_dataframes(
:return: Encoded data matrix and target (if not None),
the data encoder, and the label encoder.
"""
from dirty_cat import SuperVectorizer, GapEncoder, SimilarityEncoder
from sklearn.preprocessing import FunctionTransformer

if feature_engine == 'cu_cat':
lazy_import_has_dependancy_cu_cat()
from cu_cat import SuperVectorizer, GapEncoder, SimilarityEncoder
from cuml.preprocessing import FunctionTransformer

else:
from dirty_cat import SuperVectorizer, GapEncoder, SimilarityEncoder
from sklearn.preprocessing import FunctionTransformer

t = time()

if not is_dataframe_all_numeric(ndf):
Expand Down Expand Up @@ -911,12 +991,19 @@ def process_dirty_dataframes(
)
# now just set the feature names, since dirty cat changes them in
# a weird way...
data_encoder.get_feature_names_out = callThrough(features_transformed)

X_enc = pd.DataFrame(
X_enc, columns=features_transformed, index=ndf.index
)
X_enc = X_enc.fillna(0.0)
data_encoder.get_feature_names_out = callThrough(features_transformed)
if 'cudf' not in str(getmodule(ndf)):
X_enc = pd.DataFrame(
X_enc, columns=features_transformed, index=ndf.index
)
X_enc = X_enc.fillna(0.0)
else:
_, _, cudf = lazy_import_has_dependancy_cu_cat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if cudf is passed in but cu_cat is not installed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and are there tests for this?)

Copy link
Contributor

@dcolinmorgan dcolinmorgan Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general you mean? this gpu/feat branch was an attempt to handle this dilemma

i think in future if cudf is passed as general engine it should initiate cucat and we remove feature_engine flag; also e.g. if cucat + umap-learn passed -- just to bypass user selection error / bad combinations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes Tanmoy is out for a bit unexpectedly, so that would help

And yes we need to handle multiple envs & multiple user prefs in them, so we can do smart defaults + allow overrides . Makes automated testing here important bc so tricky ...

X_enc = cudf.DataFrame(
X_enc, columns=features_transformed, index=ndf.index
)
X_enc = X_enc.fillna(0.0).to_pandas() # will be removed for future cu_cat release

else:
logger.info("-*-*- DataFrame is completely numeric")
X_enc, _, data_encoder, _ = get_numeric_transformers(ndf, None)
Expand Down Expand Up @@ -1117,7 +1204,8 @@ def process_nodes_dataframes(
n_topics_target=n_topics_target,
similarity=similarity,
categories=categories,
multilabel=multilabel
multilabel=multilabel,
feature_engine=feature_engine,
)

if embedding:
Expand Down Expand Up @@ -1235,20 +1323,31 @@ def encode_edges(edf, src, dst, mlb, fit=False):
"""
# uses mlb with fit=T/F so we can use it in transform mode
# to recreate edge feature concat definition
edf_type = str(getmodule(edf))
source = edf[src]
destination = edf[dst]
source_dtype = str(getmodule(source))
logger.debug("Encoding Edges using MultiLabelBinarizer")
if fit:
if fit and 'cudf' not in source_dtype:
T = mlb.fit_transform(zip(source, destination))
else:
elif fit and 'cudf' in source_dtype:
T = mlb.fit_transform(zip(source.to_pandas(), destination.to_pandas()))
elif not fit and 'cudf' not in source_dtype:
T = mlb.transform(zip(source, destination))
elif not fit and 'cudf' in source_dtype:
T = mlb.transform(zip(source.to_pandas(), destination.to_pandas()))

T = 1.0 * T # coerce to float
columns = [
str(k) for k in mlb.classes_
] # stringify the column names or scikits.base throws error
mlb.get_feature_names_out = callThrough(columns)
mlb.columns_ = [src, dst]
T = pd.DataFrame(T, columns=columns, index=edf.index)
if 'cudf' in edf_type:
_, _, cudf = lazy_import_has_dependancy_cu_cat()
T = cudf.DataFrame(T, columns=columns, index=edf.index)
else:
T = pd.DataFrame(T, columns=columns, index=edf.index)
logger.info(f"Shape of Edge Encoding: {T.shape}")
return T, mlb

Expand Down Expand Up @@ -1321,6 +1420,7 @@ def process_edge_dataframes(
MultiLabelBinarizer()
) # create new one so we can use encode_edges later in
# transform with fit=False
_, _, cudf = lazy_import_has_dependancy_cu_cat()
T, mlb_pairwise_edge_encoder = encode_edges(
edf, src, dst, mlb_pairwise_edge_encoder, fit=True
)
Expand Down Expand Up @@ -1406,7 +1506,11 @@ def process_edge_dataframes(
if not X_enc.empty and not T.empty:
logger.debug("-" * 60)
logger.debug("<= Found Edges and Dirty_cat encoding =>")
X_enc = pd.concat([T, X_enc], axis=1)
T_type = str(getmodule(T))
if 'cudf' in T_type:
X_enc = cudf.concat([T, X_enc], axis=1)
else:
X_enc = pd.concat([T, X_enc], axis=1)
elif not T.empty and X_enc.empty:
logger.debug("-" * 60)
logger.debug("<= Found only Edges =>")
Expand Down Expand Up @@ -1811,7 +1915,7 @@ def prune_weighted_edges_df_and_relabel_nodes(
" -- Pruning weighted edge DataFrame "
f"from {len(wdf):,} to {len(wdf2):,} edges."
)
if index_to_nodes_dict is not None:
if index_to_nodes_dict is not None and type(index_to_nodes_dict) == dict:
wdf2[config.SRC] = wdf2[config.SRC].map(index_to_nodes_dict)
wdf2[config.DST] = wdf2[config.DST].map(index_to_nodes_dict)
return wdf2
Expand Down Expand Up @@ -1952,7 +2056,8 @@ def _featurize_nodes(
X_resolved = resolve_X(ndf, X)
y_resolved = resolve_y(ndf, y)

feature_engine = resolve_feature_engine(feature_engine)
res.feature_engine = feature_engine
dcolinmorgan marked this conversation as resolved.
Show resolved Hide resolved
X_resolved, y_resolved = make_safe_gpu_dataframes(X_resolved, y_resolved, engine=feature_engine)

from .features import ModelDict

Expand Down Expand Up @@ -2076,6 +2181,9 @@ def _featurize_edges(
**{res._destination: res._edges[res._destination]}
)

res.feature_engine = feature_engine
dcolinmorgan marked this conversation as resolved.
Show resolved Hide resolved
X_resolved, y_resolved = make_safe_gpu_dataframes(X_resolved, y_resolved, engine=feature_engine)

# now that everything is set
fkwargs = dict(
X=X_resolved,
Expand Down Expand Up @@ -2487,13 +2595,18 @@ def featurize(
default True.
:return: graphistry instance with new attributes set by the featurization process.
"""
assert_imported()
feature_engine = resolve_feature_engine(feature_engine)

if feature_engine == 'dirty_cat':
assert_imported()
dcolinmorgan marked this conversation as resolved.
Show resolved Hide resolved
elif feature_engine == 'cu_cat':
assert_cuml_cucat()

if inplace:
res = self
else:
res = self.bind()

feature_engine = resolve_feature_engine(feature_engine)

if kind == "nodes":
res = res._featurize_nodes(
Expand Down
Loading