-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cucat Featurization base #486
base: master
Are you sure you want to change the base?
Conversation
graphistry/feature_utils.py
Outdated
if y is None: | ||
return df | ||
remove_cols = [] | ||
if y is None: | ||
pass | ||
elif isinstance(y, pd.DataFrame): | ||
elif isinstance(y, pd.DataFrame) or isinstance(y, cudf.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcolinmorgan or (cudf is not None and isinstance(y, cudf.DataFrame)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe same problem elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cudf is None, I think the current would throw an exn
@tanmoyio can we close, or this is still live and needs review? |
|
Awesome - is the plan to start landing, or more first? And would it make sense to start reviewing any PRs? If a sequence, can you stack them & point out so clear? |
landing would be wonderful -- before end of july is my dream DT4 is latest cu_cat PR branch which passes many pytests + works as expected in every demo ive done in last few months |
ok @tanmoyio can you help double check tests, take for a testdrive, and land first in cu_cat and then here? After, can you help add to main graphistry (https://github.com/graphistry/graphistry/blob/master/compose/dockerfiles/base/05-nvidia.Dockerfile) ? I think we should keep default-off for now, and should test that it's truly default off -- that existence doesn't (yet) trigger it to be used, only explicit use. |
test-full-ai test L395 seems to be getting hung up by 1 of 3 features being exactly reproduced
|
graphistry/embed_utils.py
Outdated
# return False, object | ||
def check_cudf(): | ||
try: | ||
import cudf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Who calls this on import?
-
And can this be a a) cached call that b) checks module path vs an import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its only test_embed_utils#L14
that calls check_cudf
, swapped out for lazy_cudf_import
from umap_utils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, so that shouldn't be the issue, right? test/*
shouldn't get imported by import graphistry
..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, no i wasnt clear, test_embed_utils
is only OTHER place lazy_cudf_import
was present. It was used in embed_utils
and imported cudf to check df dtype, but I have swapped it out in place of just checking via getmodule e.g. if 'cudf' in str(getmodule(self._nodes)):
, so I believe the problem is solved -- tuna looks much better
|
graphistry/feature_utils.py
Outdated
@@ -62,7 +72,7 @@ | |||
SentenceTransformer = Any | |||
SuperVectorizer = Any | |||
GapEncoder = Any | |||
SimilarityEncoder = Any | |||
# SimilarityEncoder = Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove all these dead lines
graphistry/feature_utils.py
Outdated
X = np.round(X, decimals=keep_n_decimals) # type: ignore # noqa | ||
X = pd.DataFrame(X, columns=columns, index=index) | ||
else: | ||
X = transformer.fit_transform(X.to_numpy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we know if the transformer is cpu vs gpu? it seems to always assume cpu here, but if X is cudf and transformer is gpu, can't we keep X on gpu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a sometimes-ok soln would be checking transformer for being from cuml or maybe cu_cat, but that seems non-generalizable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow this nearly gave me a heart attack -- good thoughts i will work with... but this is an artifact, no .to_numpy
needed
graphistry/feature_utils.py
Outdated
|
||
|
||
def make_safe_gpu_dataframes(X, y, engine): | ||
has_cudf_dependancy_, _, cudf = lazy_import_has_dependancy_cu_cat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add assert cudf is not None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also probably good to switch from lazy_import...cu_cat
to a cudf
one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok -- after the if statement seems best here again like other assert you mentioned
graphistry/feature_utils.py
Outdated
yc = y.columns | ||
xc = df.columns | ||
for c in yc: | ||
if c in xc: | ||
remove_cols.append(c) | ||
elif isinstance(y, pd.Series): | ||
elif isinstance(y, pd.Series) or isinstance(y, cudf.Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle non-cu_cat import returning None
for cudf
graphistry/feature_utils.py
Outdated
X = transformer.fit_transform(X.to_numpy()) | ||
if keep_n_decimals: | ||
X = np.round(X, decimals=keep_n_decimals) # type: ignore # noqa | ||
_, _, cudf = lazy_import_has_dependancy_cu_cat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert cudf is not None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good practice to assert even after if statement check? good to know, yay learning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's useful when you don't want the 'if' but can imagine future changes or misuses accidentally getting the assumption wrong
umap match transpose index type-spec concat type-spec concat dc for comp_cluster dirty_cat as default, cc passes most tests ;) source cu_cat from pypi source cu_cat from pypi remove cc tests, tested for in dc place remove cc tests, tested for in dc place init 1dc > 2cc init 1dc > 2cc use constants throughout revert from constants revert from constants init 1dc > 2cc better dc default better dc default
bab7c02
to
cdda3e7
Compare
@@ -157,6 +157,54 @@ jobs: | |||
source pygraphistry/bin/activate | |||
./bin/test-umap-learn-core.sh | |||
|
|||
|
|||
test-gpu-umap: # well cpu until get a github actions gpu node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should add some sort of test to confirm the gpu is enabled + gpu is used, e.g.,
nvidia-smi || exit 1
python3 -c "import cudf; cudf.DataFrame({'x': [0,1,2]})['x'].sum()"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if that is overreach for this pr, we should comment this out
this seems to have drifted a bit from main, see merge conflict i'm unsure about the cu_cat bits here, but if this pr replaces a bunch of |
Starter script