-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cudf #445
Cudf #445
Conversation
graphistry/umap_utils.py
Outdated
index_to_nodes_dict = dict(zip(range(len(nodes)), nodes)) | ||
elif isinstance(X_,cudf.DataFrame): | ||
index_to_nodes_dict=cudf.DataFrame(nodes).reset_index() | ||
X_=pd.DataFrame(X_.to_numpy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we keep X_ as a GPU df, and enable process_umap etc handle it? when in gpu mode, generally easier to keep everything as gpu obj and only temporarily back out to cpu for calls that can't... a lot less to keep track of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_ here is cupy not cudf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see notes
graphistry/feature_utils.py
Outdated
@@ -1891,7 +1891,7 @@ def _featurize_nodes( | |||
ndf = res._nodes | |||
node = res._node | |||
|
|||
if remove_node_column: | |||
if remove_node_column and 'cudf.core.dataframe' not in str(getmodule(ndf)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this makes sense? Shouldn't the code path still run?
I see some pandas-specific code, so maybe we just need a cudf case as well:
pygraphistry/graphistry/feature_utils.py
Line 269 in c6ece09
if isinstance(X_symbolic, pd.DataFrame): |
if 'cudf.core.dataframe' not in str(getmodule(emb)): ## cuda cannot support nulls https://github.com/cupy/cupy/issues/5918#issuecomment-946327237 | ||
df[x_name] = emb.values.T[0] # if embedding is greater | ||
# than two dimensions will only take first two coordinates | ||
df[y_name] = emb.values.T[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the code still has to run, we can't just skip, right?
maybe we need a fillna or something?
also, how are nulls even getting here?
graphistry/umap_utils.py
Outdated
@@ -286,8 +286,7 @@ def _bundle_embedding(self, emb, index): | |||
if emb.shape[1] == 2 and 'cudf.core.dataframe' not in str(getmodule(emb)): | |||
emb = pd.DataFrame(emb, columns=[config.X, config.Y], index=index) | |||
elif emb.shape[1] == 2 and 'cudf.core.dataframe' in str(getmodule(emb)): | |||
import cudf | |||
emb = cudf.DataFrame(emb, columns=[config.X, config.Y], index=index) | |||
emb = pd.DataFrame(emb.to_numpy(), columns=[config.X, config.Y], index=index.to_numpy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should stick w GPU vs CPU if we can
I thought we have an embedding in the form of a cudf df with multiple x y z etc cols to begin with, is it just we are being sloppy in how.we reference the cols ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umap works with cudf by setting remove_node_column=False, but then get issues with g.plot() from g._nodes.to_arrow(). feeling closer
graphistry/umap_utils.py
Outdated
@@ -287,7 +287,7 @@ def _bundle_embedding(self, emb, index): | |||
emb = pd.DataFrame(emb, columns=[config.X, config.Y], index=index) | |||
elif emb.shape[1] == 2 and 'cudf.core.dataframe' in str(getmodule(emb)): | |||
import cudf | |||
emb = cudf.DataFrame(emb.to_cupy(), columns=[config.X, config.Y], index=index.to_cupy()) | |||
emb = cudf.DataFrame(emb.values, columns=[config.X, config.Y], index=index.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is emb
-- isn't it a cudf.DataFrame
already? this might just be a emb.rename(columns={..})
... or nothing? I'm not sure of what/why here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very strange... getmodule does say it is a cudf df, but rename does not work on emb since its a cupy
following works but not pretty enough im guessing:
emb.columns=[config.X, config.Y]
emb.index=index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do isinstance(emb, cudf.DataFrame)
and type(emb)
say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, <class 'cudf.core.dataframe.DataFrame'>
can use rename when just 2 columns
graphistry/umap_utils.py
Outdated
else: | ||
columns = [config.X, config.Y] + [ | ||
f"umap_{k}" for k in range(2, emb.shape[1] - 2) | ||
] | ||
if 'cudf.core.dataframe' not in str(getmodule(emb)): | ||
emb = pd.DataFrame(emb, columns=columns, index=index) | ||
elif 'cudf.core.dataframe' in str(getmodule(emb)): | ||
import cudf | ||
emb = cudf.DataFrame(emb.values, columns=columns, index=index.values) | ||
emb.columns=columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does shape matter? maybe 286-300 can all be emb.columns=columns
?
i'm pretty lost on cases here, and feels like there's happy-path coding here that may merit unit tests for 1/2/3/4-dim to ensure we're agreed on intended output col names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It takes care of building an arbitrary umap embedding. I can set n_components=10 and still plot, send EMB to torch, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, if emb
is a df, I don't know why any of this munging is here, is all this just a df.rename(columns={..})?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(earlier there was even weirder numpy & .values stuff)
graphistry/umap_utils.py
Outdated
if isinstance(X_, pd.DataFrame): | ||
index_to_nodes_dict = dict(zip(range(len(nodes)), nodes)) | ||
elif 'cudf.core.dataframe' in str(getmodule(X_)): | ||
index_to_nodes_dict = nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when cudf, is index_to_nodes_dict = nodes
still a dict? @dcolinmorgan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about... always make it a df, and just varies whether cpu vs gpu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah its already a dict from cudf
…g umap input and outputs if engine=cuml or pandas
@tanmoyio can we close? |
I believe we can close, have merged this into #486 already |
not sure how to check if X_ is cudf.DataFrame without loading cudf at the top