Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cudf #445

Closed
wants to merge 463 commits into from
Closed

Cudf #445

wants to merge 463 commits into from

Conversation

dcolinmorgan
Copy link
Contributor

not sure how to check if X_ is cudf.DataFrame without loading cudf at the top

graphistry/umap_utils.py Outdated Show resolved Hide resolved
index_to_nodes_dict = dict(zip(range(len(nodes)), nodes))
elif isinstance(X_,cudf.DataFrame):
index_to_nodes_dict=cudf.DataFrame(nodes).reset_index()
X_=pd.DataFrame(X_.to_numpy())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we keep X_ as a GPU df, and enable process_umap etc handle it? when in gpu mode, generally easier to keep everything as gpu obj and only temporarily back out to cpu for calls that can't... a lot less to keep track of

Copy link
Contributor Author

@dcolinmorgan dcolinmorgan Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X_ here is cupy not cudf

Copy link
Contributor

@lmeyerov lmeyerov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see notes

@@ -1891,7 +1891,7 @@ def _featurize_nodes(
ndf = res._nodes
node = res._node

if remove_node_column:
if remove_node_column and 'cudf.core.dataframe' not in str(getmodule(ndf)):
Copy link
Contributor

@lmeyerov lmeyerov Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this makes sense? Shouldn't the code path still run?

I see some pandas-specific code, so maybe we just need a cudf case as well:

if isinstance(X_symbolic, pd.DataFrame):

if 'cudf.core.dataframe' not in str(getmodule(emb)): ## cuda cannot support nulls https://github.com/cupy/cupy/issues/5918#issuecomment-946327237
df[x_name] = emb.values.T[0] # if embedding is greater
# than two dimensions will only take first two coordinates
df[y_name] = emb.values.T[1]
Copy link
Contributor

@lmeyerov lmeyerov Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the code still has to run, we can't just skip, right?

maybe we need a fillna or something?

also, how are nulls even getting here?

@@ -286,8 +286,7 @@ def _bundle_embedding(self, emb, index):
if emb.shape[1] == 2 and 'cudf.core.dataframe' not in str(getmodule(emb)):
emb = pd.DataFrame(emb, columns=[config.X, config.Y], index=index)
elif emb.shape[1] == 2 and 'cudf.core.dataframe' in str(getmodule(emb)):
import cudf
emb = cudf.DataFrame(emb, columns=[config.X, config.Y], index=index)
emb = pd.DataFrame(emb.to_numpy(), columns=[config.X, config.Y], index=index.to_numpy())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should stick w GPU vs CPU if we can

I thought we have an embedding in the form of a cudf df with multiple x y z etc cols to begin with, is it just we are being sloppy in how.we reference the cols ?

Copy link
Contributor Author

@dcolinmorgan dcolinmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umap works with cudf by setting remove_node_column=False, but then get issues with g.plot() from g._nodes.to_arrow(). feeling closer

@@ -287,7 +287,7 @@ def _bundle_embedding(self, emb, index):
emb = pd.DataFrame(emb, columns=[config.X, config.Y], index=index)
elif emb.shape[1] == 2 and 'cudf.core.dataframe' in str(getmodule(emb)):
import cudf
emb = cudf.DataFrame(emb.to_cupy(), columns=[config.X, config.Y], index=index.to_cupy())
emb = cudf.DataFrame(emb.values, columns=[config.X, config.Y], index=index.values)
Copy link
Contributor

@lmeyerov lmeyerov Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is emb -- isn't it a cudf.DataFrame already? this might just be a emb.rename(columns={..})... or nothing? I'm not sure of what/why here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very strange... getmodule does say it is a cudf df, but rename does not work on emb since its a cupy
following works but not pretty enough im guessing:
emb.columns=[config.X, config.Y]
emb.index=index

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do isinstance(emb, cudf.DataFrame) and type(emb) say?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, <class 'cudf.core.dataframe.DataFrame'>
can use rename when just 2 columns

else:
columns = [config.X, config.Y] + [
f"umap_{k}" for k in range(2, emb.shape[1] - 2)
]
if 'cudf.core.dataframe' not in str(getmodule(emb)):
emb = pd.DataFrame(emb, columns=columns, index=index)
elif 'cudf.core.dataframe' in str(getmodule(emb)):
import cudf
emb = cudf.DataFrame(emb.values, columns=columns, index=index.values)
emb.columns=columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does shape matter? maybe 286-300 can all be emb.columns=columns ?

i'm pretty lost on cases here, and feels like there's happy-path coding here that may merit unit tests for 1/2/3/4-dim to ensure we're agreed on intended output col names

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes care of building an arbitrary umap embedding. I can set n_components=10 and still plot, send EMB to torch, etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if emb is a df, I don't know why any of this munging is here, is all this just a df.rename(columns={..})?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(earlier there was even weirder numpy & .values stuff)

if isinstance(X_, pd.DataFrame):
index_to_nodes_dict = dict(zip(range(len(nodes)), nodes))
elif 'cudf.core.dataframe' in str(getmodule(X_)):
index_to_nodes_dict = nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when cudf, is index_to_nodes_dict = nodes still a dict? @dcolinmorgan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about... always make it a df, and just varies whether cpu vs gpu?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah its already a dict from cudf

@lmeyerov lmeyerov added the WIP label Apr 30, 2023
@lmeyerov
Copy link
Contributor

@tanmoyio can we close?

@dcolinmorgan
Copy link
Contributor Author

I believe we can close, have merged this into #486 already

@lmeyerov lmeyerov closed this Jul 24, 2023
@lmeyerov lmeyerov deleted the cudf branch July 24, 2023 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants