Cudf #445

dcolinmorgan · 2023-02-21T09:07:39Z

not sure how to check if X_ is cudf.DataFrame without loading cudf at the top

graphistry/feature_utils.py

graphistry/umap_utils.py

lmeyerov · 2023-02-21T09:23:19Z

graphistry/umap_utils.py

+                index_to_nodes_dict = dict(zip(range(len(nodes)), nodes))
+            elif isinstance(X_,cudf.DataFrame):
+                index_to_nodes_dict=cudf.DataFrame(nodes).reset_index()
+                X_=pd.DataFrame(X_.to_numpy())


shouldn't we keep X_ as a GPU df, and enable process_umap etc handle it? when in gpu mode, generally easier to keep everything as gpu obj and only temporarily back out to cpu for calls that can't... a lot less to keep track of

X_ here is cupy not cudf

lmeyerov

see notes

lmeyerov · 2023-03-06T07:12:32Z

graphistry/feature_utils.py

@@ -1891,7 +1891,7 @@ def _featurize_nodes(
        ndf = res._nodes
        node = res._node

-        if remove_node_column:
+        if remove_node_column and 'cudf.core.dataframe' not in str(getmodule(ndf)):


I'm not sure this makes sense? Shouldn't the code path still run?

I see some pandas-specific code, so maybe we just need a cudf case as well:

pygraphistry/graphistry/feature_utils.py

Line 269 in c6ece09

if isinstance(X_symbolic, pd.DataFrame):

lmeyerov · 2023-03-06T07:14:54Z

graphistry/umap_utils.py

+        if 'cudf.core.dataframe' not in str(getmodule(emb)): ## cuda cannot support nulls https://github.com/cupy/cupy/issues/5918#issuecomment-946327237
+            df[x_name] = emb.values.T[0]  # if embedding is greater
+            # than two dimensions will only take first two coordinates
+            df[y_name] = emb.values.T[1]


i think the code still has to run, we can't just skip, right?

maybe we need a fillna or something?

also, how are nulls even getting here?

lmeyerov · 2023-03-06T14:38:45Z

graphistry/umap_utils.py

@@ -286,8 +286,7 @@ def _bundle_embedding(self, emb, index):
        if emb.shape[1] == 2 and 'cudf.core.dataframe' not in str(getmodule(emb)):
            emb = pd.DataFrame(emb, columns=[config.X, config.Y], index=index)
        elif emb.shape[1] == 2 and 'cudf.core.dataframe' in str(getmodule(emb)):
-            import cudf
-            emb = cudf.DataFrame(emb, columns=[config.X, config.Y], index=index)
+            emb = pd.DataFrame(emb.to_numpy(), columns=[config.X, config.Y], index=index.to_numpy())


We should stick w GPU vs CPU if we can

I thought we have an embedding in the form of a cudf df with multiple x y z etc cols to begin with, is it just we are being sloppy in how.we reference the cols ?

dcolinmorgan

umap works with cudf by setting remove_node_column=False, but then get issues with g.plot() from g._nodes.to_arrow(). feeling closer

lmeyerov · 2023-03-07T02:37:27Z

graphistry/umap_utils.py

@@ -287,7 +287,7 @@ def _bundle_embedding(self, emb, index):
            emb = pd.DataFrame(emb, columns=[config.X, config.Y], index=index)
        elif emb.shape[1] == 2 and 'cudf.core.dataframe' in str(getmodule(emb)):
            import cudf
-            emb = cudf.DataFrame(emb.to_cupy(), columns=[config.X, config.Y], index=index.to_cupy())
+            emb = cudf.DataFrame(emb.values, columns=[config.X, config.Y], index=index.values)


what is emb -- isn't it a cudf.DataFrame already? this might just be a emb.rename(columns={..})... or nothing? I'm not sure of what/why here

very strange... getmodule does say it is a cudf df, but rename does not work on emb since its a cupy
following works but not pretty enough im guessing:
emb.columns=[config.X, config.Y]
emb.index=index

what do isinstance(emb, cudf.DataFrame) and type(emb) say?

True, <class 'cudf.core.dataframe.DataFrame'>
can use rename when just 2 columns

lmeyerov · 2023-03-07T06:12:21Z

graphistry/umap_utils.py

        else:
            columns = [config.X, config.Y] + [
                f"umap_{k}" for k in range(2, emb.shape[1] - 2)
            ]
            if 'cudf.core.dataframe' not in str(getmodule(emb)):
                emb = pd.DataFrame(emb, columns=columns, index=index)
            elif 'cudf.core.dataframe' in str(getmodule(emb)):
-                import cudf
-                emb = cudf.DataFrame(emb.values, columns=columns, index=index.values)
+                emb.columns=columns


why does shape matter? maybe 286-300 can all be emb.columns=columns ?

i'm pretty lost on cases here, and feels like there's happy-path coding here that may merit unit tests for 1/2/3/4-dim to ensure we're agreed on intended output col names

It takes care of building an arbitrary umap embedding. I can set n_components=10 and still plot, send EMB to torch, etc

yes, if emb is a df, I don't know why any of this munging is here, is all this just a df.rename(columns={..})?

(earlier there was even weirder numpy & .values stuff)

silkspace · 2023-03-10T01:14:43Z

graphistry/umap_utils.py

+            if isinstance(X_, pd.DataFrame):
+                index_to_nodes_dict = dict(zip(range(len(nodes)), nodes))
+            elif 'cudf.core.dataframe' in str(getmodule(X_)):
+                index_to_nodes_dict = nodes


when cudf, is index_to_nodes_dict = nodes still a dict? @dcolinmorgan

how about... always make it a df, and just varies whether cpu vs gpu?

yeah its already a dict from cudf

…g umap input and outputs if engine=cuml or pandas

lmeyerov · 2023-06-15T00:52:23Z

@tanmoyio can we close?

dcolinmorgan · 2023-07-24T05:35:22Z

I believe we can close, have merged this into #486 already

lmeyerov reviewed Feb 21, 2023

View reviewed changes

graphistry/feature_utils.py Outdated Show resolved Hide resolved

lmeyerov reviewed Feb 21, 2023

View reviewed changes

graphistry/feature_utils.py Outdated Show resolved Hide resolved

lmeyerov reviewed Feb 21, 2023

View reviewed changes

graphistry/umap_utils.py Outdated Show resolved Hide resolved

lmeyerov reviewed Feb 21, 2023

View reviewed changes

lmeyerov requested changes Feb 21, 2023

View reviewed changes

cudf all the way thru, cuda cannot handle nulls so few more ifs

23df5bc

lmeyerov reviewed Mar 6, 2023

View reviewed changes

cudf+umap working on numerics

a2640cd

lmeyerov reviewed Mar 6, 2023

View reviewed changes

dc added 2 commits March 7, 2023 09:08

full numeric cudf-- needs hack to plot

ec151b8

full numeric cudf-- needs hack to plot

0ec412a

dcolinmorgan commented Mar 7, 2023

View reviewed changes

lmeyerov reviewed Mar 7, 2023

View reviewed changes

use rename if 2 columns, otherwise = to columns list

55c2b07

lmeyerov reviewed Mar 7, 2023

View reviewed changes

umap cudf

0a42090

silkspace reviewed Mar 10, 2023

View reviewed changes

Alex and others added 12 commits March 9, 2023 20:37

untested, adds decorator for cuml and pandas dataframes, standardizin…

1a013a8

…g umap input and outputs if engine=cuml or pandas

rst change PlotterBase to plotter

c23529b

add graphistry.compute to graphistry.rst

c79fac6

Delete graphistry.compute.rst

fbc33e3

Update modules.rst

35345ff

add graphistry.compute to toctree

811d3e5

resolve short underline error

d6806cf

test1: resolve blank line error

0b301ca

test2: resolve blank line error

24638cf

test3: resolve blank line error

9e0c3d2

doc fix

ac15070

doc fix

f53710c

dcolinmorgan added 24 commits April 25, 2023 10:47

lazy cudf import

4b779ac

Merge branch 'cudf-cat-final' into cudf

9b25eec

lint

c1a0cca

better lazy cudf import

f853472

Merge branch 'cudf-cat-final' into cudf

8e05dd4

lazy merge

b6b148b

lint

17f0af6

lint

2a5c879

functiontransform cuml import

80ba095

functiontransform cuml import

6709bea

functiontransform cuml import

901846c

functiontransform cuml import

118ea80

functiontransform cuml import

f1ee230

use dirty_cat superVec for torch/etc, except if cu_cat

7fc02e2

use dirty_cat superVec for torch/etc, except if cu_cat

d38f469

use dirty_cat superVec for torch/etc, except if cu_cat

6436067

sklearn functiontransformer & MLB

25573ea

sklearn functiontransformer & MLB

dee5ad4

all preprocess back to sklearn

cac6cc4

import FT again

3757b10

rewrite g_n_t

523d180

revert g_n_t

97b725d

import FT in get_numeric_transform

7ab97a4

import FT in get_numeric_transform

aba0c55

lmeyerov added the WIP label Apr 30, 2023

latest release opt-in install

fb96400

lmeyerov closed this Jul 24, 2023

lmeyerov deleted the cudf branch July 24, 2023 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cudf #445

Cudf #445

dcolinmorgan commented Feb 21, 2023

lmeyerov Feb 21, 2023

dcolinmorgan Feb 22, 2023 •

edited

Loading

lmeyerov left a comment

lmeyerov Mar 6, 2023 •

edited

Loading

lmeyerov Mar 6, 2023 •

edited

Loading

lmeyerov Mar 6, 2023

dcolinmorgan left a comment

lmeyerov Mar 7, 2023 •

edited

Loading

dcolinmorgan Mar 7, 2023

lmeyerov Mar 7, 2023

dcolinmorgan Mar 7, 2023

lmeyerov Mar 7, 2023

silkspace Mar 10, 2023

lmeyerov Mar 10, 2023

lmeyerov Mar 10, 2023

silkspace Mar 10, 2023

lmeyerov Mar 10, 2023

dcolinmorgan Mar 16, 2023

lmeyerov commented Jun 15, 2023

dcolinmorgan commented Jul 24, 2023

Cudf #445

Cudf #445

Conversation

dcolinmorgan commented Feb 21, 2023

Choose a reason for hiding this comment

dcolinmorgan Feb 22, 2023 • edited Loading

Choose a reason for hiding this comment

lmeyerov left a comment

Choose a reason for hiding this comment

lmeyerov Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

lmeyerov Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcolinmorgan left a comment

Choose a reason for hiding this comment

lmeyerov Mar 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmeyerov commented Jun 15, 2023

dcolinmorgan commented Jul 24, 2023

dcolinmorgan Feb 22, 2023 •

edited

Loading

lmeyerov Mar 6, 2023 •

edited

Loading

lmeyerov Mar 6, 2023 •

edited

Loading

lmeyerov Mar 7, 2023 •

edited

Loading