Cucat Featurization base #486

tanmoyio · 2023-05-15T15:41:17Z

Starter script

import pandas as pd
import cudf
import graphistry

df = pd.read_csv('https://gist.githubusercontent.com/silkspace/c7b50d0c03dc59f63c48d68d696958ff/raw/31d918267f86f8252d42d2e9597ba6fc03fcdac2/redteam_50k.csv', index_col=0)
red_team = pd.read_csv('https://gist.githubusercontent.com/silkspace/5cf5a94b9ac4b4ffe38904f20d93edb1/raw/888dabd86f88ea747cf9ff5f6c44725e21536465/redteam_labels.csv', index_col=0)
df['feats'] = df.src_computer + ' ' + df.dst_computer + ' ' + df.auth_type + ' ' + df.logontype
tdf = pd.concat([red_team.reset_index(), df.reset_index()])
tdf['node'] = range(len(tdf))


g = graphistry.nodes((tdf))
g1 = g.umap(X=['feats'], feature_engine='cu_cat')
print(g1._node_features)
g2 = g.umap(X=['feats'], feature_engine='dirty_cat')
print(g2._node_features)

silkspace · 2023-05-16T16:22:23Z

graphistry/feature_utils.py

    if y is None:
        return df
    remove_cols = []
    if y is None:
        pass
-    elif isinstance(y, pd.DataFrame):
+    elif isinstance(y, pd.DataFrame) or isinstance(y, cudf.DataFrame):


great catch

@dcolinmorgan or (cudf is not None and isinstance(y, cudf.DataFrame)

maybe same problem elsewhere?

If cudf is None, I think the current would throw an exn

lmeyerov · 2023-06-15T00:49:20Z

@tanmoyio can we close, or this is still live and needs review?

dcolinmorgan · 2023-07-19T02:38:39Z

for cu_cat itself (DT3 branch), I have worked out the dynamic memory handling for T4 v A100 flexibility.
Also worked out datetime passthru. However this needs to bypass cudf dataframing in cu_cat AND pygraphistry so that g.plotter infers datetime correctly to provide time series box
-- currently i accomplish this in a hacky way by binding it to embeddings after transforming but before plotting, thus avoiding cudf requirement
Now I have refactored code to only require gapencoder and tablevectorizer files/functions DT4 branch forked from DT3

lmeyerov · 2023-07-19T07:01:10Z

Awesome - is the plan to start landing, or more first?

And would it make sense to start reviewing any PRs? If a sequence, can you stack them & point out so clear?

dcolinmorgan · 2023-07-21T03:45:17Z

landing would be wonderful -- before end of july is my dream

DT4 is latest cu_cat PR branch which passes many pytests + works as expected in every demo ive done in last few months

lmeyerov · 2023-07-23T00:34:38Z

ok @tanmoyio can you help double check tests, take for a testdrive, and land first in cu_cat and then here?

After, can you help add to main graphistry (https://github.com/graphistry/graphistry/blob/master/compose/dockerfiles/base/05-nvidia.Dockerfile) ? I think we should keep default-off for now, and should test that it's truly default off -- that existence doesn't (yet) trigger it to be used, only explicit use.

dcolinmorgan · 2023-07-26T10:04:09Z

test-full-ai test L395 seems to be getting hung up by 1 of 3 features being exactly reproduced

when first discussing with @silkspace -- this is exactly what we realized approximate estimation would liekly return and user must make sure features make sense, just like with dirty_cat
likely need to test a few so that 2/3 are always reproduced in several estimations rather that 1 case of 3/3 reproduction

lmeyerov · 2023-07-28T13:54:20Z

graphistry/embed_utils.py

-#         return False, object
+def check_cudf():
+    try:
+        import cudf


Who calls this on import?

And can this be a a) cached call that b) checks module path vs an import?

its only test_embed_utils#L14 that calls check_cudf, swapped out for lazy_cudf_import from umap_utils

oh, so that shouldn't be the issue, right? test/* shouldn't get imported by import graphistry..

sorry, no i wasnt clear, test_embed_utils is only OTHER place lazy_cudf_import was present. It was used in embed_utils and imported cudf to check df dtype, but I have swapped it out in place of just checking via getmodule e.g. if 'cudf' in str(getmodule(self._nodes)): , so I believe the problem is solved -- tuna looks much better

lmeyerov · 2023-07-29T06:19:28Z

@silkspace wrt Cucat Featurization base #486 (comment) , may understand better?

graphistry/feature_utils.py

graphistry/embed_utils.py

lmeyerov · 2023-08-02T16:15:13Z

graphistry/feature_utils.py

@@ -62,7 +72,7 @@
    SentenceTransformer = Any
    SuperVectorizer = Any
    GapEncoder = Any
-    SimilarityEncoder = Any
+    # SimilarityEncoder = Any


remove all these dead lines

lmeyerov · 2023-08-02T16:18:00Z

graphistry/feature_utils.py

+            X = np.round(X, decimals=keep_n_decimals)  #  type: ignore  # noqa
+        X = pd.DataFrame(X, columns=columns, index=index)
+    else:
+        X = transformer.fit_transform(X.to_numpy())


how do we know if the transformer is cpu vs gpu? it seems to always assume cpu here, but if X is cudf and transformer is gpu, can't we keep X on gpu?

a sometimes-ok soln would be checking transformer for being from cuml or maybe cu_cat, but that seems non-generalizable

wow this nearly gave me a heart attack -- good thoughts i will work with... but this is an artifact, no .to_numpy needed

lmeyerov · 2023-08-02T16:24:18Z

graphistry/feature_utils.py

+
+
+def make_safe_gpu_dataframes(X, y, engine):
+    has_cudf_dependancy_, _, cudf = lazy_import_has_dependancy_cu_cat()


Add assert cudf is not None ?

also probably good to switch from lazy_import...cu_cat to a cudf one

ok -- after the if statement seems best here again like other assert you mentioned

lmeyerov · 2023-08-02T16:26:52Z

graphistry/feature_utils.py

        yc = y.columns
        xc = df.columns
        for c in yc:
            if c in xc:
                remove_cols.append(c)
-    elif isinstance(y, pd.Series):
+    elif isinstance(y, pd.Series) or isinstance(y, cudf.Series):


handle non-cu_cat import returning None for cudf

lmeyerov · 2023-08-02T16:28:49Z

graphistry/feature_utils.py

+        X = transformer.fit_transform(X.to_numpy())
+        if keep_n_decimals:
+            X = np.round(X, decimals=keep_n_decimals)  #  type: ignore  # noqa
+        _, _, cudf = lazy_import_has_dependancy_cu_cat()


assert cudf is not None

good practice to assert even after if statement check? good to know, yay learning

It's useful when you don't want the 'if' but can imagine future changes or misuses accidentally getting the assumption wrong

graphistry/feature_utils.py

setup.py

umap match transpose index type-spec concat type-spec concat dc for comp_cluster dirty_cat as default, cc passes most tests ;) source cu_cat from pypi source cu_cat from pypi remove cc tests, tested for in dc place remove cc tests, tested for in dc place init 1dc > 2cc init 1dc > 2cc use constants throughout revert from constants revert from constants init 1dc > 2cc better dc default better dc default

lmeyerov · 2024-07-04T05:34:43Z

.github/workflows/ci.yml

@@ -157,6 +157,54 @@ jobs:
        source pygraphistry/bin/activate
        ./bin/test-umap-learn-core.sh

+
+  test-gpu-umap:  # well cpu until get a github actions gpu node


this might work now? https://medium.com/@tajinder.singh1985/exploring-github-nvidia-powered-gpu-hosted-runner-32b172a92c7e

maybe we should add some sort of test to confirm the gpu is enabled + gpu is used, e.g.,

nvidia-smi || exit 1 python3 -c "import cudf; cudf.DataFrame({'x': [0,1,2]})['x'].sum()"

if that is overreach for this pr, we should comment this out

lmeyerov · 2024-07-04T05:39:30Z

this seems to have drifted a bit from main, see merge conflict

i'm unsure about the cu_cat bits here, but if this pr replaces a bunch of import cudf with dynamic "cudf.dataframe" in str(module(df))" checks to avoid slow imports, sounds useful & overdue...

tanmoyio added 3 commits May 15, 2023 21:04

cucat feat support

cf07249

cudf test env var added for test_feature_utils.py

d73a2db

some import fixes

382e18b

silkspace reviewed May 16, 2023

View reviewed changes

passthru DT encode/umap, add back for timebar

44200ac

lint

777afd4

lmeyerov assigned dcolinmorgan and tanmoyio Jul 23, 2023

This was referenced Jul 24, 2023

Cudf #445

Closed

include cuCat in ai deps #444

Closed

[BUG] lazy loading regression #481

Open

dcolinmorgan added 3 commits July 26, 2023 18:12

updated cu-cat version for optional install

c1bc6f1

type check without loading cudf, via getmodule

48e4017

ok we still need the check_cudf def

6b0b52b

lmeyerov reviewed Jul 28, 2023

View reviewed changes

swap lazy import defs

e4b0c0a

dcolinmorgan reviewed Aug 2, 2023

View reviewed changes

graphistry/feature_utils.py Show resolved Hide resolved

lmeyerov reviewed Aug 2, 2023

View reviewed changes

graphistry/embed_utils.py Outdated Show resolved Hide resolved

lmeyerov reviewed Aug 2, 2023

View reviewed changes

lmeyerov reviewed Sep 20, 2023

View reviewed changes

graphistry/feature_utils.py Outdated Show resolved Hide resolved

lmeyerov reviewed Sep 20, 2023

View reviewed changes

graphistry/feature_utils.py Outdated Show resolved Hide resolved

lmeyerov reviewed Sep 20, 2023

View reviewed changes

graphistry/feature_utils.py Outdated Show resolved Hide resolved

lmeyerov reviewed Sep 20, 2023

View reviewed changes

setup.py Outdated Show resolved Hide resolved

dcolinmorgan added 7 commits September 21, 2023 11:20

most comments

5d16a9e

most comments

e931456

most comments

fc212a8

most comments

d4b1fbe

most comments

498a4de

remove single engine flag, try in next PR

aab2ad9

latest cu-cat version

f0eb1bf

dcolinmorgan requested a review from lmeyerov October 12, 2023 07:40

dcolinmorgan added 3 commits December 29, 2023 08:50

edge concat interop

867874d

Merge branch 'master' into feat/gpu-featurization

5a69233

dcolinmorgan force-pushed the feat/gpu-featurization branch from bab7c02 to cdda3e7 Compare January 2, 2024 09:14

dcolinmorgan added 9 commits January 3, 2024 14:06

renaming

63398b3

renaming

b720bc1

cupyx csr toarray for features_out

ed824ec

cupyx csr toarray for features_out

1735134

cupyx csr toarray for features_out

824d940

add gpu-umap test, allow cucat to test w/o gpu

c7ce92c

add gpu-umap test, allow cucat to test w/o gpu

30a04a4

dirty_cat version with Table&SuperVectorizer

50df365

dirty_cat version with Table&SuperVectorizer

a654f9f

dcolinmorgan mentioned this pull request Jan 4, 2024

Dev/depman gpufeat #517

Open

better dimension try

a86be5c

lmeyerov reviewed Jul 4, 2024

View reviewed changes

Merge branch 'master' into feat/gpu-featurization

4bd056c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cucat Featurization base #486

Cucat Featurization base #486

tanmoyio commented May 15, 2023 •

edited

Loading

silkspace May 16, 2023

lmeyerov Aug 2, 2023

lmeyerov Aug 2, 2023

lmeyerov Aug 4, 2023

lmeyerov commented Jun 15, 2023 •

edited

Loading

dcolinmorgan commented Jul 19, 2023

lmeyerov commented Jul 19, 2023

dcolinmorgan commented Jul 21, 2023

lmeyerov commented Jul 23, 2023

dcolinmorgan commented Jul 26, 2023 •

edited

Loading

lmeyerov Jul 28, 2023

dcolinmorgan Jul 29, 2023 •

edited

Loading

lmeyerov Jul 29, 2023

dcolinmorgan Aug 1, 2023

lmeyerov commented Jul 29, 2023

lmeyerov Aug 2, 2023

lmeyerov Aug 2, 2023

lmeyerov Aug 2, 2023 •

edited

Loading

dcolinmorgan Aug 4, 2023

lmeyerov Aug 2, 2023

lmeyerov Aug 2, 2023

dcolinmorgan Aug 11, 2023

lmeyerov Aug 2, 2023

lmeyerov Aug 2, 2023

dcolinmorgan Aug 4, 2023

lmeyerov Aug 4, 2023

lmeyerov Jul 4, 2024

lmeyerov Jul 4, 2024

lmeyerov Jul 4, 2024

lmeyerov commented Jul 4, 2024



		def make_safe_gpu_dataframes(X, y, engine):
		has_cudf_dependancy_, _, cudf = lazy_import_has_dependancy_cu_cat()

Cucat Featurization base #486

Are you sure you want to change the base?

Cucat Featurization base #486

Conversation

tanmoyio commented May 15, 2023 • edited Loading

Starter script

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmeyerov commented Jun 15, 2023 • edited Loading

dcolinmorgan commented Jul 19, 2023

lmeyerov commented Jul 19, 2023

dcolinmorgan commented Jul 21, 2023

lmeyerov commented Jul 23, 2023

dcolinmorgan commented Jul 26, 2023 • edited Loading

Choose a reason for hiding this comment

dcolinmorgan Jul 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmeyerov commented Jul 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmeyerov Aug 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmeyerov commented Jul 4, 2024

tanmoyio commented May 15, 2023 •

edited

Loading

lmeyerov commented Jun 15, 2023 •

edited

Loading

dcolinmorgan commented Jul 26, 2023 •

edited

Loading

dcolinmorgan Jul 29, 2023 •

edited

Loading

lmeyerov Aug 2, 2023 •

edited

Loading