Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cucat Featurization base #486

Open
wants to merge 98 commits into
base: master
Choose a base branch
from
Open

Cucat Featurization base #486

wants to merge 98 commits into from

Conversation

tanmoyio
Copy link
Member

@tanmoyio tanmoyio commented May 15, 2023

Starter script

import pandas as pd
import cudf
import graphistry

df = pd.read_csv('https://gist.githubusercontent.com/silkspace/c7b50d0c03dc59f63c48d68d696958ff/raw/31d918267f86f8252d42d2e9597ba6fc03fcdac2/redteam_50k.csv', index_col=0)
red_team = pd.read_csv('https://gist.githubusercontent.com/silkspace/5cf5a94b9ac4b4ffe38904f20d93edb1/raw/888dabd86f88ea747cf9ff5f6c44725e21536465/redteam_labels.csv', index_col=0)
df['feats'] = df.src_computer + ' ' + df.dst_computer + ' ' + df.auth_type + ' ' + df.logontype
tdf = pd.concat([red_team.reset_index(), df.reset_index()])
tdf['node'] = range(len(tdf))


g = graphistry.nodes((tdf))
g1 = g.umap(X=['feats'], feature_engine='cu_cat')
print(g1._node_features)
g2 = g.umap(X=['feats'], feature_engine='dirty_cat')
print(g2._node_features)

if y is None:
return df
remove_cols = []
if y is None:
pass
elif isinstance(y, pd.DataFrame):
elif isinstance(y, pd.DataFrame) or isinstance(y, cudf.DataFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcolinmorgan or (cudf is not None and isinstance(y, cudf.DataFrame)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe same problem elsewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cudf is None, I think the current would throw an exn

@lmeyerov
Copy link
Contributor

lmeyerov commented Jun 15, 2023

@tanmoyio can we close, or this is still live and needs review?

@dcolinmorgan
Copy link
Contributor

  • for cu_cat itself (DT3 branch), I have worked out the dynamic memory handling for T4 v A100 flexibility.
  • Also worked out datetime passthru. However this needs to bypass cudf dataframing in cu_cat AND pygraphistry so that g.plotter infers datetime correctly to provide time series box
    -- currently i accomplish this in a hacky way by binding it to embeddings after transforming but before plotting, thus avoiding cudf requirement
  • Now I have refactored code to only require gapencoder and tablevectorizer files/functions DT4 branch forked from DT3

@lmeyerov
Copy link
Contributor

Awesome - is the plan to start landing, or more first?

And would it make sense to start reviewing any PRs? If a sequence, can you stack them & point out so clear?

@dcolinmorgan
Copy link
Contributor

landing would be wonderful -- before end of july is my dream

DT4 is latest cu_cat PR branch which passes many pytests + works as expected in every demo ive done in last few months

@lmeyerov
Copy link
Contributor

ok @tanmoyio can you help double check tests, take for a testdrive, and land first in cu_cat and then here?

After, can you help add to main graphistry (https://github.com/graphistry/graphistry/blob/master/compose/dockerfiles/base/05-nvidia.Dockerfile) ? I think we should keep default-off for now, and should test that it's truly default off -- that existence doesn't (yet) trigger it to be used, only explicit use.

@dcolinmorgan
Copy link
Contributor

dcolinmorgan commented Jul 26, 2023

test-full-ai test L395 seems to be getting hung up by 1 of 3 features being exactly reproduced

  • when first discussing with @silkspace -- this is exactly what we realized approximate estimation would liekly return and user must make sure features make sense, just like with dirty_cat
  • likely need to test a few so that 2/3 are always reproduced in several estimations rather that 1 case of 3/3 reproduction

# return False, object
def check_cudf():
try:
import cudf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Who calls this on import?

  2. And can this be a a) cached call that b) checks module path vs an import?

Copy link
Contributor

@dcolinmorgan dcolinmorgan Jul 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its only test_embed_utils#L14 that calls check_cudf, swapped out for lazy_cudf_import from umap_utils

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, so that shouldn't be the issue, right? test/* shouldn't get imported by import graphistry..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, no i wasnt clear, test_embed_utils is only OTHER place lazy_cudf_import was present. It was used in embed_utils and imported cudf to check df dtype, but I have swapped it out in place of just checking via getmodule e.g. if 'cudf' in str(getmodule(self._nodes)): , so I believe the problem is solved -- tuna looks much better
Screenshot 2023-08-01 at 10 44 04

@lmeyerov
Copy link
Contributor

graphistry/embed_utils.py Outdated Show resolved Hide resolved
@@ -62,7 +72,7 @@
SentenceTransformer = Any
SuperVectorizer = Any
GapEncoder = Any
SimilarityEncoder = Any
# SimilarityEncoder = Any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove all these dead lines

X = np.round(X, decimals=keep_n_decimals) # type: ignore # noqa
X = pd.DataFrame(X, columns=columns, index=index)
else:
X = transformer.fit_transform(X.to_numpy())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we know if the transformer is cpu vs gpu? it seems to always assume cpu here, but if X is cudf and transformer is gpu, can't we keep X on gpu?

Copy link
Contributor

@lmeyerov lmeyerov Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a sometimes-ok soln would be checking transformer for being from cuml or maybe cu_cat, but that seems non-generalizable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow this nearly gave me a heart attack -- good thoughts i will work with... but this is an artifact, no .to_numpy needed



def make_safe_gpu_dataframes(X, y, engine):
has_cudf_dependancy_, _, cudf = lazy_import_has_dependancy_cu_cat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add assert cudf is not None ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also probably good to switch from lazy_import...cu_cat to a cudf one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok -- after the if statement seems best here again like other assert you mentioned

yc = y.columns
xc = df.columns
for c in yc:
if c in xc:
remove_cols.append(c)
elif isinstance(y, pd.Series):
elif isinstance(y, pd.Series) or isinstance(y, cudf.Series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle non-cu_cat import returning None for cudf

X = transformer.fit_transform(X.to_numpy())
if keep_n_decimals:
X = np.round(X, decimals=keep_n_decimals) # type: ignore # noqa
_, _, cudf = lazy_import_has_dependancy_cu_cat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert cudf is not None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good practice to assert even after if statement check? good to know, yay learning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful when you don't want the 'if' but can imagine future changes or misuses accidentally getting the assumption wrong

setup.py Outdated Show resolved Hide resolved
@dcolinmorgan dcolinmorgan requested a review from lmeyerov October 12, 2023 07:40
umap match transpose index

type-spec concat

type-spec concat

dc for comp_cluster

dirty_cat as default, cc passes most tests ;)

source cu_cat from pypi

source cu_cat from pypi

remove cc tests, tested for in dc place

remove cc tests, tested for in dc place

init 1dc > 2cc

init 1dc > 2cc

use constants throughout

revert from constants

revert from constants

init 1dc > 2cc

better dc default

better dc default
@dcolinmorgan dcolinmorgan force-pushed the feat/gpu-featurization branch from bab7c02 to cdda3e7 Compare January 2, 2024 09:14
@@ -157,6 +157,54 @@ jobs:
source pygraphistry/bin/activate
./bin/test-umap-learn-core.sh


test-gpu-umap: # well cpu until get a github actions gpu node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add some sort of test to confirm the gpu is enabled + gpu is used, e.g.,

nvidia-smi || exit 1

python3 -c "import cudf; cudf.DataFrame({'x': [0,1,2]})['x'].sum()"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that is overreach for this pr, we should comment this out

@lmeyerov
Copy link
Contributor

lmeyerov commented Jul 4, 2024

this seems to have drifted a bit from main, see merge conflict

i'm unsure about the cu_cat bits here, but if this pr replaces a bunch of import cudf with dynamic "cudf.dataframe" in str(module(df))" checks to avoid slow imports, sounds useful & overdue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants