Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cudf #445

Closed
wants to merge 463 commits into from
Closed

Cudf #445

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
463 commits
Select commit Hold shift + click to select a range
23df5bc
cudf all the way thru, cuda cannot handle nulls so few more ifs
Mar 6, 2023
a2640cd
cudf+umap working on numerics
Mar 6, 2023
ec151b8
full numeric cudf-- needs hack to plot
Mar 7, 2023
0ec412a
full numeric cudf-- needs hack to plot
Mar 7, 2023
55c2b07
use rename if 2 columns, otherwise = to columns list
Mar 7, 2023
0a42090
umap cudf
Mar 9, 2023
1a013a8
untested, adds decorator for cuml and pandas dataframes, standardizin…
Mar 10, 2023
c23529b
rst change PlotterBase to plotter
tanmoyio Mar 10, 2023
c79fac6
add graphistry.compute to graphistry.rst
tanmoyio Mar 10, 2023
fbc33e3
Delete graphistry.compute.rst
tanmoyio Mar 10, 2023
35345ff
Update modules.rst
tanmoyio Mar 10, 2023
811d3e5
add graphistry.compute to toctree
tanmoyio Mar 10, 2023
d6806cf
resolve short underline error
tanmoyio Mar 10, 2023
0b301ca
test1: resolve blank line error
tanmoyio Mar 10, 2023
24638cf
test2: resolve blank line error
tanmoyio Mar 10, 2023
9e0c3d2
test3: resolve blank line error
tanmoyio Mar 10, 2023
ac15070
doc fix
tanmoyio Mar 10, 2023
f53710c
doc fix
tanmoyio Mar 10, 2023
a8a9dc1
doc fix
tanmoyio Mar 10, 2023
8618b1a
doc fix
tanmoyio Mar 10, 2023
3a6611d
doc fix
tanmoyio Mar 10, 2023
82d1d3b
doc fix
tanmoyio Mar 10, 2023
3fe4cee
add versioneer
tanmoyio Mar 10, 2023
8199471
add versioneer
tanmoyio Mar 10, 2023
d01e2bf
fix(docstr): removed unneccesary indents/sections
dess890 Mar 10, 2023
812f5c7
fix(docstr): removed unneccessary spacing
dess890 Mar 10, 2023
b88ce77
fix(docstr): fixed unindent
dess890 Mar 10, 2023
940be4e
fix(docstr): removed line/spacing
dess890 Mar 10, 2023
fdb7d2b
Revert "umap cudf"
Mar 11, 2023
17fd316
Revert "umap cudf"
Mar 11, 2023
af399f6
revert b4 alex borq
Mar 11, 2023
3f3a1f4
adds cudf support and wraps dataframe if engine=cuml
Mar 12, 2023
446bdc8
adds safe gpu wrapper to edges as well
Mar 12, 2023
271e3e1
merges cleanup branch and resolves discrepencies
Mar 12, 2023
96363f5
adds safer handling of cupy/cudf arrays
Mar 12, 2023
6dc826a
removes unused code
Mar 12, 2023
27c879b
handles pandas via if statement
Mar 12, 2023
4f01bde
adds missing engine flag
Mar 12, 2023
93cadb1
adds missing _ to output
Mar 12, 2023
f479857
begin testing cudf-cu_cat
Mar 13, 2023
923ec9c
begin testing cudf-cu_cat
Mar 13, 2023
b4e110d
begin cudf_cat for 3x to 10x
Mar 13, 2023
b6a60ea
adds handling of g.transform_umap when engine=cuml
Mar 13, 2023
84741c8
safe converts X, y before infer_graph method
Mar 13, 2023
6aefc49
debugging why node not in df
Mar 13, 2023
caf367c
fix typo
Mar 13, 2023
40914dc
adds test if node not in df, adds numeric index
Mar 13, 2023
b8a7eb5
adds solution to both infer_graph and infer_self_graph functions
Mar 13, 2023
b71d8ad
fixes concat between cudf and pd
Mar 13, 2023
8be6aa0
fixes bug
Mar 13, 2023
c1bd253
merges recent cleanup changes into cudf-alex2 in new branch
Mar 14, 2023
7482eb6
bring enc_X to cudf
Mar 14, 2023
e1ceb59
coerces previously fit X to pandas if cudf
Mar 14, 2023
0dd3ced
seemingly unavoidable cudf edge transforms to pandas
Mar 15, 2023
593b31f
merge cudf, cudf-cat, cudf-alex3
Mar 16, 2023
73f8c95
incorp cudf-alex3 for cudf-cu_cat
Mar 16, 2023
c2ce1df
incorp cudf-alex3 for cudf-cu_cat
Mar 16, 2023
ba56292
naive cudf tests
Mar 16, 2023
f7b7156
merge clean, add naive tests
Mar 16, 2023
0b8028a
fix(docstr): added layout & compute to nav
dess890 Mar 16, 2023
299adf1
fix(docstr): removed unneccessary lines
dess890 Mar 16, 2023
eaf69a5
forces dbscan to run on cpu only -- once https://github.com/rapidsai/…
Mar 16, 2023
8782a6d
adds engine_dbscan flag and if [sklearn ,umap_learn] will allow g.tra…
Mar 16, 2023
44351ba
bug fix
Mar 17, 2023
c2d70e2
explicit engine_dbscan flags
Mar 17, 2023
48fe85f
bug fix
Mar 17, 2023
d2611a3
adds pandas coercion before infer_graph
Mar 17, 2023
d4030bc
bug fix
Mar 17, 2023
42e8eb3
bug fix
Mar 17, 2023
2027c35
bug fix
Mar 17, 2023
13d9173
bug fix
Mar 17, 2023
d288597
bug fix
Mar 17, 2023
dfba41c
bug fix
Mar 17, 2023
fa2a6ba
lint
Mar 17, 2023
ea6ce34
lint
Mar 17, 2023
118e36d
adds modified test for dbscan params
Mar 18, 2023
911529c
feat(iframe): added a graph homepg, can be removed
dess890 Mar 20, 2023
2c9dc78
adds NVIDIA GTC demo that installs from branch
Mar 20, 2023
e1f64c5
removes print statements and changes default umap settings
Mar 20, 2023
c9249c0
typoo
Mar 20, 2023
6e85a37
merges cudf-alex3
Mar 21, 2023
c583814
moves feature_engine resolve to logical order
Mar 21, 2023
a5a626a
changes umap spread parameter default to 1
Mar 21, 2023
c5dc84d
refactors core featurize and umap engines so that cudf and pd are con…
Mar 21, 2023
69659f8
fixes typo and disambiguates between umap_engine and umap_engine_
Mar 21, 2023
aedc81e
lazy cudf import (thx alex)
Mar 21, 2023
465d486
lazy cudf import (thx alex)
Mar 21, 2023
87d706c
changed umap spread to 1
silkspace Mar 22, 2023
08ad02c
lazy cudf import, pin |torch for now
Mar 28, 2023
521dc5b
lazy cudf import, pin |torch for now
Mar 28, 2023
ba99c40
resolve f_engine
Mar 28, 2023
5b55f24
adds cudf conversion inside _featurize_* calls
Mar 30, 2023
5e0a1ce
adds print
Mar 30, 2023
23da7fc
pulls cudf X into pandas for FAISS indexing
Mar 31, 2023
720c7f7
merge cudf + cudf_cat_alex2
Mar 31, 2023
8b13311
placeholder cu_cat setup
Mar 31, 2023
434256c
tweaks needed for gpu cu_cat
Apr 3, 2023
0807e76
fix(docs): removed iframe (for now)
dess890 Apr 4, 2023
d25faf8
fix: cuml umap and tests fix
tanmoyio Apr 4, 2023
d9987de
lint: flake8 typo
tanmoyio Apr 4, 2023
168af4b
fix: _dgl_graph fix
tanmoyio Apr 4, 2023
fee5452
typo
tanmoyio Apr 4, 2023
60d3b97
test: remove xfail test_dgl_utils
tanmoyio Apr 4, 2023
4d74ade
test: remove StartTime test_dgl_utils temp
tanmoyio Apr 4, 2023
aaf275f
pinned pandas
tanmoyio Apr 4, 2023
fd423a2
alex2/3 cucat req
Apr 5, 2023
5f2cb69
more tests
tanmoyio Apr 5, 2023
408ee52
stable
tanmoyio Apr 5, 2023
c9d1c95
Merge branch 'master' into cudf-final
tanmoyio Apr 5, 2023
762b2d2
fix(modules.rst) testing to ci warnings
dess890 Apr 5, 2023
e0abb95
fix(conf.py); added plugins to nitpick
dess890 Apr 5, 2023
6ebb74a
fix(modules): added title
dess890 Apr 5, 2023
63d665b
fix(modules.rst): added title for ci testing
dess890 Apr 5, 2023
18e0bef
doc fix
tanmoyio Apr 5, 2023
ee5be00
Merge branch 'cudf-final' of https://github.com/graphistry/pygraphist…
tanmoyio Apr 5, 2023
6a5bec9
fix(rst) docs fixes for CI passing
dess890 Apr 5, 2023
a58f279
umap trick for cudf dfs
tanmoyio Apr 5, 2023
827ae22
ignore args type
tanmoyio Apr 5, 2023
b95400e
feat(rst) added badges
dess890 Apr 5, 2023
cb10f3c
fix(plotter): adding plotter to menu (will update)
dess890 Apr 5, 2023
10907c3
plotterbase to plotter
tanmoyio Apr 5, 2023
9b5c3db
merge
tanmoyio Apr 5, 2023
dcf60ac
addStyle to add_style
tanmoyio Apr 5, 2023
75f11b0
fix(rst): revert plotter changes for ci test
dess890 Apr 5, 2023
953cccc
all addStyle to add_style
tanmoyio Apr 5, 2023
edc1f6d
resolve conflicts
tanmoyio Apr 6, 2023
8d64481
fix(plotter): expanding menu
dess890 Apr 6, 2023
5900e2f
fix(docst) added umap to articles
dess890 Apr 6, 2023
0850778
feat(docst) added photo for home pg
dess890 Apr 6, 2023
c501bca
Merge branch 'navbar-fixes' of github.com:graphistry/pygraphistry int…
dess890 Apr 6, 2023
1234dd7
test add chain in __init__.py
tanmoyio Apr 6, 2023
3b3654f
Merge branch 'navbar-fixes' of https://github.com/graphistry/pygraphi…
tanmoyio Apr 6, 2023
c7bc46d
test nitpick
tanmoyio Apr 6, 2023
a88961b
test nitpick 2
tanmoyio Apr 6, 2023
f8d6ee1
test
tanmoyio Apr 6, 2023
8115020
test 3
tanmoyio Apr 6, 2023
e581f97
test 4
tanmoyio Apr 6, 2023
bde96b1
test 5
tanmoyio Apr 6, 2023
f66856c
test 6
tanmoyio Apr 6, 2023
06f2652
test 7
tanmoyio Apr 6, 2023
254b31b
test 8
tanmoyio Apr 6, 2023
1d9722c
test 9
tanmoyio Apr 6, 2023
073eaec
test 10
tanmoyio Apr 6, 2023
1e82076
test 11
tanmoyio Apr 6, 2023
1a68d74
test 12
tanmoyio Apr 6, 2023
8320dd5
test 13
tanmoyio Apr 6, 2023
a1941ba
test 14
tanmoyio Apr 6, 2023
4361ae6
final fix
tanmoyio Apr 6, 2023
db8c228
final fix 1
tanmoyio Apr 6, 2023
5e9a577
final fix 2
tanmoyio Apr 6, 2023
52506b1
fix(conf.py): added converter for badges
dess890 Apr 6, 2023
7114bc8
fix(conf.py): removed img converter
dess890 Apr 6, 2023
224351f
test(conf.py): using only directive for ci testing
dess890 Apr 7, 2023
cf25d29
test(conf.py): testing only directive
dess890 Apr 7, 2023
b17af60
fix(docstr): removed slack badge
dess890 Apr 7, 2023
385a0ad
test(docstr): testing to see if uptime is failing
dess890 Apr 7, 2023
7a74dc8
cudf edge reqs
Apr 7, 2023
402c544
merge cudf-final
Apr 7, 2023
6b0056a
cu_cat refactor
Apr 7, 2023
800e2ba
merge conflicts with navbar-fixes
tanmoyio Apr 10, 2023
e79a3e6
typo: graphistry.rst compute.cluster
tanmoyio Apr 10, 2023
9c445f1
fix: duplicate entries docs
tanmoyio Apr 10, 2023
da34a54
TestFeatureCUMLProcessors
Apr 11, 2023
20dc72d
merge cudf_cat to main cudf
Apr 11, 2023
ce17b8e
tests: add cudf umap pass through
tanmoyio Apr 11, 2023
5ff14f8
lint: flake8 fixes
tanmoyio Apr 11, 2023
77eb8c2
fix: cudf umap skip
tanmoyio Apr 11, 2023
c2d2fcb
need to make cudf import for edges lazy
Apr 12, 2023
c703a42
test: cudf tests with docker flag
tanmoyio Apr 12, 2023
006aa7d
delete: .swp files
tanmoyio Apr 12, 2023
d634d91
add: test_umap_utils on test-gpu-local.sh
tanmoyio Apr 12, 2023
400b632
fix: revert back to addStyle from add_style
tanmoyio Apr 13, 2023
e6fc323
added warnings for predict_links
tanmoyio Apr 13, 2023
c300050
fix: test cudf flag
tanmoyio Apr 13, 2023
1305db1
doc: changelog update for _dgl_graph
tanmoyio Apr 13, 2023
7ad6449
passing gpu test_feature_utils
tanmoyio Apr 13, 2023
2395147
additional checks for embed utils
tanmoyio Apr 13, 2023
7334b4b
merge
dcolinmorgan Apr 14, 2023
d81932e
merge cudf-cat-final
dcolinmorgan Apr 14, 2023
8e99fe3
cu_cat flag in umap
dcolinmorgan Apr 14, 2023
2e9820c
added tanmoy umap changes
dcolinmorgan Apr 14, 2023
e06ce0c
flake8 fix
tanmoyio Apr 17, 2023
b7ce57e
Merge branch 'cudf-final' of https://github.com/graphistry/pygraphist…
tanmoyio Apr 17, 2023
8f5a40f
mypy ignore hyperdask
tanmoyio Apr 17, 2023
1660774
mypy ignore _version.py
tanmoyio Apr 17, 2023
7574e97
temp fix for cudf objects in embed
tanmoyio Apr 17, 2023
a9017fa
lint
dcolinmorgan Apr 18, 2023
65ce26b
lint
dcolinmorgan Apr 18, 2023
ab7fd8e
lint
dcolinmorgan Apr 18, 2023
d3d3071
type: ignore cu_cat import
dcolinmorgan Apr 18, 2023
88adafc
type: ignore cu_cat import
dcolinmorgan Apr 18, 2023
63f6044
base_extras_heavy[cu_cat]
dcolinmorgan Apr 18, 2023
1887a82
base_extras_heavy[cu_cat]
dcolinmorgan Apr 18, 2023
8ec6c6e
base_extras_heavy[cu-cat]
dcolinmorgan Apr 18, 2023
6f85ee1
base_extras_heavy[cu-cat]
dcolinmorgan Apr 18, 2023
4acc4d8
base_extras_heavy[cu-cat]
dcolinmorgan Apr 18, 2023
0e0ae32
long_version_py ignore type
dcolinmorgan Apr 18, 2023
7ec91f7
egg-0.02.0
dcolinmorgan Apr 18, 2023
d78628b
egg-0.02.0
dcolinmorgan Apr 18, 2023
429f6a3
rm egg
dcolinmorgan Apr 18, 2023
6d6ae25
fix: cu_cat missing stubs ignore
tanmoyio Apr 18, 2023
4204713
skip feature_utils cudf tests
tanmoyio Apr 18, 2023
f1a0b2c
Merge branch 'cudf-final' of https://github.com/graphistry/pygraphist…
tanmoyio Apr 18, 2023
037975e
sklearn FunctionTransformer, no lazy cuml import
dcolinmorgan Apr 19, 2023
2631564
sklearn FunctionTransformer, no lazy cuml import
dcolinmorgan Apr 19, 2023
71cfcd5
some fixes for cpu checks(gpu issues are still there)
tanmoyio Apr 19, 2023
9755e44
flake: fix
tanmoyio Apr 19, 2023
a03a294
merge with cudf-final embed
dcolinmorgan Apr 19, 2023
19e9f6c
merge with cudf-final embed
dcolinmorgan Apr 19, 2023
9d673b4
assert cudf not import
dcolinmorgan Apr 20, 2023
03a4042
assert cudf not import
dcolinmorgan Apr 20, 2023
071f1c1
lazy not assert
dcolinmorgan Apr 20, 2023
dd38945
Update embed_utils.py
tanmoyio Apr 24, 2023
e96bf01
migrate check_cudf to embed_utils.py
tanmoyio Apr 24, 2023
827984d
Update embed_utils.py
tanmoyio Apr 24, 2023
a95371a
Merge branch 'cudf-cat-final' into cudf
dcolinmorgan Apr 25, 2023
a7e28cd
Merge branch 'cudf-final' into cudf
dcolinmorgan Apr 25, 2023
13f6b7e
lint
dcolinmorgan Apr 25, 2023
339779b
lint
dcolinmorgan Apr 25, 2023
ff213fb
lint
dcolinmorgan Apr 25, 2023
084395a
lint
dcolinmorgan Apr 25, 2023
3068e6a
merge cudf-final
dcolinmorgan Apr 25, 2023
70b50d9
lint
dcolinmorgan Apr 25, 2023
79cafba
Merge branch 'cudf-cat-final' into cudf
dcolinmorgan Apr 25, 2023
06a691b
lazy cudf import
dcolinmorgan Apr 25, 2023
4b779ac
lazy cudf import
dcolinmorgan Apr 25, 2023
9b25eec
Merge branch 'cudf-cat-final' into cudf
dcolinmorgan Apr 25, 2023
c1a0cca
lint
dcolinmorgan Apr 25, 2023
f853472
better lazy cudf import
dcolinmorgan Apr 25, 2023
8e05dd4
Merge branch 'cudf-cat-final' into cudf
dcolinmorgan Apr 25, 2023
b6b148b
lazy merge
dcolinmorgan Apr 25, 2023
17f0af6
lint
dcolinmorgan Apr 25, 2023
2a5c879
lint
dcolinmorgan Apr 25, 2023
80ba095
functiontransform cuml import
dcolinmorgan Apr 25, 2023
6709bea
functiontransform cuml import
dcolinmorgan Apr 25, 2023
901846c
functiontransform cuml import
dcolinmorgan Apr 25, 2023
118ea80
functiontransform cuml import
dcolinmorgan Apr 25, 2023
f1ee230
functiontransform cuml import
dcolinmorgan Apr 25, 2023
7fc02e2
use dirty_cat superVec for torch/etc, except if cu_cat
dcolinmorgan Apr 26, 2023
d38f469
use dirty_cat superVec for torch/etc, except if cu_cat
dcolinmorgan Apr 26, 2023
6436067
use dirty_cat superVec for torch/etc, except if cu_cat
dcolinmorgan Apr 26, 2023
25573ea
sklearn functiontransformer & MLB
dcolinmorgan Apr 26, 2023
dee5ad4
sklearn functiontransformer & MLB
dcolinmorgan Apr 26, 2023
cac6cc4
all preprocess back to sklearn
dcolinmorgan Apr 26, 2023
3757b10
import FT again
dcolinmorgan Apr 26, 2023
523d180
rewrite g_n_t
dcolinmorgan Apr 26, 2023
97b725d
revert g_n_t
dcolinmorgan Apr 26, 2023
7ab97a4
import FT in get_numeric_transform
dcolinmorgan Apr 27, 2023
aba0c55
import FT in get_numeric_transform
dcolinmorgan Apr 27, 2023
fb96400
latest release opt-in install
dcolinmorgan May 10, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -199,11 +199,21 @@ jobs:
source pygraphistry/bin/activate
./bin/typecheck.sh

- name: Full dbscan tests (rich featurize)
run: |
source pygraphistry/bin/activate
./bin/test-dbscan.sh

- name: Full feature tests (rich featurize)
run: |
source pygraphistry/bin/activate
./bin/test-features.sh

- name: Full search tests (rich featurize)
run: |
source pygraphistry/bin/activate
./bin/test-text.sh

- name: Full umap tests (rich featurize)
run: |
source pygraphistry/bin/activate
Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# vim temporary files
*.swp
*.swo

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,19 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm

## [Development]

### Added
* AI: moves public `g.g_dgl` from KG `embed` method to private method `g._kg_dgl`
* AI: moves public `g.DGL_graph` to private attribute `g._dgl_graph`
* AI: BREAKING CHANGES: to return matrices during transform, set the flag: `X, y = g.transform(df, return_graph=False)` default behavior is ~ `g2 = g.transform(df)` returning a Plottable instance.

## [0.28.7 - 2022-12-22]

### Added
* AI: all `transform_*` methods return graphistry Plottable instances, using an infer_graph method. To return matrices, set the `return_graph=False` flag.
* AI: adds `g.get_matrix(**kwargs)` general method to retrieve (sub)-feature/target matrices
* AI: DBSCAN -- `g.featurize().dbscan()` and `g.umap().dbscan()` with options to use UMAP embedding, feature matrix, or subset of feature matrix via `g.dbscan(cols=[...])`
* AI: Demo cleanup using ModelDict & new features, refactoring demos using `dbscan` and `transform` methods.
* Tests: dbscan tests
* AI: Easy import of featurization kwargs for `g.umap(**kwargs)` and `g.featurize(**kwargs)`
* AI: `g.get_features_by_cols` returns featurized submatrix with `col_part` in their columns
* AI: `g.conditional_graph` and `g.conditional_probs` assessing conditional probs and graph
Expand Down
90 changes: 72 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -358,61 +358,72 @@ Automatically and intelligently transform text, numbers, booleans, and other for
g = g.umap() # UMAP, GNNs, use features if already provided, otherwise will compute

# other pydata libraries
X = g._node_features # g._get_feature('nodes')
y = g._node_target # g._get_target('nodes')
X = g._node_features # g._get_feature('nodes') or g.get_matrix()
y = g._node_target # g._get_target('nodes') or g.get_matrix(target=True)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor().fit(X, y) #assumes train/test split
new_df = pandas.read_csv(...)
X_new, _ = g.transform(new_df, None, kind='nodes')
model = RandomForestRegressor().fit(X, y) # assumes train/test split
new_df = pandas.read_csv(...) # mini batch
X_new, _ = g.transform(new_df, None, kind='nodes', return_graph=False)
preds = model.predict(X_new)
```

* Encode model definitions and compare models against each other

```python
# graphistry
from graphistry.features import search_model, topic_model, ngrams_model, ModelDict, default_featurize_parameters
from graphistry.features import search_model, topic_model, ngrams_model, ModelDict, default_featurize_parameters, default_umap_parameters

g = graphistry.nodes(df)
g2 = g.umap(X=[..], y=[..], **search_model)

# set custom encoding model with any feature kwargs
# set custom encoding model with any feature/umap/dbscan kwargs
new_model = ModelDict(message='encoding new model parameters is easy', **default_featurize_parameters)
new_model.update(dict(
y=[...],
kind='edges',
model_name='sbert/hf/a_cool_transformer_model',
model_name='sbert/cool_transformer_model',
use_scaler_target='kbins',
n_bins=11,
strategy='normal'))
print(new_model)

g3 = g.umap(X=[..], **new_model)
# compare g2 vs g3 or add to different pipelines
# ...
```


See `help(g.featurize)` for more options

### [sklearn-based UMAP](https://umap-learn.readthedocs.io/en/latest/), [cuML-based UMAP](https://docs.rapids.ai/api/cuml/stable/api.html?highlight=umap#cuml.UMAP)

* Reduce dimensionality and plot a similarity graph from feature vectors:
* Reduce dimensionality by plotting a similarity graph from feature vectors:

```python
# automatic feature engineering, UMAP
g = graphistry.nodes(df).umap()

# plot the similarity graph even though there was no explicit edge_dataframe passed in -- it is created during UMAP.
# plot the similarity graph without any explicit edge_dataframe passed in -- it is created during UMAP.
g.plot()
```

* Apply a trained model to new data:

```python
new_df = pd.read_csv(...)
embeddings, X_new, _ = g.transform_umap(new_df, None, kind='nodes')
embeddings, X_new, _ = g.transform_umap(new_df, None, kind='nodes', return_graph=False)
```
* Infer a new graph from new data using the old umap coordinates to run inference without having to train a new umap model.

```python
new_df = pd.read_csv(...)
g2 = g.transform_umap(new_df, return_graph=True) # return_graph=True is default
g2.plot() #

# or if you want the new minibatch to cluster to closest points in previous fit:
g3 = g.transform_umap(new_df, return_graph=True, merge_policy=True)
g3.plot() # useful to see how new data connects to old -- play with `sample` and `n_neighbors` to control how much of old to include
```


* UMAP supports many options, such as supervised mode, working on a subset of columns, and passing arguments to underlying `featurize()` and UMAP implementations (see `help(g.umap)`):

Expand Down Expand Up @@ -451,11 +462,11 @@ See `help(g.umap)` for more options

from [your_training_pipeline] import train, model
# Train
g = graphistry.nodes(df).build_gnn(y=`target`)
g = graphistry.nodes(df).build_gnn(y_nodes=`target`)
G = g.DGL_graph
train(G, model)
# predict on new data
X_new, _ = g.transform(new_df, None, kind='nodes' or 'edges') # no targets
X_new, _ = g.transform(new_df, None, kind='nodes' or 'edges', return_graph=False) # no targets
predictions = model.predict(G_new, X_new)
```

Expand All @@ -480,12 +491,21 @@ GNN support is rapidly evolving, please contact the team directly or on Slack fo
#encode text as paraphrase embeddings, supports any sbert model
model_name = "paraphrase-MiniLM-L6-v2")

# or use convienence `ModelDict` to store parameters

from graphistry.features import search_model
g2 = g.featurize(X = ['text_col_1', .., 'text_col_n'], kind='nodes', **search_model)

# query using the power of transformers to find richly relevant results

results_df, query_vector = g2.search('my natural language query', ...)

print(results_df[['_distance', 'text_col_1', ..., 'text_col_n']]) #sorted by relevancy
print(results_df[['_distance', 'text_col', ..]]) #sorted by relevancy

# or see graph of matching entities and original edges

# or see graph of matching entities and similarity edges (or optional original edges)
g2.search_graph('my natural language query', ...).plot()

```


Expand Down Expand Up @@ -521,7 +541,7 @@ See `help(g.search_graph)` for options
relation=['relationship_1', 'relationship_4', ..],
destination=['entity_l', 'entity_m', ..],
threshold=0.9, # score threshold
return_dataframe=False) # set to `True` to return dataframe, or just access via `g5._edges`
return_dataframe=False) # set to `True` to return dataframe, or just access via `g4._edges`
```

* Detect Anamolous Behavior (example use cases such as Cyber, Fraud, etc)
Expand Down Expand Up @@ -552,8 +572,42 @@ See `help(g.search_graph)` for options
g2.predict_links_all(threshold=0.95).plot()
```

See `help(g.embed)`, `help(g.predict_links)` , `help(g.predict_links_all)` for options
See `help(g.embed)`, `help(g.predict_links)` , or `help(g.predict_links_all)` for options

### DBSCAN

* Enrich UMAP embeddings or featurization dataframe with GPU or CPU DBSCAN

```python
g = graphistry.edges(edf, 'src', 'dst').nodes(ndf, 'node')

# cluster by UMAP embeddings
kind = 'nodes' | 'edges'
g2 = g.umap(kind=kind).dbscan(kind=kind)
print(g2._nodes['_dbscan']) | print(g2._edges['_dbscan'])

# dbscan in `umap` or `featurize` via flag
g2 = g.umap(dbscan=True, min_dist=0.2, min_samples=1)

# or via chaining,
g2 = g.umap().dbscan(min_dist=1.2, min_samples=2, **kwargs)

# cluster by feature embeddings
g2 = g.featurize().dbscan(**kwargs)

# cluster by a given set of feature column attributes, inhereted from `g.get_matrix(cols)`
g2 = g.featurize().dbscan(cols=['ip_172', 'location', 'alert'], **kwargs)

# equivalent to above (ie, cols != None and umap=True will still use features dataframe, rather than UMAP embeddings)
g2 = g.umap().dbscan(cols=['ip_172', 'location', 'alert'], umap=True | False, **kwargs)
g2.plot() # color by `_dbscan`

new_df = pd.read_csv(..)
# transform on new data according to fit dbscan model
g3 = g2.transform_dbscan(new_df)
```

See `help(g.dbscan)` or `help(g.transform_dbscan)` for options

### Quickly configurable

Expand Down
15 changes: 15 additions & 0 deletions bin/test-dbscan.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash
set -ex

# Run from project root
# - Args get passed to pytest phase
# Non-zero exit code on fail

# Assume [umap-learn,test]

python -m pytest --version

python -B -m pytest -vv \
graphistry/tests/test_compute_cluster.py

#chmod +x bin/test-dbscan.sh
15 changes: 15 additions & 0 deletions bin/test-text.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash
set -ex

# Run from project root
# - Args get passed to pytest phase
# Non-zero exit code on fail

# Assume [umap-learn,test]

python -m pytest --version

python -B -m pytest -vv \
graphistry/tests/test_text_utils.py

# chmod +x bin/test-text.sh
Loading