Skip to content

Commit

Permalink
Merge branch 'cleanup' into fix_embed_pred_links_gpu
Browse files Browse the repository at this point in the history
  • Loading branch information
silkspace authored Mar 7, 2023
2 parents b8a4f5a + d80f66f commit 260c02f
Show file tree
Hide file tree
Showing 41 changed files with 17,928 additions and 2,417 deletions.
10 changes: 10 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -199,11 +199,21 @@ jobs:
source pygraphistry/bin/activate
./bin/typecheck.sh
- name: Full dbscan tests (rich featurize)
run: |
source pygraphistry/bin/activate
./bin/test-dbscan.sh
- name: Full feature tests (rich featurize)
run: |
source pygraphistry/bin/activate
./bin/test-features.sh
- name: Full search tests (rich featurize)
run: |
source pygraphistry/bin/activate
./bin/test-text.sh
- name: Full umap tests (rich featurize)
run: |
source pygraphistry/bin/activate
Expand Down
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,16 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm

## [Development]

### Changed
* AI: moves public `g.g_dgl` from KG `embed` method to private method `g._kg_dgl`
* AI: BREAKING CHANGES: to return matrices during transform, set the flag: `X, y = g.transform(df, return_graph=False)` default behavior is ~ `g2 = g.transform(df)` returning a Plottable instance.

### Added
* AI: all `transform_*` methods return graphistry Plottable instances, using an infer_graph method. To return matrices, set the `return_graph=False` flag.
* AI: adds `g.get_matrix(**kwargs)` general method to retrieve (sub)-feature/target matrices
* AI: DBSCAN -- `g.featurize().dbscan()` and `g.umap().dbscan()` with options to use UMAP embedding, feature matrix, or subset of feature matrix via `g.dbscan(cols=[...])`
* AI: Demo cleanup using ModelDict & new features, refactoring demos using `dbscan` and `transform` methods.
* Tests: dbscan tests
* AI: Easy import of featurization kwargs for `g.umap(**kwargs)` and `g.featurize(**kwargs)`
* AI: `g.get_features_by_cols` returns featurized submatrix with `col_part` in their columns
* AI: `g.conditional_graph` and `g.conditional_probs` assessing conditional probs and graph
Expand Down
90 changes: 72 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -358,61 +358,72 @@ Automatically and intelligently transform text, numbers, booleans, and other for
g = g.umap() # UMAP, GNNs, use features if already provided, otherwise will compute

# other pydata libraries
X = g._node_features # g._get_feature('nodes')
y = g._node_target # g._get_target('nodes')
X = g._node_features # g._get_feature('nodes') or g.get_matrix()
y = g._node_target # g._get_target('nodes') or g.get_matrix(target=True)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor().fit(X, y) #assumes train/test split
new_df = pandas.read_csv(...)
X_new, _ = g.transform(new_df, None, kind='nodes')
model = RandomForestRegressor().fit(X, y) # assumes train/test split
new_df = pandas.read_csv(...) # mini batch
X_new, _ = g.transform(new_df, None, kind='nodes', return_graph=False)
preds = model.predict(X_new)
```

* Encode model definitions and compare models against each other

```python
# graphistry
from graphistry.features import search_model, topic_model, ngrams_model, ModelDict, default_featurize_parameters
from graphistry.features import search_model, topic_model, ngrams_model, ModelDict, default_featurize_parameters, default_umap_parameters

g = graphistry.nodes(df)
g2 = g.umap(X=[..], y=[..], **search_model)

# set custom encoding model with any feature kwargs
# set custom encoding model with any feature/umap/dbscan kwargs
new_model = ModelDict(message='encoding new model parameters is easy', **default_featurize_parameters)
new_model.update(dict(
y=[...],
kind='edges',
model_name='sbert/hf/a_cool_transformer_model',
model_name='sbert/cool_transformer_model',
use_scaler_target='kbins',
n_bins=11,
strategy='normal'))
print(new_model)

g3 = g.umap(X=[..], **new_model)
# compare g2 vs g3 or add to different pipelines
# ...
```


See `help(g.featurize)` for more options

### [sklearn-based UMAP](https://umap-learn.readthedocs.io/en/latest/), [cuML-based UMAP](https://docs.rapids.ai/api/cuml/stable/api.html?highlight=umap#cuml.UMAP)

* Reduce dimensionality and plot a similarity graph from feature vectors:
* Reduce dimensionality by plotting a similarity graph from feature vectors:

```python
# automatic feature engineering, UMAP
g = graphistry.nodes(df).umap()

# plot the similarity graph even though there was no explicit edge_dataframe passed in -- it is created during UMAP.
# plot the similarity graph without any explicit edge_dataframe passed in -- it is created during UMAP.
g.plot()
```

* Apply a trained model to new data:

```python
new_df = pd.read_csv(...)
embeddings, X_new, _ = g.transform_umap(new_df, None, kind='nodes')
embeddings, X_new, _ = g.transform_umap(new_df, None, kind='nodes', return_graph=False)
```
* Infer a new graph from new data using the old umap coordinates to run inference without having to train a new umap model.

```python
new_df = pd.read_csv(...)
g2 = g.transform_umap(new_df, return_graph=True) # return_graph=True is default
g2.plot() #

# or if you want the new minibatch to cluster to closest points in previous fit:
g3 = g.transform_umap(new_df, return_graph=True, merge_policy=True)
g3.plot() # useful to see how new data connects to old -- play with `sample` and `n_neighbors` to control how much of old to include
```


* UMAP supports many options, such as supervised mode, working on a subset of columns, and passing arguments to underlying `featurize()` and UMAP implementations (see `help(g.umap)`):

Expand Down Expand Up @@ -451,11 +462,11 @@ See `help(g.umap)` for more options

from [your_training_pipeline] import train, model
# Train
g = graphistry.nodes(df).build_gnn(y=`target`)
g = graphistry.nodes(df).build_gnn(y_nodes=`target`)
G = g.DGL_graph
train(G, model)
# predict on new data
X_new, _ = g.transform(new_df, None, kind='nodes' or 'edges') # no targets
X_new, _ = g.transform(new_df, None, kind='nodes' or 'edges', return_graph=False) # no targets
predictions = model.predict(G_new, X_new)
```

Expand All @@ -480,12 +491,21 @@ GNN support is rapidly evolving, please contact the team directly or on Slack fo
#encode text as paraphrase embeddings, supports any sbert model
model_name = "paraphrase-MiniLM-L6-v2")

# or use convienence `ModelDict` to store parameters

from graphistry.features import search_model
g2 = g.featurize(X = ['text_col_1', .., 'text_col_n'], kind='nodes', **search_model)

# query using the power of transformers to find richly relevant results

results_df, query_vector = g2.search('my natural language query', ...)

print(results_df[['_distance', 'text_col_1', ..., 'text_col_n']]) #sorted by relevancy
print(results_df[['_distance', 'text_col', ..]]) #sorted by relevancy

# or see graph of matching entities and original edges

# or see graph of matching entities and similarity edges (or optional original edges)
g2.search_graph('my natural language query', ...).plot()

```


Expand Down Expand Up @@ -521,7 +541,7 @@ See `help(g.search_graph)` for options
relation=['relationship_1', 'relationship_4', ..],
destination=['entity_l', 'entity_m', ..],
threshold=0.9, # score threshold
return_dataframe=False) # set to `True` to return dataframe, or just access via `g5._edges`
return_dataframe=False) # set to `True` to return dataframe, or just access via `g4._edges`
```

* Detect Anamolous Behavior (example use cases such as Cyber, Fraud, etc)
Expand Down Expand Up @@ -552,8 +572,42 @@ See `help(g.search_graph)` for options
g2.predict_links_all(threshold=0.95).plot()
```

See `help(g.embed)`, `help(g.predict_links)` , `help(g.predict_links_all)` for options
See `help(g.embed)`, `help(g.predict_links)` , or `help(g.predict_links_all)` for options

### DBSCAN

* Enrich UMAP embeddings or featurization dataframe with GPU or CPU DBSCAN

```python
g = graphistry.edges(edf, 'src', 'dst').nodes(ndf, 'node')

# cluster by UMAP embeddings
kind = 'nodes' | 'edges'
g2 = g.umap(kind=kind).dbscan(kind=kind)
print(g2._nodes['_dbscan']) | print(g2._edges['_dbscan'])

# dbscan in `umap` or `featurize` via flag
g2 = g.umap(dbscan=True, min_dist=0.2, min_samples=1)

# or via chaining,
g2 = g.umap().dbscan(min_dist=1.2, min_samples=2, **kwargs)

# cluster by feature embeddings
g2 = g.featurize().dbscan(**kwargs)

# cluster by a given set of feature column attributes, inhereted from `g.get_matrix(cols)`
g2 = g.featurize().dbscan(cols=['ip_172', 'location', 'alert'], **kwargs)

# equivalent to above (ie, cols != None and umap=True will still use features dataframe, rather than UMAP embeddings)
g2 = g.umap().dbscan(cols=['ip_172', 'location', 'alert'], umap=True | False, **kwargs)
g2.plot() # color by `_dbscan`

new_df = pd.read_csv(..)
# transform on new data according to fit dbscan model
g3 = g2.transform_dbscan(new_df)
```

See `help(g.dbscan)` or `help(g.transform_dbscan)` for options

### Quickly configurable

Expand Down
15 changes: 15 additions & 0 deletions bin/test-dbscan.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash
set -ex

# Run from project root
# - Args get passed to pytest phase
# Non-zero exit code on fail

# Assume [umap-learn,test]

python -m pytest --version

python -B -m pytest -vv \
graphistry/tests/test_compute_cluster.py

#chmod +x bin/test-dbscan.sh
15 changes: 15 additions & 0 deletions bin/test-text.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash
set -ex

# Run from project root
# - Args get passed to pytest phase
# Non-zero exit code on fail

# Assume [umap-learn,test]

python -m pytest --version

python -B -m pytest -vv \
graphistry/tests/test_text_utils.py

# chmod +x bin/test-text.sh
Loading

0 comments on commit 260c02f

Please sign in to comment.