Merge branch 'cleanup' into fix_embed_pred_links_gpu

graphistry · Mar 7, 2023 · 260c02f · 260c02f
2 parents b8a4f5a + d80f66f
commit 260c02f
Show file tree

Hide file tree

Showing 41 changed files with 17,928 additions and 2,417 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -199,11 +199,21 @@ jobs:
         source pygraphistry/bin/activate
         ./bin/typecheck.sh
 
+    - name: Full dbscan tests (rich featurize)
+      run: |
+        source pygraphistry/bin/activate
+        ./bin/test-dbscan.sh
+        
     - name: Full feature tests (rich featurize)
       run: |
         source pygraphistry/bin/activate
         ./bin/test-features.sh
 
+    - name: Full search tests (rich featurize)
+      run: |
+        source pygraphistry/bin/activate
+        ./bin/test-text.sh
+
     - name: Full umap tests (rich featurize)
       run: |
         source pygraphistry/bin/activate

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,7 +7,16 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
 
 ## [Development]
 
+### Changed
+* AI: moves public `g.g_dgl` from KG `embed` method to private method `g._kg_dgl`
+* AI: BREAKING CHANGES: to return matrices during transform, set the flag:  `X, y = g.transform(df, return_graph=False)` default behavior is ~ `g2 = g.transform(df)` returning a Plottable instance. 
+
 ### Added
+* AI: all `transform_*` methods return graphistry Plottable instances, using an infer_graph method. To return matrices, set the `return_graph=False` flag. 
+* AI: adds `g.get_matrix(**kwargs)` general method to retrieve (sub)-feature/target matrices
+* AI: DBSCAN -- `g.featurize().dbscan()` and `g.umap().dbscan()` with options to use UMAP embedding, feature matrix, or subset of feature matrix via `g.dbscan(cols=[...])`
+* AI: Demo cleanup using ModelDict & new features, refactoring demos using `dbscan` and `transform` methods.
+* Tests: dbscan tests
 * AI: Easy import of featurization kwargs for `g.umap(**kwargs)` and `g.featurize(**kwargs)`
 * AI: `g.get_features_by_cols` returns featurized submatrix with `col_part` in their columns
 * AI: `g.conditional_graph` and `g.conditional_probs` assessing conditional probs and graph

diff --git a/README.md b/README.md
@@ -358,61 +358,72 @@ Automatically and intelligently transform text, numbers, booleans, and other for
     g = g.umap()  # UMAP, GNNs, use features if already provided, otherwise will compute
 
     # other pydata libraries
-    X = g._node_features  # g._get_feature('nodes')
-    y = g._node_target  # g._get_target('nodes')
+    X = g._node_features  # g._get_feature('nodes') or g.get_matrix()
+    y = g._node_target  # g._get_target('nodes') or g.get_matrix(target=True)
     from sklearn.ensemble import RandomForestRegressor
-    model = RandomForestRegressor().fit(X, y) #assumes train/test split
-    new_df = pandas.read_csv(...)
-    X_new, _ = g.transform(new_df, None, kind='nodes')
+    model = RandomForestRegressor().fit(X, y)  # assumes train/test split
+    new_df = pandas.read_csv(...)  # mini batch
+    X_new, _ = g.transform(new_df, None, kind='nodes', return_graph=False)
     preds = model.predict(X_new)
     ```
 
  * Encode model definitions and compare models against each other
 
    ```python
     # graphistry
-    from graphistry.features import search_model, topic_model, ngrams_model, ModelDict, default_featurize_parameters
+    from graphistry.features import search_model, topic_model, ngrams_model, ModelDict, default_featurize_parameters, default_umap_parameters
 
     g = graphistry.nodes(df)
     g2 = g.umap(X=[..], y=[..], **search_model)  
 
-    # set custom encoding model with any feature kwargs
+    # set custom encoding model with any feature/umap/dbscan kwargs
     new_model = ModelDict(message='encoding new model parameters is easy', **default_featurize_parameters)
     new_model.update(dict(
                       y=[...],
                       kind='edges', 
-                      model_name='sbert/hf/a_cool_transformer_model', 
+                      model_name='sbert/cool_transformer_model', 
                       use_scaler_target='kbins', 
                       n_bins=11, 
                       strategy='normal'))
     print(new_model)
 
     g3 = g.umap(X=[..], **new_model)
     # compare g2 vs g3 or add to different pipelines
-    # ...
     ```
 
 
 See `help(g.featurize)` for more options
 
 ### [sklearn-based UMAP](https://umap-learn.readthedocs.io/en/latest/), [cuML-based UMAP](https://docs.rapids.ai/api/cuml/stable/api.html?highlight=umap#cuml.UMAP)
 
-* Reduce dimensionality and plot a similarity graph from feature vectors:
+* Reduce dimensionality by plotting a similarity graph from feature vectors:
 
     ```python
       # automatic feature engineering, UMAP
       g = graphistry.nodes(df).umap()
 
-      # plot the similarity graph even though there was no explicit edge_dataframe passed in -- it is created during UMAP.
+      # plot the similarity graph without any explicit edge_dataframe passed in -- it is created during UMAP.
       g.plot()
     ```
 
 * Apply a trained model to new data:
 
     ```python
       new_df = pd.read_csv(...)
-      embeddings, X_new, _ = g.transform_umap(new_df, None, kind='nodes')
+      embeddings, X_new, _ = g.transform_umap(new_df, None, kind='nodes', return_graph=False)
     ```
+* Infer a new graph from new data using the old umap coordinates to run inference without having to train a new umap model.
+
+    ```python
+      new_df = pd.read_csv(...)
+      g2 = g.transform_umap(new_df, return_graph=True)   # return_graph=True is default
+      g2.plot()  # 
+
+      # or if you want the new minibatch to cluster to closest points in previous fit:
+      g3 = g.transform_umap(new_df, return_graph=True, merge_policy=True)
+      g3.plot()  # useful to see how new data connects to old -- play with `sample` and `n_neighbors` to control how much of old to include
+    ```
+
 
 * UMAP supports many options, such as supervised mode, working on a subset of columns, and passing arguments to underlying `featurize()` and UMAP implementations (see `help(g.umap)`):
 
@@ -451,11 +462,11 @@ See `help(g.umap)` for more options
 
     from [your_training_pipeline] import train, model
     # Train
-    g = graphistry.nodes(df).build_gnn(y=`target`) 
+    g = graphistry.nodes(df).build_gnn(y_nodes=`target`) 
     G = g.DGL_graph
     train(G, model)
     # predict on new data
-    X_new, _ = g.transform(new_df, None, kind='nodes' or 'edges') # no targets
+    X_new, _ = g.transform(new_df, None, kind='nodes' or 'edges', return_graph=False) # no targets
     predictions = model.predict(G_new, X_new)
     ```
 
@@ -480,12 +491,21 @@ GNN support is rapidly evolving, please contact the team directly or on Slack fo
                         #encode text as paraphrase embeddings, supports any sbert model
                         model_name = "paraphrase-MiniLM-L6-v2")
 
+      # or use convienence `ModelDict` to store parameters
+
+      from graphistry.features import search_model
+      g2 = g.featurize(X = ['text_col_1', .., 'text_col_n'], kind='nodes', **search_model)
+
+      # query using the power of transformers to find richly relevant results                   
+
       results_df, query_vector = g2.search('my natural language query', ...)
 
-      print(results_df[['_distance', 'text_col_1', ..., 'text_col_n']])  #sorted by relevancy
+      print(results_df[['_distance', 'text_col', ..]])  #sorted by relevancy
+
+      # or see graph of matching entities and original edges
 
-      # or see graph of matching entities and similarity edges (or optional original edges)
       g2.search_graph('my natural language query', ...).plot()
+
     ```
 
 
@@ -521,7 +541,7 @@ See `help(g.search_graph)` for options
                       relation=['relationship_1', 'relationship_4', ..], 
                       destination=['entity_l', 'entity_m', ..], 
                       threshold=0.9,  # score threshold
-                      return_dataframe=False)  # set to `True` to return dataframe, or just access via `g5._edges`
+                      return_dataframe=False)  # set to `True` to return dataframe, or just access via `g4._edges`
     ```
 
 * Detect Anamolous Behavior (example use cases such as Cyber, Fraud, etc)
@@ -552,8 +572,42 @@ See `help(g.search_graph)` for options
       g2.predict_links_all(threshold=0.95).plot()
     ```
 
-See `help(g.embed)`, `help(g.predict_links)` , `help(g.predict_links_all)` for options
+See `help(g.embed)`, `help(g.predict_links)` , or `help(g.predict_links_all)` for options
+
+### DBSCAN 
+
+* Enrich UMAP embeddings or featurization dataframe with GPU or CPU DBSCAN
+
+    ```python
+      g = graphistry.edges(edf, 'src', 'dst').nodes(ndf, 'node')
+
+      # cluster by UMAP embeddings
+      kind = 'nodes' | 'edges'
+      g2 = g.umap(kind=kind).dbscan(kind=kind)
+      print(g2._nodes['_dbscan']) | print(g2._edges['_dbscan'])
+
+      # dbscan in `umap` or `featurize` via flag
+      g2 = g.umap(dbscan=True, min_dist=0.2, min_samples=1)
+
+      # or via chaining,
+      g2 = g.umap().dbscan(min_dist=1.2, min_samples=2, **kwargs)
+
+      # cluster by feature embeddings
+      g2 = g.featurize().dbscan(**kwargs)
+
+      # cluster by a given set of feature column attributes, inhereted from `g.get_matrix(cols)`
+      g2 = g.featurize().dbscan(cols=['ip_172', 'location', 'alert'], **kwargs)
+
+      # equivalent to above (ie, cols != None and umap=True will still use features dataframe, rather than UMAP embeddings)
+      g2 = g.umap().dbscan(cols=['ip_172', 'location', 'alert'], umap=True | False, **kwargs)
+      g2.plot() # color by `_dbscan`
+
+      new_df = pd.read_csv(..)
+      # transform on new data according to fit dbscan model
+      g3 = g2.transform_dbscan(new_df)
+    ```
 
+See `help(g.dbscan)` or `help(g.transform_dbscan)` for options
 
 ### Quickly configurable
 

diff --git a/bin/test-dbscan.sh b/bin/test-dbscan.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+set -ex
+
+# Run from project root
+# - Args get passed to pytest phase
+# Non-zero exit code on fail
+
+# Assume [umap-learn,test]
+
+python -m pytest --version
+
+python -B -m pytest -vv \
+    graphistry/tests/test_compute_cluster.py
+
+#chmod +x bin/test-dbscan.sh
diff --git a/bin/test-text.sh b/bin/test-text.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+set -ex
+
+# Run from project root
+# - Args get passed to pytest phase
+# Non-zero exit code on fail
+
+# Assume [umap-learn,test]
+
+python -m pytest --version
+
+python -B -m pytest -vv \
+    graphistry/tests/test_text_utils.py
+
+# chmod +x bin/test-text.sh