Dev/ai demos merge (#430)

* mypy thinks mask is Series[int], hence wrapped in np.array(dtype=bool) * bind title fix * removes feature flags that are deprecated (similarity, categories, confidence) and adds docstrings in main featurize function call * type check * deletes search_index if it exists when mutating nodes. g.search will rebuilt if not there * makes safe search under the case where nodes are mutated via assertion * lint * resolves Leos requested changes * fix(temp fix on imputer so that it doesnt drop nan columns) * removes print * changes `impute` type and logic for more robust pipeline. removes deprecated flags * adds outlier poc code * switched to rgcn * typo * typo fix * works * predict mode * commented out * adds Ask-HackerNews demo tutorial, adds all recall nodes to subgraph, rather than just those with found edges in text_utils.py, and adds what should be on master in feature_utiils * adds helper model dict imports for featurization * adds easier model presets for users when using .featurize * adds cli * docs * adds update to README * adds update to README * adds update to README * rel + feat * better naming * bug * batch_size arg * adds corrections and args * typo * fixes errors and adds embeddings * bug * bug * save before merge with work Tanmoy and I did * adds scoring in main graphistry instance * adds working node priors from featurization * fixes nodes issues * adds state changing code that removes edges not in nodes and vice versa * lazy import returns modules * cherry picks other branches to add dgl fix and outliers lib * adds CyberSecurity CTU-13 dataset GNN pipeline demo * adds Jack Dorsey Social Good Pledge dataset * infer with [s, r] -> d * adds function methods like get_matrix etc * wip(logger): make uniform * handles missing nodes df * adds chavismo OSINT demo * no-node feat() fix * bigfix num_node * train_split bugfix * typo * eval * typo bug * typo * save before stash * reverting back * stable * merged * adds stable algo and namespace * breaks up training so that repeated .embed calls trains existing model * faster chaining if model and preprocessing has already occured * faster chaining if model and preprocessing has already occured * faster chaining if model and preprocessing has already occured * faster chaining if model and preprocessing has already occured * faster chaining if model and preprocessing has already occured * logger * better chaining * update default lr * adds evaluation flag * to device * fixes hard coded cuda and sets args * flake8, isort, black * basic unit tests * bug * adds .to device for outside features * docs(rgcn demos): infosec jupyterthon 2022 * adds query naming in g.search_graph so it shows up in hub with name * logger * node idx converted to pd.series * map * efficient predict_link * remap pred_links wrt dict * linters in networks.py * dummy numpy doc with annotations * lint * lint * more annotations * lint * some annotations * commit before merge * adds logic for chaining and when parameters change * adds passing tests, adds args for sample_size, num_steps in g_iterator * adds passing tests, adds args for sample_size, num_steps in g_iterator * more type hints * lint * mypy checks * mypy checks more * mypy checks more * lazy imports * trange * trial 7 none(s) * embed outside minimal test * unittest min dep required * unittest min dep required * small comments for later * fixes score issues over train_idx that were expand_dims in error prone way * lint * empty * typo * infra(adds ai-embed test hook into ci gha) * infra(adds bin/test-embed.sh) * infra(adds sphinx nitpick) * adds README and CHANGELOG * adds README and CHANGELOG * adds README and CHANGELOG * feat(adds `anomalous` flag to score low confidence edges, updates readme) * feat(adds default KG args to PlotterBase) * fix(removes pd.Series as it is not needed, lint) * docs(changelog): rgcn * refactor(mypy): reducing type: ignore count from 47 -> 19 * perf: scalable predict_links_all, some cleanup of old funcs, migrating gcn_node_embeddings to property * ci: adding tqdm-stubs to setup.py * feat(streamlines predict code) * feat: New inference api with targeted source, relation and destination arguments * fix: mypy checks in predict_links method * fix: predict_links input type changed from pd.Series -> list * feat(adds factory method for scoring triplets) * fix(adds test given refactor, and CHANGELOG public methods) * feat(adds RED team hunt UMAP notebook for simplified outlier detection and alert volume reduction) * feat(handles returning dataframe as flag) * feat(handles returning dataframe as flag) * feat(sorts scored triplets) * adds more README * updates readme * lint * fix: some mypy-pandas typecheck fix * fix: some mypy-pandas typecheck fix * fix: some mypy-pandas typecheck fix * fix: some mypy-pandas typecheck fix * Readme * adds tests, README * adds tests * removing some comments * fix: mypy fix List[str] -> List * fix: mypy fix * adds changes to demo given new api changes. Adds logging in networks * demo notebook change to reflect new api * changes name of notebook * updates networks.py from heteroembed branch (which passes lint) * black reformattingg * merges feature_utils from heteroembed, adds linting changes * lint * linting hyper_dask.py * lint * adds `get_features_by_cols`, updates CHANGELOG and README, and small change in features.py * lint * lint * feat(adds conditional prob): for some reason this was not on branch... * feat(adds separate mixin for conditional.py methods) * lint * lint * lint * lint * changes compute import in plotter.py * changelog.md * sphinx adds conditional.ConditionMixin * typo * feat(adds tests for conditional.py) * feat(adds tests for conditional.py) * lint * test * test * test * Update CHANGELOG.md * docs(ModelDict): main example * docs(hackernew) * docs(more hnews) * doc(ask hacker news demo) * doc(ask hacker news demo) * changes(keywords in setup, HackerNews demo) * Delete cyber-fraud-umap-demo.ipynb Renamed but it didn't delete on remote. * mypi * mypi * adds type ignore * adds type ignore * comments out test * removes test * adds working changes from ai_demos branch for single file * fix(tests): sso_login tests no longer tolerate unexpected exns * garden(sso): clearer unexpected exn msg * docs(changelog): sso fixes * fix(tests): reenable test_hyper_evil * garden(tests): print veresion of mypy, pandas, numpy * adds docstrings * adds docstrings * adds docstrings and lint * lint * lint * fix(ci): tolerate hypergraph evil warning * fix(ci): redo warning supression Co-authored-by: Alex <[email protected]> Co-authored-by: tanmoyio <[email protected]> Co-authored-by: Alex Morrise <[email protected]> Co-authored-by: Tanmoy Sarkar <[email protected]>
graphistry · Dec 23, 2022 · c6ece09 · c6ece09
1 parent 980923d
commit c6ece09
Show file tree

Hide file tree

Showing 36 changed files with 13,171 additions and 376 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -157,7 +157,7 @@ jobs:
         source pygraphistry/bin/activate
         ./bin/test-umap-learn-core.sh
 
-  test-full-umap:
+  test-full-ai:
 
     needs: [ test-minimal-python ]
     runs-on: ubuntu-latest
@@ -209,6 +209,12 @@ jobs:
         source pygraphistry/bin/activate
         ./bin/test-umap-learn-core.sh
 
+    - name: Full embed tests (rich featurize)
+      run: |
+        source pygraphistry/bin/activate
+        ./bin/test-embed.sh
+
+
   test-neo4j:
 
     needs: [ test-minimal-python ]

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,11 +7,26 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
 
 ## [Development]
 
+### Added
+* AI: Easy import of featurization kwargs for `g.umap(**kwargs)` and `g.featurize(**kwargs)`
+* AI: `g.get_features_by_cols` returns featurized submatrix with `col_part` in their columns
+* AI: `g.conditional_graph` and `g.conditional_probs` assessing conditional probs and graph
+* AI Demos folder: OSINT, CYBER demos
+* AI: Full text & semantic search (`g.search(..)` and `g.search_graph(..).plot()`)
+* AI: Featurization: support for dataframe columns that are list of lists -> multilabel targets
+                  set using `g.featurize(y=['list_of_lists_column'], multilabel=True,...)`
+* AI: `g.embed(..)` code for fast knowledge graph embedding (2-layer RGCN) and its usage for link scoring and prediction
+* AI: Exposes public methods `g.predict_links(..)` and `g.predict_links_all()`
+* AI: automatic naming of graphistry objects during `g.search_graph(query)` -> `g._name = query`
+* AI: RGCN demos - Infosec Jupyterthon 2022, SSH anomaly detection
+
 ### Fixed
 
 * GIB: Add missing import during group-in-a-box cudf layout of 0-degree nodes
+* Tests: SSO login tests catch more unexpected exns
+
+## [0.28.6 - 2022-29-22]
 
-## [0.28.6 - 2022-11-29]
 
 ### Added
 

diff --git a/README.md b/README.md
@@ -358,15 +358,41 @@ Automatically and intelligently transform text, numbers, booleans, and other for
     g = g.umap()  # UMAP, GNNs, use features if already provided, otherwise will compute
 
     # other pydata libraries
-    X = g._node_features
-    y = g._node_target
+    X = g._node_features  # g._get_feature('nodes')
+    y = g._node_target  # g._get_target('nodes')
     from sklearn.ensemble import RandomForestRegressor
     model = RandomForestRegressor().fit(X, y) #assumes train/test split
     new_df = pandas.read_csv(...)
     X_new, _ = g.transform(new_df, None, kind='nodes')
     preds = model.predict(X_new)
     ```
 
+ * Encode model definitions and compare models against each other
+
+   ```python
+    # graphistry
+    from graphistry.features import search_model, topic_model, ngrams_model, ModelDict, default_featurize_parameters
+
+    g = graphistry.nodes(df)
+    g2 = g.umap(X=[..], y=[..], **search_model)  
+
+    # set custom encoding model with any feature kwargs
+    new_model = ModelDict(message='encoding new model parameters is easy', **default_featurize_parameters)
+    new_model.update(dict(
+                      y=[...],
+                      kind='edges', 
+                      model_name='sbert/hf/a_cool_transformer_model', 
+                      use_scaler_target='kbins', 
+                      n_bins=11, 
+                      strategy='normal'))
+    print(new_model)
+
+    g3 = g.umap(X=[..], **new_model)
+    # compare g2 vs g3 or add to different pipelines
+    # ...
+    ```
+
+
 See `help(g.featurize)` for more options
 
 ### [sklearn-based UMAP](https://umap-learn.readthedocs.io/en/latest/), [cuML-based UMAP](https://docs.rapids.ai/api/cuml/stable/api.html?highlight=umap#cuml.UMAP)
@@ -450,16 +476,18 @@ GNN support is rapidly evolving, please contact the team directly or on Slack fo
       g = graphistry.nodes(ndf, 'node').edges(edf, 'src', 'dst')
 
       g2 = g.featurize(X = ['text_col_1', .., 'text_col_n'], kind='nodes',
-                        min_words=0,  # forces all named columns as textual ones
-                        #encode text as paraphrase embeddings, supports any sbert/Huggingface model
-                        model_name: str = "paraphrase-MiniLM-L6-v2")
+                        min_words = 0,  # forces all named columns as textual ones
+                        #encode text as paraphrase embeddings, supports any sbert model
+                        model_name = "paraphrase-MiniLM-L6-v2")
 
       results_df, query_vector = g2.search('my natural language query', ...)
-      print(results_df[['distance', 'text_col_1', ..., 'text_col_n']])  #sorted by relevancy
+
+      print(results_df[['_distance', 'text_col_1', ..., 'text_col_n']])  #sorted by relevancy
 
       # or see graph of matching entities and similarity edges (or optional original edges)
       g2.search_graph('my natural language query', ...).plot()
     ```
+
 
 * If edges are not given, `g.umap(..)` will supply them: 
 
@@ -473,6 +501,59 @@ GNN support is rapidly evolving, please contact the team directly or on Slack fo
 
 See `help(g.search_graph)` for options
 
+### Knowledge Graph Embeddings
+
+* Train a RGCN model and predict:
+
+    ```python
+      edf = pd.read_csv(edges.csv)
+      g = graphistry.edges(edf, src, dst)
+      g2 = g.embed(relation='relationship_column_of_interest', **kwargs)
+
+      # predict links over all nodes
+      g3 = g2.predict_links_all(threshold=0.95)  # score high confidence predicted edges
+      g3.plot()
+
+      # predict over any set of entities and/or relations. 
+      # Set any `source`, `destination` or `relation` to `None` to predict over all of them.
+      # if all are None, it is better to use `g.predict_links_all` for speed.
+      g4 = g2.predict_links(source=['entity_k'], 
+                      relation=['relationship_1', 'relationship_4', ..], 
+                      destination=['entity_l', 'entity_m', ..], 
+                      threshold=0.9,  # score threshold
+                      return_dataframe=False)  # set to `True` to return dataframe, or just access via `g5._edges`
+    ```
+
+* Detect Anamolous Behavior (example use cases such as Cyber, Fraud, etc)
+
+    ```python
+      # Score anomolous edges by setting the flag `anomalous` to True and set confidence threshold low
+      g5 = g.predict_links_all(threshold=0.05, anomalous=True)  # score low confidence predicted edges
+      g5.plot()
+
+      g6 = g.predict_links(source=['ip_address_1', 'user_id_3'], 
+                      relation=['attempt_logon', 'phishing', ..], 
+                      destination=['user_id_1', 'active_directory', ..], 
+                      anomalous=True,
+                      threshold=0.05)
+      g6.plot()
+    ```
+
+* Train a RGCN model including auto-featurized node embeddings
+
+    ```python
+      edf = pd.read_csv(edges.csv)
+      ndf = pd.read_csv(nodes.csv)  # adding node dataframe
+
+      g = graphistry.edges(edf, src, dst).nodes(ndf, node_column)
+
+      # inherets all the featurization `kwargs` from `g.featurize` 
+      g2 = g.embed(relation='relationship_column_of_interest', use_feat=True, **kwargs)
+      g2.predict_links_all(threshold=0.95).plot()
+    ```
+
+See `help(g.embed)`, `help(g.predict_links)` , `help(g.predict_links_all)` for options
+
 
 ### Quickly configurable
 

diff --git a/bin/test-embed.sh b/bin/test-embed.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+set -ex
+
+# Run from project root
+# - Args get passed to pytest phase
+# Non-zero exit code on fail
+
+# Assume [umap-learn,test]
+
+python -m pytest --version
+
+python -B -m pytest -vv \
+    graphistry/tests/test_embed_utils.py
+
+#chmod +x bin/test-embed.sh
diff --git a/bin/test-minimal.sh b/bin/test-minimal.sh
@@ -18,3 +18,4 @@ python -B -m pytest -vv \
     --ignore=graphistry/tests/test_feature_utils.py \
     --ignore=graphistry/tests/test_umap_utils.py \
     --ignore=graphistry/tests/test_dgl_utils.py \
+    --ignore=graphistry/tests/test_embed_utils.py \