Skip to content

Commit

Permalink
Dev/ai demos merge (#430)
Browse files Browse the repository at this point in the history
* mypy thinks mask is Series[int], hence wrapped in np.array(dtype=bool)

* bind title fix

* removes feature flags that are deprecated (similarity, categories, confidence) and adds docstrings in main featurize function call

* type check

* deletes search_index if it exists when mutating nodes. g.search will rebuilt if not there

* makes safe search under the case where nodes are mutated via assertion

* lint

* resolves Leos requested changes

* fix(temp fix on imputer so that it doesnt drop nan columns)

* removes print

* changes `impute` type and logic for more robust pipeline. removes deprecated flags

* adds outlier poc code

* switched to rgcn

* typo

* typo fix

* works

* predict mode

* commented out

* adds Ask-HackerNews demo tutorial, adds all recall nodes to subgraph, rather than just those with found edges in text_utils.py, and adds what should be on master in feature_utiils

* adds helper model dict imports for featurization

* adds easier model presets for users when using .featurize

* adds cli

* docs

* adds update to README

* adds update to README

* adds update to README

* rel + feat

* better naming

* bug

* batch_size arg

* adds corrections and args

* typo

* fixes errors and adds embeddings

* bug

* bug

* save before merge with work Tanmoy and I did

* adds scoring in main graphistry instance

* adds working node priors from featurization

* fixes nodes issues

* adds state changing code that removes edges not in nodes and vice versa

* lazy import returns modules

* cherry picks other branches to add dgl fix and outliers lib

* adds CyberSecurity CTU-13 dataset GNN pipeline demo

* adds Jack Dorsey Social Good Pledge dataset

* infer with [s, r] -> d

* adds function methods like get_matrix etc

* wip(logger): make uniform

* handles missing nodes df

* adds chavismo OSINT demo

* no-node feat() fix

* bigfix num_node

* train_split bugfix

* typo

* eval

* typo bug

* typo

* save before stash

* reverting back

* stable

* merged

* adds stable algo and namespace

* breaks up training so that repeated .embed calls trains existing model

* faster chaining if model and preprocessing has already occured

* faster chaining if model and preprocessing has already occured

* faster chaining if model and preprocessing has already occured

* faster chaining if model and preprocessing has already occured

* faster chaining if model and preprocessing has already occured

* logger

* better chaining

* update default lr

* adds evaluation flag

* to device

* fixes hard coded cuda and sets args

* flake8, isort, black

* basic unit tests

* bug

* adds .to device for outside features

* docs(rgcn demos): infosec jupyterthon 2022

* adds query naming in g.search_graph so it shows up in hub with name

* logger

* node idx converted to pd.series

* map

* efficient predict_link

* remap pred_links wrt dict

* linters in networks.py

* dummy numpy doc with annotations

* lint

* lint

* more annotations

* lint

* some annotations

* commit before merge

* adds logic for chaining and when parameters change

* adds passing tests, adds args for sample_size, num_steps in g_iterator

* adds passing tests, adds args for sample_size, num_steps in g_iterator

* more type hints

* lint

* mypy checks

* mypy checks more

* mypy checks more

* lazy imports

* trange

* trial 7 none(s)

* embed outside minimal test

* unittest min dep required

* unittest min dep required

* small comments for later

* fixes score issues over train_idx that were expand_dims in error prone way

* lint

* empty

* typo

* infra(adds ai-embed test hook into ci gha)

* infra(adds bin/test-embed.sh)

* infra(adds sphinx nitpick)

* adds README and CHANGELOG

* adds README and CHANGELOG

* adds README and CHANGELOG

* feat(adds `anomalous` flag to score low confidence edges, updates readme)

* feat(adds default KG args to PlotterBase)

* fix(removes pd.Series as it is not needed, lint)

* docs(changelog): rgcn

* refactor(mypy): reducing type: ignore count from 47 -> 19

* perf: scalable predict_links_all, some cleanup of old funcs, migrating gcn_node_embeddings to property

* ci: adding tqdm-stubs to setup.py

* feat(streamlines predict code)

* feat: New inference api with targeted source, relation and destination arguments

* fix: mypy checks in predict_links method

* fix: predict_links input type changed from pd.Series -> list

* feat(adds factory method for scoring triplets)

* fix(adds test given refactor, and CHANGELOG public methods)

* feat(adds RED team hunt UMAP notebook for simplified outlier detection and alert volume reduction)

* feat(handles returning dataframe as flag)

* feat(handles returning dataframe as flag)

* feat(sorts scored triplets)

* adds more README

* updates readme

* lint

* fix: some mypy-pandas typecheck fix

* fix: some mypy-pandas typecheck fix

* fix: some mypy-pandas typecheck fix

* fix: some mypy-pandas typecheck fix

* Readme

* adds tests, README

* adds tests

* removing some comments

* fix: mypy fix List[str] -> List

* fix: mypy fix

* adds changes to demo given new api changes. Adds logging in networks

* demo notebook change to reflect new api

* changes name of notebook

* updates networks.py from heteroembed branch (which passes lint)

* black reformattingg

* merges feature_utils from heteroembed, adds linting changes

* lint

* linting hyper_dask.py

* lint

* adds `get_features_by_cols`, updates CHANGELOG and README, and small change in features.py

* lint

* lint

* feat(adds conditional prob): for some reason this was not on branch...

* feat(adds separate mixin for conditional.py methods)

* lint

* lint

* lint

* lint

* changes compute import in plotter.py

* changelog.md

* sphinx adds conditional.ConditionMixin

* typo

* feat(adds tests for conditional.py)

* feat(adds tests for conditional.py)

* lint

* test

* test

* test

* Update CHANGELOG.md

* docs(ModelDict): main example

* docs(hackernew)

* docs(more hnews)

* doc(ask hacker news demo)

* doc(ask hacker news demo)

* changes(keywords in setup, HackerNews demo)

* Delete cyber-fraud-umap-demo.ipynb

Renamed but it didn't delete on remote.

* mypi

* mypi

* adds type ignore

* adds type ignore

* comments out test

* removes test

* adds working changes from ai_demos branch for single file

* fix(tests): sso_login tests no longer tolerate unexpected exns

* garden(sso): clearer unexpected exn msg

* docs(changelog): sso fixes

* fix(tests): reenable test_hyper_evil

* garden(tests): print veresion of mypy, pandas, numpy

* adds docstrings

* adds docstrings

* adds docstrings and lint

* lint

* lint

* fix(ci): tolerate hypergraph evil warning

* fix(ci): redo warning supression

Co-authored-by: Alex <[email protected]>
Co-authored-by: tanmoyio <[email protected]>
Co-authored-by: Alex Morrise <[email protected]>
Co-authored-by: Tanmoy Sarkar <[email protected]>
  • Loading branch information
5 people authored Dec 23, 2022
1 parent 980923d commit c6ece09
Show file tree
Hide file tree
Showing 36 changed files with 13,171 additions and 376 deletions.
8 changes: 7 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ jobs:
source pygraphistry/bin/activate
./bin/test-umap-learn-core.sh
test-full-umap:
test-full-ai:

needs: [ test-minimal-python ]
runs-on: ubuntu-latest
Expand Down Expand Up @@ -209,6 +209,12 @@ jobs:
source pygraphistry/bin/activate
./bin/test-umap-learn-core.sh
- name: Full embed tests (rich featurize)
run: |
source pygraphistry/bin/activate
./bin/test-embed.sh
test-neo4j:

needs: [ test-minimal-python ]
Expand Down
17 changes: 16 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,26 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm

## [Development]

### Added
* AI: Easy import of featurization kwargs for `g.umap(**kwargs)` and `g.featurize(**kwargs)`
* AI: `g.get_features_by_cols` returns featurized submatrix with `col_part` in their columns
* AI: `g.conditional_graph` and `g.conditional_probs` assessing conditional probs and graph
* AI Demos folder: OSINT, CYBER demos
* AI: Full text & semantic search (`g.search(..)` and `g.search_graph(..).plot()`)
* AI: Featurization: support for dataframe columns that are list of lists -> multilabel targets
set using `g.featurize(y=['list_of_lists_column'], multilabel=True,...)`
* AI: `g.embed(..)` code for fast knowledge graph embedding (2-layer RGCN) and its usage for link scoring and prediction
* AI: Exposes public methods `g.predict_links(..)` and `g.predict_links_all()`
* AI: automatic naming of graphistry objects during `g.search_graph(query)` -> `g._name = query`
* AI: RGCN demos - Infosec Jupyterthon 2022, SSH anomaly detection

### Fixed

* GIB: Add missing import during group-in-a-box cudf layout of 0-degree nodes
* Tests: SSO login tests catch more unexpected exns

## [0.28.6 - 2022-29-22]

## [0.28.6 - 2022-11-29]

### Added

Expand Down
93 changes: 87 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -358,15 +358,41 @@ Automatically and intelligently transform text, numbers, booleans, and other for
g = g.umap() # UMAP, GNNs, use features if already provided, otherwise will compute

# other pydata libraries
X = g._node_features
y = g._node_target
X = g._node_features # g._get_feature('nodes')
y = g._node_target # g._get_target('nodes')
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor().fit(X, y) #assumes train/test split
new_df = pandas.read_csv(...)
X_new, _ = g.transform(new_df, None, kind='nodes')
preds = model.predict(X_new)
```

* Encode model definitions and compare models against each other

```python
# graphistry
from graphistry.features import search_model, topic_model, ngrams_model, ModelDict, default_featurize_parameters

g = graphistry.nodes(df)
g2 = g.umap(X=[..], y=[..], **search_model)

# set custom encoding model with any feature kwargs
new_model = ModelDict(message='encoding new model parameters is easy', **default_featurize_parameters)
new_model.update(dict(
y=[...],
kind='edges',
model_name='sbert/hf/a_cool_transformer_model',
use_scaler_target='kbins',
n_bins=11,
strategy='normal'))
print(new_model)

g3 = g.umap(X=[..], **new_model)
# compare g2 vs g3 or add to different pipelines
# ...
```


See `help(g.featurize)` for more options

### [sklearn-based UMAP](https://umap-learn.readthedocs.io/en/latest/), [cuML-based UMAP](https://docs.rapids.ai/api/cuml/stable/api.html?highlight=umap#cuml.UMAP)
Expand Down Expand Up @@ -450,16 +476,18 @@ GNN support is rapidly evolving, please contact the team directly or on Slack fo
g = graphistry.nodes(ndf, 'node').edges(edf, 'src', 'dst')

g2 = g.featurize(X = ['text_col_1', .., 'text_col_n'], kind='nodes',
min_words=0, # forces all named columns as textual ones
#encode text as paraphrase embeddings, supports any sbert/Huggingface model
model_name: str = "paraphrase-MiniLM-L6-v2")
min_words = 0, # forces all named columns as textual ones
#encode text as paraphrase embeddings, supports any sbert model
model_name = "paraphrase-MiniLM-L6-v2")

results_df, query_vector = g2.search('my natural language query', ...)
print(results_df[['distance', 'text_col_1', ..., 'text_col_n']]) #sorted by relevancy

print(results_df[['_distance', 'text_col_1', ..., 'text_col_n']]) #sorted by relevancy

# or see graph of matching entities and similarity edges (or optional original edges)
g2.search_graph('my natural language query', ...).plot()
```


* If edges are not given, `g.umap(..)` will supply them:

Expand All @@ -473,6 +501,59 @@ GNN support is rapidly evolving, please contact the team directly or on Slack fo

See `help(g.search_graph)` for options

### Knowledge Graph Embeddings

* Train a RGCN model and predict:

```python
edf = pd.read_csv(edges.csv)
g = graphistry.edges(edf, src, dst)
g2 = g.embed(relation='relationship_column_of_interest', **kwargs)

# predict links over all nodes
g3 = g2.predict_links_all(threshold=0.95) # score high confidence predicted edges
g3.plot()

# predict over any set of entities and/or relations.
# Set any `source`, `destination` or `relation` to `None` to predict over all of them.
# if all are None, it is better to use `g.predict_links_all` for speed.
g4 = g2.predict_links(source=['entity_k'],
relation=['relationship_1', 'relationship_4', ..],
destination=['entity_l', 'entity_m', ..],
threshold=0.9, # score threshold
return_dataframe=False) # set to `True` to return dataframe, or just access via `g5._edges`
```

* Detect Anamolous Behavior (example use cases such as Cyber, Fraud, etc)

```python
# Score anomolous edges by setting the flag `anomalous` to True and set confidence threshold low
g5 = g.predict_links_all(threshold=0.05, anomalous=True) # score low confidence predicted edges
g5.plot()

g6 = g.predict_links(source=['ip_address_1', 'user_id_3'],
relation=['attempt_logon', 'phishing', ..],
destination=['user_id_1', 'active_directory', ..],
anomalous=True,
threshold=0.05)
g6.plot()
```

* Train a RGCN model including auto-featurized node embeddings

```python
edf = pd.read_csv(edges.csv)
ndf = pd.read_csv(nodes.csv) # adding node dataframe

g = graphistry.edges(edf, src, dst).nodes(ndf, node_column)

# inherets all the featurization `kwargs` from `g.featurize`
g2 = g.embed(relation='relationship_column_of_interest', use_feat=True, **kwargs)
g2.predict_links_all(threshold=0.95).plot()
```

See `help(g.embed)`, `help(g.predict_links)` , `help(g.predict_links_all)` for options


### Quickly configurable

Expand Down
15 changes: 15 additions & 0 deletions bin/test-embed.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash
set -ex

# Run from project root
# - Args get passed to pytest phase
# Non-zero exit code on fail

# Assume [umap-learn,test]

python -m pytest --version

python -B -m pytest -vv \
graphistry/tests/test_embed_utils.py

#chmod +x bin/test-embed.sh
1 change: 1 addition & 0 deletions bin/test-minimal.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ python -B -m pytest -vv \
--ignore=graphistry/tests/test_feature_utils.py \
--ignore=graphistry/tests/test_umap_utils.py \
--ignore=graphistry/tests/test_dgl_utils.py \
--ignore=graphistry/tests/test_embed_utils.py \
Loading

0 comments on commit c6ece09

Please sign in to comment.