Merge pull request #518 from graphistry/dev/fix-hop

fix(hop)
graphistry · Dec 5, 2023 · 32f4d14 · 32f4d14
2 parents 238e9d0 + 0ca28cf
commit 32f4d14
Show file tree

Hide file tree

Showing 31 changed files with 4,684 additions and 128 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,12 +7,47 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
 
 ## [Development]
 
+## [0.30.0 - 2023-12-04]
+
 ### Added
 
 * Neptune: Can now use PyGraphistry OpenCypher/BOLT bindings with Neptune, in addition to existing Gremlin bindings
+* chain/hop: `is_in()` membership predicate, `.chain([ n({'type': is_in(['a', 'b'])}) ])`
+* hop: optional df queries - `hop(..., source_node_query='...', edge_query='...', destination_node_query='...')`
+* chain: optional df queries:
+  - `chain([n(query='...')])`
+  - `chain([e_forward(..., source_node_query='...', edge_query='...', destination_node_query='...')])`
+* `ASTPredicate` base class for filter matching
+* Additional predicates for hop and chain match expressions:
+  - categorical: is_in (example above), duplicated
+  - temporal: is_month_start, is_month_end, is_quarter_start, is_quarter_end, is_year_start, is_year_end, is_leap_year
+  - numeric: gt, lt, ge, le, eq, ne, between, isna, notna
+  - str: contains, startswith, endswith, match, isnumeric, isalpha, isdigit, islower, isupper, isspace, isalnum, isdecimal, istitle, isnull, notnull
+
+### Fixed
+
+* chain/hop: source_node_match was being mishandled when multiple node attributes exist
+* chain: backwards validation pass was too permissive; add `target_wave_front` check`
+* hop: multi-hops with `source_node_match` specified was not checking intermediate hops
+* hop: multi-hops reverse validation was mishandling intermediate nodes
+* compute logging no longer default-overrides level to DEBUG
+
+### Infra
+
+* Docker tests support LOG_LEVEL
+
+### Changed
+
+* refactor: move `is_in`, `IsIn` implementations to `graphistry.ast.predicates`; old imports preserved
+* `IsIn` now implements `ASTPredicate`
+* Refactor: use `setup_logger(__name__)` more consistently instead of `logging.getLogger(__name__)`
+* Refactor: drop unused imports
+* Redo `setup_logger()` to activate formatted stream handler iff verbose / LOG_LEVEL
 
 ### Docs
 
+* hop/chain: new query and predicate forms
+* hop/chain graph pattern mining tutorial: [ipynb demo](demos/more_examples/graphistry_features/hop_and_chain_graph_pattern_mining.ipynb)
 * Neptune: Initial tutorial for using PyGraphistry with Amazon Neptune's OpenCypher/BOLT bindings
 
 ## [0.29.7 - 2023-11-02]

diff --git a/README.md b/README.md
@@ -147,6 +147,25 @@ It is easy to turn arbitrary data into insightful graphs. PyGraphistry comes wit
     g2.plot()
     ```
 
+* Cypher-style graph pattern mining queries on dataframes ([ipynb demo](demos/more_examples/graphistry_features/hop_and_chain_graph_pattern_mining.ipynb))
+
+  Run Cypher-style graph queries natively on dataframes without going to a database or Java:
+
+    ```python
+    from graphistry import n, e_undirected, is_in
+
+    g2 = g.chain([
+      n({'user': 'Biden'}),
+      e_undirected(),
+      n(name='bridge'),
+      e_undirected(),
+      n({'user': is_in(['Trump', 'Obama'])})
+    ])
+
+    print('# bridges', len(g2._nodes[g2._nodes.bridge]))
+    g2.plot()
+    ```
+
 * [Spark](https://spark.apache.org/)/[Databricks](https://databricks.com/) ([ipynb demo](demos/demos_databases_apis/databricks_pyspark/graphistry-notebook-dashboard.ipynb), [dbc demo](demos/demos_databases_apis/databricks_pyspark/graphistry-notebook-dashboard.dbc))
 
     ```python
@@ -1073,7 +1092,7 @@ g.addStyle(logo={
 The below methods let you quickly manipulate graphs directly and with dataframe methods: Search, pattern mine, transform, and more:
 
 ```python
-from graphistry import n, e_forward, e_reverse, e_undirected
+from graphistry import n, e_forward, e_reverse, e_undirected, is_in
 g = (graphistry
   .edges(pd.DataFrame({
     's': ['a', 'b'],
@@ -1101,21 +1120,53 @@ g2.plot() # nodes are values from cols s, d, k1
   .hop( # filter to subgraph
     #almost all optional
     direction='forward', # 'reverse', 'undirected'
-    hops=1, # number or None if to_fixed_point
+    hops=2, # number (1..n hops, inclusive) or None if to_fixed_point
     to_fixed_point=False, 
-    source_node_match={"k2": 0},
+
+    #every edge source node must match these
+    source_node_match={"k2": 0, "k3": is_in(['a', 'b', 3, 4])},
+    source_node_query='k2 == 0',
+
+    #every edge must match these
     edge_match={"k1": "x"},
-    destination_node_match={"k2": 2})
+    edge_query='k1 == "x"',
+
+    #every edge destination node must match these
+    destination_node_match={"k2": 2},
+    destination_node_query='k2 == 2 or k2 == 4',
+  )
   .chain([ # filter to subgraph
     n(),
-    n({'k2': 0}),
+    n({'k2': 0, "m": 'ok'}), #specific values
+    n({'type': is_in(["type1", "type2"])}), #multiple valid values
+    n(query='k2 == 0 or k2 == 4'), #dataframe query
     n(name="start"), # add column 'start':bool
     e_forward({'k1': 'x'}, hops=1), # same API as hop()
     e_undirected(name='second_edge'),
+    e_reverse(
+      {'k1': 'x'}, # edge property match
+      hops=2, # 1 to 2 hops
+      #same API as hop()
+      source_node_match={"k2": 2},
+      source_node_query='k2 == 2 or k2 == 4',
+      edge_match={"k1": "x"},
+      edge_query='k1 == "x"',
+      destination_node_match={"k2": 0},
+      destination_node_query='k2 == 0')
   ])
+  # replace as one node the node w/ given id + transitively connected nodes w/ col=attr
   .collapse(node='some_id', column='some_col', attribute='some val')
 ```
 
+Both `hop()` and `chain()` match dictionary expressions support dataframe series *predicates*. The above examples show `is_in([x, y, z, ...])`. Additional predicates include:
+
+* categorical: is_in, duplicated
+* temporal: is_month_start, is_month_end, is_quarter_start, is_quarter_end, is_year_start, is_year_end
+* numeric: gt, lt, ge, le, eq, ne, between, isna, notna
+* string: contains, startswith, endswith, match, isnumeric, isalpha, isdigit, islower, isupper, isspace, isalnum, isdecimal, istitle, isnull, notnull
+
+
+
 #### Table to graph
 
 ```python
@@ -1125,6 +1176,30 @@ g = hg['graph']  # g._edges: | src, dst, user, email, org, time, ... |
 g.plot()
 ```
 
+```python
+hg = graphistry.hypergraph(
+  df,
+  ['from_user', 'to_user', 'email', 'org'],
+  direct=True,
+  opts={
+
+   # when direct=True, can define src -> [ dst1, dst2, ...] edges
+  'EDGES': {
+    'org': ['from_user'], # org->from_user
+    'from_user': ['email', 'to_user'],  #from_user->email, from_user->to_user
+  },
+
+  'CATEGORIES': {
+    # determine which columns share the same namespace for node generation:
+    # - if user 'louie' is both a from_user and to_user, show as 1 node
+    # - if a user & org are both named 'louie', they will appear as 2 different nodes
+    'user': ['from_user', 'to_user']
+  }
+})
+g = hg['graph']
+g.plot()
+```
+
 #### Generate node table
 
 ```python
@@ -1162,6 +1237,10 @@ assert 'pagerank' in g2._nodes.columns
 
 #### Graph pattern matching
 
+PyGraphistry supports a PyData-native variant of the popular Cypher graph query language, meaning you can do graph pattern matching directly from Pandas dataframes without installing a database or Java
+
+See also [graph pattern matching tutorial](demos/more_examples/graphistry_features/hop_and_chain_graph_pattern_mining.ipynb)
+
 Traverse within a graph, or expand one graph against another
 
 Simple node and edge filtering via `filter_edges_by_dict()` and `filter_nodes_by_dict()`:
@@ -1178,16 +1257,37 @@ Method `.hop()` enables slightly more complicated edge filters:
 
 ```python
 
+from graphistry import is_in, gt
+
 # (a)-[{"v": 1, "type": "z"}]->(b) based on g
 g2b = g2.hop(
   source_node_match={g2._node: "a"},
   edge_match={"v": 1, "type": "z"},
   destination_node_match={g2._node: "b"})
+g2b = g2.hop(
+  source_node_query='n == "a"',
+  edge_query='v == 1 and type == "z"',
+  destination_node_query='n == "b"')
+
+# (a {x in [1,2] and y > 3})-[e]->(b) based on g
+g2c = g2.hop(
+  source_node_match={
+    g2._node: "a",
+    "x": is_in([1,2]),
+    "y": gt(3)
+  },
+  destination_node_match={g2._node: "b"})
+)
 
 # (a or b)-[1 to 8 hops]->(anynode), based on graph g2
 g3 = g2.hop(pd.DataFrame({g2._node: ['a', 'b']}), hops=8)
 
+# (a or b)-[1 to 8 hops]->(anynode), based on graph g2
+g3 = g2.hop(pd.DataFrame({g2._node: is_in(['a', 'b'])}), hops=8)
+
 # (c)<-[any number of hops]-(any node), based on graph g3
+# Note multihop matches check source/destination/edge match/query predicates 
+# against every encountered edge for it to be included
 g4 = g3.hop(source_node_match={"node": "c"}, direction='reverse', to_fixed_point=True)
 
 # (c)-[incoming or outgoing edge]-(any node),
@@ -1200,10 +1300,12 @@ g5.plot()
 Rich compound patterns are enabled via `.chain()`:
 
 ```python
-from graphistry import n, e_forward, e_reverse, e_undirected
+from graphistry import n, e_forward, e_reverse, e_undirected, is_in
 
 g2.chain([ n() ])
-g2.chain([ n({"v": 1, "y": True}) ])
+g2.chain([ n({"x": 1, "y": True}) ]),
+g2.chain([ n(query='x == 1 and y == True') ]),
+g2.chain([ n({"z": is_in([1,2,4,'z'])}) ]), # multiple valid values
 g2.chain([ e_forward({"type": "x"}, hops=2) ]) # simple multi-hop
 g3 = g2.chain([
   n(name="start"),  # tag node matches
@@ -1216,6 +1318,8 @@ print('# end nodes: ', len(g3._nodes[ g3._nodes.end ]))
 print('# end edges: ', len(g3._edges[ g3._edges.final_edge ]))
 ```
 
+See table above for more predicates like `is_in()` and `gt()`
+
 #### Pipelining
 
 ```python