Skip to content

Commit

Permalink
Merge pull request #518 from graphistry/dev/fix-hop
Browse files Browse the repository at this point in the history
fix(hop)
  • Loading branch information
lmeyerov authored Dec 5, 2023
2 parents 238e9d0 + 0ca28cf commit 32f4d14
Show file tree
Hide file tree
Showing 31 changed files with 4,684 additions and 128 deletions.
35 changes: 35 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,47 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm

## [Development]

## [0.30.0 - 2023-12-04]

### Added

* Neptune: Can now use PyGraphistry OpenCypher/BOLT bindings with Neptune, in addition to existing Gremlin bindings
* chain/hop: `is_in()` membership predicate, `.chain([ n({'type': is_in(['a', 'b'])}) ])`
* hop: optional df queries - `hop(..., source_node_query='...', edge_query='...', destination_node_query='...')`
* chain: optional df queries:
- `chain([n(query='...')])`
- `chain([e_forward(..., source_node_query='...', edge_query='...', destination_node_query='...')])`
* `ASTPredicate` base class for filter matching
* Additional predicates for hop and chain match expressions:
- categorical: is_in (example above), duplicated
- temporal: is_month_start, is_month_end, is_quarter_start, is_quarter_end, is_year_start, is_year_end, is_leap_year
- numeric: gt, lt, ge, le, eq, ne, between, isna, notna
- str: contains, startswith, endswith, match, isnumeric, isalpha, isdigit, islower, isupper, isspace, isalnum, isdecimal, istitle, isnull, notnull

### Fixed

* chain/hop: source_node_match was being mishandled when multiple node attributes exist
* chain: backwards validation pass was too permissive; add `target_wave_front` check`
* hop: multi-hops with `source_node_match` specified was not checking intermediate hops
* hop: multi-hops reverse validation was mishandling intermediate nodes
* compute logging no longer default-overrides level to DEBUG

### Infra

* Docker tests support LOG_LEVEL

### Changed

* refactor: move `is_in`, `IsIn` implementations to `graphistry.ast.predicates`; old imports preserved
* `IsIn` now implements `ASTPredicate`
* Refactor: use `setup_logger(__name__)` more consistently instead of `logging.getLogger(__name__)`
* Refactor: drop unused imports
* Redo `setup_logger()` to activate formatted stream handler iff verbose / LOG_LEVEL

### Docs

* hop/chain: new query and predicate forms
* hop/chain graph pattern mining tutorial: [ipynb demo](demos/more_examples/graphistry_features/hop_and_chain_graph_pattern_mining.ipynb)
* Neptune: Initial tutorial for using PyGraphistry with Amazon Neptune's OpenCypher/BOLT bindings

## [0.29.7 - 2023-11-02]
Expand Down
118 changes: 111 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,25 @@ It is easy to turn arbitrary data into insightful graphs. PyGraphistry comes wit
g2.plot()
```

* Cypher-style graph pattern mining queries on dataframes ([ipynb demo](demos/more_examples/graphistry_features/hop_and_chain_graph_pattern_mining.ipynb))

Run Cypher-style graph queries natively on dataframes without going to a database or Java:

```python
from graphistry import n, e_undirected, is_in

g2 = g.chain([
n({'user': 'Biden'}),
e_undirected(),
n(name='bridge'),
e_undirected(),
n({'user': is_in(['Trump', 'Obama'])})
])

print('# bridges', len(g2._nodes[g2._nodes.bridge]))
g2.plot()
```

* [Spark](https://spark.apache.org/)/[Databricks](https://databricks.com/) ([ipynb demo](demos/demos_databases_apis/databricks_pyspark/graphistry-notebook-dashboard.ipynb), [dbc demo](demos/demos_databases_apis/databricks_pyspark/graphistry-notebook-dashboard.dbc))

```python
Expand Down Expand Up @@ -1073,7 +1092,7 @@ g.addStyle(logo={
The below methods let you quickly manipulate graphs directly and with dataframe methods: Search, pattern mine, transform, and more:

```python
from graphistry import n, e_forward, e_reverse, e_undirected
from graphistry import n, e_forward, e_reverse, e_undirected, is_in
g = (graphistry
.edges(pd.DataFrame({
's': ['a', 'b'],
Expand Down Expand Up @@ -1101,21 +1120,53 @@ g2.plot() # nodes are values from cols s, d, k1
.hop( # filter to subgraph
#almost all optional
direction='forward', # 'reverse', 'undirected'
hops=1, # number or None if to_fixed_point
hops=2, # number (1..n hops, inclusive) or None if to_fixed_point
to_fixed_point=False,
source_node_match={"k2": 0},

#every edge source node must match these
source_node_match={"k2": 0, "k3": is_in(['a', 'b', 3, 4])},
source_node_query='k2 == 0',

#every edge must match these
edge_match={"k1": "x"},
destination_node_match={"k2": 2})
edge_query='k1 == "x"',

#every edge destination node must match these
destination_node_match={"k2": 2},
destination_node_query='k2 == 2 or k2 == 4',
)
.chain([ # filter to subgraph
n(),
n({'k2': 0}),
n({'k2': 0, "m": 'ok'}), #specific values
n({'type': is_in(["type1", "type2"])}), #multiple valid values
n(query='k2 == 0 or k2 == 4'), #dataframe query
n(name="start"), # add column 'start':bool
e_forward({'k1': 'x'}, hops=1), # same API as hop()
e_undirected(name='second_edge'),
e_reverse(
{'k1': 'x'}, # edge property match
hops=2, # 1 to 2 hops
#same API as hop()
source_node_match={"k2": 2},
source_node_query='k2 == 2 or k2 == 4',
edge_match={"k1": "x"},
edge_query='k1 == "x"',
destination_node_match={"k2": 0},
destination_node_query='k2 == 0')
])
# replace as one node the node w/ given id + transitively connected nodes w/ col=attr
.collapse(node='some_id', column='some_col', attribute='some val')
```

Both `hop()` and `chain()` match dictionary expressions support dataframe series *predicates*. The above examples show `is_in([x, y, z, ...])`. Additional predicates include:

* categorical: is_in, duplicated
* temporal: is_month_start, is_month_end, is_quarter_start, is_quarter_end, is_year_start, is_year_end
* numeric: gt, lt, ge, le, eq, ne, between, isna, notna
* string: contains, startswith, endswith, match, isnumeric, isalpha, isdigit, islower, isupper, isspace, isalnum, isdecimal, istitle, isnull, notnull



#### Table to graph

```python
Expand All @@ -1125,6 +1176,30 @@ g = hg['graph'] # g._edges: | src, dst, user, email, org, time, ... |
g.plot()
```

```python
hg = graphistry.hypergraph(
df,
['from_user', 'to_user', 'email', 'org'],
direct=True,
opts={

# when direct=True, can define src -> [ dst1, dst2, ...] edges
'EDGES': {
'org': ['from_user'], # org->from_user
'from_user': ['email', 'to_user'], #from_user->email, from_user->to_user
},

'CATEGORIES': {
# determine which columns share the same namespace for node generation:
# - if user 'louie' is both a from_user and to_user, show as 1 node
# - if a user & org are both named 'louie', they will appear as 2 different nodes
'user': ['from_user', 'to_user']
}
})
g = hg['graph']
g.plot()
```

#### Generate node table

```python
Expand Down Expand Up @@ -1162,6 +1237,10 @@ assert 'pagerank' in g2._nodes.columns

#### Graph pattern matching

PyGraphistry supports a PyData-native variant of the popular Cypher graph query language, meaning you can do graph pattern matching directly from Pandas dataframes without installing a database or Java

See also [graph pattern matching tutorial](demos/more_examples/graphistry_features/hop_and_chain_graph_pattern_mining.ipynb)

Traverse within a graph, or expand one graph against another

Simple node and edge filtering via `filter_edges_by_dict()` and `filter_nodes_by_dict()`:
Expand All @@ -1178,16 +1257,37 @@ Method `.hop()` enables slightly more complicated edge filters:

```python

from graphistry import is_in, gt

# (a)-[{"v": 1, "type": "z"}]->(b) based on g
g2b = g2.hop(
source_node_match={g2._node: "a"},
edge_match={"v": 1, "type": "z"},
destination_node_match={g2._node: "b"})
g2b = g2.hop(
source_node_query='n == "a"',
edge_query='v == 1 and type == "z"',
destination_node_query='n == "b"')

# (a {x in [1,2] and y > 3})-[e]->(b) based on g
g2c = g2.hop(
source_node_match={
g2._node: "a",
"x": is_in([1,2]),
"y": gt(3)
},
destination_node_match={g2._node: "b"})
)

# (a or b)-[1 to 8 hops]->(anynode), based on graph g2
g3 = g2.hop(pd.DataFrame({g2._node: ['a', 'b']}), hops=8)

# (a or b)-[1 to 8 hops]->(anynode), based on graph g2
g3 = g2.hop(pd.DataFrame({g2._node: is_in(['a', 'b'])}), hops=8)

# (c)<-[any number of hops]-(any node), based on graph g3
# Note multihop matches check source/destination/edge match/query predicates
# against every encountered edge for it to be included
g4 = g3.hop(source_node_match={"node": "c"}, direction='reverse', to_fixed_point=True)

# (c)-[incoming or outgoing edge]-(any node),
Expand All @@ -1200,10 +1300,12 @@ g5.plot()
Rich compound patterns are enabled via `.chain()`:

```python
from graphistry import n, e_forward, e_reverse, e_undirected
from graphistry import n, e_forward, e_reverse, e_undirected, is_in

g2.chain([ n() ])
g2.chain([ n({"v": 1, "y": True}) ])
g2.chain([ n({"x": 1, "y": True}) ]),
g2.chain([ n(query='x == 1 and y == True') ]),
g2.chain([ n({"z": is_in([1,2,4,'z'])}) ]), # multiple valid values
g2.chain([ e_forward({"type": "x"}, hops=2) ]) # simple multi-hop
g3 = g2.chain([
n(name="start"), # tag node matches
Expand All @@ -1216,6 +1318,8 @@ print('# end nodes: ', len(g3._nodes[ g3._nodes.end ]))
print('# end edges: ', len(g3._edges[ g3._edges.final_edge ]))
```

See table above for more predicates like `is_in()` and `gt()`

#### Pipelining

```python
Expand Down
Loading

0 comments on commit 32f4d14

Please sign in to comment.