In this notebook, we’ll use neo4j, a Graph Database to efficiently store and retrieve graph data, and perform some interesting analysis around superheroes.
neo4j Cheatsheet: https://neo4j.com/docs/cypher-refcard/current/
http://www2.ece.rochester.edu/~gmateosb/ECE442/Slides/block_2_graphs_inference_review_part_a.pdf
Relational databases are collection of records (aka row or tuple) all having the same structure that are contained in table. In the following image, we have an Employees and a Departments table and a Dept_Members (many-to-many table) that link these two tables together:
As a graph the above relationship looks like this:
In a Graph database, instead of records and tables, we have nodes with labels (or attributes). These are explicitly linked together through relationships. In Graph Theory these are called nodes and edges.
A SQL Query to link these tables together might look like this:
SELECT name FROM Person
LEFT JOIN Person_Department
ON Person.Id = Person_Department.PersonId
LEFT JOIN Department
ON Department.Id = Person_Department.DepartmentId
WHERE Department.name = "IT Department"
Where the corresponding Graph Query Language (neo4j's language is called Cypher), looks like this:
MATCH (p:Person)-[:BELONGS_TO]->(d:Dept)
WHERE d.name = "IT Department"
RETURN p.name
Download and install neo4j community edition from https://neo4j.com/download-center/#releases. Unzip the package and start neo4j.
Instructions below should work for Mac OS.
navigate to the directory where neo4j was downloaded (~/Downloads) and run the following commands:
tar xvzf neo4j-community-3.5.3-unix.tar.gz
cd neo4j-community-3.5.3
bin/neo4j start
Browse to http://localhost:7474/. In the browser window, connect to the database (default username: neo4j, default password: neo4j), change your password.
If you have never used neo4J before, it is recommended that you follow:
Online Training: Introduction to Graph Databases with Neo4j https://neo4j.com/graphacademy/online-training/introduction-graph-databases/
Now that you’re comfortable with neo4j, we’re going to use it to answer questions about the Marvel Comics Universe. First, we need to clean up our database and load the new data.
Clear the graph
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE n,r
Load the nodes
LOAD CSV FROM "http://portal.nersc.gov/project/m342/datahack-dw/hero_nodes.csv" AS line
MERGE (:Hero {name:line[0] , degree: toInt(line[1])})
Create an Index on Hero Names
CREATE INDEX ON :Hero(name)
Load the edges
LOAD CSV FROM
"http://portal.nersc.gov/project/m342/datahack-dw/hero_edges.csv" AS line
MATCH (u:Hero {name:line[0]})
MATCH (v:Hero {name:line[1]})
CREATE UNIQUE (u) -[:APPEARED { w: toInt(line[2])}]-> (v)
MATCH (spiderman:Hero)
WHERE spiderman.name STARTS WITH "SPIDER"
RETURN spiderman
MATCH p=(peter:Hero {name: 'SPIDER-MAN/PETER PAR'})-[:APPEARED*0..2]-(logan:Hero {name: 'WOLVERINE/LOGAN'})
RETURN p, length(p)
ORDER BY length(p)
LIMIT 1
MATCH (peter:Hero { name: 'SPIDER-MAN/PETER PAR' })-[:APPEARED*2..2]-(friend_of_friend)
WHERE NOT (peter)-[:APPEARED]-(friend_of_friend)
AND friend_of_friend.name <> 'SPIDER-MAN/PETER PAR'
RETURN friend_of_friend.name, COUNT(*)
ORDER BY COUNT(*) DESC , friend_of_friend.name
MATCH (peter:Hero { name: 'SPIDER-MAN/PETER PAR' })-[:APPEARED*2..2]-(friend_of_friend)
WHERE NOT (peter)-[:APPEARED]-(friend_of_friend)
AND friend_of_friend.name <> 'SPIDER-MAN/PETER PAR'
RETURN friend_of_friend
LIMIT 20
MATCH (tony:Hero {name:'IRON MAN/TONY STARK'}) -[e:APPEARED]-> (other) <-[f:APPEARED]- (donald:Hero {name:'THOR/DR. DONALD BLAK'})
RETURN other
ORDER BY e.w DESC, f.w DESC
LIMIT 5
Modify the above templates to find answers to the following questions
- What is the shortest path between DR. STRANGE and DR. DOOM?
- List the 5 shortest paths between DR. STRANGE and DR. DOOM
- List 5 Friends of Friends with the most connections and COLOSSUS II.
- Visualize 10 Friends of friends for IRON MAN
- Discover how the Avengers grew over time from 5 to 10. Find team members starting with 5 and incrementing to 10. Who was added to the team? Is the resulting graph ever not fully connected?
You can also connect to neo4j with python (and other languages). Following is an example of connecting to your local neo4j instance with the loaded Marvel data.
Note: You'll need to update the password in the code to get this to work.
Neo4j Python developer page: https://neo4j.com/developer/python/
API Manual: https://neo4j.com/docs/driver-manual/1.7/
You'll likely not have the neo4j drivers installed in your Python instance. To install the Python3 libraries for neo4j using pip, execute the following command at the command line:
pip3 install neo4j
Once you have the noe4j library installed, and updated the 4th line in the below block to have the correct password, you can run the following code to query for all the friends of the the Marvel character Cassiopeia.
from neo4j import GraphDatabase
user = 'neo4j'
password = 'neo4j' # You'll need to put in your password you setI think
uri = 'bolt://localhost:7687'
driver = GraphDatabase.driver(uri, auth=(user, password))
def print_friends_of(tx, name):
for record in tx.run('MATCH (peter:Hero { name: {name} })-->(friend) '
'WHERE friend.name <> {name} '
'RETURN friend, COUNT(*) '
'ORDER BY COUNT(*) DESC, friend.name', name=name):
print(record['friend']['name'])
with driver.session() as session:
session.read_transaction(print_friends_of, 'CASSIOPEIA')
driver.close()