Data Warehouse Hack Part 2

neo4j - Introduction

In this notebook, we’ll use neo4j, a Graph Database to efficiently store and retrieve graph data, and perform some interesting analysis around superheroes.

neo4j Cheatsheet: https://neo4j.com/docs/cypher-refcard/current/

Graph Theory Review

http://www2.ece.rochester.edu/~gmateosb/ECE442/Slides/block_2_graphs_inference_review_part_a.pdf

Concepts: Relational to Graph

Relational databases are collection of records (aka row or tuple) all having the same structure that are contained in table. In the following image, we have an Employees and a Departments table and a Dept_Members (many-to-many table) that link these two tables together:

Relational Model

As a graph the above relationship looks like this:

Relational Model as Graph

In a Graph database, instead of records and tables, we have nodes with labels (or attributes). These are explicitly linked together through relationships. In Graph Theory these are called nodes and edges.

Graph

A SQL Query to link these tables together might look like this:

SELECT name FROM Person
LEFT JOIN Person_Department
  ON Person.Id = Person_Department.PersonId
LEFT JOIN Department
  ON Department.Id = Person_Department.DepartmentId
WHERE Department.name = "IT Department"

Where the corresponding Graph Query Language (neo4j's language is called Cypher), looks like this:

MATCH (p:Person)-[:BELONGS_TO]->(d:Dept)
WHERE d.name = "IT Department"
RETURN p.name

From: https://neo4j.com/developer/graph-db-vs-rdbms/

Step 1. Download and install Neo4J

Download and install neo4j community edition from https://neo4j.com/download-center/#releases. Unzip the package and start neo4j.

Instructions below should work for Mac OS.

navigate to the directory where neo4j was downloaded (~/Downloads) and run the following commands:

tar xvzf neo4j-community-3.5.3-unix.tar.gz
cd neo4j-community-3.5.3
bin/neo4j start

Browse to http://localhost:7474/. In the browser window, connect to the database (default username: neo4j, default password: neo4j), change your password.

If you have never used neo4J before, it is recommended that you follow:

Online Training: Introduction to Graph Databases with Neo4j https://neo4j.com/graphacademy/online-training/introduction-graph-databases/

Step 2. Exploring Marvel Character Relationships

Now that you’re comfortable with neo4j, we’re going to use it to answer questions about the Marvel Comics Universe. First, we need to clean up our database and load the new data.

  1. Clear the graph

    MATCH (n)
    OPTIONAL MATCH (n)-[r]-()
    DELETE n,r
  1. Load the nodes

    LOAD CSV FROM "http://portal.nersc.gov/project/m342/datahack-dw/hero_nodes.csv" AS line
    MERGE (:Hero {name:line[0] , degree: toInt(line[1])})
  1. Create an Index on Hero Names

    CREATE INDEX ON :Hero(name)
  1. Load the edges

    LOAD CSV FROM
    "http://portal.nersc.gov/project/m342/datahack-dw/hero_edges.csv" AS line
    MATCH (u:Hero {name:line[0]})
    MATCH (v:Hero {name:line[1]})
    CREATE UNIQUE (u) -[:APPEARED { w: toInt(line[2])}]-> (v)

Step 3. Query Patterns

  1. Finding Characters By Name
    MATCH (spiderman:Hero)
    WHERE spiderman.name STARTS WITH "SPIDER"
    RETURN spiderman
  1. Finding the shortest path between 2 characters
    MATCH p=(peter:Hero {name: 'SPIDER-MAN/PETER PAR'})-[:APPEARED*0..2]-(logan:Hero {name: 'WOLVERINE/LOGAN'})
    RETURN p, length(p)
    ORDER BY length(p)
    LIMIT 1
  1. Finding and Counting Friends of Friends
    MATCH (peter:Hero { name: 'SPIDER-MAN/PETER PAR' })-[:APPEARED*2..2]-(friend_of_friend)
    WHERE NOT (peter)-[:APPEARED]-(friend_of_friend)
    AND friend_of_friend.name <> 'SPIDER-MAN/PETER PAR'
    RETURN friend_of_friend.name, COUNT(*)
    ORDER BY COUNT(*) DESC , friend_of_friend.name
  1. Finding Visualizing Connections Between Friends of Friends
    MATCH (peter:Hero { name: 'SPIDER-MAN/PETER PAR' })-[:APPEARED*2..2]-(friend_of_friend)
    WHERE NOT (peter)-[:APPEARED]-(friend_of_friend)
    AND friend_of_friend.name <> 'SPIDER-MAN/PETER PAR'
    RETURN friend_of_friend
    LIMIT 20
  1. Finding Teammates
    MATCH (tony:Hero {name:'IRON MAN/TONY STARK'}) -[e:APPEARED]-> (other) <-[f:APPEARED]- (donald:Hero {name:'THOR/DR. DONALD BLAK'})
    RETURN other
    ORDER BY e.w DESC, f.w DESC
    LIMIT 5

Step 4. Some Queries to Try On Your Own

Modify the above templates to find answers to the following questions

  1. What is the shortest path between DR. STRANGE and DR. DOOM?
  2. List the 5 shortest paths between DR. STRANGE and DR. DOOM
  3. List 5 Friends of Friends with the most connections and COLOSSUS II.
  4. Visualize 10 Friends of friends for IRON MAN
  5. Discover how the Avengers grew over time from 5 to 10. Find team members starting with 5 and incrementing to 10. Who was added to the team? Is the resulting graph ever not fully connected?

Querying the graph database with Python

You can also connect to neo4j with python (and other languages). Following is an example of connecting to your local neo4j instance with the loaded Marvel data.

Note: You'll need to update the password in the code to get this to work.

Neo4j Python developer page: https://neo4j.com/developer/python/

API Manual: https://neo4j.com/docs/driver-manual/1.7/

You'll likely not have the neo4j drivers installed in your Python instance. To install the Python3 libraries for neo4j using pip, execute the following command at the command line:

pip3 install neo4j

Once you have the noe4j library installed, and updated the 4th line in the below block to have the correct password, you can run the following code to query for all the friends of the the Marvel character Cassiopeia.

In [ ]:
from neo4j import GraphDatabase

user = 'neo4j'
password = 'neo4j'  # You'll need to put in your password you setI think 
uri = 'bolt://localhost:7687'

driver = GraphDatabase.driver(uri, auth=(user, password))

def print_friends_of(tx, name):
    for record in tx.run('MATCH (peter:Hero { name: {name} })-->(friend) '
                         'WHERE friend.name <> {name} '
                         'RETURN friend, COUNT(*) '
                         'ORDER BY COUNT(*) DESC, friend.name', name=name):
        print(record['friend']['name'])
    
with driver.session() as session:
    session.read_transaction(print_friends_of, 'CASSIOPEIA')
    
driver.close()
In [ ]: