An Introduction to Graph Databases with Neo4j

Graph Databases

Over the last decade there has been a rise in the use of non-relational stores such as in-memory, document, multi-model, and graph databases. This post is about the latter, graph databases. We will be taking a look specifically at Neo4j, and how it can be used to answer questions about highly connected data.

 

Graph Theory

In mathematics, the study of graphs is known as graph theory. There are many types of graphs, and among them is the directed graph. Directed graphs consist of a set of nodes, connected by directed edges.

Graph Theory

The Property Graph Model

Property graphs contain nodes, and relationships.

Nodes

Nodes are the main data points of interest in a property graph.

Relationships

Relationships are the directed connections between two nodes.

Properties

Properties are attributes that belong to nodes and/or edges.

Labels

Labels are used to group data together.

Property Graph

By Originally uploaded by Ahzf (Transferred by Obersachse) – Originally uploaded on en.wikipedia, CC0, https://commons.wikimedia.org/w/index.php?curid=19279472

 

What is Neo4j?

Neo4j is a graph database management system. The system features native graph storage and processing, as well as support for the property graph model.Neo4j Logo

 

Using the Neo4j Sandbox

Image of the Neo4J Getting Started

To work with Neo4j without having to install it locally, we will use the Neo4j Sandbox.

Using your browser, open a new tab or window and go to  https://neo4j.com/sandbox-v2/.

You should see a splash screen with a dialogue like the one on the right.

Click the Start Now button.

Neo4j Login Screen

The Log in/Sign up dialog page will appear.

Select the best option for you and log in.

After logging in, you should see several sandbox options.

Neo4J Start a Blank Sandbox

Select the Blank Sandbox, and click Launch Sandbox.

After a few moments you should see a sandbox dialog with tabs across the top.

You will now have a sandbox that is available to you for a few days.

Click the details tab. Make a note of the information displayed and click the Neo4j Browser link to continue.

Neo4j Sandbox Details Panel

 

Neo4j Browser

Once you have completed the steps above, you should be viewing the Neo4j Browser.
In the center of the screen at the top you should see the Query Editor. We will make extensive use of the editor throughout the remainder of this post.

On the left side of the browser, click the database icon. You will see the Database Information panel. It includes the sections Node Labels, Relationship Types and Property Keys . The database is presently empty, but as data is added those sections will display icons that you can click to execute queries.

Please consult this beginner’s  guide for more details on how to use and customize the browser.

The Neo4j Interface

 

Cypher Query Language

Neo4j’s Cypher Query Language is a declarative graph query language that aims to be intuitive and human-readable. Nodes, relationships, and properties are described using ascii-art. Pattern matching is used to query and update the data stored in the database.

This query would return a node from all nodes labeled “Companies” that have the name “Sharp Notions”:

MATCH (company:Companies{name: ‘Sharp Notions’})
RETURN company;

Please consult Cypher’s documentation for a better understanding of the queries used later in the post.

 

Importing Data

To utilize the concepts mentioned so far, we are going to create a new property graph by importing data to our sandbox.

The Baseball Databank is a compilation of historical baseball data distributed under Open Data terms. It is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

The database is updated annually, prior to the start of the next season. Some of the data it contains dates back to the 1870’s.

With this in mind, we can use Neo4j as the means to view historical baseball data using the Property Graph Model.

 

Creating Nodes

In the next few sections we will create nodes in the database using the Cypher import statements located here.

Each import statement contains the clause USING PERIODIC COMMIT, which is a query hint that may be used to prevent an out-of-memory error from occurring when importing large amounts of data using LOAD CSV .

To avoid issues during import we will execute the import statements one at a time.

LOAD CSV is used to import data from a comma delimited file at a specified location. In this case, we are loading files from a GitHub repository. Neo4j does not require a schema, so we can import at will.

The readme file included with the data contains important information about the the history of the database, and its structure. This will serve as our guide for the tables we import.

People Nodes

Neo4j People Import

The aforementioned readme file indicates that the Master table is the most important. At some point Baseball Databank renamed Master to People.

We will create People nodes first, as they will contain names, the date of birth and other biographical information for a person.

To import data from the People table, begin by copying the People import statement provided.

Next, click inside the query editor, and hit the ESC key. The editor should expand.

Paste, and then execute the query by using the Run button in the top right of the editor.

Neo4j Database Information Panel

As each node is created, it is being given a label of People, matching the basename of the file the data is being pulled from. This will be the convention for each table imported.

After a few moments, the process should complete.

The database is now populated with nodes labeled People.

If you click the database icon you will now see Node Labels and Property Keys are filled in.

Click People in the Node Labels  section to execute a query. It will return 25 nodes.

 

Neo4J People Query Results

Clicking the label with the number of results at the top of the query pane, will update the dialog at the bottom of the pane. This allows the color and size of the nodes to be changed.

Neo4J Results Label Image

To the far right, there is an arrow that when clicked expands to display the caption for a node. Select the playerID to display it as the caption for People  nodes.

Neo4j Change Node

Neo4j Change Node

You can repeat these steps for the other labeled nodes as you deem necessary.

Lastly, if you hover over a node, its properties will be revealed in the bottom panel of the query results.

Neo4j Node

Let’s continue with the rest of our data imports. The process will be the same as the one we followed above to import People. Copy the matching import statements provided for the Batting, Pitching, Fielding, Teams, and HallOfFame tables. Be sure to import each one at a time.

 

Creating Relationships

Relationships on a property graph can help with answering questions about our data.

At this point we have created nodes labeled People, Batting, Pitching, Fielding, Teams, and HallofFame into the database. We will now take a look at creating relationships between these nodes to answer questions such as “Who is in the Hall of Fame?”

Who is in the Hall of Fame?

The ultimate honor for the people involved in baseball is to be inducted into The National Baseball Hall of Fame.

HallOfFame nodes contain a playerID property, which we can use to select People nodes with the corresponding playerID .

Each HallOfFame node also has an inducted property that indicates if a person was voted into the Hall of Fame.

The following query will return nodes for People who have been inducted to the Hall of Fame, and HallOfFame votes resulting in a person’s induction:

Image of the Result of the Hall of Fame Inducted Query

The query returned the nodes we wanted, but there is no relationship between them.

To create a relationship between a person and their induction vote*, execute the following query:

We have now created the relationship type WAS_INDUCTED_TO_HALL_OF_FAME.

* Please note that Neo4j does not require nodes to have matching properties in order to create a relationship. Any two nodes can be connected. Here we are using the data’s relational database origins to create relationships.

Neo4j Hall Of Fame Inducted Relationship PanelNeo4j Hall Of Fame Was Inducted Relationship Query Results Panel

Returning to graph theory for a moment, we have created an edge going out from a Person node to a HallOfFame node.

Querying by that relationship, we can easily answer the question “Who is in the Hall of Fame?”

Neo4j Hall Of Fame Relationship Result

Let’s continue by asking a few more questions, and creating relationships to answer them.

What are this player’s statistics?

Using the following queries, we can create relationships between People (players) and the statistics they have compiled:

What are this team’s statistics?

Neo4j Relationship Panel

We have finished creating relationships for the purposes of this post. You can create additional relationships by importing the other available tables, and thinking about how the resulting nodes are related.

Remember, any relationship created above can be queried by clicking its icon in the browser.

Revealing Relationships

One of the most powerful features of Neo4j is the ability reveal a node’s relationships by expanding it within a query result.

The result below displays a Hall of Fame player’s Batting and Fielding statistics, after double clicking the People node representing the player.

Neo4j Hall of Fame Player

This next result expands the player’s Batting and Fielding statistics for a single season to reveal the team they were compiled for.

Neo4j Player

If we went on to expand the Teams  node, the player’s teammates, their statistics and more would be revealed. This effectively answers “Who were this Hall of Fame player’s teammates?” or “What teams did this Hall of Fame player play for?”

 

Conclusion

This was a brief look at graph databases, and how they can help answer questions about connected data using relationships. However, there is much more to understand than can be covered here.

Consider creating a Neo4j sandbox to see if a graph database might fit your use case. Be sure to visit the Neo4j website and documentation for more information.