Overview
When I first time heard about Neo4j I thought that such databases are used only for recommendation engines. My perception has changed when I saw presentation about GAAND Stack (GraphQL, Apollo, Angular, Neo4j Database). I noticed during this training that there are many more use cases for this database. Therefore day after training I dug deeper into this technology. I realized that this is really powerful tool because of it’s speed an way of data representation (using graphs we can model real world).
There are two main reasons why I decided to write this post. First one is creating complete manual for GRAND Stack (same as GAAND but with React.js instead of Angular). Second one is structuring knowledge about Neo4j as a preparation for professional certification.
What is graph database?
In simple words graph database is database that uses graph structures to store data. Exactly same as in graph such database has nodes and edges which can be unidirectional and bidirectional.
Structuring data into graphs is intuitive and allows to search data faster than other data structures.
Neo4j database structure
What is node?
Node is object that represents entity. Node itself can hold data called properties.
What is relation?
Exactly as name says this element represents relation between two nodes. Relationship organize nodes into structures that allows creating lists, maps, trees etc. Relation can hold properties and have to have exactly one relation type.
What is relation type?
Type in relation defines which role one node plays for another or explain why two nodes are connected.
What is label?
Labels are used to assign nodes to any groups. One node can have zero or many labels.
We can think about labels like about table names from relational database. Label defines type of node. In below case we have available two types: Person and Movie.
What are properties?
Properties are just data that node or relation holds.
What is traversal?
Traversal is how you query a graph to get data that You want to get. Traversing a graph means visiting nodes by following defined relationships in any established way that we define by setting rules in query.
What is index?
Index like in other databases allows us to increase performance of querying data. Database create redundant copy of data and store it in most efficient way. Therefore this comes with cost of additional storage space and slower writes.
What is constraint?
Databases are using constraints to prevent storing unwanted data. We define rules, that data should follow, and database checks their values before each commit.
Query Language – Cypher
Cypher is query language used in Neo4j database. For people who was working with SQL it will look familiar. From my perspective it’s also similar a little bit to streams in Java because writing code in this language creates something like pipe. Reading it from left to right reminds me reading classic sentences.
This query language uses ASCII-Art for patters, that makes this language more readable. After looking into it, we immediately know what is node and what is relation and how we are going to process data.
Basic queries
Fetching data
MATCH (charlie { name: 'Charlie Sheen' })-[:ACTED_IN]->(movie)<-[:DIRECTED]-(director) RETURN movie.title, director.name
Above tells to database to return movie title and director’s name based on person (as I suppose, because we don’t have it explicitly defined), named Charlie Sheen in this case, who acted in this movie. In simple words we are looking for moves where Charlie Sheen was an actor.
Please note that we are using unidirectional relation by typing “arrow” (-[:RELATION_TYPE]->). This arrow precisely explain relation between nodes.
Creating node
CREATE (a:Artist { Name : "Strapping Young Lad" })
We can also create multiple node using one command by separating data with commas
CREATE (a:Album { Name: "Killers"}), (b:Album { Name: "Fear of the Dark"}) RETURN a,b
or by using separate CREATE statements
CREATE (a:Album { Name: "Piece of Mind"}) CREATE (b:Album { Name: "Somewhere in Time"}) RETURN a,b
Creating relationship
MATCH (a:Actor),(b:Movie)
WHERE a.Name = "John Tree" AND b.Name = "The neo4j movie"
CREATE (a)-[r:ACTED_IN]->(b)
RETURN r
As You can see, for creating relationships similar keyword is used but we have to point out nodes that should be connected.
Difference in comparison to SQL
In You know SQL You will see many similarities between these query languages. Clauses such as WHERE, UNION, ORDER BY or CREATE exist in both languages. The main difference is that there is no joins since relations are designed in another way than in relational database.
Transactions in Neo4j
Neo4j supports ACID properties to fully maintain data integrity and ensure good transaction behavior.
We have to perform all database operations, that access the graph, indexes, or the schema, in a transaction.
Worth to remember is that:
- Data retrieved by traversals is not protected from modification by other transactions.
- Non-repeatable reads may occur (only write locks are acquired and held until the end of the transaction).
- One can manually acquire locks on nodes and relationships to achieve higher level of isolation
- Locks are acquired at the Node and Relationship level
- Deadlock detection is build into the core transaction management
To read more about transactions in Neo4j visit this site.
Isolation Level
Transactions in Neo4j database use a READ_COMMITTED isolation level. It means that transactions won’t see any uncommitted changes from other transactions. Additionally Java API enables explicit locking of nodes and relationships. Locks give the opportunity to simulate the effect of higher levels of isolation by obtaining and releasing locks explicitly.
Examining Neo4j queries
EXPLAIN
Command EXPLAIN allow us to see execution plan without running our statement. To use it we have to prepend our query with EXPLAIN keyword. It will return empty result and won’t make changes to the database state.
You can see example result of this command below
EXPLAIN MATCH p=()-[r:ACTED_IN]->() RETURN p LIMIT 25
PROFILE
To see which operators are doing the most of work we can use PROFILE statement. This commands runs our query and keeps track of how many rows pass through each operator, and how much each operator needs to interact with storage layer to retrieve the necessary data.
Example:
PROFILE MATCH p=()-[r:ACTED_IN]->() RETURN p LIMIT 25
Naming in Neo4j database
Node Label
For naming node labels we use CamelCase starting with upper-case character.
Proper name | Incorrect name |
VehicleOwner | vehicle_owner |
NetworkNode | networkNode |
Relationship type name
To name relationship we use uppercase words separated with underscore.
Proper name | Incorrect name |
ACTED_IN | acted_in |
OWNED_BY | ownedBy |
Property
Lower camel case, beginning with a lower-case character
Proper name | Incorrect name |
firstName | first_name |
amountOfStudents | AMOUNT_OF_STUDENTS |
Comparison to Relational Database
When we would like to migrate data from relational database to Neo4j database we would have to think about particular rows from table as a nodes. Having this in mind table name would be node’s label. Properties in node would be just data from particular row. Name of each primary key column can be taken as a relation type in graph database.
Communication protocol – Bolt
Bolt is non-standardized open source protocol designed for databases. This protocol is statement oriented. In simple words it means that client can send statements consisting single string with set of parameters. Server will respond with result message and optional stream of data. Neo4j uses this protocol and default port is 7687.
Neo4j Bloom
Bloom is application available in the Graph Platform which allows users to interact visually with graph data. In simple words it’s web application that shows graph which we are working on.
Watch this video to see more about Neo4j Bloom.
License
There are two types of license. Community edition is fully featured database that can be used for open-source projects, projects inside organization or for application that runs on personal device. Enterprise edition comes with better availability and scalability for commercial usage.
Neo4j database supports startups. It means that there is possibility to get it for free after signing up to startup program. For more details see here.
Summary
From my perspective Graph databases is perfect choice when we need to model real relation or any more complicated connections between objects. Searching in graphs is enormously fast, what is huge benefit nowadays. Structure of graphs is much more easier to imagine that any document or tabular data structure.
When talking exactly about Neo4j database I like the way how we are operating on data. Cypher is intuitive language that exactly shows what we are going to do thanks to it’s ASCII-Art look. This language is clear and we can read it like normal sentence. I’m also impressed by Bloom tool that Neo4j provides. It perfectly visualize graphs that we work on and user interface is intuitive and clear.
Neo4j isn’t cheap solution but it in cases where speed of data querying counts it can be the best choice. Graph database is much more flexible and easier to maintain which can also be beneficial or even crucial for some kind of projects.
To sum up, I would recommend Neo4j database for every project that can take benefits from path traversal and graph algorithms as well as flexible relation and data modelling.