Extending Neo4j – enforcing strict schema

July 5, 2016 Jarek Strzelecki

In case you’re not familiar with the name Neo4j, let me give you a brief introduction.

If I told you Neo4j was a non-relational database that, according to a ranking maintained by ‘dbengines.com’, is near the top 20 of popular database engines, you might shrug your shoulders and ask why you should care.

While there are plenty of NoSQL solutions available on the market today, most of them are simple document or key-value storages. Neo4j is different. Much like traditional relational databases, it is built on a solid mathematical foundation – graph theory.

If the structure of your data model is complex; hierarchical or is constituted of large numbers of densely interconnected entities, converting it into a relational model may be challenging and lead to sophisticated queries. As the size of the data set grows, so does the probability of performance issues due to the largest weakness of the relational model – joins between tables not scaling well. Graph databases are free of this limitation and able to traverse complex associations in real time. For use case examples, please visit http://neo4j.com/use-cases/

Getting back on track, Neo4j stores its data in the form of graphs, but what does it mean? Formally speaking, every graph is a set of entities (nodes) combined with a set of associations among these entities (edges). While this definition may sound complex, working with a graph feels quite natural in practice. In fact, you have probably drawn it multiple times on a whiteboard.

Example of graph

In the system being modelled, every noun is represented by a node of the graph and every verb describing the associations between nodes – by an edge or, as Neo4j calls it, a relation.

It wouldn’t be much of database if you could not store additional data describing the entities stored within it. Hence, every node and every relation can contain multiple properties of either simple types (numbers, strings, boolean values) or collections of the types mentioned.

In reality, most systems are composed of elements of multiple types, where every type may have different properties describing it. What may now come to mind is the question of how you can distinguish nodes of different types. Neo4j has a special feature for that very purpose – labels attached to nodes and relations.

As an example for above statements, we may create a node labelled “Book” with the following properties:

Properties of Node „Book”

However, assigning a label to a node or relation does not change the way Neo4j treats the properties of such objects; they are always typed dynamically. While such an approach offers great flexibility, it also brings risks; failing to account for inconsistent data in a single case, may introduce issues resulting in unintended consequences.

For example, Neo4j would be completely fine with modifying the properties of the previously mentioned “Book” in such a way that their types differ from what you would expect.
Properties of Node „Book”

It effectively means that, if you are developing an application on Neo4j, all type checking has to be enforced at an application level. This is especially the case when your graph data comes from multiple sources (e.g. importing data from another system).

Now, if you ever had anything to do with a classic relational database, you might be wondering if it is possible to enforce type checks on graph schema anyhow.

Fortunately, the answer is ‘yes’ – due to the fact that Neo4j developers have allowed the enhancement of their product to a great extent. Perhaps the most powerful of these enhancements is the ability to register kernel event handlers, supervising transactions occurring with the database.

My concept of a type checking extension was to create an event handler that would scan the nodes and relations being persisted for labels and check if the values in property match the schema definition. In the case of a schema violation (), an exception is thrown resulting in a transaction rollback.

Here is a snippet of event handler executed before a transaction is committed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public class SchemaEnforcerTransactionEventHandler extends TransactionEventHandler.Adapter {
    private final PropertiesValidator propertiesValidator;
 
    public SchemaEnforcerTransactionEventHandler(final SchemaProvider schemaProvider, final LogService logService) {
        this.propertiesValidator = new PropertiesValidator(schemaProvider, logService);
    } 
 
    @Override
    public Object beforeCommit(final TransactionData data) throws Exception {
        for (final PropertyEntry<Node> prop : data.assignedNodeProperties()) {
             propertiesValidator.validatePropertyEntry(prop); 
        }
        for (final PropertyEntry<Relationship> prop : data.assignedRelationshipProperties()) {
             propertiesValidator.validatePropertyEntry(prop); 
        }
        return null; 
    } 
}

Validation is performed by a separate component PropertiesValidator that retrieves schema definition for each label and then compares the types of properties present on a node/relation with those present in the definition.

The schema definition is stored in the properties of special nodes labelled “Metadata”. In the case of previously presented “Book”, the definition may look like this:

1
2
3
4
5
6
7
8
CREATE (m:Metadata)
    SET m.label = 'Book', 
        m.schema = [ 
            'title:string', 
            'pages:int', 
            'genre:bool', 
            'ratings:array[string]' 
        ]

As a further enhancement, this syntax might be extended so it allows for the definition of complex validation rules (similar in functionality to check constraints available in some relational databases).

Last posts