Building a Conference Session Recommendation engine using Neo4J Graph Database image 39

Building a Conference Session Recommendation engine using Neo4J Graph Database

This article describes a use case for which a traditional SQL-powered relational database approach can provide a solution – but for which that traditional approach is not the optimal solution. SQL is jack of all trades – you can make it do almost anything you need. And therefore it is easy to become your hammer and every challenge a nail. This article is if anything meant to open my and maybe your eyes to the realization that there are technologies that could complement the SQL hammer in our toolbox. In this particular case, the complementary tool is Graph Database and more specifically Neo4J.

My colleague Rosanna Denis recently published this article on our AMIS Technology Blog – providing a good first introduction to Neo4J.You may want to check it out to get a little bit of background on Neo4J.

image

The challenge discussed here is a Code One Session Recommendation Engine. Conferences such as CodeOne have many sessions to choose from. The challenge of picking the best sessions is a real one. My time is valuable – how do I ensure I do not end up wasting it on sessions by inarticulate or uninspiring speakers?

What I would like to have is an engine that will recommend to me sessions by speakers that are liked by people who attend(ed) the same same sessions that I attend. Surely if people come to the sessions that I attend, then their recommendations for speakers they like must be valuable to me.

Below I will create two implementations of this recommendation. One relational and SQL based, the other using Graph  DB and Cypher (the Neo4J query- and data manipulation language.

All code for this article is on GitHub: https://github.com/lucasjellema/conference-recommendation-engine-in-graphdb .

Take One – Relational and SQL

The relational data model I have designed looks something like this:

image

I have to cater for the fact that attendees will attend multiple sessions and sessions have multiple attendees as well as speakers can present multiple sessions and a session can have multiple speakers. Additionally, people can like more than one speaker and likewise a speaker can be liked by multiple people.

The GitHub repo contains the source file with table creation DDL and data creation DML. Both are pretty straightforward.

The SQL query I would create to Start from me and the sessions I attended, then Locate other attendees in these sessions and next Find the speakers they like to Finally Retrieve the sessions presented by those speakers looks like this:

select s.code
,      s.title
,      a2.attendee_name "suggested by"
from   people p1
       join
       attendance a1
       on (p1.name = a1.attendee_name)
       join attendance a2
       on (a2.session_code = a1.session_code)
       join speaker_liking sl
       on (sl.attendee_name = a2.attendee_name)
       join speakers sp
       on (sl.speaker_name = sp.speaker_name)
       join sessions s
       on (sp.session_code = s.code)
where  p1.name = 'Lucas Jellema'

The result from this query (in SQL Developer):

image

The query is a little longwinded even though I took some shortcuts. In SQL, this particular search challenge that relies heavily on relationships between objects, ends up using many table joins in order to find the right answer. This is not necessarily a bad thing – it is just the way of relational databases and SQL. However, we end up with a query that is not very intuitive to grasp and therefore not super easy to maintain. Furthermore, for really big data sets this approach might not scale well in terms of performance. We can probably employ smart indexing strategies to stretch the approach a little in terms of performance. But we may very well hit limits at some stage.

So let’s consider an alternative approach.

Take Two – Graph DB (Neo4J and Cypher)

Graph Databases have a very different view on data. Data stored in and queries performed against graph database are driven by relations between objects – or edges between vertices in the graph. Graph Databases can be put to good use for specific query use cases that require a very rich, scalable and well performing access path to data based on these relations. They will typically complement a relational database, not necessarily replace it.

Neo4J is a popular Graph Database and therefore an obvious candidate for my demonstration in this article.

To get started, I worked with the instructions provided by Rosanna in her article . Working in a Linux VM with a Docker Engine, all I needed to do:

Create directories

mkdir ~/neo4j
mkdir ~/neo4j/data
mkdir ~/neo4j/logs

Then run:

docker run –publish=7474:7474 –publish=7687:7687 –volume=$HOME/neo4j/data:/data –volume=$HOME/ neo4j/logs:/logs neo4j:3.0

Now access Neo4J through browser at port 7474, for example: http://192.168.188.120:7474

Using Neo4J’s Cypher language, I created the vertices and edges in my graph (check this file for all data manipulation statements):

CREATE (lucas:Person{name:'Lucas Jellema'}) – [:PRESENTS] ->(session:Session {title:'50 Shades of Data: How, When, Why—Big, Relational, NoSQL, Elastic, Graph, Event', code:'DEV4976'})
CREATE (daniel:Person{name:'Daniel Bryant'}) – [:PRESENTS] ->(session:Session {title:'Continuous Delivery with Containers and Java: Lessons Learned and Mistakes Made', code:'DEV5349'})
CREATE (luis:Person{name:'Luis Weir'}) – [:PRESENTS] ->(session:Session {title:'The Seven Deadly Sins of API Design', code:'DEV4921'})
MERGE (daniel:Person{name:'Daniel Bryant'}) CREATE (daniel) - [:PRESENTS] -> (session:Session{title:'AdoptOpenJDK: Lessons Learned from the New Build Farm for Java Itself', code:'TUT5283'})
MATCH (luis:Person {name:'Luis Weir'}), ((session:Session {code:'DEV4854'})) MERGE (luis)-[:ATTENDS]-> (session)
MATCH (lucas:Person {name:'Lucas Jellema'}), ((session:Session {code:'DEV4854'})) MERGE (lucas)-[:ATTENDS]-> (session)

//Luis likes Daniel Bryant as a speaker
MATCH (luis:Person {name:'Luis Weir'}), ((daniel:Person {name:'Daniel Bryant'})) MERGE (luis)-[:VALUES]-> (daniel)

This resulted in the following graphical representation of the graph (with sessions in pink and people in blue):

image

With the graph in place, I can create the query that will provide the session recommendations. In Cypher, I can stay very close to the natural language that describes my enquiry:

image

// now the hunt is on
//find me and the sessions I attended
match (lucas:Person {name:'Lucas Jellema'}) - [:ATTENDS] -> (s1) RETURN s1

// find people who attended the same sessions as ME
match (lucas:Person {name:'Lucas Jellema'}) - [:ATTENDS] -> (s1) <- [:ATTENDS] - (p2) RETURN p2

// find presenters valued by the people who attended the same sessions as I did:
match (lucas:Person {name:'Lucas Jellema'}) - [:ATTENDS] -> (s1) <- [:ATTENDS] - (p2) - [:VALUES] -> (p3) RETURN p3

// find sessions presented by presenters valued by the people who attended the same sessions as I did:
match (lucas:Person {name:'Lucas Jellema'}) - [:ATTENDS] -> (session1) 
  <- [:ATTENDS] - (recommender) - [:VALUES] -> (speaker) - [:PRESENTS] -> (session) RETURN session, speaker.name, recommender.name

Starting with me (vertex of type Person with property name equal to Lucas Jellema), traverse to all sessions that I attended. For these sessions, navigate to all attendees (let’s call them recommenders) and find all speakers that they value. Traverse the PRESENTS edge for all these speakers to get the sessions that they present. Return these sessions, as well as the name of the speaker and the name of the recommender.

Not surprisingly, I get the desired result:

image

Comparison

The two queries can be compared side by side. It is clear which is the shorter one. After getting a little used to the Cypher syntax, I venture that the shorter one will prove to be the more intuitive and easier maintainable one. I have not specifically tested for performance; however, the indications suggest that the Graph DB approach will perform (significantly) better for this use case, even though the SQL performance may be good enough for the use case.

image

Data manipulation is definitely different in Neo4J – no fixed table structures are defined ahead of time and meta-data is derived at data creation time. For a small demo system, that is nice and convenient. For a more serious application, that is hardly relevant. All software will rely on a certain structure of the data, so strict meta-data management is still very much required.

I like the idea of having Graph Database in my toolbox, as a tool that can help me deal with query-challenges that in the traditional, relational approach might give me some real headaches. The interesting challenge that comes with it is of course how replicate data changes from the relational master to the Graph DB query only store. We call that CQRS. And we will be talking about that again.

Note: figuring out what the query should be in Cypher on top of the Graph DB is a lot of fun! It is like discovering SQL all over again.

Resources

Source code for this article: https://github.com/lucasjellema/conference-recommendation-engine-in-graphdb 

Slides from my 50 Shades of Data presentation: https://www.slideshare.net/lucasjellema/50-shades-of-data-how-when-and-why-bigrelationalnosqlelasticgraphevent-codeone-2018-san-francisco 
(or watch the YouTube recording from CodeOne 2018: https://www.youtube.com/watch?v=S2wEDlKzVok)

My colleague Rosanna Denis recently published this article on our AMIS Technology Blog – providing a good first introduction to Neo4J.You may want to check it out to get a little bit of background on Neo4J.