Original post

“Why should I use a graph database, instead of a relational database, for my app?”

I often get asked questions like this.

People tend to begin their tech journey with relational databases and get stuck there. So, for them, there’s a question of why move away from what they know to be ‘right’. For me, the quickest response looks like:

“Why wouldn’t you – and why was your first thought relational anyway?”

But that’s not a reason. It’s my reaction based on the experiences I’ve had and the tech I’ve used.

In this post, I’m going to pick apart why you should trust a graph database for your next app and not trust a relational database.

Too often, such discussion is posed from the viewpoint of why move off a relational platform; why do something other than the default choice? I’m going to turn that around, arguing instead that, for most apps, a graph database should be the default choice, and you’d need a good reason to go the other way.

I’ll begin by noting that relational databases are probably your default choice because of historical accident, educational bias, and a bit of groupthink, not because they fit your engineering needs. Some part of many people’s initial reaction to a different bit of tech just boils down to familiarity: you moved past much of what you were initially taught; don’t get stuck on this one.

In that, I’ll draw on how relational databases are not nice to use from an engineering perspective, and thus transition this post’s line of argument to the counterpoint that graph databases are easy to use, naturally fit your app, and by using one, you won’t have to spend your time engineering around the monster at the bottom of your tech stack.

From there, we’ll take up data modeling, then query execution, finally speed and scale. In reading those sections, you’ll notice that I’ve taken up my position: a graph database is the natural choice for your app; here’s why.

I’m arguing that from the point of view of engineering ease, technology fit, and the mechanics of how the graph database processes requests. This isn’t an essay about some nice points of graph databases. It’s here is the natural way; give me a reason to do otherwise.

You may feel a knee-jerk reaction that you want to defend the choice of a relational database for your app and why all that intermediate tech to translate from your app’s graph through to some tables and joins makes sense; that’s fine, but you’ll need a great argument to get through this one.

Let’s begin – relational databases cause you to do engineering work that’s not required.

One of the first things that most engineers were taught is relational databases. It’s a mental hurdle to think of a different thing at the bottom of the tech stack. But you were taught languages you don’t use now, tech that’s grown stale, and practices that your organization doesn’t follow. This is no different.

That you were taught relational databases at all is an accident. It’s simply that they are an early tech that became widely used, so your teacher knows about them, and, importantly, that they are a dream to teach.

Relational databases come with a basis in relational algebra. That means a course in relational databases is a beautiful package: start with a theoretical basis, mix in some practical applications, dive into the engineering challenges. It’s computer science, it’s fantastic to teach, and it’s easy to assess.

But relational databases aren’t as clean and fantastic when they are the bottom-most layer in your tech stack.

Using a relational database in an app isn’t a dream to teach, or easy to assess. It’s often a hard slog of design choices and working around what the relational database presents you as an interface.

You see, in that set of courses where you learned about building software, you did a course on software design and principles of good software design. In that, you learned about what makes a software component well designed and easy to use from other components. You were taught all about abstraction and interfaces and dependencies, and then, you walked out of that class and into databases 101. And the component featured in your databases course is essentially the counterexample to all the good software engineering design principles.

Relational databases are so much the counterexample to good software design that the history of development is littered with ideas of things to wrap around them to make them nicer. Anyone who’s built a big app knows that somehow dealing with that cantankerous bit of tech at the bottom of the stack is one of the key engineering challenges you face.

You wrap it in an ORM, write something bespoke, grab all the tools at your disposal to tame the fact that the keystone of your app is exposed in a way that’s not at all nice from a software engineering perspective. You are forced to model its way, and, then, it forces its internal workings on you as an interface.

You have to hide the relational database. If you don’t, there’s too much dependency between your app’s code and the model and interface it has forced on you.

Somehow, though, it leaks everywhere. You constantly face the challenges of engineering around it. You have to work out how changes in the DB percolate through your wrapper, what it’ll mean for join performance, how can you force your wrapper to get you the bit of data you need without adding a magic string to your app that ties you to the internal workings of the database.

In fact, we get so caught up in the challenges of engineering around the relational database, and get so tooled and skilled in dealing with the challenges, that we forget to ask if there’s a simpler way.

The situation is so ingrained that it’s become somewhat of a cultural aspect of software engineering. To be willing to take on and solve the engineering challenges of relational databases is seen almost as a badge of honor, to know deeply how they work and how to perform dark magic with them is seen as a mark of knowledge and intelligence, to question that extra effort is somewhat of a software engineering anathema.

Even in the GraphQL ecosystem, whole companies exist on the premise that they can tame this engineering beast, that if you deepen your tech stack by adding their layer, you’ll better be able to handle the translation and engineering challenges, and thus reach what you want — which is simply to model in a way that’s natural for your app and have an interface to your data that’s not causing you issues. Why start with relational? Why start with the premise that the core tech is fixed and that we have to find ever more creative ways of engineering around its failings?

When you deal with a relational database, you get a highly engineered piece of tech, but not a piece of tech that’s nice to engineer with. It breaks every software engineering principle you know and forces you to do more work than you need just to tame it.

So many of those challenges melt away with a graph database. Your data model is a more natural abstraction of your app, the queries are more naturally traversals in your app’s data model and, particularly if your project is a GraphQL project on Dgraph, the interface you use to access your data works with you – it’s not something that you have to engineer around.

Trust a graph database for your app: you’ll get a piece of tech that works with you, not a piece of tech that forces you to engineer around it.

And with that, the perspective in this post changed. Stop thinking about a relational database as the natural choice for your app; it’s the unnatural choice; it’s the choice that forces you to do more work; it’s the choice that forces you to do mental gymnastics, both for modeling and query; it’s the choice that forces you to do engineering work that you don’t need to do and has nothing to do with your app.

Now I take up my stance from the natural perspective and continue with that.

Graph databases let you model in a more natural way than relational databases.

Graph databases present a simple modeling abstraction that naturally matches both how you mentally view your app and how the data structures in your programming language will use the app’s data. With GraphQL, you also get end-to-end types that mean your app and your model have the same view of the data.

Let’s take the simple example of a Trello clone with cards that might have a list of comments. In my Dgraph GraphQL world, that’s simply an edge between card and comments.

type Comment { ... }

type Card {
  ...
  comments: [Comment]
}

Let’s ignore any notions of a shared abstraction — for example, where card and comments are both extensions of some interface that lists the creator, date, etc. — which get even harder in a relational world and easier in a graph one, and just take the simplest possible model. What do I do to model this in a relational way? Well, this simplest example is similar, but already doesn’t match my mental model.

A relational modeling would have two tables, again card and comments, and would list the same properties, except that I’m forced to put a reference to the card in the comment table and not in the card table. And already, even in the simplest case, I have to break my mental model of this data. I’m never going to traverse the data that way; I’ll always be laying out a page or popup of a card, its details, and its list of comments. But the relational database has forced its internal workings on me. I also have to be cognizant that at query time I can’t follow my mental model; I have to convert it to joins. I want to go card.comments, not find card, then find comments where comment.card = card.id, but I can’t.

My modeling is governed, not by how I mentally view my data or how my app uses the data, but by the internal workings of the relational database. Already, even in the simplest case, my engineering would be based on the internal workings of the database, not my app.

If the model is more complex, it stays just as nice in the graph world, but not in the relational one. For example, a card can have several assignees. In my Dgraph GraphQL model, I simply state the natural model that a card is linked to people.

type Person { ... }

type Card {
  ...
  assignees: [Person]
}

My relational model, however, gets strange. I can’t represent this. I have to create not just the person and card tables, but now a third table cardAssignment that represents this link. My queries equally changed. I can’t go card.assignees, I now have to join three blocks of data and make two projections.

As I build more of my app, this disparity grows. On the graph side, I model my data and the relationships that my app cares about. On the relational side, I’m concerned with how the database works, what tricks can I use to squeeze my model in there, how can I make this performant, do I need to de-normalized my database.

The relational model does ‘work’, but it’s more effort and adds unneeded complexity and mental gymnastics. You can even deepen your tech stack and dependencies by adding in an ORM to help you. But why? I’ve been challenged in the past by lines like “I haven’t found a case that’s modeled by a graph that can’t be modeled in a relational database”, which is likely true, but I also “haven’t seen a hole dug by an excavator that couldn’t have been dug with a shovel”. Sure, both statements are true, but would you want to be left holding the shovel!

Why introduce this complexity? I wouldn’t. The model is about the graph of cards comments and assignees. I’d like to keep it that way. I’d like to think about my app that way. I’d like to query that way. I don’t want to engineer around something else.

Trust a graph database in modeling your app: it works with you to model your domain, not against you by forcing you to model according to its internal workings.

I’m sorry, reader. I meant to spend more text on graph modeling than relational, but the relational takes so much explaining that it’s ended up dominating the text here. The graph version is simply, cards can have a list of assignees. But to even compare to the relational version, I was forced to tell you about its internal workings – those things are so pernicious that you can’t model with them without welding yourself the internal details, and, it seems, you can’t even talk about them without their internals getting in your way.

Query execution in a graph database matches how you think about getting the data for your app.

I’ve got my model — cards, comments and assignees — and I want the data to layout a page. The graphQL syntax may be unfamiliar to you, but it’s simply a description of the data traversal I need.

getCard(id: "0x123") {
  ...
  comments { 
    text
    ...
  }
  assignees {
    username
    ...
  }
}

Get this card, follow the comments edge to get all the comments, follow the assignees edge to get the people assigned to this card. Done.

The relational version, well — nope, I’m not describing it this time. I’m not going to describe the internal workings of the database just to explain this simple query. Just like in the previous section, that leads down the relational rabbit hole. You know it’s more complex. You know it’s all about table joins and projections and cross-products. You know a relational query returns a block of data, but here, we don’t want a block, we want this subgraph. What would the rows in my block look like here? Do I write a single query and make the rows repeat the card data for each comment, or do I run multiple queries, multiple round trips?

Hang on, why are we even thinking about this: we want some data for our app, not to understand the internal workings of another piece of tech. How many other libraries or components that you use for your app force you to have a deep knowledge of their internal workings to even get off the ground. None. It’s just the relational database.

Trust a graph database for querying your data: your queries will be about your data, not about the internal workings of the database.

What about how those queries are executed?

Well, the easiest way to explain how Dgraph stores data is that edges are stored like pointers in a programming language. So when a query traverses from a card to its list of comments, that’s pretty much like a (disc-based) pointer lookup. The net result is that for Dgraph to find the data to layout our card, its comments and assignees, Dgraph need only use the actual data involved. Look up the card, follow the edges to the comments and assignees. If there are more edges involved, just follow those.

The relational version — oh, dear, so much complexity.

The graph version for both how you write the query and how the database executes the query is about the app’s data and how it’s linked. The relational version is about blocks of data, and subsets of those blocks, and multiple joins, and are my joins a problem, and do I have to make multiple trips creating an N+1 problem.

Trust a graph database for executing queries in your app: it works by using your graph to find the data you need, not by being fixed to some algebraic theory that looks nice as an exercise in a textbook.

Graph databases are designed for scale; it’s not something you have to engineer for.

With a graph database, as discussed above, the query execution method is to process the graph. That turns out to be fast, and, in Dgraph’s case, scalable. Dgraph’s data storage format and the way it processes queries are both optimized to solve the kinds of problems that GraphQL queries ask.

Two problems tend to affect the performance of GraphQL queries. One is fan-out, and, the other, depth. As a query requests more fields, there’s more work to do, more data to be returned at each level. As it gets deeper, that’s potentially more trips to the disk to get data and N+1 problems.

Dgraph solves these as core features and throws scale in for fun.

Query fan-out isn’t an issue in Dgraph because it solves for each queried field independently and in parallel. That means asking for more fields doesn’t necessarily take more time.

Depth in Dgraph just means following another level of pointers. Some ways of solving for depth, like compiling a larger query, require extra table joins, other ways move from a node to it’s N siblings, and create an N+1 query problem. Dgraph advances the query frontier in parallel and always batching, avoiding fetching the same nodes multiple times, so a deeper traversal isn’t a problem.

For example, if a query expands from a card to its comments to the comment author to other cards they are assigned, etc., it might look like this in GraphQL.

getCard(id: "0x123") {
  ...
  comments { 
    ...
    text
    author {
      username
      assignedCards {
        assignees {
          ...
        }
      }
    }
  }
}

There might be N comments, but < N distinct authors, since the same author may make multiple comments, so Dgraph will expand the comments and then the authors as a batch (effectively as pointer dereferences), and similarly for the assigned cards — batched and minimized to avoid duplicate work. Dgraph is engineered to solve these GraphQL query problems for you.

Implementing such an app in a relational DB places the engineering challenge on you. Are you going to over-fetch or under-fetch data in your queries, are you going to compile the query to a single join that may be inefficient because it’ll contain self-joins, are you going to have to engineer something to deal with batching and N+1.

In Dgraph, the query cost is bounded by the data the query investigates and returns, in a relational implementation, it’s bounded by your engineering efforts to contain the various implementation problems and, often, the size of the tables being joined.

Dgraph’s query mechanism also enables scale. The data can be replicated and sharded across a distributed cluster. The query answering mechanism doesn’t change. You don’t have to write different queries or engineer a solution to distributed joins, it’s solved in the graph database. With an SQL database, you start with a single instance, engineer for queries, then engineer to scale.

Trust a graph database for speed and scale: the engineering challenges of efficiently executing GraphQL, even at scale, are solved by Dgraph, not forced onto you as another engineering challenge.

In the end, engineers should make intelligent choices that match their technology needs, so I’m going to leave that choice up to you. I just hope that I’ve at least opened a little mind-door that’ll let you realize that what many see as the ‘default’ choice may not match up with what you want to spend your time doing as an engineer.

Can you simply ignore all the arguments and just write your app. Well, no, because the tech choices you make may work with you to help you build your app, or force you to engineer around them just to get to the point of building your app.

Many large companies move away from relational databases when they can no longer sustain the engineering cost. However, they ran into the engineering problems early and spent time and effort dealing with them. Google, Facebook, Twitter, LinkedIn, and others all moved once they were large enough to have the engineering effort to build an in-house graph solution, but that doesn’t mean you have to wait that long to ease your engineering pain.