Original post

If you store data in a relational database, it’s good practice to have the data
normalized. This typically requires splitting data to multiple tables that
are logically connected through keys. As a result, most non-trivial queries
require joins on multiple tables to gather all the interesting columns. This
post is a brief tour of SQL joins, focusing on the differences between inner
and outer joins.

Cross join

To understand SQL joins, it’s best to start with cross joins, because they
are the simplest combination of tables supported by SQL. A cross join occurs
when we write:

select * from t1, t2;

Throughout this post, we’ll be working with two sample tables called t1 and
t2:

    t1                         t2

 id |   name              code | id
----+----------          ------+----
  1 | Joanne              x    |  2
  2 | Sam                 z    |  3
  3 | Emmanuel            a    |  7
  4 | Brayden

The SQL code to create these tables and run all the examples in this post is
available here. All the code
was tested on PostgreSQL 9.5.

Running the cross join on these tables results in:

 id |   name   | code | id
----+----------+------+----
  1 | Joanne   | x    |  2
  2 | Sam      | x    |  2
  3 | Emmanuel | x    |  2
  4 | Brayden  | x    |  2
  1 | Joanne   | z    |  3
  2 | Sam      | z    |  3
  3 | Emmanuel | z    |  3
  4 | Brayden  | z    |  3
  1 | Joanne   | a    |  7
  2 | Sam      | a    |  7
  3 | Emmanuel | a    |  7
  4 | Brayden  | a    |  7

The cross join performs a cross product (or Cartesian product) between the two
tables. For each row in t1, it adds all possible rows from t2. The
resulting table has all the columns of t1 and of t2, and its number of
rows is the product of numbers of rows in t1 and t2.

I find cross joins to be a good starting point because they make inner joins
much easier to understand. They are also the basis of joins in relational
algebra.

SQL also supports a more explicit way to invoke a cross join:

select * from t1 cross join t2;

This is equivalent to the first statement.

Inner join

An important component of SQL queries is filtering results with a where
clause. For example, we can create the following (slightly nonsensical) filter
on the cross join shown earlier:

select * from t1, t2 where t2.code = 'x' and t1.name like '%d%'

Resulting in:

 id |   name   | code | id
----+----------+------+----
  4 | Brayden  | x    |  2

One filter that’s particularly useful when crossing two tables is checking
whether there’s a match on some column value. Both t1 and t2 have an
id column; let’s assume these IDs refer to the same thing, and that we want
to find all combinations of rows from the two tables where the IDs match. We can
do:

select * from t1, t2 where t1.id = t2.id;

Resulting in:

 id |   name   | code | id
----+----------+------+----
  2 | Sam      | x    |  2
  3 | Emmanuel | z    |  3

This kind of filtering is so useful that it has its own concept: the inner
join
[1]:

select * from t1 inner join t2 on t1.id = t2.id;

It produces the exact same result table. When the names of the columns we
compare are the same in the two tables, there’s an even shorter syntax that can
be used:

select * from t1 inner join t2 using (id);

The result of this will only have a single id column, since we’re making it
explicit that ids match between the tables:

 id |   name   | code
----+----------+------
  2 | Sam      | x
  3 | Emmanuel | z

I find the filtering equivalence very useful to understand inner joins. Just
remember that it’s a cross product of the two tables where only rows that
satisfy a certain condition are returned. You may be wondering what’s the
difference between using where filtering and inner join ... on. While
the two are logically equivalent, some things to keep in mind:

  • At least theoretically, inner join ... on is more efficient because in
    multi-table joins (which is common) we get to apply the filtering per join and
    not at the end on one huge table. With modern SQL query optimizers it’s not
    clear whether this is a real advantage, however. It’s quite likely that the
    optimizer will generate exactly the same sequence of low-level operations for
    the two.
  • In terms of readability, it’s much nicer to be able to see what the join is
    on close to the join itself, rather than in the end of the query in one
    large where filter. This can be significant for multi-table joins.

As an example, consider customers making orders, with order details in a
separate table (since customers could have multiple orders). We could have a
complex join done as:

select *
  from customers, orders, order_details
  where customers.id = order_details.customerid and
        orders.id = order_details.orderid

Compared with:

select *
  from customers
    inner join order_details on customer.id = order_details.customerid
    inner join orders on orders.id = order_details.orderid

In the latter it’s much clearer what the criteria for each join is.

Finally, I’ll mention that some support the natural join, which
is a shortcut for “inner join tables on the columns that have the same name”.
The following query is equivalent to the variant with using shown above:

select * from t1 natural join t2;

Natural join is a term from relational algebra, and it’s not commonly used in
SQL queries.

Outer join

While the inner join is simple to understand as a special case of the cross
product, outer join is a bit trickier. Luckily, it’s not hard to grok outer
joins once you undererstand inner joins, so we can build this knowledge step by
step.

Let’s get back to our tables t1 and t2. We could assign a logical
meaning to the inner join using (id) as “show me all the codes (from t2)
matching names (from t1)”. The result is two rows where a match on id
was found in the two tables. However, sometimes we want something slightly
different; we want to ask “show me all the names (from t1) and all the
codes (from t2) that match them, if any”. In other words, we want all the
names to be in the results, perhaps with null values for code where no match
was found in the t2 table [2].

Let’s break this request to pieces. We want:

  • All names from t1 that have a match in t2, with the code from t2
  • All names from t1 that have no match in t2, with null for the code

In SQL we can express this as follows:

select id, name, code
    from t1 inner join t2 using (id)
  union
select id, name, null
    from t1 where id not in (select id from t2);

Some things to note:

  • The first query is precisely our inner join from the previous section, and
    it’s answering the first piece.
  • The second query lists all the names that don’t have a match in t2 using
    a subquery.
  • We’re listing the column names explicitly here because column names must match
    exactly for the two tables being union-ed.

What we just wrote is called a left outer join in SQL [3], and can be more
easily written as:

select * from t1 left outer join t2 using (id);

The result is:

 id |   name   | code
----+----------+------
  2 | Sam      | x
  3 | Emmanuel | z
  4 | Brayden  |
  1 | Joanne   |

This is the left outer join because we want all the rows from the left-hand
side
table to appear in the result. As you may have guessed, there’s also
a right outer join:

select * from t1 right outer join t2 using (id);
 id |   name   | code
----+----------+------
  2 | Sam      | x
  3 | Emmanuel | z
  7 |          | a

Here all the rows from the right-hand side table appear in the result, with
a matching column (name) from the left-hand side if found, null otherwise.

Finally, we may want rows from both sides of the join to always appear in the
result table. That’s called a full outer join:

select * from t1 full outer join t2 using (id);

Resulting in:

 id |   name   | code
----+----------+------
  2 | Sam      | x
  3 | Emmanuel | z
  7 |          | a
  4 | Brayden  |
  1 | Joanne   |

A full outer join is straightforward to express using a union of left and right
joins:

select * from t1 left join t2 using (id)
  union
select * from t1 right join t2 using (id);

There’s a slight caveat, though. While union removes duplicates, full
outer join
does not; therefore, the results can be different in some special
cases. In the event that you care about seeing duplicates in the output and
the database doesn’t support a full outer join, this is a more accurate
(though less efficient) translation:

select * from t1 left join t2 using (id)
  union all
select * from t1 right join t2 using (id) where t1.id is null;

Joins on multiple columns

The examples so far showed joins on a single shared column – id. While this
is the most common case, sometimes more complex matching criteria are used.
SQL doesn’t restrict the syntax of join to a single condition, so we can
join on multiple columns and arbitrary conditions. Let’s add another column to
our two tables:

        t1                             t2

 id |   name   | ranking        code | id | ranking
----+----------+---------      ------+----+--------
  1 | Joanne   |       7        x    |  2 |       8
  2 | Sam      |       7        z    |  3 |       6
  3 | Emmanuel |       6
  4 | Brayden  |       2

We can run joins on both id and ranking:

select * from t1 inner join t2 on t1.id = t2.id and t1.ranking = t2.ranking;

Resulting in:

 id |   name   | ranking | code | id | ranking
----+----------+---------+------+----+---------
  3 | Emmanuel |       6 | z    |  3 |       6

And with using:

select * from t1 inner join t2 using (id, ranking);

Resulting in:

 id | ranking |   name   | code
----+---------+----------+------
  3 |       6 | Emmanuel | z

Similarly, we can run outer joins:

select * from t1 left outer join t2 using (id, ranking);

Resulting in:

 id | ranking |   name   | code
----+---------+----------+------
  3 |       6 | Emmanuel | z
  2 |       7 | Sam      |
  4 |       2 | Brayden  |
  1 |       7 | Joanne   |

And so on.

Joins on multiple tables

In real-life databases, data is often split to multiple tables; it’s not
uncommon for queries to probe 4-5 or more tables to gather all the interesting
information. Let’s use three table for an example. We’ll have a table of
customers and a table of items:

         customers                       items

 customerid |   name          itemid | description | price
------------+----------      --------+-------------+-------
          1 | Robert               1 | Napkins     |   1.5
          2 | Jennifer             2 | Granola     |  4.25
          3 | Yoshi                3 | Cheese      |     3
          4 | Xi

In addition we’ll have a linking table to record orders made by customers:

 customerid | itemid | orderdate
------------+--------+------------
          1 |      2 | 2019-03-02
          1 |      3 | 2019-03-02
          1 |      1 | 2019-03-03
          2 |      1 | 2019-02-22
          3 |      3 | 2019-01-15
          3 |      2 | 2019-02-20
          4 |      3 | 2019-02-21
          4 |      3 | 2019-02-22

We may be interested in all the customers who ordered cheese, and the date of
the order. This requires joining all three tables:

select name, orderdate, description
    from (customers
    inner join orders using (customerid))
    inner join items using (itemid)
    where items.description = 'Cheese';

Resulting in:

  name  | orderdate  | description
--------+------------+-------------
 Robert | 2019-03-02 | Cheese
 Yoshi  | 2019-01-15 | Cheese
 Xi     | 2019-02-21 | Cheese
 Xi     | 2019-02-22 | Cheese

Note the parens around the first join. This is not strictly necessary for this
query, but I find it useful to control the order of joining explicitly. We can
join as many tables as we want, but the order has to make sense. Each join
produces a new logical table that participates in other joins, and for some
queries the order of joins is important.

While it will be more common to see sequences of inner joins in such
queries, it’s also possible to mix and match with outer joins; whatever
makes sense.


[1] When we use the join keyword in SQL, inner join is the default,
so the keyword inner is optional. That said, to distinguish inner
joins from outer joins IMHO it’s preferable to be explicit.
[2] This sounds contrived with our simplistic tables, but in reality it’s an
extremely common database query. Imagine our t1 is customers with
unique IDs and names, and our t2 is some code assigned to each
customer. Suppose we want to display all our customers, regardless of who
already has a code assigned. For customers that do have a code we want to
show it.
[3] With the keyword left before a join, the keyword outer is
optional, so we could just say left join instead of left outer
join
. I like the explicitness of having outer there. The same
applies for right and full joins.