Joins

When selecting data from CrateDB, you can join one or more relations (e.g., tables) to combine columns into one result set.

See also

Join (SQL)

Table of contents

Cross joins

Referencing two tables results in a CROSS JOIN.

The result is computed by creating every possible combination (i.e., a cartesian product) of their rows (t1 * t2 * t3 * tn) and then applying the given query operation on it (e.g., WHERE clause, SELECT list, ORDER BY clause, and so on):

cr> select articles.name as article, colors.name as color, price
... from articles cross join colors
... where price > 5000.0
... order by price, color, article;
+------------------------------+---------------+----------+
| article                      | color         |    price |
+------------------------------+---------------+----------+
| Infinite Improbability Drive | Antique White | 19999.99 |
| Infinite Improbability Drive | Gold          | 19999.99 |
| Infinite Improbability Drive | Midnight Blue | 19999.99 |
| Infinite Improbability Drive | Olive Drab    | 19999.99 |
| Starship Titanic             | Antique White | 50000.0  |
| Starship Titanic             | Gold          | 50000.0  |
| Starship Titanic             | Midnight Blue | 50000.0  |
| Starship Titanic             | Olive Drab    | 50000.0  |
+------------------------------+---------------+----------+
SELECT 8 rows in set (... sec)

Cross joins can be done explicitly using the CROSS JOIN statement as shown in the example above, or implicitly by just specifying two or more tables in the FROM list:

cr> select articles.name as article, colors.name as color, price
... from articles, colors
... where price > 5000.0
... order by price, color, article;
+------------------------------+---------------+----------+
| article                      | color         |    price |
+------------------------------+---------------+----------+
| Infinite Improbability Drive | Antique White | 19999.99 |
| Infinite Improbability Drive | Gold          | 19999.99 |
| Infinite Improbability Drive | Midnight Blue | 19999.99 |
| Infinite Improbability Drive | Olive Drab    | 19999.99 |
| Starship Titanic             | Antique White | 50000.0  |
| Starship Titanic             | Gold          | 50000.0  |
| Starship Titanic             | Midnight Blue | 50000.0  |
| Starship Titanic             | Olive Drab    | 50000.0  |
+------------------------------+---------------+----------+
SELECT 8 rows in set (... sec)

Inner joins

Inner joins require each record of one table to have matching records on the other table:

cr> select s.id, s.table_name, t.number_of_shards
... from sys.shards s, information_schema.tables t
... where s.table_name = t.table_name
... and s.table_name = 'employees'
... order by s.id;
+----+------------+------------------+
| id | table_name | number_of_shards |
+----+------------+------------------+
|  0 | employees  |                4 |
|  1 | employees  |                4 |
|  2 | employees  |                4 |
|  3 | employees  |                4 |
+----+------------+------------------+
SELECT 4 rows in set (... sec)

Outer joins

Left outer joins

Left outer join returns tuples for all matching records of the left-hand and right-hand relation like inner joins. Additionally it returns tuples for all other records from left-hand that don’t match any record on the right-hand by using NULL values for the columns of the right-hand relation:

cr> select e.name || ' ' || e.surname as employee, coalesce(d.name, '') as manager_of_department
... from employees e left join departments d
... on e.id = d.manager_id
... order by e.id;
+--------------------+-----------------------+
| employee           | manager_of_department |
+--------------------+-----------------------+
| John Doe           | Administration        |
| John Smith         | IT                    |
| Sean Lee           |                       |
| Rebecca Sean       |                       |
| Tim Ducan          |                       |
| Robert Duval       |                       |
| Clint Johnson      |                       |
| Sarrah Mcmillan    |                       |
| David Limb         |                       |
| David Bowe         |                       |
| Smith Clark        | Marketing             |
| Ted Kennedy        |                       |
| Ronald Reagan      |                       |
| Franklin Rossevelt |                       |
| Sam Malone         |                       |
| Marry Georgia      |                       |
| Tim Doe            | Human Resources       |
| Tim Malone         | Purchasing            |
+--------------------+-----------------------+
SELECT 18 rows in set (... sec)

Right outer joins

Right outer join returns tuples for all matching records of the right-hand and left-hand relation like inner joins. Additionally it returns tuples for all other records from right-hand that don’t match any record on the left-hand by using NULL values for the columns of the left-hand relation:

cr> select e.name || ' ' || e.surname as employee, d.name as manager_of_department
... from employees e right join departments d
... on e.id = d.manager_id
... order by d.id;
+-------------+-----------------------+
| employee    | manager_of_department |
+-------------+-----------------------+
| John Doe    | Administration        |
| Smith Clark | Marketing             |
| Tim Malone  | Purchasing            |
| Tim Doe     | Human Resources       |
|             | Shipping              |
| John Smith  | IT                    |
+-------------+-----------------------+
SELECT 6 rows in set (... sec)

Full outer joins

Full outer join returns tuples for all matching records of the left-hand and right-hand relation like inner joins. Additionally it returns tuples for all other records from left-hand that don’t match any record on the right-hand by using NULL values for the columns of the right-hand relation. Additionally it returns tuples for all other records from right-hand that don’t match any record on the left-hand by using NULL values for the columns of the left-hand relation:

cr> select e.name || ' ' || e.surname as employee, coalesce(d.name, '') as manager_of_department
... from employees e full join departments d
... on e.id = d.manager_id
... order by e.id;
+--------------------+-----------------------+
| employee           | manager_of_department |
+--------------------+-----------------------+
| John Doe           | Administration        |
| John Smith         | IT                    |
| Sean Lee           |                       |
| Rebecca Sean       |                       |
| Tim Ducan          |                       |
| Robert Duval       |                       |
| Clint Johnson      |                       |
| Sarrah Mcmillan    |                       |
| David Limb         |                       |
| David Bowe         |                       |
| Smith Clark        | Marketing             |
| Ted Kennedy        |                       |
| Ronald Reagan      |                       |
| Franklin Rossevelt |                       |
| Sam Malone         |                       |
| Marry Georgia      |                       |
| Tim Doe            | Human Resources       |
| Tim Malone         | Purchasing            |
|                    | Shipping              |
+--------------------+-----------------------+
SELECT 19 rows in set (... sec)

Join conditions

CrateDB supports all operators and scalar functions as join conditions in the WHERE clause.

Example with within scalar function:

cr> select photos.name, countries.name
... from countries, photos
... where within(location, geo)
... order by countries.name, photos.name;
+--------------+---------+
| name         | name    |
+--------------+---------+
| Eiffel Tower | France  |
| Berlin Wall  | Germany |
+--------------+---------+
SELECT 2 rows in set (... sec)

Available join algorithms

Nested loop join algorithm

The nested loop algorithm evaluates the join conditions on every record of the left-hand table with every record of the right-hand table in a distributed manner (for each shard of the used tables). The right-hand table is scanned once for every row in the left-hand table.

This is the default algorithm used for all types of joins.

Block hash join algorithm

The performance of equi-joins is substantially improved by using the hash join algorithm. At first, one relation is scanned and loaded into a hash table using the attributes of the join conditions as hash keys. Once the hash table is built, the second relation is scanned and the join condition values of every row are hashed and matched against the hash table.

In order to built a hash table even if the first relation wouldn’t fit into the available memory, only a certain block size of a relation is loaded at once. The whole operation will be repeated with the next block of the first relation once scanning the second relation has finished.

This optimisation cannot be applied unless the join is an INNER join and the join condition satisfies the following rules:

  • Contains at least one EQUAL operator

  • Contains no OR operator

  • Every argument of a EQUAL operator can only references fields from one relation

The hash join algorithm is faster but has a bigger memory footprint. As such it can explicitly be disabled on demand when memory is scarce using the session setting enable_hashjoin:

SET enable_hashjoin=false

Limitations

Joining more than 2 tables can result in poor execution plans.

Internally the relations are joined in pairs. So for example in a 3 table join, we’d join (r1 ⋈ r2) ⋈ r3 (r1 with r2 first, then with r3). The poor execution plan could happen as there is no optimization in place which improves the join ordering. Ideally we’d join those relations first which narrow the intermediate result set to a large degree, so that later joins have less work to do. In the example before, joining r1 ⋈ (r2 ⋈ r3) might be the better order.