Read more

Storing trees in databases

Henning Koch
May 09, 2017Software engineer at makandra GmbH

This card compares patterns to store trees in a relation database like MySQL or PostgreSQL. Implementation examples are for the ActiveRecord Show archive.org snapshot ORM used with Ruby on Rails, but the techniques can be implemented in any language or framework.

Illustration online protection

Rails Long Term Support

Rails LTS provides security patches for old versions of Ruby on Rails (2.3, 3.2, 4.2 and 5.2)

  • Prevents you from data breaches and liability risks
  • Upgrade at your own pace
  • Works with modern Rubies
Read more Show archive.org snapshot

We will be using this example tree (from the acts_as_nested_set Show archive.org snapshot docs):

root
|
+-- Child 1
|   |
|   +-- Child 1.1
|   |
|   +-- Child 1.2
|
+-- Child 2
    |
    +-- Child 2.1
    |
    +-- Child 2.2

Option 1: Parent Association

The simplest way to serialize a tree is to give each node a parent_id column that contains the ID of the parent node.

This would serialize the example above to this table:

id | parent_id | data
---+-----------+----------
 1 |      NULL | root
 2 |         1 | Child 1
 3 |         2 | Child 1.1
 4 |         2 | Child 1.2
 5 |         1 | Child 2
 6 |         5 | Child 2.1
 7 |         5 | Child 2.2

This pattern is so simple, you don't even need a library for it. With ActiveRecord you could implement it with a single belongs_to association:

class Node < ActiveRecord::Base
  belongs_to :parent, class: 'Node'
end

Any modification of the tree (like adding a node or changing a node's parent) only affect a single row in the table, so changes are fast.

The big drawback of this pattern is that you need an unbounded number of queries to read from the tree. E.g. to select all descendants of Child 1 you would need to recursively query for children until you have reached all leaf nodes:

SELECT * FROM nodes WHERE parent_id=2
SELECT * FROM nodes WHERE parent_id=3
SELECT * FROM nodes WHERE parent_id=4

The number of queries grows with the depth of your tree.

Because of this limitation this pattern should generally be avoided.

Option 2: Materialized Path

With the Materialized Path pattern you would cache all ancestor IDs of each node in a string column.

This would serialize the example above to this table:

id | ancestry | data
---+----------+----------
 1 | NULL     | root
 2 | 1/       | Child 1
 3 | 1/2/     | Child 1.1
 4 | 1/2/     | Child 1.2
 5 | 1/       | Child 2
 6 | 1/5/     | Child 2.1
 7 | 1/5/     | Child 2.2

Implementations like ancestry Show archive.org snapshot will maintain this ancestry column for you. A library will also give you convenient methods to cache this new column to effectively query your tree. E.g. we can select all descendants of Child 1 with a single query:

SELECT * FROM nodes WHERE ancestry = '1/2' OR ancestry LIKE '1/2/%'

Since LIKE queries can be indexed, this query can be fast.

Adding a node to the tree only affects a single row in the table, so changes are usually fast as well. However, changing a node's parent means updating the node and all of its descendants.

Overall Materialized Path is a well-rounded pattern that can be recommended for most use cases.

Option 3: Nested Set

The Nested Set pattern looks at the entire tree structure and assigns left and right boundary values to each node. The boundary values must contain the boundary values for all descendants.

The acts_as_nested_set Show archive.org snapshot docs provide us with this pieace of ASCII art that shows boundary values for the table above:

 ___________________________________________________________________
|  Root                                                             |
|    ____________________________    ____________________________   |
|   |  Child 1                  |   |  Child 2                  |   |
|   |   __________   _________  |   |   __________   _________  |   |
|   |  |  C 1.1  |  |  C 1.2 |  |   |  |  C 2.1  |  |  C 2.2 |  |   |
1   2  3_________4  5________6  7   8  9_________10 11_______12 13  14
|   |___________________________|   |___________________________|   |
|___________________________________________________________________|

Note that the boundary values will not necessarily correspond to row IDs.

This would serialize the example above to this table:

id | parent_id | lft | rgt | data
---+-----------+-----+-----+----------
 1 |         0 |   1 |  14 | root
 2 |         1 |   2 |   7 | Child 1
 3 |         2 |   3 |   4 | Child 1.1
 4 |         2 |   5 |   6 | Child 1.2
 5 |         1 |   8 |  13 | Child 2
 6 |         5 |   9 |  10 | Child 2.1
 7 |         5 |  11 |  12 | Child 2.2

Implementations like awesome_nested_set Show archive.org snapshot will maintain this data structure for you. A library will also give you convenient methods to cache this new column to effectively query your tree. E.g. we can select all descendants of Child 1 with a single query:

SELECT * FROM nodes WHERE lft >= 2 AND rgt < 7 AND id != 2

By indexing the lft and rgt columns we can make this query very fast, since the database only needs to compare indexed numbers.

The big drawback of this pattern is that this structure is extremely expensive to maintain during writes. Adding a node or changing a node's parent might require reading and updating the entire tree.

Because of the complexity of this pattern, Nested Set is only recommended when you expect few writes and require maximum read performance. For all other cases prefer Materialized Path.

Option 4: Closure Table

The Closure Table pattern Show archive.org snapshot extends the Parent Association pattern by maintaining a separate table with hierarchy information. This second table can be used for efficient tree queries.

An ActiveRecord implementation of this pattern is closure_tree. It promises superior performance to Materialized Path with ancestry Show archive.org snapshot , but we haven't worked with it yet.

In any case this pattern requires you to maintain a second table. Reads will need to span two tables using JOINs or subselects, which costs some overhead.

Henning Koch
May 09, 2017Software engineer at makandra GmbH
Posted by Henning Koch to makandra dev (2017-05-09 07:57)