HowTo: Rebalance Elasticsearch Shards

If you end up with an Elasticsearch cluster which has a very different disk usage on it's nodes you can use these steps to rebalance the shards.

Before we begin it's important to understand how Elasticsearch defines balance Show archive.org snapshot :

The balance of the cluster depends only on the number of shards on each node and the indices to which those shards belong. It considers neither the sizes of these shards nor the available disk space on each node [...]

Rebalance

Check the current allocation

curl -s -X GET 'localhost:9200/_cat/allocation?v&s=node'

shards disk.indices disk.used disk.avail disk.total disk.percent host        ip          node
    66      258.6gb   273.3gb    230.4gb    503.7gb           54 10.10.0.1 10.10.0.1 node01
    66      204.1gb   216.9gb    267.5gb    484.4gb           44 10.10.0.2 10.10.0.2 node02
    66      204.9gb     218gb    266.3gb    484.4gb           45 10.10.0.3 10.10.0.3 node03

See that the first node uses more disk space than the other nodes. However no rebalancing is performed since all nodes have the same amount of shards and stay within their disk usage watermarks.

Watermarks

By default there are these watermark settings:

cluster.routing.allocation.disk.watermark.low: 85%
cluster.routing.allocation.disk.watermark.high: 90%
cluster.routing.allocation.disk.watermark.flood_stage: 95%

If the disk.percent usage exceeds the low watermark no more shards will be allocated on that node
If the disk usage exceeds the high threshold it will attempt to relocate shards to regain disk usage balance
The flood_stage is a last resort to prevent nodes from running out of disk space

You can also use fixed sizes like:

cluster.routing.allocation.disk.watermark.low: 260gb
cluster.routing.allocation.disk.watermark.high: 250gb
cluster.routing.allocation.disk.watermark.flood_stage: 10gb

This will use disk.avail (disk free) instead of disk.percent (disk usage). Keep in mind to reverse your thresholds for that.

Rebalancing

First check and backup your current cluster settings

curl -s -X GET 'localhost:9200/_cluster/settings' | tee settings.json

Initiate rebalancing

curl -s -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "48%",
    "cluster.routing.allocation.disk.watermark.high": "50%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "95%",
    "cluster.routing.rebalance.enable": "all",
    "cluster.routing.allocation.allow_rebalance": "always"
  }
}
'

Check relocations

watch curl -s -X GET 'localhost:9200/_cluster/health?pretty'

watch "curl -s -X GET 'localhost:9200/_cat/shards?v&s=index' | grep RELOCATING"

watch curl -s -X GET 'localhost:9200/_cat/allocation?v\&s=node'

Exclude node

If you can't find good thresholds, i. e. the relocated shards will always mess up with your watermarks, you can force Elasticsearch with cluster-level shard allocation filters Show archive.org snapshot to not allocate shards on a specific node.

curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
  "transient" : {
    "cluster.routing.allocation.exclude._name" : "node01"
  }
}
'

Elasticsearch will relocate the shards based on your defined watermarks and ignore the number of shards on each node. You will end up with an uneven amount of shards for a while. Check the relocations to get an insight of what is going on.

Finish

After the relocation is finished ("relocating_shards" : 0) you can revert the temporary settings. Check the previous cluster settings in settings.json. To use the default settings execute:

curl -s -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": null,
    "cluster.routing.allocation.disk.watermark.high": null,
    "cluster.routing.allocation.disk.watermark.flood_stage": null,
    "cluster.routing.rebalance.enable": null,
    "cluster.routing.allocation.allow_rebalance": null,
    "cluster.routing.allocation.exclude._name" : null
  }
}
'

If you have excluded a node with cluster.routing.allocation.exclude._name Elasticsearch will now relocate the shards to achieve the same number of shards on every node again.

Track the relocations and check the allocation to verify the rebalanced shards.

Relocation when node goes offline / leaves the cluster

When a node goes offline Elasticsearch promotes replicas of primary shards on this node to new primaries and tries to create new replicas on other nodes in the cluster. This does not happen immediately and the timeout is configurable. Please see Delaying allocation when a node leaves Show archive.org snapshot for details.

We have a default an index template Show archive.org snapshot matching all created indices changing the delay to 1h. This way we want to prevent unnecessary I/O workloads when we reboot a node.

Andreas Vöst

Say thanks7

Last edit

2021-07-16

Claus-Theodor Riegg