If you end up with an Elasticsearch cluster which has a very different disk usage on it's nodes you can use these steps to rebalance the shards.
Before we begin it's important to understand how Elasticsearch defines balance Show archive.org snapshot :
The balance of the cluster depends only on the number of shards on each node and the indices to which those shards belong. It considers neither the sizes of these shards nor the available disk space on each node [...]
Rebalance
Check the current allocation
curl -s -X GET 'localhost:9200/_cat/allocation?v&s=node'
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
66 258.6gb 273.3gb 230.4gb 503.7gb 54 10.10.0.1 10.10.0.1 node01
66 204.1gb 216.9gb 267.5gb 484.4gb 44 10.10.0.2 10.10.0.2 node02
66 204.9gb 218gb 266.3gb 484.4gb 45 10.10.0.3 10.10.0.3 node03
See that the first node uses more disk space than the other nodes. However no rebalancing is performed since all nodes have the same amount of shards and stay within their disk usage watermarks.
Watermarks
By default there are these watermark settings:
cluster.routing.allocation.disk.watermark.low: 85%
cluster.routing.allocation.disk.watermark.high: 90%
cluster.routing.allocation.disk.watermark.flood_stage: 95%
- If the
disk.percent
usage exceeds thelow
watermark no more shards will be allocated on that node - If the disk usage exceeds the
high
threshold it will attempt to relocate shards to regain disk usage balance - The
flood_stage
is a last resort to prevent nodes from running out of disk space
You can also use fixed sizes like:
cluster.routing.allocation.disk.watermark.low: 260gb
cluster.routing.allocation.disk.watermark.high: 250gb
cluster.routing.allocation.disk.watermark.flood_stage: 10gb
This will use disk.avail
(disk free) instead of disk.percent
(disk usage). Keep in mind to reverse your thresholds for that.
Rebalancing
First check and backup your current cluster settings
curl -s -X GET 'localhost:9200/_cluster/settings' | tee settings.json
Initiate rebalancing
curl -s -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "48%",
"cluster.routing.allocation.disk.watermark.high": "50%",
"cluster.routing.allocation.disk.watermark.flood_stage": "95%",
"cluster.routing.rebalance.enable": "all",
"cluster.routing.allocation.allow_rebalance": "always"
}
}
'
Check relocations
watch curl -s -X GET 'localhost:9200/_cluster/health?pretty'
watch "curl -s -X GET 'localhost:9200/_cat/shards?v&s=index' | grep RELOCATING"
watch curl -s -X GET 'localhost:9200/_cat/allocation?v\&s=node'
Exclude node
If you can't find good thresholds, i. e. the relocated shards will always mess up with your watermarks, you can force Elasticsearch with cluster-level shard allocation filters Show archive.org snapshot to not allocate shards on a specific node.
curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"transient" : {
"cluster.routing.allocation.exclude._name" : "node01"
}
}
'
Elasticsearch will relocate the shards based on your defined watermarks and ignore the number of shards on each node. You will end up with an uneven amount of shards for a while. Check the relocations to get an insight of what is going on.
Finish
After the relocation is finished ("relocating_shards" : 0
) you can revert the temporary settings. Check the previous cluster settings in settings.json
. To use the default settings execute:
curl -s -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": null,
"cluster.routing.allocation.disk.watermark.high": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null,
"cluster.routing.rebalance.enable": null,
"cluster.routing.allocation.allow_rebalance": null,
"cluster.routing.allocation.exclude._name" : null
}
}
'
If you have excluded a node with cluster.routing.allocation.exclude._name
Elasticsearch will now relocate the shards to achieve the same number of shards on every node again.
Track the relocations and check the allocation to verify the rebalanced shards.
Relocation when node goes offline / leaves the cluster
When a node goes offline Elasticsearch promotes replicas of primary shards on this node to new primaries and tries to create new replicas on other nodes in the cluster. This does not happen immediately and the timeout is configurable. Please see Delaying allocation when a node leaves Show archive.org snapshot for details.
We have a default an index template Show archive.org snapshot matching all created indices changing the delay to 1h. This way we want to prevent unnecessary I/O workloads when we reboot a node.