If you end up with an Elasticsearch cluster which has a very different disk usage on it's nodes you can use these steps to rebalance the shards.
Before we begin it's important to understand how Elasticsearch defines balance Show archive.org snapshot :
The balance of the cluster depends only on the number of shards on each node and the indices to which those shards belong. It considers neither the sizes of these shards nor the available disk space on each node [...]
curl -s -X GET 'localhost:9200/_cat/allocation?v&s=node'
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
66 258.6gb 273.3gb 230.4gb 503.7gb 54 10.10.0.1 10.10.0.1 node01
66 204.1gb 216.9gb 267.5gb 484.4gb 44 10.10.0.2 10.10.0.2 node02
66 204.9gb 218gb 266.3gb 484.4gb 45 10.10.0.3 10.10.0.3 node03
See that the first node uses more disk space than the other nodes. However no rebalancing is performed since all nodes have the same amount of shards and stay within their disk usage watermarks.
By default there are these watermark settings:
cluster.routing.allocation.disk.watermark.low: 85%
cluster.routing.allocation.disk.watermark.high: 90%
cluster.routing.allocation.disk.watermark.flood_stage: 95%
disk.percent
usage exceeds the low
watermark no more shards will be allocated on that nodehigh
threshold it will attempt to relocate shards to regain disk usage balanceflood_stage
is a last resort to prevent nodes from running out of disk spaceYou can also use fixed sizes like:
cluster.routing.allocation.disk.watermark.low: 260gb
cluster.routing.allocation.disk.watermark.high: 250gb
cluster.routing.allocation.disk.watermark.flood_stage: 10gb
This will use disk.avail
(disk free) instead of disk.percent
(disk usage). Keep in mind to reverse your thresholds for that.
curl -s -X GET 'localhost:9200/_cluster/settings' | tee settings.json
curl -s -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "48%",
"cluster.routing.allocation.disk.watermark.high": "50%",
"cluster.routing.allocation.disk.watermark.flood_stage": "95%",
"cluster.routing.rebalance.enable": "all",
"cluster.routing.allocation.allow_rebalance": "always"
}
}
'
watch curl -s -X GET 'localhost:9200/_cluster/health?pretty'
watch "curl -s -X GET 'localhost:9200/_cat/shards?v&s=index' | grep RELOCATING"
watch curl -s -X GET 'localhost:9200/_cat/allocation?v\&s=node'
If you can't find good thresholds, i. e. the relocated shards will always mess up with your watermarks, you can force Elasticsearch with cluster-level shard allocation filters Show archive.org snapshot to not allocate shards on a specific node.
curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"transient" : {
"cluster.routing.allocation.exclude._name" : "node01"
}
}
'
Elasticsearch will relocate the shards based on your defined watermarks and ignore the number of shards on each node. You will end up with an uneven amount of shards for a while. Check the relocations to get an insight of what is going on.
After the relocation is finished ("relocating_shards" : 0
) you can revert the temporary settings. Check the previous cluster settings in settings.json
. To use the default settings execute:
curl -s -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": null,
"cluster.routing.allocation.disk.watermark.high": null,
"cluster.routing.allocation.disk.watermark.flood_stage": null,
"cluster.routing.rebalance.enable": null,
"cluster.routing.allocation.allow_rebalance": null,
"cluster.routing.allocation.exclude._name" : null
}
}
'
If you have excluded a node with cluster.routing.allocation.exclude._name
Elasticsearch will now relocate the shards to achieve the same number of shards on every node again.
Track the relocations and check the allocation to verify the rebalanced shards.
When a node goes offline Elasticsearch promotes replicas of primary shards on this node to new primaries and tries to create new replicas on other nodes in the cluster. This does not happen immediately and the timeout is configurable. Please see Delaying allocation when a node leaves Show archive.org snapshot for details.
We have a default an index template Show archive.org snapshot matching all created indices changing the delay to 1h. This way we want to prevent unnecessary I/O workloads when we reboot a node.