Repair broken etcd node

Updated . Posted . Visible to the public. Repeats.

If one etcd node is no longer a member of the remaining etcd cluster or fails to connect you need to remove it from the cluster and add it again.

Upstream documentation

Make sure to read and understand the detailed instructions for etcd runtime reconfiguration Show archive.org snapshot .

Re-adding a faulty node

On the faulty node

sudo systemctl stop etcd
sudo mv /var/lib/etcd /tmp/etcd.old
sudo mkdir -p /var/lib/etcd/{data,wal}
sudo chown -R etcd: /var/lib/etcd

On a working node

Warning

Double-check to use the correct hostname

Bash

etcdctl member list -w table
etcdctl member remove $ID_OF_FAULTY_NODE
etcdctl member add $NAME_OF_FAULTY_NODE --peer-urls=$PEER_ADDRS_OF_FAULTY_NODE

AWK

# define the hostname you wish to re-add as variable
member=FQDN_OF_FAULTY_NODE
# re-add the host using an awk script
etcdctl member list | awk -F ', ' -v member="$member" '{
  if ( member == $3 ){
    system("etcdctl member remove " $1);
    system("etcdctl member add " $3 " --peer-urls=" $4);
  }
}'

On the faulty node again

sudo sed -i -e 's/new/existing/g' /etc/etcd/etcd.cfg
sudo systemctl restart etcd

Logging

Even if etcd is configured to write to /var/log/etcd/etcd.log it can happen on new hosts (focal) that STDOUT gets written to the systemd journal. You can check it with sudo journalctl -efu etcd

Help, my terminal gets spammed with error messages

You are probably connected to a patroni cluster node. When patronictl cannot connect to etcd it will try over and over again until it succeeds. Since the etcd-cluster is broken this won't succeed. You need to run the following:

killall patronictl
Claus-Theodor Riegg
Last edit
Marc Dierig
License
Source code in this card is licensed under the MIT License.