Ubuntu HA - Fencing Ubuntu Bionic in Microsoft Azure

This is a small procedure to install 3 Pacemaker + Corosync nodes called “vm01, vm02 and vm03” in Microsoft Azure AND configure fencing_azure_arm agent, available in Ubuntu Bionic, to fence the virtual nodes in case of something goes bad in your HA cluster.

This is my current setup: A “gateway” machine accessing all 3 vms with public network. All 3 vms have 2 vNICs: public and private networks:

- gateway: Internet IP mapped to an IP in public: 10.250.92.254
- public network: 10.250.92.11 10.250.92.12 and 10.250.92.13
- private network: 10.250.96.11 10.250.92.12 and 10.250.92.13

In order to have fence_azure_arm fencing agent fully working, we first need to install the Azure library packages for Python in all nodes:

$ apt-get install python-pip
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libexpat1-dev libpython-all-dev libpython-dev libpython2.7-dev python-all python-all-dev
  python-dev python-pip-whl python-setuptools python-wheel python-xdg python2.7-dev
Suggested packages:
  python-setuptools-doc
The following NEW packages will be installed:
  libexpat1-dev libpython-all-dev libpython-dev libpython2.7-dev python-all python-all-dev
  python-dev python-pip python-pip-whl python-setuptools python-wheel python-xdg python2.7-dev
0 upgraded, 13 newly installed, 0 to remove and 0 not upgraded.
Need to get 30.9 MB of archives.
After this operation, 47.4 MB of additional disk space will be used.
Do you want to continue? [Y/n] y

$ pip install azure-storage-blob

$ pip install azure-mgmt-storage

$ pip install azure-mgmt-compute

$ pip show azure-mgmt-compute
Name: azure-mgmt-compute
Version: 13.0.0
Summary: Microsoft Azure Compute Management Client Library for Python
Home-page: https://github.com/Azure/azure-sdk-for-python
Author: Microsoft Corporation
Author-email: azpysdkhelp@microsoft.com
License: MIT License
Location: /usr/local/lib/python2.7/dist-packages
Requires: azure-mgmt-nspkg, azure-common, msrest, msrestazure

$ pip show azure-storage-blob
Name: azure-storage-blob
Version: 12.4.0
Summary: Microsoft Azure Blob Storage Client Library for Python
Home-page: https://github.com/Azure/azure-sdk-for-python/tree/master/sdk/storage/azure-storage-blob
Author: Microsoft Corporation
Author-email: ascl@microsoft.com
License: MIT License
Location: /usr/local/lib/python2.7/dist-packages
Requires: cryptography, azure-core, enum34, typing, azure-storage-nspkg, msrest, futures

$ pip show azure-mgmt-storage
Name: azure-mgmt-storage
Version: 11.2.0
Summary: Microsoft Azure Storage Management Client Library for Python
Home-page: https://github.com/Azure/azure-sdk-for-python
Author: Microsoft Corporation
Author-email: azpysdkhelp@microsoft.com
License: MIT License
Location: /usr/local/lib/python2.7/dist-packages
Requires: azure-mgmt-nspkg, azure-common, msrest, msrestazure

Then let’s install corosync engine in all nodes:

$ apt-get install corosync
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  ibverbs-providers libcfg6 libcmap4 libcorosync-common4 libcpg4 libibverbs1 libnl-route-3-200
  libnspr4 libnss3 libqb0 libquorum5 librdmacm1 libstatgrab10 libtotem-pg5 libvotequorum8 xsltproc
The following NEW packages will be installed:
  corosync ibverbs-providers libcfg6 libcmap4 libcorosync-common4 libcpg4 libibverbs1
  libnl-route-3-200 libnspr4 libnss3 libqb0 libquorum5 librdmacm1 libstatgrab10 libtotem-pg5
  libvotequorum8 xsltproc
0 upgraded, 17 newly installed, 0 to remove and 0 not upgraded.
Need to get 2,136 kB of archives.
After this operation, 8,698 kB of additional disk space will be used.
Do you want to continue? [Y/n] 

and make sure it is working by checking if corosync has started:

$ systemctl is-active corosync.service
active

$ sudo corosync-quorumtool 
Quorum information
------------------
Date:             Tue Sep  1 18:48:19 2020
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          2130706433
Ring ID:          2130706433/4
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:            

Membership information
----------------------
    Nodeid      Votes Name
2130706433          1 localhost (local)

in Ubuntu, corosync starts in a standalone 1-node cluster so the output for “corosync-quorumtool” should be similar to the one showed above.

Before moving on and trying to setup anything else, let’s first make sure we can establish a fully operational corosync cluster with the nodes “vm01, vm02 and vm03”.

In order to achieve that, we first have to generate a corosync key in just one node:

[rafaeldtinoco@vm02 ~]$ corosync-keygen 
Corosync Cluster Engine Authentication key generator.
Gathering 1024 bits for key from /dev/random.
Press keys on your keyboard to generate entropy.
Press keys on your keyboard to generate entropy (bits = 832).
Press keys on your keyboard to generate entropy (bits = 976).
Writing corosync key to /etc/corosync/authkey.

and copy it to the other nodes:

$ sudo scp /etc/corosync/authkey root@vm01:/etc/corosync/authkey
$ sudo scp /etc/corosync/authkey root@vm03:/etc/corosync/authkey

All nodes should have the same /etc/corosync/corosync.conf file with contents:

$ cat /etc/corosync/corosync.conf 
totem {
        version: 2
        secauth: off
        cluster_name: clubionic
        transport: udpu
}

nodelist {
        node {
                ring0_addr: 10.250.92.11
                name: vm01
                nodeid: 1
        }
        node {
                ring0_addr: 10.250.92.12
                name: vm02
                nodeid: 2
        }
        node {
                ring0_addr: 10.250.92.13
                name: vm03
                nodeid: 3
        }
}

quorum {
        provider: corosync_votequorum
        two_node: 0
}

qb {
        ipc_type: native
}

logging {

        fileline: on
        to_stderr: on
        to_logfile: yes
        logfile: /var/log/corosync/corosync.log
        to_syslog: no
        debug: off
}

and then the “corosync.service” can be restarted in all nodes:

$ systemctl restart corosync.service

and we can check if corosync has obtained all needed votes:

$ sudo corosync-quorumtool 
Quorum information
------------------
Date:             Tue Sep  1 19:07:21 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          2
Ring ID:          1/8
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
         1          1 10.250.92.11
         2          1 10.250.92.12 (local)
         3          1 10.250.92.13

NOW we are ready in moving to installing the resources and fencing agents manager (pacemaker) and the rest of tools that will make this our Ubuntu Bionic HA cluster in Azure.

Let’s install altogether: the resource manager, it’s command line interfaces, the crmsh interface, the resource and fencing agents:

$ apt-get install pacemaker pacemaker-cli-utils crmsh resource-agents fence-agents 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  cluster-glue libcib4 libcrmcluster4 libcrmcommon3 libcrmservice3 libdbus-glib-1-2 libesmtp6
  liblrm2 liblrmd1 libltdl7 libnet-telnet-perl libnet1 libopenhpi3 libopenipmi0 libpe-rules2
  libpe-status10 libpengine10 libpils2 libplumb2 libplumbgpl2 libsensors4 libsnmp-base libsnmp30
  libstonith1 libstonithd2 libtransitioner2 libxml2-utils openhpid pacemaker-common
  pacemaker-resource-agents python-bs4 python-dateutil python-html5lib python-lxml python-parallax
  python-pexpect python-ptyprocess python-pycurl python-webencodings python-yaml snmp
Suggested packages:
  ipmitool csync2 ocfs2-tools sbd python-requests python-suds lm-sensors snmp-mibs-downloader
  python-genshi python-lxml-dbg python-lxml-doc python-pexpect-doc libcurl4-gnutls-dev
  python-pycurl-dbg python-pycurl-doc
The following NEW packages will be installed:
  cluster-glue crmsh fence-agents libcib4 libcrmcluster4 libcrmcommon3 libcrmservice3
  libdbus-glib-1-2 libesmtp6 liblrm2 liblrmd1 libltdl7 libnet-telnet-perl libnet1 libopenhpi3
  libopenipmi0 libpe-rules2 libpe-status10 libpengine10 libpils2 libplumb2 libplumbgpl2 libsensors4
  libsnmp-base libsnmp30 libstonith1 libstonithd2 libtransitioner2 libxml2-utils openhpid pacemaker
  pacemaker-cli-utils pacemaker-common pacemaker-resource-agents python-bs4 python-dateutil
  python-html5lib python-lxml python-parallax python-pexpect python-ptyprocess python-pycurl
  python-webencodings python-yaml resource-agents snmp
0 upgraded, 46 newly installed, 0 to remove and 0 not upgraded.
Need to get 6,632 kB of archives.
After this operation, 27.8 MB of additional disk space will be used.
Do you want to continue? [Y/n]

Make sure to install those in all cluster nodes. In Ubuntu, pacemaker will be “ready” (enabled and started) right after the installation and you could check that by issuing the following command in all servers:

[rafaeldtinoco@vm01 ~]$ systemctl is-active pacemaker.service 
active
[rafaeldtinoco@vm02 ~]$ systemctl is-active pacemaker.service 
active
[rafaeldtinoco@vm03 ~]$ systemctl is-active pacemaker.service 
active

or by start using the crmsh tool, the preferred tool to manage a pacemaker cluster in Ubuntu:

$ crm status
Stack: corosync
Current DC: vm02 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Tue Sep  1 20:26:15 2020
Last change: Tue Sep  1 20:22:19 2020 by hacluster via crmd on vm02

3 nodes configured
0 resources configured

Online: [ vm01 vm02 vm03 ]

No resources

As you can see here, pacemaker is “online”, with all nodes active, and that was expected since we had corosync, pacemaker messaging layer, up and running. Now it is time for us configure the fencing agent fence_azure_arm.

Using the following configuration in pacemaker cluster:

node 1: vm01
node 2: vm02
node 3: vm03
primitive fence_vm01 stonith:fence_azure_arm \
	params resourceGroup=eastus plug=vm01 username=xxxxxxxx-05af-4f9d-9e80-1bf551ca02a4 login=xxxxxxxx-05af-4f9d-9e80-1bf551ca02a4 passwd="xxxxxxxxxxxx" tenantId=xxxxxxxx-e8f6-40ae-8875-da47c934f1c1 subscriptionId=xxxxxxxx-0262-4e54-9bd9-8b7458eec86b pcmk_monitor_retries=4 pcmk_action_limit=3 power_timeout=240 pcmk_reboot_timeout=900 \
	op monitor interval=3600 timeout=120 \
	meta target-role=Started
primitive fence_vm02 stonith:fence_azure_arm \
	params resourceGroup=eastus plug=vm02 username=xxxxxxxx-05af-4f9d-9e80-1bf551ca02a4 login=xxxxxxxx-05af-4f9d-9e80-1bf551ca02a4 passwd="xxxxxxxxxxxx" tenantId=xxxxxxxx-e8f6-40ae-8875-da47c934f1c1 subscriptionId=xxxxxxxx-0262-4e54-9bd9-8b7458eec86b pcmk_monitor_retries=4 pcmk_action_limit=3 power_timeout=240 pcmk_reboot_timeout=900 \
	op monitor interval=3600 timeout=120 \
	meta target-role=Started
primitive fence_vm03 stonith:fence_azure_arm \
	params resourceGroup=eastus plug=vm03 username=xxxxxxxx-05af-4f9d-9e80-1bf551ca02a4 login=xxxxxxxx-05af-4f9d-9e80-1bf551ca02a4 passwd="xxxxxxxxxxxx" tenantId=xxxxxxxx-e8f6-40ae-8875-da47c934f1c1 subscriptionId=xxxxxxxx-0262-4e54-9bd9-8b7458eec86b pcmk_monitor_retries=4 pcmk_action_limit=3 power_timeout=240 pcmk_reboot_timeout=900 \
	op monitor interval=3600 timeout=120 \
	meta target-role=Started
location l_fence_vm01 fence_vm01 -inf: vm01
location l_fence_vm02 fence_vm02 -inf: vm02
location l_fence_vm03 fence_vm03 -inf: vm03
property cib-bootstrap-options: \
	have-watchdog=false \
	dc-version=1.1.18-2b07d5c5a9 \
	cluster-infrastructure=corosync \
	cluster-name=clubionic \
	stonith-enabled=on \
	stonith-action=off \
    last-lrm-refresh=1599053050\
    stonith-timeout=900

Note: you can use the “crmsh” tool to set the properties, configure the primitives and locations. (Just run crm configure edit and you can edit the configuration just like the one being showed above).

after following steps described at:

https://docs.microsoft.com/en-us/azure/azure-sql/virtual-machines/linux/rhel-high-availability-stonith-tutorial#configure-the-fencing-agent

for the Role Base Access Control creation, also described at:

https://docs.microsoft.com/en-us/azure/role-based-access-control/tutorial-custom-role-cli#create-a-custom-role

and configuring the fencing agent in the following format:

crm configure primitive fence-vm03 stonith:fence_azure_arm \
  params \
    action=reboot \
    plug=<machine-to-fence> \
    resourceGroup="<AzureResourceGroup" \
    username="<ApplicationID>" \
    login="<ApplicationID>" \
    passwd="<servicePrincipalPassword>" \
    tenantId="<tenantId>" \
    subscriptionId="<subscriptionId>" \
    pcmk_monitor_retries=4 \
    pcmk_action_limit=3 \
    power_timeout=240 \
    pcmk_reboot_timeout=900 \
  \
  op monitor \
    interval=3600 \
    timeout=120

and setting stonith-timeout property to 900:

 crm configure property stonith-timeout=900

Then make sure that your fencing agent IS working:

[rafaeldtinoco@vm01 ~]$ crm status
Stack: corosync
Current DC: vm01 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Wed Sep  2 13:25:28 2020
Last change: Wed Sep  2 13:24:10 2020 by hacluster via crmd on vm03

3 nodes configured
3 resources configured

Online: [ vm01 vm02 vm03 ]

Full list of resources:

 fence_vm03	(stonith:fence_azure_arm):	Started vm01
 fence_vm01	(stonith:fence_azure_arm):	Started vm02
 fence_vm02	(stonith:fence_azure_arm):	Started vm03

Note: You can see errors in corosync.log in case the fence agent was unable to start in one of the nodes. Usually the error will be related to one forgetting to install needed python libraries for azure fence agent to work.

Let’s test if the fencing agent is working…

With the private network removed from vm01… (using iptables):

[rafaeldtinoco@vm02 ~]$ crm status
Stack: corosync
Current DC: vm02 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Wed Sep  2 13:30:01 2020
Last change: Wed Sep  2 13:24:10 2020 by hacluster via crmd on vm03

3 nodes configured
3 resources configured

Node vm01: UNCLEAN (offline)
Online: [ vm02 vm03 ]

Full list of resources:

 fence_vm03	(stonith:fence_azure_arm):	Started[ vm01 vm02 ]
 fence_vm01	(stonith:fence_azure_arm):	Started vm03
 fence_vm02	(stonith:fence_azure_arm):	Started vm03

Instance vm01 still powered on but machine vm01 not reachable. After powering off the vm01 instance, using the Azure web management interface, we have:

[rafaeldtinoco@vm02 ~]$ crm status
Stack: corosync
Current DC: vm02 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Wed Sep  2 14:18:18 2020
Last change: Wed Sep  2 13:24:10 2020 by hacluster via crmd on vm03

3 nodes configured
3 resources configured

Online: [ vm02 vm03 ]
OFFLINE: [ vm01 ]

Full list of resources:

 fence_vm03	(stonith:fence_azure_arm):	Started vm02
 fence_vm01	(stonith:fence_azure_arm):	Started vm03
 fence_vm02	(stonith:fence_azure_arm):	Started vm03

And after powering vm01 back on, also using the Azure web management interface, we have:

[rafaeldtinoco@vm02 ~]$ crm status
Stack: corosync
Current DC: vm02 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Wed Sep  2 14:33:37 2020
Last change: Wed Sep  2 13:24:10 2020 by hacluster via crmd on vm03

3 nodes configured
3 resources configured

Online: [ vm01 vm02 vm03 ]

Full list of resources:

 fence_vm03	(stonith:fence_azure_arm):	Started vm02
 fence_vm01	(stonith:fence_azure_arm):	Started vm03
 fence_vm02	(stonith:fence_azure_arm):	Started vm01

TO BE CONTINUED…

I still need to finish configuration of the shared disk and the cluster resources. To be finished soon.