Ubuntu HA - MS SQL Server on Azure

rbasak · April 11, 2022, 7:46pm

Summary

This tutorial demonstrates how to set up MS SQL Server on Ubuntu instances running on Azure in High Availability mode.

Warning: MS SQL Server is not yet generally available for 24.04. These instructions have been updated in advance to include 24.04, but will not work until this the case. You can instead follow the steps for older releases for now.

Prerequisites

It is assumed that:

The user is running bash locally (ie. a Linux shell).
The Azure CLI client is available locally and is already authenticated. If you need help with this, see Appendix A for a quick start guide on setting this up.
Three VMs running Ubuntu (18.04, 20.04, 22.04 or 24.04) have been instantiated in Azure with internal IPs 10.0.0.11, 10.0.0.12 and 10.0.0.13, with ssh and scp to them as both the root and ubuntu users working and accessible via the names vm1, vm2 and vm3 respectively. If you need help with this, see Appendix B for a quick start guide on setting this up.

Step 1: Install HA software

ssh into vm1 with the command ssh vm1.

On Ubuntu 24.04 “Noble Numbat”:

Type the following commands to install the required software:

sudo apt-get update
sudo apt-get install -y pacemaker fence-agents-azure-arm

On Ubuntu 20.04 “Focal Fossa” or Ubuntu 22.04 “Jammy Jellyfish”:

Type the following commands to install the required software:

sudo apt-get update
sudo apt-get install -y corosync pacemaker pacemaker-cli-utils crmsh resource-agents fence-agents python3-azure

On Ubuntu 18.04 “Bionic Beaver”:

On 18.04, use of a PPA is required. Enable it and then install packages using the following commands:

sudo add-apt-repository -y ppa:canonical-server/azure-sql-ha
sudo apt-get install -y corosync pacemaker pacemaker-cli-utils crmsh resource-agents fence-agents python3-azure python-adal

Type exit to leave ssh, and repeat the above steps for the other two VMs.

Step 2a: Configure the cluster (for 24.04 or later only)

Delete the default cluster from each VM:

ssh vm1 sudo pcs cluster destroy
ssh vm2 sudo pcs cluster destroy
ssh vm3 sudo pcs cluster destroy

Set a cluster password on each VM:

echo hacluster:demopassword|ssh vm1 sudo chpasswd
echo hacluster:demopassword|ssh vm2 sudo chpasswd
echo hacluster:demopassword|ssh vm3 sudo chpasswd

Refer to Appendix C on the creation of an Azure service principal and on granting a suitable role assignment to that service principal so that it has permission to fence the cluster nodes. Once done, you should have expansions for the six variables in the command below that are required to be supplied to the Azure fence agent.

ssh into vm1 with the command ssh vm1. Then type the following commands, expanding the six variables as needed. If you have set shell variables according to Appendix C, then typing these commands as-is will expand them automatically.

sudo pcs host auth -u hacluster -p demopassword vm{1,2,3}
sudo pcs cluster setup demo vm{1,2,3} transport udpu --enable --start --wait=30
sudo pcs property set stonith-enabled=on
sudo pcs property set stonith-action=reboot
sudo pcs property set no-quorum-policy=stop
sudo pcs property set stonith-timeout=900
sudo pcs stonith create fence-vm fence_azure_arm \
    	resourceGroup="$resource_group" \
    	username="$secret_id" \
    	login="$application_id" \
    	passwd="$password" \
    	tenantId="$tenant_id" \
    	subscriptionId="$subscription_id" \
    	pcmk_reboot_timeout=900 \
    	power_timeout=60 \
	op monitor \
    	interval=3600 \
    	timeout=120

Type exit to leave ssh.

Verify the status of the cluster:

ssh vm1 sudo pcs status

You should see output like the following:

Cluster name: demo
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: vm3 (version 2.1.6-6fdc9deea29) - partition with quorum
  * Last updated: Tue Apr 30 18:53:13 2024 on vm1
  * Last change:  Tue Apr 30 18:51:43 2024 by root via cibadmin on vm1
  * 3 nodes configured
  * 1 resource instance configured

Node List:
  * Online: [ vm1 vm2 vm3 ]

Full List of Resources:
  * fence-vm    (stonith:fence_azure_arm):       Started vm1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Step 2b: Configure the cluster (for 22.04 and earlier only)

Corosync

Create a temporary local file called corosync.conf using nano. Run nano corosync.conf and then enter the following contents. When done, press Ctrl-X, agree to save by pressing y, then press Return to accept the default corosync.conf filename as specified earlier. Your corosync.conf file should now look like this, which you can verify with the command cat corosync.conf:

totem {
    version: 2
    secauth: off
    cluster_name: demo
    transport: udpu
}

nodelist {
    node {
        ring0_addr: 10.0.0.11
        name: vm1
        nodeid: 1
    }
    node {
        ring0_addr: 10.0.0.12
        name: vm2
        nodeid: 2
    }
    node {
        ring0_addr: 10.0.0.13
        name: vm3
        nodeid: 3
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 0
}

qb {
    ipc_type: native
}

logging {
    fileline: on
    to_stderr: on
    to_logfile: yes
    logfile: /var/log/corosync/corosync.log
    to_syslog: no
    debug: off
}

Copy the local file corosync.conf to the three VMs:

scp corosync.conf root@vm1:/etc/
scp corosync.conf root@vm2:/etc/
scp corosync.conf root@vm3:/etc/

Generate /etc/corosync/authkey on the first VM:

ssh vm1 sudo corosync-keygen

Copy the generated authkey file temporarily to the local system, and then from there to the other two VMs. We use -p to preserve the permissions that make the file available only to the root user:

scp -p root@vm1:/etc/corosync/authkey ./
scp -p authkey root@vm2:/etc/corosync/
scp -p authkey root@vm3:/etc/corosync/

The local copies of the files corosync.conf and authkey are no longer required and you can delete them:

rm -f corosync.conf authkey

Restart corosync on all three VMs:

ssh vm1 sudo systemctl restart corosync
ssh vm2 sudo systemctl restart corosync
ssh vm3 sudo systemctl restart corosync

Verify the status of the cluster:

ssh vm1 sudo crm status

You should see output like the following:

Cluster Summary:
  * Stack: corosync
  * Current DC: vm1 (version 2.0.3-4b1f869f0f) - partition with quorum
  * Last updated: Tue Feb 22 15:38:36 2022
  * Last change:  Tue Feb 22 15:32:45 2022 by root via cibadmin on vm1
  * 3 nodes configured
  * 0 resource instances configured

Node List:
  * Online: [ vm1 vm2 vm3 ]

Full List of Resources:
  * No resources

Pacemaker

ssh into vm1 with the command ssh vm1. Then type the following commands:

sudo crm configure property stonith-enabled=on
sudo crm configure property stonith-action=reboot
sudo crm configure property no-quorum-policy=stop
sudo crm configure property stonith-timeout=900
sudo crm configure property have-watchdog=false

Refer to Appendix C on the creation of an Azure service principal and on granting a suitable role assignment to that service principal so that it has permission to fence the cluster nodes. Once done, you should have expansions for the six variables in the command below that are required to be supplied to the Azure fence agent.

Now run the following command on the first node, expanding the six variables as needed. If you have set shell variables according to Appendix C, then typing this command as-is will expand them automatically.

sudo crm configure primitive fence-vm \
    stonith:fence_azure_arm \
    params \
    action=reboot \
    resourceGroup="$resourceGroup" \
    username="$secretId" \
    login="$applicationId" \
    passwd="$password" \
    tenantId="$tenantId" \
    subscriptionId="$subscriptionId" \
    pcmk_reboot_timeout=900 \
    power_timeout=60 \
    op monitor \
    interval=3600 \
    timeout=120

Type exit to leave ssh.

Verify the status of the cluster:

ssh vm1 sudo crm status

You should see output like the following:

Cluster Summary:
  * Stack: corosync
  * Current DC: vm1 (version 2.0.3-4b1f869f0f) - partition with quorum
  * Last updated: Wed Feb 23 13:04:51 2022
  * Last change:  Wed Feb 23 12:59:43 2022 by root via cibadmin on vm1
  * 3 nodes configured
  * 1 resource instance configured

Node List:
  * Online: [ vm1 vm2 vm3 ]

Full List of Resources:
  * fence-vm    (stonith:fence_azure_arm):       Started vm1

Step 3: Install and Configure SQL Server

Follow instructions in Appendix D to install and configure SQL Server itself and to create a suitable availability group.

On 24.04 and later only

Connect Pacemaker to manage SQL Server’s availability group ag1 by running ssh vm1 and then the following command:

sudo pcs resource create ag1_cluster ocf:mssql:ag \
	ag_name="ag1" meta failure-timeout=60s \
	op start timeout=60s \
	op stop timeout=60s \
	op promote timeout=60s \
	op demote timeout=10s \
	op monitor timeout=60s interval=10s \
	op monitor timeout=60s interval=11s role="Master" \
	op monitor timeout=60s interval=12s role="Slave" \
	op notify timeout=60s
sudo pcs resource promotable ag1_cluster \
	promoted-max=1 \
	promoted-node-max=1 \
	clone-max=3 \
	clone-node-max=1 \
	notify=true

Type exit to leave ssh.

On 22.04 or earlier only

Connect Pacemaker to manage SQL Server’s availability group ag1 by running ssh vm1 and then the following command:

sudo crm configure <<EOT
primitive ag1_cluster ocf:mssql:ag \
params ag_name="ag1" meta failure-timeout=60s \
op start timeout=60s \
op stop timeout=60s \
op promote timeout=60s \
op demote timeout=10s \
op monitor timeout=60s interval=10s \
op monitor timeout=60s interval=11s role="Master" \
op monitor timeout=60s interval=12s role="Slave" \
op notify timeout=60s
ms ms-ag1_cluster ag1_cluster \
meta master-max="1" master-node-max="1" clone-max="3" \
clone-node-max="1" notify="true"
commit
EOT

Type exit to leave ssh.

Step 4: Verification

Verify the status of the cluster. If using 24.04 or later:

ssh vm1 sudo pcs status

or if using 22.04 or earlier:

ssh vm1 sudo crm status

You should see output like the following:

Cluster Summary:
  * Stack: corosync
  * Current DC: vm1 (version 2.0.3-4b1f869f0f) - partition with quorum
  * Last updated: Wed Feb 23 13:22:10 2022
  * Last change:  Wed Feb 23 13:19:00 2022 by root via cibadmin on vm1
  * 3 nodes configured
  * 4 resource instances configured

Node List:
  * Online: [ vm1 vm2 vm3 ]

Full List of Resources:
  * fence-vm    (stonith:fence_azure_arm):       Started vm1
  * Clone Set: ms-ag1_cluster [ag1_cluster] (promotable):
    * Masters: [ vm1 ]
    * Slaves: [ vm2 vm3 ]

Type ssh vm1 and then /opt/mssql-tools/bin/sqlcmd -S localhost -U SA to enter the SQL monitor. Then type the following SQL command:

SELECT synchronization_state_desc FROM sys.dm_hadr_database_replica_states;
GO

You should see the following output:

synchronization_state_desc                                  
------------------------------------------------------------
SYNCHRONIZED                                                
SYNCHRONIZED                                                
SYNCHRONIZED                                                

(3 rows affected)

Type exit to leave the SQL monitor and exit again to leave ssh.

Conclusion

You now have SQL Server set up with an availability group that is fully synchronized, with status and fencing managed by Corosync and Pacemaker.

Appendix A: Prepare the Azure CLI

Follow Microsoft’s instructions to install the Azure CLI. Once done, the az command should be available to use.

The Azure CLI needs to be signed in to access Azure resources. The simplest way is to run az login. For more details, see Microsoft’s instructions here.

Appendix B: Create and configure virtual machines

This tutorial requires three VMs named vm1, vm2 and vm3 available to each other internally on the IP addresses 10.0.0.11, 10.0.0.12 and 10.0.0.13, with ssh and scp to them available from the outside using both the ubuntu and root users. A quick way to achieve this is with the following steps. See Microsoft’s tutorial on creating a virtual network and deploying virtual machines for more details.

Create the virtual machines

Run the following commands:

resourceGroup=tutorial-rg
az group create --name $resourceGroup --location eastus
az network vnet create \
      --name myVNet \
      --resource-group "$resourceGroup" \
      --address-prefixes 10.0.0.0/16 \
      --subnet-name myBackendSubnet \
      --subnet-prefixes 10.0.0.0/24
az vm create \
    --resource-group “$resourceGroup” \
    --name vm1 \
    --image UbuntuLTS \
    --vnet-name myVNet \
    --subnet myBackendSubnet \
    --generate-ssh-keys \
    --private-ip-address 10.0.0.11
az vm create \
    --resource-group “$resourceGroup” \
    --name vm2 \
    --image UbuntuLTS \
    --vnet-name myVNet \
    --subnet myBackendSubnet \
    --generate-ssh-keys \
    --private-ip-address 10.0.0.12
az vm create \
    --resource-group “$resourceGroup” \
    --name vm3 \
    --image UbuntuLTS \
    --vnet-name myVNet \
    --subnet myBackendSubnet \
    --generate-ssh-keys \
    --private-ip-address 10.0.0.13

As you create the VMs, make a note of each publicIpAddress given in the output. To reach a particular VM over ssh, you can then use ssh ubuntu@publicIPAddress. To leave an ssh session, use the command exit.

Configure virtual machines

Permitting direct root login is the easiest way to transfer configuration files for the purposes of this tutorial. In production, configuration management tools should be used instead. Enable direct root login by ssh-ing into each VM in turn and running the following command:

sudo cp ~ubuntu/.ssh/authorized_keys ~root/.ssh/authorized_keys

Now you can also use ssh root@publicIPAddress to reach a particular VM.

Configure ssh for easy access

Run nano ~/.ssh/config and then ensure the lines exist in that file, replacing each instance of publicIPAddress with the corresponding public IP address of the VM from above. When done, press Ctrl-X, agree to save by pressing y, then press Return to accept the default config filename as specified earlier.

Host vm1
    Hostname publicIPAddress
    Username ubuntu
Host vm2
    Hostname publicIPAddress
    Username ubuntu
Host vm3
    Hostname publicIPAddress
    Username ubuntu

Now you can use ssh vm1 to reach vm1 as the ubuntu user, ssh root@vm1 to reach vm1 as the root user, and equivalently for the other two VMs. scp to and from each VM should also work equivalently.

ssh and scp to the VMs are now configured as required for the rest of this tutorial. The VMs are running Ubuntu 18.04 “Bionic Beaver” as is the default when requesting the UbuntuLTS image on Azure.

Appendix C: Managing authentication for the fence agent

Create the custom fencing role

Create a temporary local file called fence-agent-role.json using nano. Run nano fence-agent-role.json and then enter the following contents, but change subscriptionID to your actual Azure subscription ID. Your actual subscription ID is available from Azure Portal → Subscriptions. When done, press Ctrl-X, agree to save by pressing y, then press Return to accept the default fence-agent-role.json filename as specified earlier. Your fence-agent-role.json file should now look like this, which you can verify with the command cat fence-agent-role.json:

{
  "Name": "fence-ha-role-fencing",
  "Id": null,
  "IsCustom": true,
  "Description": "fence-ha-role-fencing",
  "Actions": [
    "Microsoft.Compute/*/read",
    "Microsoft.Compute/virtualMachines/powerOff/action",
    "Microsoft.Compute/virtualMachines/start/action"
  ],
  "NotActions": [
  ],
  "AssignableScopes": [
    "/subscriptions/subscriptionId"
  ]
}

Now type the following command to define the custom role inside Azure:

az role definition create --role-definition ./fence-agent-role.json

The local file fence-agent-role.json is no longer required and you can delete it with:

rm -f fence-agent-role.json

Create a service principal

The service principal is the identity that the fence agent will use to authenticate with Azure. Once set up, there are five variables specific to your credentials that you will need to set or expand to configure the fence agent with later, as well as the specific resource group name you are using for this tutorial. These are:

Subscription ID: $subscriptionId
Directory (tenant) ID: $tenantId
Application (client) ID: $applicationId
Secret ID: $secretId
Value: $password

Keep a note of these separately, but also assign them to shell variables so that they are automatically expanded by the shell in later steps in this tutorial. For example, to assign the value abcd to the shell variable $subscriptionId, type the command:

subscriptionId=abcd

Note that $ is used to refer to a variable and to request that it be expanded, but is not used for a variable assignment itself.

The actual subscription ID is available from Azure Portal → Subscriptions. Assign the variable as above but using your actual subscription ID.

Make a note of the values of the other variables and assign them to shell variables as they are generated in the following steps:

From the Azure Portal, go to Azure Active Directory → App registrations → New registration.
Specify a name for your service principal. Any name will do. For example: “TutorialFenceAgent”.
All other options can be left at defaults.
Click Register.
Save the Application (client) ID ($applicationId) and Directory (tenant) ID ($tenantId).
Go to Certificates & secrets → New client secret.
All options can be left at defaults. If you wish, you may change the expiry to match your expectations for this tutorial.
Click Add.
Save the Secret ID ($secretId) and the Value ($password).

Assign service principal to fence role scoped to VMs

To allow the fence agent to use the service principal to fence specific VMs, use the following command to assign the our custom fence agent role against our service principal and scoped to a specific VM:

az role assignment create \
    --assignee "$applicationId" \
    --role fence-ha-role-fencing
    --scope "/subscriptions/$subscriptionId/$resourceGroup/providers/Microsoft.Compute/virtualMachines/vm1"

Repeat the command twice more but adjusting the specified scope to be against the VMs vm2 and vm3.

Appendix D: Install and configure SQL Server

Install SQL Server

Follow Microsoft’s instructions to install SQL Server. Ensure that you install both their mssql-server and mssql-server-ha packages.

Configure SQL Server

To enter the SQL monitor on a particular VM, first ssh to the VM (eg. ssh vm1) and then run the following command. You will need the database password you configured in the “Install SQL Server” step previously.

/opt/mssql-tools/bin/sqlcmd -S localhost -U SA

To leave the SQL monitor, type exit. Leave the ssh session to return to your host machine as needed with a further exit.

Create a certificate with the following SQL commands using the SQL monitor on vm1:

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '**<Master_Key_Password>**';
CREATE CERTIFICATE dbm_certificate WITH SUBJECT = 'dbm';
BACKUP CERTIFICATE dbm_certificate
   TO FILE = '/var/opt/mssql/data/dbm_certificate.cer'
   WITH PRIVATE KEY (
           FILE = '/var/opt/mssql/data/dbm_certificate.pvk',
           ENCRYPTION BY PASSWORD = '**<Private_Key_Password>**'
       );

Copy the certificate to the second VM from your host machine:

scp -3 root@vm1:/var/opt/mssql/data/dbm_certificate.{cer,pvk} root@vm2:/var/opt/mssql/data/

Set ownership on the second VM from your host machine:

ssh root@vm2 chown mssql.mssql /var/opt/mssql/data/dbm_certificate.{cer,pvk}

Repeat the above copy and ownership steps but for vm3 instead of vm2.

Run the following commands on the SQL monitor of each of the VMs in turn:

CREATE ENDPOINT [Hadr_endpoint]
    AS TCP (LISTENER_PORT = 5022)
    FOR DATABASE_MIRRORING (
    ROLE = ALL,
    AUTHENTICATION = CERTIFICATE dbm_certificate,
ENCRYPTION = REQUIRED ALGORITHM AES
);
ALTER ENDPOINT [Hadr_endpoint] STATE = STARTED;

USE [master];
CREATE LOGIN [pacemakerLogin] with PASSWORD= N'StrongPassword123!';
ALTER SERVER ROLE [sysadmin] ADD MEMBER [pacemakerLogin];

On your host machine, create a temporary local file called passwd using nano. Run nano passwd and then enter the following contents. When done, press Ctrl-X, agree to save by pressing y, then press Return to accept the default passwd filename as specified earlier. Your passwd file should now look like this, which you can verify with the command cat passwd:

pacemakerLogin
StrongPassword123!

On your host machine, run the following two commands to copy the file to vm1 and set its permissions so that only root can read it:

scp passwd root@vm1:/var/opt/mssql/secrets/
ssh root@vm1 chmod 600 /var/opt/mssql/secrets/passwd

Repeat the above two commands but with vm2 instead of vm1, and then again with vm3 instead of vm1.

Run the following commands on the SQL monitor of vm1 only:

CREATE DATABASE [db1];
ALTER DATABASE [db1] SET RECOVERY FULL;
BACKUP DATABASE [db1] TO DISK = N'/var/opt/mssql/data/db1.bak' with compression;
CREATE AVAILABILITY GROUP [ag1]
     WITH (DB_FAILOVER = ON, CLUSTER_TYPE = EXTERNAL)
     FOR DATABASE [db1] REPLICA ON
         N'vm1'
          WITH (
             ENDPOINT_URL = N'tcp://vm1:5022',
             AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
             FAILOVER_MODE = EXTERNAL,
             SEEDING_MODE = AUTOMATIC,
             SECONDARY_ROLE (ALLOW_CONNECTIONS = ALL, READ_ONLY_ROUTING_URL = 'TCP://10.0.0.11:1433'),
             PRIMARY_ROLE (ALLOW_CONNECTIONS = READ_WRITE, READ_ONLY_ROUTING_LIST = ('vm2', 'vm3'), READ_WRITE_ROUTING_URL = 'TCP://10.0.0.11:1433'),
             SESSION_TIMEOUT = 10
             ),
         N'vm2'
          WITH (
             ENDPOINT_URL = N'tcp://vm2:5022',
             AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
             FAILOVER_MODE = EXTERNAL,
             SEEDING_MODE = AUTOMATIC,
             SECONDARY_ROLE (ALLOW_CONNECTIONS = ALL, READ_ONLY_ROUTING_URL = 'TCP://10.0.0.12:1433'),
             PRIMARY_ROLE (ALLOW_CONNECTIONS = READ_WRITE, READ_ONLY_ROUTING_LIST = ('vm1', 'vm3'), READ_WRITE_ROUTING_URL = 'TCP://10.0.0.12:1433'),
             SESSION_TIMEOUT = 10
             ),
         N'vm3'
         WITH(
             ENDPOINT_URL = N'tcp://vm3:5022',
             AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
             FAILOVER_MODE = EXTERNAL,
             SEEDING_MODE = AUTOMATIC,
             SECONDARY_ROLE (ALLOW_CONNECTIONS = ALL, READ_ONLY_ROUTING_URL = 'TCP://10.0.0.13:1433'),
             PRIMARY_ROLE (ALLOW_CONNECTIONS = READ_WRITE, READ_ONLY_ROUTING_LIST = ('vm1', 'vm2'), READ_WRITE_ROUTING_URL = 'TCP://10.0.0.13:1433'),
             SESSION_TIMEOUT = 10
            );
ALTER AVAILABILITY GROUP [ag1] GRANT CREATE ANY DATABASE;

Run the following commands on the SQL monitors of vm2 and vm3 only:

ALTER AVAILABILITY GROUP [ag1] JOIN WITH (CLUSTER_TYPE = EXTERNAL);
ALTER AVAILABILITY GROUP [ag1] GRANT CREATE ANY DATABASE;

Now SQL Server should be set up with an availability group called ag1 on all three VMs.