Tech Tutorial: 361.3 Failover Clusters (weight: 8) #

Exam Objective: #

Candidates should be able to manage and maintain highly available services using failover clusters. This includes the understanding and implementation of high-availability techniques, managing cluster resources, and troubleshooting.

Key Knowledge Areas: #

Understanding of clustering and high availability
Configuration and management of Linux failover clusters
Resource management in a cluster
Troubleshooting cluster issues

Utilities: #

Pacemaker
Corosync
pcs (Pacemaker/Corosync configuration system)
crm_mon

Introduction #

Failover clusters are essential for maintaining high availability in critical applications. They ensure that if one or more nodes in a cluster fail, the services they provide continue running on other nodes with minimal or no downtime. In this tutorial, we’ll explore how to set up and manage a failover cluster using popular Linux tools: Pacemaker, Corosync, and pcs.

Step-by-Step Guide #

Step 1: Environment Setup #

Before we begin, ensure that you have at least two Linux servers (nodes) available for clustering. For this guide, we’ll use CentOS 8, but the steps should be similar for other Linux distributions.

Step 2: Install Required Packages #

On both nodes, install the necessary clustering software:

sudo dnf install -y pacemaker corosync pcs

Step 3: Configure Corosync #

Corosync is the messaging layer for the cluster, handling communication between nodes.

Authenticate the pcs user across all nodes:

sudo passwd hacluster
pcs cluster auth node1 node2 -u hacluster -p password --force

Create and configure the Corosync communication layer:

pcs cluster setup --name my_cluster node1 node2
pcs cluster start --all

Step 4: Configure Pacemaker #

Pacemaker manages the resources and services on the cluster.

Check the cluster status:

pcs status

Add a resource to the cluster. For example, a simple Dummy resource:

pcs resource create Dummy ocf:pacemaker:Dummy op monitor interval=30s

Step 5: Manage Cluster Resources #

Resource management involves adding, modifying, and deleting resources as needed.

To modify a resource, for example, changing its monitoring interval:

pcs resource update Dummy op monitor interval=20s

To disable a resource temporarily:

pcs resource disable Dummy

To enable it again:

pcs resource enable Dummy

To delete a resource:

pcs resource delete Dummy

Step 6: Troubleshooting Cluster Issues #

Use crm_mon to monitor cluster status and troubleshoot issues:

crm_mon -1

If there’s a problem with a node, you might see it marked as OFFLINE. To further investigate, check the Corosync and Pacemaker logs typically found in /var/log/corosync/corosync.log and /var/log/pacemaker/pacemaker.log.

Conclusion #

Setting up a failover cluster on Linux with Pacemaker and Corosync involves several steps from installing necessary packages, configuring Corosync for node communication, setting up Pacemaker for resource management, to managing and troubleshooting the cluster. Properly configured, a failover cluster enhances the reliability and availability of services, critical for business continuity in production environments.

Remember, cluster management can be complex, especially in larger environments. Always test configurations in a controlled setting before applying them in production. Regularly review cluster logs and status to preemptively address potential issues.