Tech Tutorial: 361.3 Failover Clusters (weight: 8) #
Exam Objective: #
Candidates should be able to manage and maintain highly available services using failover clusters. This includes the understanding and implementation of high-availability techniques, managing cluster resources, and troubleshooting.
Key Knowledge Areas: #
- Understanding of clustering and high availability
- Configuration and management of Linux failover clusters
- Resource management in a cluster
- Troubleshooting cluster issues
Utilities: #
- Pacemaker
- Corosync
- pcs (Pacemaker/Corosync configuration system)
- crm_mon
Introduction #
Failover clusters are essential for maintaining high availability in critical applications. They ensure that if one or more nodes in a cluster fail, the services they provide continue running on other nodes with minimal or no downtime. In this tutorial, we’ll explore how to set up and manage a failover cluster using popular Linux tools: Pacemaker, Corosync, and pcs.
Step-by-Step Guide #
Step 1: Environment Setup #
Before we begin, ensure that you have at least two Linux servers (nodes) available for clustering. For this guide, we’ll use CentOS 8, but the steps should be similar for other Linux distributions.
Step 2: Install Required Packages #
On both nodes, install the necessary clustering software:
sudo dnf install -y pacemaker corosync pcs
Step 3: Configure Corosync #
Corosync is the messaging layer for the cluster, handling communication between nodes.
- Authenticate the pcs user across all nodes:
sudo passwd hacluster
pcs cluster auth node1 node2 -u hacluster -p password --force
- Create and configure the Corosync communication layer:
pcs cluster setup --name my_cluster node1 node2
pcs cluster start --all
Step 4: Configure Pacemaker #
Pacemaker manages the resources and services on the cluster.
- Check the cluster status:
pcs status
- Add a resource to the cluster. For example, a simple Dummy resource:
pcs resource create Dummy ocf:pacemaker:Dummy op monitor interval=30s
Step 5: Manage Cluster Resources #
Resource management involves adding, modifying, and deleting resources as needed.
- To modify a resource, for example, changing its monitoring interval:
pcs resource update Dummy op monitor interval=20s
- To disable a resource temporarily:
pcs resource disable Dummy
- To enable it again:
pcs resource enable Dummy
- To delete a resource:
pcs resource delete Dummy
Step 6: Troubleshooting Cluster Issues #
Use crm_mon
to monitor cluster status and troubleshoot issues:
crm_mon -1
If there’s a problem with a node, you might see it marked as OFFLINE
. To further investigate, check the Corosync and Pacemaker logs typically found in /var/log/corosync/corosync.log
and /var/log/pacemaker/pacemaker.log
.
Conclusion #
Setting up a failover cluster on Linux with Pacemaker and Corosync involves several steps from installing necessary packages, configuring Corosync for node communication, setting up Pacemaker for resource management, to managing and troubleshooting the cluster. Properly configured, a failover cluster enhances the reliability and availability of services, critical for business continuity in production environments.
Remember, cluster management can be complex, especially in larger environments. Always test configurations in a controlled setting before applying them in production. Regularly review cluster logs and status to preemptively address potential issues.