Block Storage High Availability
Platform9's High Availability (HA) architecture provides automatic failover capabilities for PCD block storage services. By automatically migrating volumes from failed hosts to healthy alternatives, it ensures continuous availability of storage resources and minimizes service disruption.
In a typical Private Cloud Director deployment, block storage services run on multiple hosts, each managing volumes stored on local or attached storage backends. Each volume is associated with a specific host and backend configuration, creating a dependency between the volume and its hosting infrastructure. When a block storage service host becomes unavailable due to hardware failure, network issues, or maintenance, the volumes hosted on that system become inaccessible, potentially disrupting applications and services that depend on them.
The HA system addresses this challenge by automatically migrating volumes from failed hosts to healthy hosts with compatible storage backends, ensuring continuous availability of storage resources. Proper configuration, monitoring, and operational practices are essential for optimal performance and reliability.
Overview
The block storage HA feature is designed to handle failures of block storage service by orchestrating the following set of activities.
Automatic Detection: Monitoring block storage service health across the cluster.
Volume Migration: Automatically migrating volumes from failed hosts to healthy hosts with compatible backends.
Cross-Availability Zones Support: Handling volumes attached across Availability Zones and Glance backend volumes.
Event Processing: Managing host up/down events through a robust event processing system.
Verification: Ensuring successful migration completion before marking operations as complete.
Prerequisites
The following prerequisites are required for enabling high availability for block storage service.
A minimum of 2 hosts assigned with the persistent storage role and the same volume backends to ensure redundancy and failover capability.
All hosts that will participate in block storage HA must be located in the same region. Cross-region volume migration is not currently supported.
Networking and firewall rules must allow storage-related traffic between all block storage service hosts and the HA Manager for proper communication and volume migration operations.
Volume Types Supported
The following types of volumes are supported by the HA service.
Standard Volumes: Regular block storage volumes with various backend types.
Cross-Availability Zone Volumes: Volumes attached to instances in different availability zones.
Glance Backend Volumes: Volumes used by Glance for image storage.
Multi-Project Volumes: Volumes across different OpenStack projects/tenants.
Backend Compatibility
The system supports migration between hosts with storage backends listed below.
Same Backend Type: Hosts must use the same storage backend (e.g., LVM, Ceph, NetApp).
Compatible Pools: Target hosts must have compatible storage pools.
Matching Configuration: Backend configurations must be compatible for seamless migration.
How It Works
1. Host Monitoring
The HA system continuously monitors block storage service hosts through the following mechanisms.
Service Status Checks: Regular polling of block storage service status via PCD APIs.
Cluster Integration: Integration with the underlying HA cluster for host health detection.
Event Generation: Creation of host up/down events when status changes are detected.
2. Failure Detection and Response
When a block storage service host fails, Private Cloud Director automatically performs the steps below.
Event Creation: A
HOST_DOWNevent is created in the events processing table.Volume Discovery: The system identifies all volumes hosted on the failed host.
Backend Analysis: Compatible target hosts with the same backend configuration are identified.
Migration Execution: Volumes are migrated to healthy hosts using the
cinder-managecommand-line tool.Verification: Migration success is verified by checking volume host attributes.
Event Completion: The event is marked as finished upon successful migration.
3. Recovery Handling
When a previously failed host comes back online, Private Cloud Director performs the steps below.
Host Up Event: A
HOST_UPevent is processed.Service Verification: The system verifies that block storage services are actually running.
Event Closure: The corresponding recovery event is marked as complete.
Supported Failure Scenarios
The following types of failures will be gracefully handled by the block storage HA service.
Host Hardware Failure: Complete host unavailability.
Service Crashes: Block storage service process failures.
Network Partitions: Temporary network connectivity issues.
Planned Maintenance: Graceful host shutdowns for maintenance.
Best Practices
The following best practices are recommended to ensure for optimal performance and reliability of the block storage HA service.
Multi-Host Backends: Deploy at least 2 hosts per backend type for redundancy.
Resource Planning: Ensure target hosts have sufficient capacity for migrated volumes.
Network Design: Use redundant network paths for storage traffic.
Monitoring: Implement comprehensive monitoring of HA events and system health.
Limitations and Troubleshooting
Current Limitations
The current known limitations of the storage HA service are listed below.
Single Backend Migration: Volumes can only migrate between hosts with identical backends.
Manual Intervention: Some failure scenarios may require manual intervention.
Migration Downtime: Brief service interruption may occur during migration.
Cross-Region: Migration across OpenStack regions is not supported.
Common Issues and Solutions
Issue: No Compatible Target Hosts
Symptoms: Migration fails with "No other hosts found with backend".
Solution: Ensure multiple hosts are configured with the same backend type.
Issue: Volume Migration Verification Fails
Symptoms: Migration appears successful but verification fails.
Solution: Check volume host attributes and ensure proper API connectivity.
For additional support or questions about block storage HA configuration and operation, please consult the Platform9 support documentation or contact technical support.
Last updated
Was this helpful?
