Block Storage High Availability
Platform9's High Availability (HA) architecture provides automatic failover capabilities for Private Cloud Director block storage services. When you enable HA, the system automatically migrates volumes from failed hosts to healthy alternatives, ensuring continuous availability of your storage resources and minimizing service disruption.
In a typical Private Cloud Director deployment, block storage services run on multiple hosts, each managing volumes stored on local or attached storage backends. Each volume associates with a specific host and backend configuration. When a block storage service host becomes unavailable due to hardware failure, network issues, or maintenance, the volumes on that system become inaccessible, potentially disrupting your applications and services.
The HA system addresses this challenge by automatically migrating volumes from failed hosts to healthy hosts with compatible storage backends. You'll maintain continuous availability when you follow proper configuration, monitoring, and operational practices.
How Block Storage HA Works
The block storage HA feature handles failures by orchestrating these activities:
Automatic Detection: Monitors block storage service health across your cluster
Volume Migration: Automatically migrates volumes from failed hosts to healthy hosts with compatible backends
Cross-Availability Zones Support: Handles volumes attached across Availability Zones and Glance backend volumes
Event Processing: Manages host up/down events through a robust event processing system
Verification: Ensures successful migration completion before marking operations as complete
Prerequisites for Enabling Block Storage HA
Before you enable high availability for block storage service, ensure you have:
A minimum of 2 hosts assigned with the persistent storage role and the same volume backends to ensure redundancy and failover capability
All participating hosts located in the same region (cross-region volume migration is not currently supported)
Networking and firewall rules configured to allow storage-related traffic between all block storage service hosts and the PCD management plane
Supported Volume Types
You can use the HA service with these volume types:
Standard Volumes: Regular block storage volumes with various backend types
Cross-Availability Zone Volumes: Volumes attached to instances in different availability zones
Image Library Service Backend Volumes: Volumes used by the Image Library Service for image storage
Volumes across Tenants: Volumes across different Private Cloud Director tenants
Backend Compatibility Requirements
The system supports migration between hosts when you configure:
Same Backend Type: Both hosts use the same storage backend (e.g., LVM, Ceph, NetApp)
Compatible Pools: Target hosts have compatible storage pools
Matching Configuration: Backend configurations are compatible for seamless migration
Understanding the HA Monitoring and Failover Process
How Host Monitoring Works
The HA system continuously monitors your block storage service hosts through these mechanisms:
Service Status Checks: Regular polling (default 30s) of block storage service status via PCD APIs
Cluster Integration: Integration with the underlying PCD cluster for host health detection
Event Generation: Creation of host up/down events when status changes are detected
Expected outcome: The system detects host failures within 30 seconds and initiates the failover process.
What Happens During Failure Detection
When a block storage service host fails, Private Cloud Director automatically performs these steps:
Creates Event: A
HOST_DOWNevent is created in the events processing tableDiscovers Volumes: The system identifies all volumes hosted on the failed host
Analyzes Backends: Compatible target hosts with the same backend configuration are identified
Executes Migration: Volumes are migrated to healthy hosts
Verifies Success: Migration success is verified by checking volume host attributes
Completes Event: The event is marked as finished upon successful migration
Expected outcome: Your volumes automatically migrate to healthy hosts, and your services resume with minimal disruption.
What Happens During Host Recovery
When a previously failed host comes back online, Private Cloud Director performs these steps:
Processes Host Up Event: A
HOST_UPevent is processedVerifies Service: The system verifies that block storage services are actually running
Closes Event: The corresponding recovery event is marked as complete
Expected outcome: The system recognizes the recovered host and makes it available for future volume operations.
Failure Scenarios Handled by Block Storage HA
You can rely on the block storage HA service to handle these failure types:
Host Hardware Failure: Complete host unavailability
Service Crashes: Block storage service process failures
Network Partitions: Temporary network connectivity issues
Planned Maintenance: Graceful host shutdowns for maintenance
Best Practices for Block Storage HA
Follow these best practices to ensure optimal performance and reliability:
Deploy Multi-Host Backends: Deploy at least 2 hosts per backend type for redundancy
Plan Resource Capacity: Ensure target hosts have sufficient capacity for migrated volumes
Design Redundant Networks: Use redundant network paths for storage traffic
Monitor Continuously: Implement comprehensive monitoring of HA events and system health
Current Limitations
Be aware of these limitations when using the storage HA service:
Single Backend Migration: You can only migrate volumes between hosts with identical backends
Manual Intervention: Some failure scenarios may require manual intervention
Migration Downtime: Brief service interruption may occur during migration
Cross-Region Restrictions: Migration across PCD regions is not supported
Troubleshooting Common Issues
No Compatible Target Hosts Found
Symptoms: Migration fails with "No other hosts found with backend" error.
Solution: Ensure you have configured multiple hosts with the same backend type. Verify that at least one other host shares the same backend configuration as the failed host.
Volume Migration Verification Fails
Symptoms: Migration appears successful but verification fails.
Solution:
Check volume host attributes to confirm the migration completed
Ensure proper API connectivity between hosts and the management plane
Review system logs for additional error details
Getting Additional Support
For additional support or questions about block storage HA configuration and operation, consult the Platform9 support documentation or contact technical support.
Last updated
Was this helpful?
