Block Storage High Availability

Platform9's High Availability (HA) architecture provides automatic failover capabilities for Private Cloud Director block storage services. When you enable HA, the system automatically migrates volumes from failed hosts to healthy alternatives, ensuring continuous availability of your storage resources and minimizing service disruption.

In a typical Private Cloud Director deployment, block storage services run on multiple hosts, each managing volumes stored on local or attached storage backends. Each volume associates with a specific host and backend configuration. When a block storage service host becomes unavailable due to hardware failure, network issues, or maintenance, the volumes on that system become inaccessible, potentially disrupting your applications and services.

The HA system addresses this challenge by automatically migrating volumes from failed hosts to healthy hosts with compatible storage backends. You'll maintain continuous availability when you follow proper configuration, monitoring, and operational practices.

How Block Storage HA Works

The block storage HA feature handles failures by orchestrating these activities:

  • Automatic Detection: Monitors block storage service health across your cluster

  • Volume Migration: Automatically migrates volumes from failed hosts to healthy hosts with compatible backends

  • Cross-Availability Zones Support: Handles volumes attached across Availability Zones and Glance backend volumes

  • Event Processing: Manages host up/down events through a robust event processing system

  • Verification: Ensures successful migration completion before marking operations as complete

Prerequisites for Enabling Block Storage HA

Before you enable high availability for block storage service, ensure you have:

  • A minimum of 2 hosts assigned with the persistent storage role and the same volume backends to ensure redundancy and failover capability

  • All participating hosts located in the same region (cross-region volume migration is not currently supported)

  • Networking and firewall rules configured to allow storage-related traffic between all block storage service hosts and the PCD management plane

Supported Volume Types

You can use the HA service with these volume types:

  • Standard Volumes: Regular block storage volumes with various backend types

  • Cross-Availability Zone Volumes: Volumes attached to instances in different availability zones

  • Image Library Service Backend Volumes: Volumes used by the Image Library Service for image storage

  • Volumes across Tenants: Volumes across different Private Cloud Director tenants

Backend Compatibility Requirements

The system supports migration between hosts when you configure:

  • Same Backend Type: Both hosts use the same storage backend (e.g., LVM, Ceph, NetApp)

  • Compatible Pools: Target hosts have compatible storage pools

  • Matching Configuration: Backend configurations are compatible for seamless migration

Understanding the HA Monitoring and Failover Process

How Host Monitoring Works

The HA system continuously monitors your block storage service hosts through these mechanisms:

  • Service Status Checks: Regular polling (default 30s) of block storage service status via PCD APIs

  • Cluster Integration: Integration with the underlying PCD cluster for host health detection

  • Event Generation: Creation of host up/down events when status changes are detected

Expected outcome: The system detects host failures within 30 seconds and initiates the failover process.

What Happens During Failure Detection

When a block storage service host fails, Private Cloud Director automatically performs these steps:

  1. Creates Event: A HOST_DOWN event is created in the events processing table

  2. Discovers Volumes: The system identifies all volumes hosted on the failed host

  3. Analyzes Backends: Compatible target hosts with the same backend configuration are identified

  4. Executes Migration: Volumes are migrated to healthy hosts

  5. Verifies Success: Migration success is verified by checking volume host attributes

  6. Completes Event: The event is marked as finished upon successful migration

Expected outcome: Your volumes automatically migrate to healthy hosts, and your services resume with minimal disruption.

What Happens During Host Recovery

When a previously failed host comes back online, Private Cloud Director performs these steps:

  1. Processes Host Up Event: A HOST_UP event is processed

  2. Verifies Service: The system verifies that block storage services are actually running

  3. Closes Event: The corresponding recovery event is marked as complete

Expected outcome: The system recognizes the recovered host and makes it available for future volume operations.

Failure Scenarios Handled by Block Storage HA

You can rely on the block storage HA service to handle these failure types:

  • Host Hardware Failure: Complete host unavailability

  • Service Crashes: Block storage service process failures

  • Network Partitions: Temporary network connectivity issues

  • Planned Maintenance: Graceful host shutdowns for maintenance

Best Practices for Block Storage HA

Follow these best practices to ensure optimal performance and reliability:

  • Deploy Multi-Host Backends: Deploy at least 2 hosts per backend type for redundancy

  • Plan Resource Capacity: Ensure target hosts have sufficient capacity for migrated volumes

  • Design Redundant Networks: Use redundant network paths for storage traffic

  • Monitor Continuously: Implement comprehensive monitoring of HA events and system health

Current Limitations

Be aware of these limitations when using the storage HA service:

  • Single Backend Migration: You can only migrate volumes between hosts with identical backends

  • Manual Intervention: Some failure scenarios may require manual intervention

  • Migration Downtime: Brief service interruption may occur during migration

  • Cross-Region Restrictions: Migration across PCD regions is not supported

Troubleshooting Common Issues

No Compatible Target Hosts Found

Symptoms: Migration fails with "No other hosts found with backend" error.

Solution: Ensure you have configured multiple hosts with the same backend type. Verify that at least one other host shares the same backend configuration as the failed host.

Volume Migration Verification Fails

Symptoms: Migration appears successful but verification fails.

Solution:

  1. Check volume host attributes to confirm the migration completed

  2. Ensure proper API connectivity between hosts and the management plane

  3. Review system logs for additional error details

Getting Additional Support

For additional support or questions about block storage HA configuration and operation, consult the Platform9 support documentation or contact technical support.

Last updated

Was this helpful?