Volume Migration and Retype Troubleshooting

Overview

Volume migration and retype operations move a volume's data from one storage backend to another, or change the volume type while keeping the data in place when the backend supports it. These operations can fail or stall for several reasons, including driver incompatibility, insufficient capacity, NFS share imbalance, or a transient network issue.

This guide explains how to detect a stuck or failed migration, understand backend-specific limitations, and recover a volume that is stuck in retyping, maintenance, or error state.

In this guide, you will diagnose and remediate failed or stuck volume migration and retype operations.

Prerequisites

  • pcdctl configured and authenticated against your region.

  • Access to the block storage host logs at /var/log/pf9/cindervolume-base.log.

  • For Self-Hosted deployments: kubectl access to the management-plane namespace.

Understand Migration and Retype Modes

Before troubleshooting, confirm which mode was used:

Operation
What Happens

Retype — same backend, driver-assisted

The driver changes metadata in place; no data copy occurs. Fast.

Retype — cross-backend

Full data copy from source to destination backend. Slow; duration proportional to volume size.

Volume migrate

Explicit move to a different host/pool. Always copies data.

A cross-backend retype or explicit migration places the volume in retyping or maintenance status during the copy and updates migration_status. A driver-assisted same-backend retype completes almost immediately with no intermediate status.

Detect a Stuck or Failed Migration

Check Volume Status and Migration Status

Key fields to inspect:

  • status — should be available on success; error on failure; retyping or maintenance while in progress.

  • migration_status — values include migrating, completing, error, success, or empty.

A migration that shows migration_status=error has failed. A migration that has been in migrating status for more than an hour (for a small volume) or proportionally longer for large volumes is likely stuck.

Review the Storage Service Logs

On the block storage host, search for log entries related to the volume UUID:

Common error patterns and their meanings:

Log Pattern
Meaning

No valid host was found

The destination backend rejected the placement (capacity or capability mismatch)

driver does not support migration

The source or destination driver does not implement the migration path

Timeout waiting for volume migration

The data copy stalled; often network or NFS issue

Volume copy failed

Backend-level copy failure; check the storage array

NFS share ... has insufficient space

NFS destination share lacks capacity for the full volume

Self-Hosted deployments only

If the cinder-volume service runs as a pod, retrieve logs with:

Common Failure Modes

Incompatible Drivers

Not every driver-pair supports cross-backend migration. The Persistent Storage Service relies on each driver advertising its capabilities. When a driver does not support the migration path requested, the operation fails immediately with No valid host was found or a driver capability error.

Remediation: Use the generic host-assisted migration path. This copies the volume data through the Persistent Storage Service host rather than delegating to the drivers:

--force-host-copy bypasses driver-to-driver negotiation and copies the raw volume data block-by-block. It is slower but works across any pair of backends.

Insufficient Capacity on the Destination Backend

The migration pre-checks may pass but the data copy fails if the destination backend has less available capacity than the volume's allocated size.

Check destination backend capacity:

Look at the free_capacity_gb for the destination pool. It must be greater than the volume's size value, plus the reserved_percentage configured for that backend.

Remediation: Either free capacity on the destination backend, reduce reserved_percentage, or choose a different destination with sufficient capacity.

NFS Capacity Imbalance Across Shares

When an NFS-backed backend is configured with multiple NFS shares (for example, multiple NetApp ONTAP NFS exports), the Persistent Storage Service distributes volumes across shares based on available capacity. A migration that targets a backend with uneven share utilization may route the volume to an already-full share.

Detect share imbalance:

Each NFS share appears as a separate pool. Compare free_capacity_gb across pools. A pool that reports free_capacity_gb=0 or a very low value will reject new volumes even if total backend capacity is available.

Remediation options:

  1. Delete or migrate volumes off the overloaded share to rebalance capacity.

  2. Add a new NFS share and update the nfs_shares_config file on the block storage host, then restart pf9-cindervolume-base to make the new share available for placement.

Backend-Specific Limitations

Backend
Known Limitation

NetApp ONTAP NFS

FlexClone-based clone migrations require the destination to be on the same SVM. Cross-SVM migration falls back to file copy.

Tintri NFS

The Tintri driver does not support live migration of in-use volumes. The volume must be detached before retyping.

Pure Storage

Volume copy is performed natively on the array; requires both source and destination volumes to be visible to the same Pure array. Cross-array migration uses generic copy.

iSCSI / FC SAN backends

Cross-backend migration to an NFS backend (or vice versa) always uses generic host-assisted copy.

Recover a Volume Stuck in retyping or maintenance

If a migration fails midway, the volume may be left in retyping or maintenance status with migration_status=error. In this state the volume is locked and cannot be used or deleted.

Step 1 — Confirm the Migration Has Truly Failed

Check the log for a definitive error (not just a timeout). Wait at least 30 minutes for large volumes before concluding the migration is stuck rather than slow.

Step 2 — Clean Up the Temporary Migration Volume

During a cross-backend migration, the Persistent Storage Service creates a temporary volume on the destination backend. If the migration fails, this temporary volume may be left behind. Find and delete it:

Delete any volume whose name contains the pattern migration-<VOLUME_UUID>:

Step 3 — Reset the Volume State

After the temporary volume is removed, reset the source volume's state to available:

Also reset the migration_status field:

Step 4 — Retry the Migration

After confirming the volume is in available state, address the root cause identified in the logs (capacity, driver compatibility, NFS share space), then retry:

Or for an explicit host-level migration:

Next Steps

Last updated

Was this helpful?