Per-Backend Performance Tuning and Troubleshooting

Overview

Storage performance problems often surface as slow volume creation, intermittent attach failures, or volume operations that time out. This guide covers the most impactful tuning knobs and common failure patterns for NFS-backed backends — specifically NetApp ONTAP NFS and Tintri NFS — because NFS introduces unique considerations around mount options, share capacity balance, and driver RPC timeouts.

For SAN-backed backends (iSCSI, Fibre Channel), the performance guidance in the vendor-specific configuration pages applies; this guide focuses on NFS.

In this guide, you will identify the cause of NFS storage performance problems and apply configuration changes to resolve them.

Prerequisites

  • Block storage host access for log review and configuration changes.

  • NFS exports accessible and mounted on block storage hosts.

  • pcdctl configured and authenticated.

Diagnose a Slow or Timing-Out NFS Backend

Check get_volume_stats RPC Latency

The Persistent Storage Service relies on periodic get_volume_stats calls to each backend to refresh capacity and capability information. If these calls are slow or timing out, volume creation can fail with No valid host was found even when capacity is available, because the scheduler is working from stale or missing pool data.

Check for RPC timeout errors in the storage service log:

sudo grep -E 'get_volume_stats|Timeout|RPC|MessagingTimeout' /var/log/pf9/cindervolume-base.log | tail -50

A MessagingTimeout or Timeout waiting for get_volume_stats message means the backend took longer than the configured RPC timeout to respond.

Remediation:

The default RPC timeout is 60 seconds. If your backend consistently takes longer (common for large NFS shares with many volumes), increase the timeout in the backend configuration:

And increase the global RPC timeout in the [DEFAULT] section:

Restart pf9-cindervolume-base after making this change:

Verify NFS Mount Options

Incorrect or suboptimal NFS mount options are among the most common causes of NFS backend performance problems. The mount options used by the driver are set in the backend configuration's nfs_mount_options parameter and applied when the Persistent Storage Service mounts NFS shares.

Check current mounts:

Recommended NFS mount options for block storage backends:

Option
Purpose

vers=3

Use NFSv3 (required for some drivers; check driver documentation)

rsize=262144,wsize=262144

Maximize read/write block size for throughput (256 KiB)

nconnect=16

Open multiple TCP connections per mount for parallelism (Linux 5.3+)

noatime

Disable access-time updates to reduce metadata write load

lookupcache=pos

Cache positive dentry lookups; reduces lookup latency

hard,intr

Hard mounts with interrupt support; prevents silent data loss on network interruption

NetApp ONTAP NFS — Tuning and Troubleshooting

Slow Volume Creation: FlexClone Not Available

NetApp NFS volume creation uses FlexClone when creating volumes from images or snapshots. If FlexClone is not licensed or not enabled on the SVM, the driver falls back to a full file copy, which is significantly slower.

Verify FlexClone availability:

A log line like FlexClone feature is not available confirms the fallback. Contact your NetApp administrator to enable the FlexClone license on the SVM.

NFS Share Capacity Imbalance

NetApp NFS backends are typically configured with multiple NFS shares (multiple ONTAP FlexVol exports). The driver distributes volumes across shares by free capacity. If shares become unevenly loaded, new volumes consistently land on the same share until others drain.

Detect imbalance:

Each share appears as a separate pool. Significant difference in free_capacity_gb across pools indicates imbalance.

Remediation options:

  1. Expand the underloaded shares. Increase the FlexVol quota on lightly loaded ONTAP volumes so the driver places more new volumes there.

  2. Migrate volumes off overloaded shares. Use volume retype to move volumes from a full share's pool to a less-loaded one.

  3. Add a new share. Add a new NFS export to nfs_shares_config and restart pf9-cindervolume-base. The driver will prefer it for new placements until it reaches parity.

Explanation of the NetApp-specific additions:

  • write=eager — Enables eager write flushing, which reduces write latency on ONTAP NFS exports.

  • nconnect=16 — Opens 16 parallel TCP connections to the NFS server, improving throughput for concurrent I/O. Requires Linux kernel 5.3 or later.

Check Pool Name Filtering

If volumes are not being placed on the expected ONTAP FlexVols, verify the netapp_pool_name_search_pattern setting. An overly restrictive regex can exclude pools that should be eligible:

Test the regex against your pool names using the pool list:

Tintri NFS — Tuning and Troubleshooting

Volume Operations Timing Out

The Tintri driver communicates with the Tintri REST API for operations such as snapshot management and QoS reporting. If the REST API endpoint (vmstore_rest_address) is unreachable or slow, volume operations can time out.

Check REST API connectivity:

A successful response returns a JSON object with server version information. An error or timeout indicates a network or Tintri appliance issue.

Check for REST API errors in storage service logs:

The Tintri driver requires NFSv3. The following options are recommended:

Do not add nconnect for Tintri unless your Tintri firmware version supports it. Check with your Tintri administrator before enabling parallel connections.

NFS Share Capacity: Tintri Single-Share Deployments

Unlike NetApp NFS, a Tintri backend is typically configured with a single NFS share (nas_share_path). Capacity exhaustion on that share means the entire backend is unavailable for new volumes.

Monitor share utilization:

When free_capacity_gb drops below the volume sizes you regularly create, add capacity on the Tintri appliance or migrate some volumes to another backend.

QCOW2 Volume Format Performance

Tintri stores volumes in QCOW2 format when vmstore_qcow2_volumes = true. QCOW2 enables thin provisioning and space-efficient snapshots. However, QCOW2 has slightly higher I/O overhead for random-write workloads compared to raw volumes.

If you observe higher-than-expected write latency on Tintri-backed volumes, benchmark with and without vmstore_qcow2_volumes = false in a non-production environment to determine whether the overhead is significant for your workload. Changing this setting affects only newly created volumes; existing volumes retain their format.

General NFS Troubleshooting Steps

  1. Verify NFS connectivity from the block storage host to the NFS server:

  2. Check whether shares are mounted:

    If shares are not mounted, check that nfs-common is installed and that the NFS server is reachable.

  3. Check for stale file handles:

    A stale file handle means an NFS mount became invalid (server restarted or export removed and recreated). Restart pf9-cindervolume-base to force a re-mount.

  4. Verify available disk space on NFS exports:

Next Steps

Last updated

Was this helpful?