Setting up CTDB for Clustered NFS

From SambaWiki

Assumptions

This guide is aimed at the Linux kernel NFS daemon.

CTDB can be made to manage another NFS server by using the CTDB_NFS_CALLOUT script option to specify an NFS server-specific call-out.

Prerequisites

NFS configuration

Exports

Requirements:

  • NFS exports must be the same on all nodes
  • For each export, the fsid option must be set to the same value on all nodes.

For the Linux kernel NFS server, this is usually in /etc/exports.

Example:

 /clusterfs0/data *(rw,fsid=1235)
 /clusterfs0/misc *(rw,fsid=1237)

Daemon configuration

Clustering NFS has some extra requirements compared to running a regular NFS server, so some extra configuration is needed.

  • All NFS RPC services should run on fixed ports, which should be the same on all cluster nodes. Some clients can become confused if ports change during fail-over.
  • NFSv4 should be disabled.
  • statd should be configured to use CTDB's high-availability call-out.
  • The NFS_HOSTNAME variable must be set in the NFS system configuration. This configuration is loaded by CTDB's high-availability call-out, which uses NFS_HOSTNAME. NFS_HOSTNAME should be resolvable into the CTDB public IP addresses that are used by NFS clients.
  • statd's hostname (passed via the -H option) must use the value of NFS_HOSTNAME.

Selecting fixed ports for NFS RPC services

Although mountd has a fixed port registered with IANA (20048), none of the other NFS RPC services have registered port numbers. To avoid issues, ports need to be selected for all of the regular services.

It is important to make sure that the configured ports:

  • Are not used by other services.
  • Can not be used by the Linux kernel network stack as the local port for outgoing connections.

The local port range can be checked as follows:

 $ sysctl net.ipv4.ip_local_port_range
 net.ipv4.ip_local_port_range = 32768	60999

The values above are the defaults on recent Linux kernels.

It is possible to re-use a range of system ports that are not in use on cluster nodes (e.g. AppleTalk 201-205). However, some people may not consider this to be acceptable practice.

In the examples below we will use ports 61001-61005 for the fixed NFS RPC service ports. While these are in the dynamic/private port range specified by RFC6335, the Linux defaults do not follow the RFC.

Red Hat Linux variants

Newer Red Hat Linux variants use /etc/nfs.conf, which should look something like:

 [general]
 # defaults
 
 [exportfs]
 # defaults
 
 [gssd]
 use-gss-proxy=1
 
 [lockd]
 port = 61005
 udp-port = 61005
 
 [mountd]
 port = 61003
 
 [nfsdcltrack]
 # defaults
 
 [nfsd]
 vers4 = n
 threads = 8
 
 [statd]
 name = cluster1
 port = 61001
 outgoing-port = 61002
 ha-callout = /etc/ctdb/statd-callout
 
 [sm-notify]
 # defaults

On older variants, the configuration file will be /etc/sysconfig/nfs and it should look something like:

 NFS_HOSTNAME="cluster1"
 
 STATD_PORT=61001
 STATD_OUTGOING_PORT=61002
 MOUNTD_PORT=61003
 RQUOTAD_PORT=61004
 LOCKD_UDPPORT=61005
 LOCKD_TCPPORT=61005
 
 STATDARG="-n ${NFS_HOSTNAME}"
 STATD_HA_CALLOUT="/etc/ctdb/statd-callout"
 
 RPCNFSDARGS="-N 4"
 RPCNFSDCOUNT=8

This should work with both systemd and Sys-V init variants.

When using systemd then /etc/sysconfig/rpc-rquotad should also contain:

 RPCRQUOTADOPTS="-p 61004"

Debian GNU/Linux variants

The following configuration files should work for both systemd and Sys-V init:

/etc/default/nfs-kernel-server:

 RPCNFSDOPTS="-N 4"
 RPCNFSDCOUNT=8
 
 RPCMOUNTDOPTS="-p 61003"

/etc/default/nfs-common:

 NFS_HOSTNAME="cluster1"
 
 STATDOPTS="-n ${NFS_HOSTNAME} -p 61001 -o 61002 -H /etc/ctdb/statd-callout -T 61005 -U 61005"

/etc/default/quota:

 RPCRQUOTADOPTS="-p 61004"

Unfortunately, RPCNFSDOPTS isn't used by Debian Sys-V init, so there is no way to disable NFSv4 via the configuration file.

Configure CTDB to manage NFS

The NFS event scripts must be enabled:

 ctdb event script enable legacy 60.nfs
 ctdb event script enable legacy 06.nfs

CTDB will manage and start/stop/restart the NFS services, so the operating system should be configured so these are not started/stopped automatically.

Red Hat Linux variants

If using a Red Hat Linux variant, the NFS services are nfs and nfslock services. Starting them at boot time is not recommended and this can be disabled using chkconfig.

 chkconfig nfs off
 chkconfig nfslock off

The service names and mechanism for disabling them varies across operating systems.

Client configuration

IP addresses, rather than a DNS/host name, should be used when configuring client mounts. NFSv3 locking is heavily tied to IP addresses and can break if a client uses round-robin DNS. This means load balancing for NFS is achieved by hand-distributing public IP addresses across clients.

IMPORTANT

Never mount the same NFS directory on a client from two different nodes in the cluster at the same time. The client-side caching in NFS is very fragile and assumes that an object can only be accessed through one single path at a time.

Event scripts

CTDB clustering for NFS relies on two event scripts 06.nfs and 60.nfs. These are provided as part of CTDB and do not usually need to be changed.

Using CTDB with other NFS servers

The NFS event scripts provide a generic framework for managing NFS from CTDB. These scripts also include infrastructure for flexible NFS RPC service monitoring. There are 2 configuration variables that may need to be changed when using an NFS server other than the default (Linux kernel NFS server).

CTDB_NFS_CALLOUT

This variable is the absolute pathname of the desired NFS call-out used by CTDB's

If CTDB_NFS_CALLOUT is unset or null then CTDB will use the provided nfs-linux-kernel-callout.

An example nfs-ganesha-callout is provided as an example as part of CTDB's documentation. This call-out has not been as extensively tested as nfs-linux-kernel-callout.

Writing a call-out

A call-out should implement any required methods. Available methods are:

startup, shutdown
Startup or shutdown the entire NFS service.
start, stop
Start or stop a subset of services, as referenced from NFS checks (see below).
releaseip-pre, releaseip, takeip-pre, takeip
Take actions before or after an IP address is released or taken over during IP failover.
monitor-list-shares
List exported directories that should be monitored for existence. This can be used to ensure that cluster filesystems are mounted.
monitor-pre, monitor-post
Additional monitoring before or after the standard monitoring of RPC services (see below).
register
Should list the names of all implemented methods. This is an optimisation that stops the event scripts from calling unimplemented methods in the call-out.

See the existing call-outs for implementation details and suggested style.

CTDB_NFS_CHECKS_DIR

This is the absolute pathname of a directory of files that describe how to monitor desired NFS RPC services. It can also be configured to try to restart services if they remain unresponsive.

If CTDB_NFS_CHECKS_DIR is unset or null then CTDB uses a set of NFS RPC checks in the nfs-checks.d subdirectory of the CTDB configuration directory.

When providing a different set of NFS RPC checks, create a new subdirectory, such as nfs-checks-enabled.d or nfs-checks-ganesha.d, and set CTDB_NFS_CHECKS_DIR to point to this directory. Populate the directory with custom check files and/or symbolic links to desired checks in the nfs-checks.d. This method is upgrade-safe - if you remove certain checks then they will not be replaced when you upgrade CTDB to a newer version.

Writing NFS RPC check files

These files are described by the relevant README file. See the files shipped with CTDB for examples.

Troubleshooting

File handle consistency

If testing shows stale file handles or other unexpected issues during fail-over testing then this may be due to a cluster filesystem providing inconsistent device numbers across the nodes of the cluster for an exported filesystem.

NFS implementations often use device numbers when constructing file handles. If file handles are constructed inconsistently across the cluster then this can rest in stale file handles. In such cases you should test device and inode number uniformity of your cluster filesystem. If devices numbers are inconsistent then it may or may not be possible to configure the NFS implementation to construct file handles using some other algorithm.

libnfs includes an example program called nfs-fh that can be used to check that file handles are constructed consistently across cluster nodes.

# onnode -c all ./examples/nfs-fh nfs://127.0.0.1/testfile

>> NODE: 10.1.1.1 <<
43000be210dd17000000010000feffffffffffffff000000

>> NODE: 10.1.1.2 <<
43000be210dd17000000010000feffffffffffffff000000

In this case the file handles are consistent.