Setting up pCIFS using Samba and CTDB
As of April 2007 you can setup a simple Samba3 or Samba4 CTDB cluster, running either on loopback (with simulated nodes) or on a real cluster with TCP. This page will tell you how to get started.
The setup instructions on this page are modelled on setting up a cluster of N nodes that function in nearly all respects as a single multi-homed node. So the cluster will export N IP interfaces, each of which is equivalent (same shares) and which offers coherent CIFS file access across all nodes.
The clustering model utilizes IP takeover techniques to ensure that the full set of public IP addresses assigned to services on the cluster will always be available to the clients even when some nodes have failed and become unavailable.
Binary packages for some operating systems are available on the CTDB home page at http://ctdb.samba.org/
Getting the source code
If you are building from source, you need two source trees, one is a copy of Samba3, and the other is the ctdb code itself. CTDB source trees are stored in bzr repositories. See http://bazaar-vcs.org/ for more information on bzr.
Since Samba version 3.3 all cluster-relevant changes have been merged to the mainstream Samba code. Please refer to the Samba website for download details and current release information.
To get an initial checkout of the ctdb code do this:
cd /usr/src/ rsync -avz samba.org::ftp/unpacked/ctdb .
To update this tree when improvements are made in the upstream code do this:
cd ctdb bzr merge http://samba.org/~tridge/ctdb
If you don't have bzr and can't easily install it, then you can instead use the following command to update your tree to the latest version:
cd ctdb rsync -avz samba.org::ftp/unpacked/ctdb/ .
Building the CTDB tree
To build a copy of the CTDB code you should do this:
cd ctdb ./autogen.sh ./configure make make install
You need to install ctdb on all nodes of your cluster.
Building the Samba3 tree
To build a copy of Samba3 with clustering and ctdb support you should do this:
cd samba-3.3.x/source ./autogen.sh ./configure --with-ctdb=/usr/src/ctdb --with-cluster-support --enable-pie=no --with-shared-modules=idmap_tdb2 make
Once compiled, you should install Samba on all cluster nodes.
Next you need to initialise the Samba password database, e.g.
smbpasswd -a root
Samba with clustering must use the tdbsam or ldap SAM passdb backends (it must not use the default smbpasswd backend), or must be configured to be a member of a domain. The rest of the configuration of Samba is exactly as it is done on a normal system. See the docs on http://samba.org/ for details.
Critical smb.conf parameters
A clustered Samba install must set some specific configuration parameters
* clustering = yes * idmap backend = tdb2
CTDB Cluster Configuration
These are the primary configuration files for CTDB. When CTDB is installed, it will install template versions of these files which you need to edit to suit your system. The current set of config files for CTDB are also available in the /usr/src/ctdb/config directory.
This file contains the startup parameters for ctdb. When you installed ctdb, a template config file should have been installed in /usr/src/ctdb/config/ctdb.sysconfig. Copy this to /etc/sysconfig/ctdb and edit, following the instructions in the template.
The most important options are:
* CTDB_RECOVERY_LOCK * CTDB_PUBLIC_ADDRESSES * CTDB_PUBLIC_INTERFACE * CTDB_MANAGES_SAMBA
Please check those carefully
This file needs to be created as /etc/ctdb/nodes and contains a list of the private IP addresses that the CTDB daemons will use in your cluster. This should be a private non-routable subnet which is only used for internal cluster traffic. This file must be the same on all nodes in the cluster.
10.1.1.1 10.1.1.2 10.1.1.3 10.1.1.4
This file is only required if you plan to use IP takeover. In order to use IP takeover you must specify which interface to use in /etc/sysconfig/ctdb by specifying the CTDB_PUBLIC_INTERFACE variable. You must also specify the list of public IP addresses to use in this file.
This file contains a list (one for each node) of public cluster addresses. these are the addresses that the SMBD daemons will bind to. This file must contain one address for each node, i.e. it must have the same number of entries as the nodes file.
192.168.1.1/24 192.168.1.2/24 192.168.2.1/24 192.168.2.2/24
These are the IP addresses that you should configure in DNS for the name of the clustered samba server and are the addresses that CIFS clients will connect to. The CTDB cluster utilizes IP takeover techniques to ensure that as long as at least one node in the cluster is available, all the public IP addresses will always be available to clients.
CTDB nodes will only take over IP addresses that are inside the same subnet as its own public IP address. In the example above, nodes 0 and 1 would be able to take over each others public ip and analog for nodes 2 and 3, but node 0 and 1 would NOT be able to take over the IP addresses for nodes 2 or 3 since they are on a different subnet.
Do not assign these addresses to any of the interfaces on the host. CTDB will add and remove these addresses automatically at runtime.
This is a collection of script that is called out to by CTDB when certain events occur to allow for site specific tasks to be performed.
The events currently implemented and called out for are
1, when the cluster starts up 2, when a node takes over an ip address 3, when a node releases an ip address 4, when recovery has completed and the cluster is reconfigured 5, when the cluster performs a clean shutdown 6, during normal operations ot monitor the health of each managed service
Please see the service scripts that installed by ctdb in /etc/ctdb/events.d for examples of how to configure other services to be aware of the HA features of CTDB.
Also see /etc/ctdb/events.d/README for additional documentation on how to write and modify event scripts.
CTDB defaults to use IANA assigned TCP port 4379 for its traffic. Configuring a different port to use for CTDB traffic is done by adding a ctdb entry to the /etc/services file.
Example: for change CTDB to use port 9999 add the following line to /etc/services
Note: all nodes in the cluster MUST use the same port or else CTDB will not start correctly.
You need to setup some method for your Windows and NFS clients to find the nodes of the cluster, and automatically balance the load between the nodes. We recommend that you setup a round-robin DNS entry for your cluster, listing all the public IP addresses that CTDB will be managing as a single DNS A record.
You may also wish to setup a static WINS server entry listing all of your cluster nodes IP addresses.
Managing Network Interfaces
The default install of CTDB is able to add/remove IP addresses from your network interfaces using the CTDB_PUBLIC_ADDRESSS option shown above.
For more sophisticated interface management you will need to add a new events script in /etc/ctdb/events.d/.
For example, say you wanted CTDB to add a default route when it brings it up. You could have an event script called /etc/ctdb/events.d/11.route that looks like this:
#!/bin/sh . /etc/ctdb/functions loadconfig ctdb cmd="$1" shift case $cmd in takeip) # we ignore errors from this, as the route might be up already when we're grabbing # a 2nd IP on this interface /sbin/ip route add $CTDB_PUBLIC_NETWORK via $CTDB_PUBLIC_GATEWAY dev $1 2> /dev/null ;; esac exit 0
Then you would put CTDB_PUBLIC_NETWORK and CTDB_PUBLIC_GATEWAY in /etc/sysconfig/ctdb like this:
Filesystem specific configuration
The cluster filesystem you use with ctdb plays a critical role in ensuring that CTDB works seamlessly. Here are some filesystem specific tips
If you are interested in testing a new cluster filesystem with CTDB then I strongly recommend looking at the page on testing filesystems using ping_pong
IBMs GPFS filesystem
The GPFS filesystem (see http://www-03.ibm.com/systems/clusters/software/gpfs.html) is a proprietary cluster filesystem that has been extensively tested with CTDB/Samba. When using GPFS, the following smb.conf settings are recommended
clustering = yes idmap backend = tdb2 fileid:mapping = fsname vfs objects = gpfs fileid gpfs:sharemodes = No force unknown acl user = yes nfs4: mode = special nfs4: chown = yes nfs4: acedup = merge
The ACL related options should only be enabled if you have NFSv4 ACLs enabled on your filesystem
The most important of these options is the "fileid:mapping". You risk data corruption if you use a different mapping backend with Samba and GPFS, because locking wilk break across nodes. NOTE: You must also load "fileid" as a vfs object in order for this to take effect.
RedHat GFS filesystem
Red Hat GFS is a native file system that interfaces directly with the Linux kernel file system interface (VFS layer).
The gfs_controld daemon manages mounting, unmounting, recovery and posix locks. Edit /etc/init.d/cman (If using RedHat Cluster Suite) to start gfs_controld with the '-l 0 -o 1' flags to optimize posix locking performance. You'll notice the difference this makes by running the ping_pong test with and without these options.
A complete HowTo document to setup clustered samba with CTDB and GFS2 is here: GFS CTDB HowTo
Lustre® is a scalable, secure, robust, highly-available cluster file system. It is designed, developed and maintained by Cluster File Systems, Inc (see http://www.clusterfs.com).
Tests have been done on Lustre releases of 1.4.x and 1.6.x with CTDB/Samba. When mounting Lustre, an option of "-o flock" should be specified to enable cluster-wide byte range lock among all Lustre clients.
These two versions have differnt mechanisms of configuration and startup. More information is available at http://wiki.lustre.org.
In comparison of Lustre configurating, setting up CTDB/Samba on the two different versions keeps the same way. The following settings are recommended:
clustering = yes idmap backend = tdb2 private dir=/mnt/lustre/ctdb fileid:mapping = fsname use mmap = no nt acl support = yes ea support = yes
The options of "fileid:mapping" and "use mmap" must be specified to avoid possibe data corruption. The sixth of "nt acl support" is to map the POSIX ACL to Windows NT's format. At the moment, Lustre only supports POSIX ACL.
GlusterFS is a cluster file-system capable of scaling to several peta-bytes that is easy to configure. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system. GlusterFS is based on a stackable user space design without compromising performance. It uses Linux File System in Userspace (FUSE) to achieve all this.
NOTE: GlusterFS has not yet had extensive testing but this is currently underway.
Currently from versions 2.0 to 2.0.4 of GlusterFS, it must be patched with:
This is to ensure GlusterFS passes the ping_pong test. This issue is being tracked at:
Update: As of GlusterFS 2.0.6 this has been fixed.
- OCFS2 - see http://oss.oracle.com/projects/ocfs2/
fileid:mapping = fsid vfs objects = fileid
OCFS2 1.4 offers cluster-wide byte-range locking.
Starting the cluster
Just start the ctdb service on all nodes. A sample init script (works for RedHat) is located in /usr/src/ctdb/config/ctdb.init
If you have taken advantage of the ability of CTDB to start other services, then you should disable those other services with chkconfig, or your systems service configuration tool. Those services will instead be started by ctdb using the /etc/ctdb/events.d service scripts.
If you wish to cope with software faults in ctdb, or want ctdb to automatically restart when an administration kills it, then you may wish to add a cron entry for root like this:
* * * * * /etc/init.d/ctdb cron > /dev/null 2>&1
Testing your cluster
Once your cluster is up and running, you may wish to know how to test that it is functioning correctly. The following tests may help with that
The ctdb package comes with a utility called ctdb that can be used to view the behaviour of the ctdb cluster. If you run it with no options it will provide some terse usage information. The most commonly used commands are:
- ctdb ping - ctdb status
You can check for connectivity to the smbd daemons on each node using smbcontrol
- smbcontrol smbd ping
Using Samba4 smbtorture
The Samba4 version of smbtorture has several tests that can be used to benchmark a CIFS cluster. You can download Samba4 like this:
git clone git://git.samba.org/samba.git cd samba/source4
Then configure and compile it as usual. The particular tests that are helpful for cluster benchmarking are the RAW-BENCH-OPEN, RAW-BENCH-LOCK and BENCH-NBENCH tests. These tests take a unclist that allows you to spread the workload out over more than one node. For example:
smbtorture //localhost/data -Uuser%password RAW-BENCH-LOCK --unclist=unclist.txt --num-progs=32 -t60
The file unclist.txt should contain a list of share in your cluster (UNC format: //server//share). For example
//node1/data //node2/data //node3/data //node4/data
For NBENCH testing you need a client.txt file. A suitable file can be found in the dbench distribution at http://samba.org/ftp/tridge/dbench/
Setting up CTDB for clustered NFS
Configure CTDB as above and set it up to use public ipaddresses. Verify that the CTDB cluster works.
Export the same directory from all nodes. Also make sure to specify the fsid export option so that all nodes will present the same fsid to clients. clients can get "upset" if the fsid on a mount suddenly changes.
This file must be edited to point statd to keep its state directory on shared storage instead of in a local directory. We must also make statd use a fixed port to listen on that is the same for all nodes in the cluster. If we don't specify a fixed port, the statd port will change during failover which causes problems on some clients.
This file should look something like :
CTDB_MANAGES_NFS=yes NFS_TICKLE_SHARED_DIRECTORY=/gpfs0/nfs-tickles STATD_PORT=595 STATD_OUTGOING_PORT=596 MOUNTD_PORT=597 RQUOTAD_PORT=598 LOCKD_UDPPORT=599 LOCKD_TCPPORT=599 STATD_SHARED_DIRECTORY=/gpfs0/nfs-state NFS_HOSTNAME="ctdb" STATD_HOSTNAME="$NFS_HOSTNAME -P "$STATD_SHARED_DIRECTORY/$PUBLIC_IP" -H /etc/ctdb/statd-callout -p 97" RPCNFSDARGS="-N 4"
The CTDB_MANAGES_NFS line tells the events scripts that CTDB is to manage startup and shutdown of the NFS and NFSLOCK services. With this set to yes, CTDB will start/stop/restart these services as required.
STATD_SHARED_DIRECTORY is the shared directory where statd and the statd-callout script expects that the state variables and lists of clients to notify are found.
The ip address specified should be the public address of this node.
The reason to specify the port used by the lockmanager is so that the port used by a public address will not change during address failover/failback since this can confuse some clients.
NFS_TICKLE_SHARED_DIRECTORY is where ctdb will store information about which clients have established tcp connections to the cluster. This information is used during failover of ip addresses. This allows the node that takes over an ip address to very quickly 'tickle' and reset any tcp connections for the ip address it took over. The reason to do this is to improve the speed at which a client will detect that the tcp connection for NFS needs to be reestablished and to speed up recovery in the client.
NFS_HOSTNAME is the name that the nfs server will use for the public addresses. This should be the same as the name samba uses. This name must be resolvable into the ip addresses used for public addresses.
The RPCNFSDARGS line is used to disable support for NFSv4 which is not yet supported by CTDB.
Since CTDB will manage and start/stop/restart the nfs and the nfslock services, you must disable them in chkconfig.
chkconfig nfs off chkconfig nfslock off
Statd state directories
For each node, create a state directory on shared storage where each local statd daemon can keep its state information. This needs to be on shared storage since if a node takes over an ip address it needs to find the list of monitored clients to notify. You need to create the directory used to host this statd state on shared storage.
CTDB clustering for NFS relies on two event scripts /etc/ctdb/events.d/60.nfs and /etc/ctdb/events.d/61.nfstickle. These two scripts are provided by the RPM package and there should not be any need to change them.
Never ever mount the same nfs share on a client from two different nodes in the cluster at the same time. The client side caching in NFS is very fragile and assumes/relies on that an object can only be accessed through one single path at a time.