My-CTDB

Revision as of 14:13, 23 February 2020 by StefanKania (talk | contribs)

Introduction

CTDB is a clustered database component in clustered Samba that provides a high-availability load-sharing CIFS server cluster.

The main functions of CTDB are:

  • Provide a clustered version of the TDB database with automatic rebuild/recovery of the databases upon node failures.
  • Monitor nodes in the cluster and services running on each node.
  • Manage a pool of public IP addresses that are used to provide services to clients. Alternatively, CTDB can be used with LVS.

Combined with a cluster filesystem CTDB provides a full high-availability (HA) environment for services such as clustered Samba, NFS and other services.

Setting up CTDB

After setting up the cluster filesystem you can set up a CTDB-cluster. To user CTDB you have to install the ctdb-package for your distribution. After installing the package with all it's dependencies you will find a directory /etc/ctdb. Inside tis directory you need some configuration files for CTDB.

Let's take a look at the files needed for configuring CTDB.

File Content
/etc/ctdb/ctdb.conf Basic configuration
/etc/ctdb/script.options Setting options for event-scripts
/etc/ctdb/nodes All IP-addresses of all nodes
/etc/ctdb/public_addresses Dynamic IP-addresses for all nodes

The ctdb.conf file

The ctdb.conf file has changed a lot from the old configuration style (< Samba 4.9). This file will no longer be used to configure the different services managed by CTDB. At the moment the only setting you have to do inside the file is setting up the recovery lock file. This file is used by all nodes to check if it's possible to lock files inside the cluster for exclusive use. If you don't use a recovery lock file your cluster can run into a split brain situation. By default the 'recovery lock' is NOT set. You should not use CTDB without a recovery lock unless you know what you are doing. The variable must point to a file inside your mounted gluster-volume. To use the recovery lock enter the following line into /etc/ctdb/ctdb.conf on both nodes:

recovery lock = /glusterfs/ctdb.lock


The file script.options

All the service CTDB will provide will be started via special scripts. In this file you can set options to the script. An example is shown in the script. There is as section for the service-script 50.samba.options named CTDB_SAMBA_SKIP_SHARE_CHECK this option by default is set to yes. This means, every time you create a new share CTDB will check if the path exists, if not CTDB will stop. But if you use the vfs-module glusterfs you will have no local path in the share-configuration. The share points to a directory on your gluster-volume, so CTDB can`t check the path. So if you going to use glusterfs you must set this option for Samba to no.

Because you can set all options to all service-scripts in this file, you don't have to change any of the service-scripts. You will find more information on all options in the manpage man ctdb-script.options.

The file nodes

CTDB must know all hosts belonging to it's cluster, in this file you have to put all IPs from the heartbeat network of all nodes. This file must have the same content on all nodes. Just put the two IPs from the two nodes into the file. Here you see the content of the file.

192.168.57.42
192.168.57.43

In most distributions the file does'n exists, you have to create it.


The file public_addresses

Every time CTDB starts it will provide an IP-address to all nodes in the CTDB-Cluster, this must be an IP-address from the production network.

After starting the cluster, CTDB will take care of those IP-addresses and will give an IP-address of this list to every CTDB-node. If a CTDB-node crashes CTDB will assign the IP-address, from the crashed node, to another CTDB-node. So every IP-address from this file is always assigned to on of the nodes.

CTDB is doing the failover for the services. If one node fails the IP-address will switch to one of the remaining nodes. All clients will then reconnect to this node. That`s possible because all nodes have all session-information of all clients.

For each node you need a public_addresses-file. The files can be different on the nodes, depending to which subnet you would like to assign the node. The example uses just one subnet, so both nodes have identical public_addresses-files. Here you see the content of the file:

192.168.56.101/24 enp0s8
192.168.56.102/24 enp0s8

Starting CTDB the first time

Now you have configured the CTDB-service on both nodes, then you will be ready for the first start. To see what will happened during the start you can open another terminal and start tail -f /var/log/ctdb/ctdb.log to see the messages. First start one node with systemctl restart ctdb, look at the log-messages and then start the second node and still keep an eye on the log.

2020/02/11 17:32:53.778637 ctdbd[1926]: monitor event OK - node re-enabled
2020/02/11 17:32:53.778831 ctdbd[1926]: Node became HEALTHY. Ask recovery master to reallocate IPs 
2020/02/11 17:32:53.779152 ctdb-recoverd[1966]: Node 0 has changed flags - now 0x0  was 0x2
2020/02/11 17:32:54.575970 ctdb-recoverd[1966]: Unassigned IP 192.168.56.102 can be served by this node
2020/02/11 17:32:54.576047 ctdb-recoverd[1966]: Unassigned IP 192.168.56.101 can be served by this node
2020/02/11 17:32:54.576254 ctdb-recoverd[1966]: Trigger takeoverrun
2020/02/11 17:32:54.576780 ctdb-recoverd[1966]: Takeover run starting
2020/02/11 17:32:54.594527 ctdbd[1926]: Takeover of IP 192.168.56.102/24 on interface enp0s8
2020/02/11 17:32:54.595551 ctdbd[1926]: Takeover of IP 192.168.56.101/24 on interface enp0s8
2020/02/11 17:32:54.843175 ctdb-recoverd[1966]: Takeover run completed successfully

Here you can see, that the node has taken both dynamic IP-addresses, you can check this with ip a l enp0s8.

Before you start the second node, take a look at the CTDB-status with ctdb status. You will see that the first node you have just started has the status OK, the other node has the status DISCONNECTED|UNHEALTHY|INACTIVE.

root@cluster-01:~# ctdb status
Number of nodes:2
pnn:0 192.168.57.42    OK (THIS NODE)
pnn:1 192.168.57.43    DISCONNECTED|UNHEALTHY|INACTIVE
Generation:1636031659
Size:1
hash:0 lmaster:0
Recovery mode:NORMAL (0)
Recovery master:0


Now you can start CTDB on the second node with \texttt{systemctl restart ctdb}. Inside the log on the first node you will see the message that the takeover was successfully. In listing~\ref{start-second-node} you will see the last lines from the log: \begin{lstlisting}[captionpos=b,label=start-second-node,caption=Log-entries starting the second node] 2020/02/11 17:51:49.964668 ctdb-recoverd[6598]: Takeover run starting 2020/02/11 17:51:50.004374 ctdb-recoverd[6598]: Takeover run completed \

                          successfully

2020/02/11 17:51:59.061780 ctdb-recoverd[6598]: Reenabling recoveries \

                          after timeout

2020/02/11 17:52:04.632267 ctdb-recoverd[6598]: Node 1 has changed flags \

                          - now 0x0  was 0x2

2020/02/11 17:52:04.989395 ctdb-recoverd[6598]: Takeover run starting 2020/02/11 17:52:05.008763 ctdbd[6554]: Release of IP 192.168.56.102/24 \

                          on interface enp0s8  node:1

2020/02/11 17:52:05.154588 ctdb-recoverd[6598]: Takeover run completed \

                          successfully

\end{lstlisting}

If you see a lot of messages as in listing~\ref{lock-error}, check if the gluster-volume is mounted correctly and if the \textit{recovery lock}-option in \textsf{/etc/ctdb/ctdb.conf} is set correctly: \begin{lstlisting}[captionpos=b,label=lock-error,caption=Recovery error] 2020/02/11 17:51:00.883523 ctdbd[6554]: CTDB_WAIT_UNTIL_RECOVERED 2020/02/11 17:51:00.883630 ctdbd[6554]: ../../ctdb/server/ctdb_monitor.c:324 \

                          wait for pending recoveries to end. Wait \
                          one more second.

\end{lstlisting}

A look at the status will show both nodes \textit{OK}, as you can see in listing~\ref{both-nodes-ok}: \begin{lstlisting}[captionpos=b,label=both-nodes-ok,caption=Status after both nodes started] root@cluster-01:~# ctdb status Number of nodes:2 pnn:0 192.168.57.42 OK (THIS NODE) pnn:1 192.168.57.43 OK Generation:101877096 Size:2 hash:0 lmaster:0 hash:1 lmaster:1 Recovery mode:NORMAL (0) Recovery master:0 \end{lstlisting} Whenever you stop a node, the other node will takeover the IP-address assigned by CTDB. Now CTDB is running, but you still have no service configured. You should only continue configuring the services if both nodes are healthy.

As a test you can stop CTDB on one node and you will see, that the other node will get the IP from the stopped node. As soon as you restart the node, the IP-address will be assigned to the node and both nodes will have the status \textit{OK} again.

The next step will be setting up samba.