GlusterFS

Revision as of 17:17, 22 February 2020 by StefanKania (talk | contribs)

Fundamentals

GlusterFS is a free and open source scalable filesystem it can be used for cloud storage or to store data in a local network. It can be used to set up an active-active filesystem cluster with failover and loadbalancing via DNS-round robin. Together with CTDB it is possible to build a fileserver for a network with the following advantages:

  • Expandable without downtime
  • Mount Gluster volumes via the network
  • Posix ACL support
  • Different configurations possible (depending of your needs)
  • Self-healing
  • Support of snapshots if LVM2 thinly provisioned is used for the bricks

The different configurations available are:

  • Replicated Volume
  • Distributed Volume
  • Striped Volume
  • Replicated-Distributed Volume
  • Dispersed Volume

To read more about the different configurations see:



What you need

  • Two hosts with two network cards
  • An empty partition of 2GB to create the volume on each host
  • Two IP addresses from your production network
  • Two IP addresses for the heartbeat network
  • The GlusterFS packages version 7.x

Hostnames and IPs

Here you see two tables with the used IP-addresses on both hosts.

production network

If a client should connect to the Gluster-cluster an IP-address from the production network is used.

Hostname IP-address Network name
cluster-01 192.168.56.101 example.net
cluster-02 192.168.56.102 example.net

Heartbeat network

The heartbeat network is only for the communication between the Gluster-nodes

Hostname IP-address Network name
c-01 192.168.57.101 heartbeat.net
c-02 192.168.57.102 heartbeat.net

The mountpoints

you need two mountpoints, one for the physical brick and one for the volume.

Mountpoint What to mount
/gluster The brick on each node
/glusterfs For the volume on each node

Setting up the LVM-partition

The first step will be, setting up the replicated Gluster-Volume with two nodes. As an example a partition with 2GB is used.

root@cluster-01:~# fdisk /dev/sdc

root@cluster-01:~# apt install lvm2 thin-provisioning-tools

root@cluster-01:~# pvcreate /dev/sdc1
  Physical volume "/dev/sdc1" successfully created.

root@cluster-01:~# vgcreate glustergroup /dev/sdc1
  Volume group "glustergroup" successfully created

root@cluster-01:~# lvcreate -L 1950M -T glustergroup/glusterpool
  Using default stripesize 64,00 KiB
  Rounding up size to full physical extent 1,91 GiB
  Logical volume "glusterpool" created.

root@cluster-01:~# lvcreate -V 1900M -T glustergroup/glusterpool -n glusterv1
  Using default stripesize 64,00 KiB.
  Logical volume "glusterv1" created.
 
root@cluster-01:~# mkfs.xfs /dev/glustergroup/glusterv1

root@cluster-01:~# mkdir /gluster

root@cluster-01:~# mount /dev/glustergroup/glusterv1 /gluster

root@cluster-01:~# echo /dev/glustergroup/glusterv1 /gluster xfs defaults 0 0 >> /etc/fstab

root@cluster-01:~# mkdir /gluster/brick

Do all the steps on both nodes.

Creating the peer pool

Before you can create the volume you have to set up a peer pool, by adding the two hosts as peer to the pool In next listing you will see the commands to add the second gluster-node to the pool. You have to do this on the first of the first gluster-host:

root@cluster-01:~# gluster peer probe c-02
peer probe: success. 

If you try to add the peer and you get one of the following error messages:

root@cluster-01:~# gluster peer probe c-02
Connection failed. Please check if gluster daemon is operational.

root@cluster-01:~# gluster peer probe c-02
peer probe: failed: Probe returned with Transport endpoint is not connected

The first error message will point you to a not running glusterd on the host you are trying to add the peer. Restart the the daemon

systemctl restart glusterd

The second error message will point to a not running daemon on the peer you are trying to add to the pool. Restart the glusterd on the other node.

If you could add the node c-02 on the node c-01, add the host c-01 to the trusted pool on node c-02

root@cluster-02:~# gluster peer probe c-01
peer probe: success. 

Now you can check the status of each node and take a look at the list of all nodes with the gluster-command

root@cluster-01:~# gluster peer status
Number of Peers: 1

Hostname: c-02
Uuid: aca7d361-51df-4d1f-9b0f-4cf494029f21
State: Peer in Cluster (Connected) 
root@cluster-02:~# gluster peer status
Number of Peers: 1
Other names:
c-02

Hostname: c-01.heartbeat.net
Uuid: adafbf93-e716-4d99-bf89-e8044d57e3aa
State: Peer in Cluster (Connected)
Other names:
c-01
root@cluster-02:~# gluster pool list
UUID					Hostname          	State
adafbf93-e716-4d99-bf89-e8044d57e3aa	c-01.heartbeat.net Connected 
aca7d361-51df-4d1f-9b0f-4cf494029f21	localhost          Connected 

On each host you will find the information of the peer in /var/lib/glusterd/peers/<UUID>

Now you have all the peers added to the pool, you will need for the gluster-volume.

The Gluster volume

The next step is creating the volume. But before we create the volume of two bricks let me explain some things. If you start creating the volume and give just two bricks as parameter you will see a warning, that it's not a good idea to create a replicated volume with only two bricks, because you will not be able to to set up a quorum. In a productive environment you should always create a replicate volume of an odd number of nodes, because of the quorum.

What is the quorum and why is it so important?

If you set up three nodes and you are lose the connection between node-1 and the other nodes (node-2 and node-3) but still the clients from the production network can reach all three nodes and all three nodes still running the glusterd. So one client can connect to node-1 and do some changes on a file. Another client can connect to the rest of the Gluster-cluster (node-2 and node-3) and change the same file, because node-2 and node-3 can't communicate to node-1 about open files anymore.

You will get a split brain of your cluster as soon as the connection is reestablished. If you configure a quorum of 51% the two nodes still communicate (node-2 and node-3) will meet the quorum, but the other node (node-1) not. So the Gluster-daemon will stop taking any changes from a client on node-1 the node will either stop the service or will go to a read-only status.

With two nodes you can't set up a quorum, because each node is 50% of the cluster. Only with an odd number of nodes you can configure a good working quorum. That's why you will get the warning when creating the volume with two nodes. But it's just a warning.

CTDB and quorum

This problem will apply to CTDB too. In the future the developer plan to introduce an optional quorum where nodes will have to be connected to >50% of configured nodes before they can join the cluster.

With 2 nodes it is very easy to get a stupid form of split brain. Node A is shut down and node B is active, updating information in persistent databases (perhaps id-mapping info?). Node B is shut down and node A is restarted. Now node A's old database is in use for a while. When node B is restarted then some databases from A might be used and some from B - it depends on the sequence numbers.

Creating the volume

Now choose one of the nodes to creating the volume. It doesn't matter which node you chose:

root@cluster-01:~# gluster volume create gv0 replica 2 c-01:/gluster/brick \
    c-02:/gluster/brick
Replica 2 volumes are prone to split-brain. Use Arbiter or Replica 3 to avoid \
   this. See: ht tp :// docs.gluster.org/en/latest/Administrator%20Guide/Split%20brain%20and%20ways%20to%20deal%20with%20it/.
Do you still want to continue?
(y/n) y
volume create: gv0: success: please start the volume to access data

This is the warning I mentioned before. But by typing a "y" you can create the volume anyway.

Now let's take a look at the setup and the status of the volume. Here you can see the result:

root@cluster-01:~# gluster v info

Volume Name: gv0
Type: Replicate
Volume ID: 5d1e1031-5474-48e9-9451-1dbeb5ebb79e
Status: Created
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: c-01:/gluster/brick
Brick2: c-02:/gluster/brick
Options Reconfigured:
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off
root@cluster-01:~# gluster v status
Volume gv0 is not started

With gluster v info or gluster volume info you will see the setup of the volume and a list of set parameters at the end of the output. The command gluster v status is telling you, that the cluster is not running. You have to start the volume before you can use it. Now you see the command and the new status:

root@cluster-01:~# gluster v start gv0
volume start: gv0: success
root@cluster-01:~# gluster v status gv0
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick c-01:/gluster/brick                   49152     0          Y       1583 
Brick c-02:/gluster/brick                   49152     0          Y       9830 
Self-heal Daemon on localhost               N/A       N/A        Y       1604 
Self-heal Daemon on c-02                    N/A       N/A        Y       9851 
 
Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks

Now the volume is running an ready to use.

Setting some Samba-Options

To get a better performance from Gluster when connecting via SMB it's possible to set some options to your Gluster-volume. Starting from Gluster version 6 most of the options are put together in a group of options.

Starting with Gluster7 the group-option for Samba is not a part of the debian-packages anymore. So I created a new group-options file my-samba:

cluster.self-heal-daemon=enable
performance.cache-invalidation=on
server.event-threads=4
client.event-threads=4
performance.parallel-readdir=on
performance.readdir-ahead=on
performance.nl-cache-timeout=600
performance.nl-cache=on
network.inode-lru-limit=200000
performance.md-cache-timeout=600
performance.stat-prefetch=on
performance.cache-samba-metadata=on
features.cache-invalidation-timeout=600
features.cache-invalidation=on
nfs.disable=on
cluster.data-self-heal=on
cluster.metadata-self-heal=on
cluster.entry-self-heal=on
cluster.force-migration=disable

You have to put the file in /var/lib/glusterd/groups/ the name of the file is my-samba. If you find a file named samba in this directory then you have the original file from the glusterfs-server-package.

You only have to set this options on one of your nodes. Here you see the command to set and list the new options:

root@cluster-02:~# gluster v set gv0 group my-samba
volume set: success 
root@cluster-02:~# gluster v info

Volume Name: gv0
Type: Replicate
Volume ID: 5d1e1031-5474-48e9-9451-1dbeb5ebb79e
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: c-01:/gluster/brick
Brick2: c-02:/gluster/brick
Options Reconfigured:
cluster.force-migration: disable
cluster.entry-self-heal: on
cluster.metadata-self-heal: on
cluster.data-self-heal: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.cache-samba-metadata: on
performance.stat-prefetch: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 200000
performance.nl-cache: on
performance.nl-cache-timeout: 600
performance.readdir-ahead: on
performance.parallel-readdir: on
client.event-threads: 4
server.event-threads: 4
performance.cache-invalidation: on
cluster.self-heal-daemon: enable
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
performance.client-io-threads: off

As you can see, all the options are set. If you would like to set a single option you can do it with gluster v set <volume-name> <option>=<value>. To reset an option to it's original value use gluster v reset <volume-name> <option>. To see all options use gluster v get <volume-name> all.

Mounting the gluster-volume

To use the volume you have to mount the volume either to a local mountpoint or over the network to a host which has the glusterfs-client installed. To mount the volume to a local mountpoint on your hosts the system will use fuse. The first try is to mount the volume manually. Next you will see how to mount the volume on both nodes:

root@cluster-02:~# mount -t glusterfs cluster-02:/gv0 /glusterfs
root@cluster-02:~# mount | grep glusterfs
cluster-02:/gv0 on /glusterfs type fuse.glusterfs 
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

root@cluster-01:~# mount -t glusterfs cluster-01:/gv0 /glusterfs

root@cluster-01:~# mount | grep glusterfs

cluster-01:/gv0 on /glusterfs type fuse.glusterfs 
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)      

As you can see the host's own name is used as source on each host, but what will happen if you mount the volume via the network on a pure glusterfs-client and if the host you use for mounting the volume crashes? The client will go to the next node and reconnect, because during the mount-process the client receives a list of all nodes holding the volume, so the client is knowing all nodes of the volume.

The system is using fuse.mount to mount the volume. With most of the Debian based distributions there is a problem using fuse.mount with network-filesystems, because Debian first try to mount the fuse-volume and then starts the network and that will not work. So you need a different way to mount the volume during the start of your glusterfs-client. You will find a good solution in creating a systemd-script for mounting the volume. Here you see a systemd-script to mount the volume: \begin{lstlisting}[captionpos=b,label=systemd-mount-g,caption=Systemd-script to mount the volume]

[Unit]
Description = Data dir
After=network.target glusterfs-server.service
Required=network-online.target 

[Mount]
RemainAfterExit=true
ExecStartPre=/usr/sbin/gluster volume list
ExecStart=/bin/mount -a -t glusterfs
Restart=on-failure
RestartSec=3
What=cluster-01:/gv0
Where=/glusterfs
Type=glusterfs
Options=defaults,acl

[Install]
WantedBy=multi-user.target

It's very important that the name of the script is the same as the name of the mountpoint. If your mountpoint is not in filesystem-root but for example in /vol/glusterfs you must use the filename vol.glusterfs.mount for your script.

To try the script unmount the gluster-volume with umount /glusterfs and run systemctl start glusterfs.mount, afterwards check if the volume is mounted with mount. Now you have to enable the script with systemctl enable glusterfs.mount, so that the volume will be mounted every time you start your system.

Do all the steps on both nodes, so the volume will be mounted on both nodes everytime you reboot the system.

Now you can use the volume.

What should you do next

Test the volume by writing files, deactivating a node and seeing what happens if the host comes back after you have written some file to the remaining node. Test it by change to the directory /glusterfs on one node, create some files and directories and see if the same entries appear on the other node in glusterfs, this will show you, that the cluster is working.