GlusterFS: Difference between revisions
StefanKania (talk | contribs) |
StefanKania (talk | contribs) No edit summary |
||
Line 358: | Line 358: | ||
Now you can use the volume. |
Now you can use the volume. |
||
==What should you do next== |
|||
Test the volume by writing files, deactivating a node and seeing what happens if the host comes back after you have written some file to the remaining node. Test it by change to the directory ''/glusterfs'' on one node, create some files and directories and see if the same entries appear on the other node in ''glusterfs'', this will show you, that the cluster is working. |
|||
\textbf{Important!}\newline\fbox{\parbox{\linewidth}{Never ever write or delete any files or directories directly on the brick it self \textsf{/gluster/brick}. This will crash your volume!}} |
Revision as of 17:16, 22 February 2020
Fundamentals
GlusterFS is a free and open source scalable filesystem it can be used for cloud storage or to store data in a local network. It can be used to set up an active-active filesystem cluster with failover and loadbalancing via DNS-round robin. Together with CTDB it is possible to build a fileserver for a network with the following advantages:
- Expandable without downtime
- Mount Gluster volumes via the network
- Posix ACL support
- Different configurations possible (depending of your needs)
- Self-healing
- Support of snapshots if LVM2 thinly provisioned is used for the bricks
The different configurations available are:
- Replicated Volume
- Distributed Volume
- Striped Volume
- Replicated-Distributed Volume
- Dispersed Volume
To read more about the different configurations see:
![]() | This article is part of CTDB-setup so it just shows how to setup a replicated volume to be used with CTDB. The setup will be a two node replicated volume with 2GB diskspace, so it will be easy to reproduce the setup. |
What you need
- Two hosts with two network cards
- An empty partition of 2GB to create the volume on each host
- Two IP addresses from your production network
- Two IP addresses for the heartbeat network
- The GlusterFS packages version 7.x
Hostnames and IPs
Here you see two tables with the used IP-addresses on both hosts.
production network
If a client should connect to the Gluster-cluster an IP-address from the production network is used.
Hostname | IP-address | Network name |
---|---|---|
cluster-01 | 192.168.56.101 | example.net |
cluster-02 | 192.168.56.102 | example.net |
Heartbeat network
The heartbeat network is only for the communication between the Gluster-nodes
Hostname | IP-address | Network name |
---|---|---|
c-01 | 192.168.57.101 | heartbeat.net |
c-02 | 192.168.57.102 | heartbeat.net |
The mountpoints
you need two mountpoints, one for the physical brick and one for the volume.
Mountpoint | What to mount |
---|---|
/gluster | The brick on each node |
/glusterfs | For the volume on each node |
Setting up the LVM-partition
![]() | Be sure that you are working with the right partition, you will lose all data if you choose the wrong partition. |
The first step will be, setting up the replicated Gluster-Volume with two nodes. As an example a partition with 2GB is used.
root@cluster-01:~# fdisk /dev/sdc root@cluster-01:~# apt install lvm2 thin-provisioning-tools root@cluster-01:~# pvcreate /dev/sdc1 Physical volume "/dev/sdc1" successfully created. root@cluster-01:~# vgcreate glustergroup /dev/sdc1 Volume group "glustergroup" successfully created root@cluster-01:~# lvcreate -L 1950M -T glustergroup/glusterpool Using default stripesize 64,00 KiB Rounding up size to full physical extent 1,91 GiB Logical volume "glusterpool" created. root@cluster-01:~# lvcreate -V 1900M -T glustergroup/glusterpool -n glusterv1 Using default stripesize 64,00 KiB. Logical volume "glusterv1" created. root@cluster-01:~# mkfs.xfs /dev/glustergroup/glusterv1 root@cluster-01:~# mkdir /gluster root@cluster-01:~# mount /dev/glustergroup/glusterv1 /gluster root@cluster-01:~# echo /dev/glustergroup/glusterv1 /gluster xfs defaults 0 0 >> /etc/fstab root@cluster-01:~# mkdir /gluster/brick
Do all the steps on both nodes.
Creating the peer pool
![]() | Make sure that you are using the the hostnames from the heartbeat network, to be sure that the communication between the nodes is using the heartbeat network. |
Before you can create the volume you have to set up a peer pool, by adding the two hosts as peer to the pool In next listing you will see the commands to add the second gluster-node to the pool. You have to do this on the first of the first gluster-host:
root@cluster-01:~# gluster peer probe c-02 peer probe: success.
If you try to add the peer and you get one of the following error messages:
root@cluster-01:~# gluster peer probe c-02 Connection failed. Please check if gluster daemon is operational. root@cluster-01:~# gluster peer probe c-02 peer probe: failed: Probe returned with Transport endpoint is not connected
The first error message will point you to a not running glusterd on the host you are trying to add the peer. Restart the the daemon
systemctl restart glusterd
The second error message will point to a not running daemon on the peer you are trying to add to the pool. Restart the glusterd on the other node.
If you could add the node c-02 on the node c-01, add the host c-01 to the trusted pool on node c-02
root@cluster-02:~# gluster peer probe c-01 peer probe: success.
Now you can check the status of each node and take a look at the list of all nodes with the gluster-command
root@cluster-01:~# gluster peer status Number of Peers: 1 Hostname: c-02 Uuid: aca7d361-51df-4d1f-9b0f-4cf494029f21 State: Peer in Cluster (Connected)
root@cluster-02:~# gluster peer status Number of Peers: 1 Other names: c-02 Hostname: c-01.heartbeat.net Uuid: adafbf93-e716-4d99-bf89-e8044d57e3aa State: Peer in Cluster (Connected) Other names: c-01
root@cluster-02:~# gluster pool list UUID Hostname State adafbf93-e716-4d99-bf89-e8044d57e3aa c-01.heartbeat.net Connected aca7d361-51df-4d1f-9b0f-4cf494029f21 localhost Connected
On each host you will find the information of the peer in /var/lib/glusterd/peers/<UUID>
Now you have all the peers added to the pool, you will need for the gluster-volume.
The Gluster volume
The next step is creating the volume. But before we create the volume of two bricks let me explain some things. If you start creating the volume and give just two bricks as parameter you will see a warning, that it's not a good idea to create a replicated volume with only two bricks, because you will not be able to to set up a quorum. In a productive environment you should always create a replicate volume of an odd number of nodes, because of the quorum.
What is the quorum and why is it so important?
If you set up three nodes and you are lose the connection between node-1 and the other nodes (node-2 and node-3) but still the clients from the production network can reach all three nodes and all three nodes still running the glusterd. So one client can connect to node-1 and do some changes on a file. Another client can connect to the rest of the Gluster-cluster (node-2 and node-3) and change the same file, because node-2 and node-3 can't communicate to node-1 about open files anymore.
You will get a split brain of your cluster as soon as the connection is reestablished. If you configure a quorum of 51% the two nodes still communicate (node-2 and node-3) will meet the quorum, but the other node (node-1) not. So the Gluster-daemon will stop taking any changes from a client on node-1 the node will either stop the service or will go to a read-only status.
With two nodes you can't set up a quorum, because each node is 50% of the cluster. Only with an odd number of nodes you can configure a good working quorum. That's why you will get the warning when creating the volume with two nodes. But it's just a warning.
CTDB and quorum
This problem will apply to CTDB too. In the future the developer plan to introduce an optional quorum where nodes will have to be connected to >50% of configured nodes before they can join the cluster.
With 2 nodes it is very easy to get a stupid form of split brain. Node A is shut down and node B is active, updating information in persistent databases (perhaps id-mapping info?). Node B is shut down and node A is restarted. Now node A's old database is in use for a while. When node B is restarted then some databases from A might be used and some from B - it depends on the sequence numbers.
Creating the volume
Now choose one of the nodes to creating the volume. It doesn't matter which node you chose:
root@cluster-01:~# gluster volume create gv0 replica 2 c-01:/gluster/brick \ c-02:/gluster/brick Replica 2 volumes are prone to split-brain. Use Arbiter or Replica 3 to avoid \ this. See: ht tp :// docs.gluster.org/en/latest/Administrator%20Guide/Split%20brain%20and%20ways%20to%20deal%20with%20it/. Do you still want to continue? (y/n) y volume create: gv0: success: please start the volume to access data
This is the warning I mentioned before. But by typing a "y" you can create the volume anyway.
Now let's take a look at the setup and the status of the volume. Here you can see the result:
root@cluster-01:~# gluster v info Volume Name: gv0 Type: Replicate Volume ID: 5d1e1031-5474-48e9-9451-1dbeb5ebb79e Status: Created Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: c-01:/gluster/brick Brick2: c-02:/gluster/brick Options Reconfigured: transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: off
root@cluster-01:~# gluster v status Volume gv0 is not started
With gluster v info
or gluster volume info
you will see the setup of the volume and a list of set parameters at the end of the output. The command gluster v status
is telling you, that the cluster is not running. You have to start the volume before you can use it. Now you see the command and the new status:
root@cluster-01:~# gluster v start gv0 volume start: gv0: success
root@cluster-01:~# gluster v status gv0 Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick c-01:/gluster/brick 49152 0 Y 1583 Brick c-02:/gluster/brick 49152 0 Y 9830 Self-heal Daemon on localhost N/A N/A Y 1604 Self-heal Daemon on c-02 N/A N/A Y 9851 Task Status of Volume gv0 ------------------------------------------------------------------------------ There are no active volume tasks
Now the volume is running an ready to use.
Setting some Samba-Options
To get a better performance from Gluster when connecting via SMB it's possible to set some options to your Gluster-volume. Starting from Gluster version 6 most of the options are put together in a group of options.
Starting with Gluster7 the group-option for Samba is not a part of the debian-packages anymore. So I created a new group-options file my-samba:
cluster.self-heal-daemon=enable performance.cache-invalidation=on server.event-threads=4 client.event-threads=4 performance.parallel-readdir=on performance.readdir-ahead=on performance.nl-cache-timeout=600 performance.nl-cache=on network.inode-lru-limit=200000 performance.md-cache-timeout=600 performance.stat-prefetch=on performance.cache-samba-metadata=on features.cache-invalidation-timeout=600 features.cache-invalidation=on nfs.disable=on cluster.data-self-heal=on cluster.metadata-self-heal=on cluster.entry-self-heal=on cluster.force-migration=disable
You have to put the file in /var/lib/glusterd/groups/
the name of the file is my-samba
. If you find a file named samba
in this directory then you have the original file from the glusterfs-server
-package.
You only have to set this options on one of your nodes. Here you see the command to set and list the new options:
root@cluster-02:~# gluster v set gv0 group my-samba volume set: success
root@cluster-02:~# gluster v info Volume Name: gv0 Type: Replicate Volume ID: 5d1e1031-5474-48e9-9451-1dbeb5ebb79e Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: c-01:/gluster/brick Brick2: c-02:/gluster/brick Options Reconfigured: cluster.force-migration: disable cluster.entry-self-heal: on cluster.metadata-self-heal: on cluster.data-self-heal: on features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.cache-samba-metadata: on performance.stat-prefetch: on performance.md-cache-timeout: 600 network.inode-lru-limit: 200000 performance.nl-cache: on performance.nl-cache-timeout: 600 performance.readdir-ahead: on performance.parallel-readdir: on client.event-threads: 4 server.event-threads: 4 performance.cache-invalidation: on cluster.self-heal-daemon: enable transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: off
As you can see, all the options are set. If you would like to set a single option you can do it with gluster v set <volume-name> <option>=<value>
. To reset an option to it's original value use gluster v reset <volume-name> <option>
. To see all options use gluster v get <volume-name> all
.
Mounting the gluster-volume
To use the volume you have to mount the volume either to a local mountpoint or over the network to a host which has the glusterfs-client installed. To mount the volume to a local mountpoint on your hosts the system will use fuse. The first try is to mount the volume manually. Next you will see how to mount the volume on both nodes:
root@cluster-02:~# mount -t glusterfs cluster-02:/gv0 /glusterfs
root@cluster-02:~# mount | grep glusterfs cluster-02:/gv0 on /glusterfs type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
root@cluster-01:~# mount -t glusterfs cluster-01:/gv0 /glusterfs
root@cluster-01:~# mount | grep glusterfs
cluster-01:/gv0 on /glusterfs type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
As you can see the host's own name is used as source on each host, but what will happen if you mount the volume via the network on a pure glusterfs-client and if the host you use for mounting the volume crashes? The client will go to the next node and reconnect, because during the mount-process the client receives a list of all nodes holding the volume, so the client is knowing all nodes of the volume.
The system is using fuse.mount to mount the volume. With most of the Debian based distributions there is a problem using fuse.mount with network-filesystems, because Debian first try to mount the fuse-volume and then starts the network and that will not work. So you need a different way to mount the volume during the start of your glusterfs-client. You will find a good solution in creating a systemd-script for mounting the volume. Here you see a systemd-script to mount the volume: \begin{lstlisting}[captionpos=b,label=systemd-mount-g,caption=Systemd-script to mount the volume]
[Unit] Description = Data dir After=network.target glusterfs-server.service Required=network-online.target [Mount] RemainAfterExit=true ExecStartPre=/usr/sbin/gluster volume list ExecStart=/bin/mount -a -t glusterfs Restart=on-failure RestartSec=3 What=cluster-01:/gv0 Where=/glusterfs Type=glusterfs Options=defaults,acl [Install] WantedBy=multi-user.target
It's very important that the name of the script is the same as the name of the mountpoint. If your mountpoint is not in filesystem-root but for example in /vol/glusterfs
you must use the filename vol.glusterfs.mount
for your script.
To try the script unmount the gluster-volume with umount /glusterfs
and run systemctl start glusterfs.mount
, afterwards check if the volume is mounted with mount
. Now you have to enable the script with systemctl enable glusterfs.mount
, so that the volume will be mounted every time you start your system.
Do all the steps on both nodes, so the volume will be mounted on both nodes everytime you reboot the system.
Now you can use the volume.
What should you do next
Test the volume by writing files, deactivating a node and seeing what happens if the host comes back after you have written some file to the remaining node. Test it by change to the directory /glusterfs on one node, create some files and directories and see if the same entries appear on the other node in glusterfs, this will show you, that the cluster is working.
\textbf{Important!}\newline\fbox{\parbox{\linewidth}{Never ever write or delete any files or directories directly on the brick it self \textsf{/gluster/brick}. This will crash your volume!}}