6.0: DRBD: Difference between revisions

From SambaWiki
 
(74 intermediate revisions by 3 users not shown)
Line 29: Line 29:
'''Primary/Secondary'''
'''Primary/Secondary'''


'''Primary/Primary''' <-- to do
'''Primary/Primary'''


DRBD is a kernel module which has the ability to network 2 machines to provide Raid1 over LAN.
DRBD is a kernel module which has the ability to network 2 machines to provide Raid1 over LAN.
Line 96: Line 96:




In the example throughout this document we have linked /dev/hdd1 to /dev/drbd0; your however may be a different device, it could be SCSI.
In the example throughout this document we have linked /dev/hdd1 to /dev/drbd0; however your configuration may use a different device (for example, it could be SCSI).


All data on the device /dev/hdd will be destroyed.
All data on the device /dev/hdd will be destroyed.
Line 103: Line 103:
'''Step1.'''
'''Step1.'''


We are going to create a partition on /dev/hdd1 using fdisk. Your actuall device will most likely differ from /dev/hdd
We are going to create a partition on /dev/hdd1 using fdisk. Your actual device will most likely differ from /dev/hdd


[root@node1]# fdisk /dev/hdd1
[root@node1]# fdisk /dev/hdd1
Line 147: Line 147:
Now login to node2 the backup domain controller and fdisk /dev/hdd1 as per above; or your chosen device.
Now login to node2 the backup domain controller and fdisk /dev/hdd1 as per above; or your chosen device.


== [[6.3.1. drbd.conf]] ==
=== [[6.3.1. drbd.conf]] ===




Line 195: Line 195:
[root@node1]# scp /etc/drbd.conf root@node2:/etc/
[root@node1]# scp /etc/drbd.conf root@node2:/etc/


== [[6.3.2. Initialization ]] ==
=== [[6.3.2. Initialization ]] ===




Line 352: Line 352:




== [[6.5. DRBD 8.0 Primary/Primary]] ==
== [[6.5. DRBD 8.0 GFS2 Primary/Primary Clustered Filesystem]] ==


- GFS must be used for 8.0 primary/primary
Tested with Fedora 7


Loose the SAN like a skirt.
The following section is not intended for use.


Using DRBD we can create a clustered filesystem and avoid expensive SAN and Filer devices. This also opens up gateways for those of us that wish to run CTDB clustered Samba on a 2 node cluster.
- GFS must be used for 8.0 primary/primary


In my expieriance SANs themselves have been a single point of failure, changing anything from a cache battery to firmware upgrade is supposed to be non impacting; this is very rarely the case.
Make sure you have the necessary prerequisite packages installed.


Using DRBD in dual primary mode with a clustered file system is far more tolerant to failures then any other configuration I have seen & far less expensive. In allot of cases disk performance will be better as we are using local storage.
[root@node1 ~]# yum -y install kernel-devel
[root@node1 ~]# yum -y install gcc
[root@node1 ~]# yum -y install flex
[root@node1 ~]# yum -y install rpmbuild


Some notes should be taken about RAID controllers. I have found them to be much slower then using your onboard sata controllers. There is no point setting up RAID0 on a hardware controller..
[root@node1 drbd-8.0.5]# make rpm
[root@node1 drbd-8.0.5]# cd dist/RPMS/i386/


''No raid configured but disk running through controller''
[root@node1 i386]# ls

drbd-8.0.5-3.i386.rpm drbd-km-2.6.22.4_65.fc7-8.0.5-3.i386.rpm
[root@core-02 ~]# hdparm -tT /dev/cciss/c0d1p1
/dev/cciss/c0d1p1:
Timing cached reads: 9464 MB in 2.00 seconds = 4738.25 MB/sec
Timing buffered disk reads: 68 MB in 3.02 seconds = 22.50 MB/sec

''Raid0 configured running through the hardware conroller.

[root@core-02 ~]# hdparm -tT /dev/cciss/c0d2p1
/dev/cciss/c0d2p1:
Timing cached reads: 8692 MB in 2.00 seconds = 4351.50 MB/sec
Timing buffered disk reads: 118 MB in 3.01 seconds = 39.19 MB/sec

Running RAID0 through my onboard SATA with software RAID I would expect ~200 MB/sec.

'''Step1.'''

Install GFS2 on the node. With x86-64 never install the i386 packages for GFS or or you will receive an error "/usr/sbin/cman_tool: aisexec daemon didn't start"

[root@core-01 ~]# yum install gfs2-utils.x86_64
[root@core-01 ~]# yum install cman.x86_64
[root@core-01 ~]# yum install openais.x86_64

'''Step2.'''

In the below example configuration file we have called our 2 nodes core-01 and core-02 respectively; the clustername is "hardcore".

Edit the gfs2 cluster configuration file; this file is to be identical on both nodes.

"Ordinarily, the loss of quorum after one out of two nodes fails will prevent the remaining node from continuing (if both nodes have one vote.) Special configuration options can be set to allow the one remaining node to continue operating if the other fails. To do this only two nodes, each with one vote, can be defined in cluster.conf. The two_node and expected_votes values must then be set to 1 in the cman section as follows."

''Note abount fence_manual - when a node is rebooted you will need to ackknowledge the failed node with fence_ack_manual. Manual fencing is only useful in testing and you will need to have some sort of hardware device available to power cycle the machine once in production''

[root@core-01 ~]# vi /etc/cluster/cluster.conf
[root@core-01 ~]# scp /etc/cluster/cluster.conf root@core-02:/etc/cluster/

<?xml version="1.0"?>
<cluster name="hardcore" config_version="2">
<dlm plock_ownership="1" plock_rate_limit="0"/>
<cman two_node="1" expected_votes="1">
</cman>
<clusternodes>
<clusternode name="core-01" votes="1" nodeid="1">
<fence>
<method name="single">
<device name="human" ipaddr="192.168.0.2"/>
</method>
</fence>
</clusternode>
<clusternode name="core-02" votes="1" nodeid="2">
<fence>
<method name="single">
<device name="human" ipaddr="192.168.0.3"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fence_devices>
<fence_device name="human" agent="fence_manual"/>
</fence_devices>
</cluster>


'''Step3.'''

On both nodes edit /etc/drbd.conf. The configuration is to be identical on both nodes

Build the user land tools and kernel modules. When upgrading the kernel you will need to rebuild the kernel module with 'make km-rpm'.

./configure
make rpm
make km-rpm

Manual recovery is recommended to isolate root cause of the failure. In the below configuration file DRBD will attempt automatic recovery which may not be desirable in some situations and could lead to inconsistencies and / or data loss.

As per the drbd.conf manual page (man 5 drbd.conf) there are several actions we can take to achieve an automatic recovery from a failed node.

I put this in here because some people are lazy and do not read man pages. Please read the man page for drbd.conf as these options may have impact on your vital date

The DRBD team have written a great document which goes into detail about DRBD & GFS; '''http://www.drbd.org/users-guide/ch-gfs.html'''

'''after-sb-0pri policy'''

possible policies are:

- disconnect
No automatic resynchronization, simply disconnect.

- discard-younger-primary
Auto sync from the node that was primary before the split-brain situation happened.

- discard-older-primary
Auto sync from the node that became primary as second during the split-brain situation.

- discard-zero-changes
In case one node did not write anything since the split brain became evident, sync from the node that wrote something to the node that did not write anything. In case
none wrote anything this policy uses a random decision to perform a "resync" of 0 blocks. In case both have written something this policy disconnects the nodes.

- discard-least-changes
Auto sync from the node that touched more blocks during the split brain situation.

- discard-node-NODENAME
Auto sync to the named node.


'''after-sb-1pri policy'''

possible policies are:

- disconnect
No automatic resynchronization, simply disconnect.

- consensus
Discard the version of the secondary if the outcome of the after-sb-0pri algorithm would also destroy the current secondary’s data. Otherwise disconnect.

- violently-as0p
Always take the decision of the after-sb-0pri algorithm. Even if that causes an erratic change of the primary’s view of the data. This is only useful if you use a 1node
FS (i.e. not OCFS2 or GFS) with the allow-two-primaries flag, _AND_ if you really know what you are doing. This is DANGEROUS and MAY CRASH YOUR MACHINE if you have an FS
mounted on the primary node.

- discard-secondary
Discard the secondary’s version.

- call-pri-lost-after-sb
Always honor the outcome of the after-sb-0pri algorithm. In case it decides the current secondary has the right data, it calls the "pri-lost-after-sb" handler on the current primary.

'''after-sb-2pri policy'''

possible policies are:

- disconnect
No automatic resynchronization, simply disconnect.

- violently-as0p
Always take the decision of the after-sb-0pri algorithm. Even if that causes an erratic change of the primary’s view of the data. This is only useful if you use a 1node
FS (i.e. not OCFS2 or GFS) with the allow-two-primaries flag, _AND_ if you really know what you are doing. This is DANGEROUS and MAY CRASH YOUR MACHINE if you have an FS
mounted on the primary node.

- call-pri-lost-after-sb
Call the "pri-lost-after-sb" helper program on one of the machines. This program is expected to reboot the machine, i.e. make it secondary.

- always-asbp
Normally the automatic after-split-brain policies are only used if current states of the UUIDs do not indicate the presence of a third node. With this option you request that the automatic after-split-brain policies are used as long as the data sets of the nodes are somehow related. This might cause a full sync, if the UUIDs indicate the presence of a third node. (Or double faults led to strange UUID sets.)

- rr-conflict policy
To solve the cases when the outcome of the resync decision is incompatible with the current role assignment in the cluster.

-disconnect
No automatic resynchronization, simply disconnect.

- violently
Sync to the primary node is allowed, violating the assumption that data on a block device are stable for one of the nodes. Dangerous, do not use.

- call-pri-lost
Call the "pri-lost" helper program on one of the machines. This program is expected to reboot the machine, i.e. make it secondary.

[root@core-01 ~]# vi /etc/drbd.conf
[root@core-01 ~]# scp /etc/drbd.conf root@core-02:/etc/

# Resource r0 DRBD0 /dev/cciss/c0d1p1: 250.0 GB
resource r0 {
protocol C;
device /dev/drbd0;
disk {
on-io-error detach;
}
startup {
become-primary-on both;
}
net {
allow-two-primaries;
cram-hmac-alg sha1;
shared-secret 123456;
after-sb-0pri discard-least-changes;
after-sb-1pri violently-as0p;
after-sb-2pri violently-as0p;
rr-conflict violently;
}
syncer {
rate 100M;
}
on core-01 {
device /dev/drbd0;
disk /dev/cciss/c0d1p1;
address 10.0.0.1:7788;
flexible-meta-disk internal;
}
on core-02 {
device /dev/drbd0;
disk /dev/cciss/c0d1p1;
address 10.0.0.2:7788;
flexible-meta-disk internal;
}
}
# Resource r1 DRBD0 /dev/cciss/c0d2p1: 500.0 GB
resource r1 {
protocol C;
device /dev/drbd1;
disk {
on-io-error detach;
}
startup {
become-primary-on both;
}
net {
allow-two-primaries;
cram-hmac-alg sha1;
shared-secret 123456;
after-sb-0pri discard-least-changes;
after-sb-1pri violently-as0p;
after-sb-2pri violently-as0p;
rr-conflict violently;
}
syncer {
rate 125M;
}
on core-01 {
device /dev/drbd1;
disk /dev/cciss/c0d2p1;
address 10.0.1.1:7789;
flexible-meta-disk internal;
}
on core-02 {
device /dev/drbd1;
disk /dev/cciss/c0d2p1;
address 10.0.1.2:7789;
flexible-meta-disk internal;
}
}



'''Step4.'''

Now lets start up GFS2.

[root@core-01 ~]# cman_tool nodes
cman_tool: Cannot open connection to cman, is it running ?

[root@core-1 ~]# service cman start
Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman...

If cman hangs at this point check /var/log/messages for messages such as:
core-1 openais[2942]: [TOTEM] The consensus timeout expired.
core-1 openais[2942]: [TOTEM] entering GATHER state from 3.

[root@core-01 ~]# vi /etc/ais/openais.conf

look for the following line and change the bindnetaddr to listen on the network address.

bindnetaddr: 192.168.0.0
# bindnetaddr: 192.168.2.0

If you still receive the error diable SELinux and stop iptables.

[root@core-01 ~]# service cman start
Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting daemons... done
Starting fencing...

At this point fencing will not start because it is waiting for core-02 to join.

[root@core-01 ~]# cman_tool nodes
Node Sts Inc Joined Name
1 M 34944 2008-02-16 02:08:14 core-01
2 X 0 core-02

[root@core-01 ~]# cman_tool status
Version: 6.0.1
Config Version: 2
Cluster Name: hardcore
Cluster Id: 26333
Cluster Member: Yes
Cluster Generation: 34944
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 2
Quorum: 1
Active subsystems: 6
Flags: 2node
Ports Bound: 0
Node name: core-01
Node ID: 1
Multicast addresses: 239.192.102.68
Node addresses: 192.168.0.2

Time to start gfs2 on core-02

[root@core-02 ~]# service cman start
Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting daemons... done
Starting fencing... done
[ OK ]
Now lets check the status of the cluster.

[root@core-01 ~]# cman_tool nodes
Node Sts Inc Joined Name
1 M 34944 2008-02-16 02:08:14 core-01
2 M 34948 2008-02-16 02:10:09 core-02

[root@core-01 ~]# cman_tool status
Version: 6.0.1
Config Version: 2
Cluster Name: hardcore
Cluster Id: 26333
Cluster Member: Yes
Cluster Generation: 34948
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1
Active subsystems: 6
Flags: 2node
Ports Bound: 0
Node name: core-01
Node ID: 1
Multicast addresses: 239.192.102.68
Node addresses: 192.168.0.2

[root@core-01 ~]# cman_tool services
type level name id state
fence 0 default 00010001 none
[1 2]
dlm 1 gfs2-00 00030001 none
[1 2]
dlm 1 gfs2-01 00050001 none
[1 2]
gfs 2 gfs2-00 00020001 none
[1 2]
gfs 2 gfs2-01 00040001 none
[1 2]



'''Step5.'''

Start DRBD on both nodes (note this assumes you have no data on your disks)

You want me to create a v08 style flexible-size internal meta data block.

[root@core-01 ~]# drbdadm create-md r0
[root@core-01 ~]# drbdadm create-md r1

DRBD will wait for core-02

[root@core-01 ~]# service drbd start
Starting DRBD resources: [ d0 d1 s0 s1 n0 n1 ].
......

[root@core-01 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.2.4 (api:88/proto:86-88)
GIT-hash: fc00c6e00a1b6039bfcebe37afa3e7e28dbd92fa build by root@core-01, 2008-02-13 22:22:18
0: cs:WFConnection st:Secondary/Unknown ds:UpToDate/DUnknown C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
1: cs:WFConnection st:Secondary/Unknown ds:UpToDate/DUnknown C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0

Start DRBD on core-02

[root@core-02 ~]# service drbd start
Starting DRBD resources: [ d0 d1 s0 s1 n0 n1 ].

Now on core-01 we can start the initial full synchronization - this may take many hours. You can continue to use the filesystem on both nodes as per usual operation. If you reboot a node during the initial sync you will have to start again. Be patient and leave it overnight depending on your storage size.

[root@core-01 ~]# drbdadm invalidate-remote r0
[root@core-01 ~]# drbdadm invalidate-remote r1

If you receive the message Refusing to be Primary without at least one UpToDate disk you can try the following:

[root@core-01 ~]# drbdadm -- --overwrite-data-of-peer primary all

Now we can see both nodes have clustered filesystem synchronized.

[root@core-01 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.2.4 (api:88/proto:86-88)
GIT-hash: fc00c6e00a1b6039bfcebe37afa3e7e28dbd92fa build by root@core-01, 2008-02-13 22:22:18
0: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
1: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0

'''Step6.'''

We need to specify 2 journals as each cluster node requires its own lock. Because this configuation is not designed to scale beyond two nodes we specify 2 journals.

Referencing back to our cluster.conf file we have chosen hardcore as our cluster name. We will call the clustered filesystem gfs2-00.

[root@core-01 ~]# mkfs.gfs2 -t hardcore:gfs2-00 -p lock_dlm -j 2 /dev/drbd0
Are you sure you want to proceed? [y/n] y
Device: /dev/drbd0
Blocksize: 4096
Device Size 465.76 GB (122096000 blocks)
Filesystem Size: 465.76 GB (122095999 blocks)
Journals: 3
Resource Groups: 1864
Locking Protocol: "lock_dlm"
Lock Table: "core-01:gfs2-00"

Now do the same for the our second disk we have defined in drbd.conf.

[root@core-01 ~]# mkfs.gfs2 -t hardcore:gfs2-01 -p lock_gulm -j 2 /dev/drbd1
Are you sure you want to proceed? [y/n] y
Device: /dev/drbd1
Blocksize: 4096
Device Size 465.76 GB (122096000 blocks)
Filesystem Size: 465.76 GB (122095999 blocks)
Journals: 3
Resource Groups: 1864
Locking Protocol: "lock_dlm"
Lock Table: "core-01:gfs2-01"


'''Step7.'''

Now we have created the filesystem we can go ahead and mount it.

If you are not able to mount the file system check that fence is in a running state.

/sbin/mount.gfs2: lock_dlm_join: gfs_controld join error: -22
/sbin/mount.gfs2: error mounting lockproto lock_dlm

If you loose both nodes and only one comes up you can manually mount with no locking.

Do not allow multiple nodes to mount the same file system while LOCK_NOLOCK is used. Doing so causes one or more nodes to panic their kernels, and may cause file system corruption.

Use only in a disaster!

mount -t gfs2 -o lockproto=lock_nolock /dev/drbd1 /gfs2-01


[root@core-01 ~]# mount -t gfs2 /dev/drbd0 /gfs2-00 -v
/sbin/mount.gfs2: mount /dev/drbd0 /gfs2-00
/sbin/mount.gfs2: parse_opts: opts = "rw"
/sbin/mount.gfs2: clear flag 1 for "rw", flags = 0
/sbin/mount.gfs2: parse_opts: flags = 0
/sbin/mount.gfs2: parse_opts: extra = ""
/sbin/mount.gfs2: parse_opts: hostdata = ""
/sbin/mount.gfs2: parse_opts: lockproto = ""
/sbin/mount.gfs2: parse_opts: locktable = ""
/sbin/mount.gfs2: message to gfs_controld: asking to join mountgroup:
/sbin/mount.gfs2: write "join /gfs2-00 gfs2 lock_dlm hardcore:gfs2-00 rw /dev/drbd0"
/sbin/mount.gfs2: message from gfs_controld: response to join request:
/sbin/mount.gfs2: lock_dlm_join: read "0"
/sbin/mount.gfs2: message from gfs_controld: mount options:
/sbin/mount.gfs2: lock_dlm_join: read "hostdata=jid=0:id=131073:first=0"
/sbin/mount.gfs2: lock_dlm_join: hostdata: "hostdata=jid=0:id=131073:first=0"
/sbin/mount.gfs2: lock_dlm_join: extra_plus: "hostdata=jid=0:id=131073:first=0"
/sbin/mount.gfs2: mount(2) ok
/sbin/mount.gfs2: lock_dlm_mount_result: write "mount_result /gfs2-00 gfs2 0"
/sbin/mount.gfs2: read_proc_mounts: device = "/dev/drbd0"
/sbin/mount.gfs2: read_proc_mounts: opts = "rw,relatime,hostdata=jid=0:id=131073:first=0"


Now lets add the mounts to fstab so we can have them mount when system boots

[root@core-01 ~]# vi /etc/fstab

#GFS DRBD MOUNT POINTS
/dev/drbd0 /gfs2-00 gfs2 defaults 1 1
/dev/drbd1 /gfs2-01 gfs2 defaults 1 1


'''Performance Plocks'''

Currently GFS2 is in a working state although performance seems to be lacking.

''GFS2 + DRBD ping_pong.c test "./ping_ping /gfs2-00/test 3"''

- One node plock test

[root@core-01 ~]# ./ping_pong /gfs2-00/test 3
2159 locks/sec

- On both nodes

[root@core-01 ~]# ./ping_pong /gfs2-00/test 3
1336 locks/sec

[root@core-02 ~]# ./ping_pong /gfs2-00/test 3
1333 locks/sec

- One node pclock rw test "./ping_pong -rw /gfs2-00/test 3"

[root@core-01 ~]# ./ping_pong -rw /gfs2-00/test 3
2192 locks/sec

- Two node pclock rw test "./ping_ping -rw /gfs2-00/test 3"

[root@core-01 ~]# ./ping_pong -rw /gfs2-00/test 3
2 locks/sec

[root@core-02 ~]# ./ping_pong -rw /gfs2-00/test 3
2 locks/sec

== [[6.6. Virtualization]] ==

Create this new cluster configuration file on both nodes. The last configuration example in 6.5 covered only a basic clustered filesystem. Here we are taking things a step further and configuring failover domains and resources.

For our resources we will be managing virtual machines using qemu-kvm. GFS2 has several resources available pre packaged; the one we are interested in is the resource rule vm.sh. This rules are loaded by default when the cluster first starts from /usr/share/cluster.

In this configuration there are two failover domains configured one for each node core-01_domain & core-02_domain. The resources can failover between these domains or be migrated manually. vm.sh taps directly into virsh and supports live migration.

In order for gfs2 to function correctly you need a fencing device configured, without this your mileage will verify and in the event of a problem your node will need a fence_ack_manual.


[root@core-01 ~]# ccs_config_dump
/etc/cluster/cluster.conf

<?xml version="1.0"?>
<cluster config_version="3" name="hardcore">
<dlm plock_ownership="1" plock_rate_limit="0"/>
<gfs_controld plock_rate_limit="0"/>
<cman cluster_id="26333" expected_votes="1" nodename="core-01" two_node="1"/>
<clusternodes>
<clusternode name="core-01" nodeid="1" votes="1">
<fence>
<method name="single">
<device name="core-01_ipmi"/>
</method>
</fence>
</clusternode>
<clusternode name="core-02" nodeid="2" votes="1">
<fence>
<method name="single">
<device name="core-02_ipmi"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice action="reboot" agent="fence_ipmilan" ipaddr="xxx.xxx.xxx.xxx" login="admin" name="core-01_ipmi" passwd="xxxxxx"/>
<fencedevice action="reboot" agent="fence_ipmilan" ipaddr="xxx.xxx.xxx.xxx" login="admin" name="core-02_ipmi" passwd="xxxxxx"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="core-01_domain" restricted="0">
<failoverdomainnode name="core-01"/>
</failoverdomain>
<failoverdomain name="core-02_domain" restricted="0">
<failoverdomainnode name="core-02"/>
</failoverdomain>
</failoverdomains>
<vm domain="core-01_domain" name="blueonyx_01"/>
<vm domain="core-01_domain" name="rhel5_01"/>
<vm domain="core-01_domain" name="winxp_01"/>
<vm domain="core-02_domain" name="rhel5_02"/>
<vm domain="core-02_domain" name="winxp_02"/>
<vm domain="core-02_domain" name="winxp_03"/>
</rm>
</cluster>



We can you the ccs commands to verify the configuration of the cluster.

[root@core-01 ~]# ccs_config_validate
Configuration validates

We can check our fencing devices using the ccs_tool. We are using ipmilan fence devices which I have used with HP Advanced Lights out; a problem with this fencing method is the loss of a power supply, fencing will not be able to verify and complete its action, so it will fail & your resources wont be migrated.

[root@core-01 ~]# ccs_tool lsfence
Name Agent
core-01_ipmi fence_ipmilan
core-02_ipmi fence_ipmilan

We can override a failed fencing agent using manual intervention; however the more you look into the fencing topic you will realise the importance to avoid simulative read/writes. So for production it would be best to add an additional fencing agent such as an apc power rail.

[root@core-01 ~]# fence_ack_manual core-02
About to override fencing for core-02.
Improper use of this command can cause severe file system damage.
Continue [NO/absolutely]?

We can list the nodes with the corresponding fence devices.

[root@core-01 ~]# ccs_tool lsnode
Cluster name: hardcore, config_version: 2
Nodename Votes Nodeid Fencetype
core-01 1 1 core-01_ipmi
core-02 1 2 core-02_ipmi


Now lets verify the cluster resource configuration has no errors by using the rg_test facility. Here we can see our resources and failover domains and how the resources are presented in the cluster.

[root@core-01 ~]# rg_test test /etc/cluster/cluster.conf

Loading resource rule from /usr/share/cluster/vm.sh

Loaded 23 resource rules
=== Resources List ===
Resource type: vm [INLINE]
Instances: 1/1
Agent: vm.sh
Attributes:
name = blueonyx_01 [ primary ]
domain = core-01_domain [ reconfig ]
autostart = 1 [ reconfig ]
hardrecovery = 0 [ reconfig ]
exclusive = 0 [ reconfig ]
use_virsh = 1
migrate = live
snapshot =
depend_mode = hard
max_restarts = 0 [ reconfig ]
restart_expire_time = 0 [ reconfig ]
hypervisor = auto
hypervisor_uri = auto
migration_uri = auto
Resource type: vm [INLINE]
Instances: 1/1
Agent: vm.sh
Attributes:
name = rhel5_01 [ primary ]
domain = core-01_domain [ reconfig ]
autostart = 1 [ reconfig ]
hardrecovery = 0 [ reconfig ]
exclusive = 0 [ reconfig ]
use_virsh = 1
migrate = live
snapshot =
depend_mode = hard
max_restarts = 0 [ reconfig ]
restart_expire_time = 0 [ reconfig ]
hypervisor = auto
hypervisor_uri = auto
migration_uri = auto
Resource type: vm [INLINE]
Instances: 1/1
Agent: vm.sh
Attributes:
name = winxp_01 [ primary ]
domain = core-01_domain [ reconfig ]
autostart = 1 [ reconfig ]
hardrecovery = 0 [ reconfig ]
exclusive = 0 [ reconfig ]
use_virsh = 1
migrate = live
snapshot =
depend_mode = hard
max_restarts = 0 [ reconfig ]
restart_expire_time = 0 [ reconfig ]
hypervisor = auto
hypervisor_uri = auto
migration_uri = auto

Resource type: vm [INLINE]
Instances: 1/1
Agent: vm.sh
Attributes:
name = rhel5_02 [ primary ]
domain = core-02_domain [ reconfig ]
autostart = 1 [ reconfig ]
hardrecovery = 0 [ reconfig ]
exclusive = 0 [ reconfig ]
use_virsh = 1
migrate = live
snapshot =
depend_mode = hard
max_restarts = 0 [ reconfig ]
restart_expire_time = 0 [ reconfig ]
hypervisor = auto
hypervisor_uri = auto
migration_uri = auto
Resource type: vm [INLINE]
Instances: 1/1
Agent: vm.sh
Attributes:
name = winxp_02 [ primary ]
domain = core-02_domain [ reconfig ]
autostart = 1 [ reconfig ]
hardrecovery = 0 [ reconfig ]
exclusive = 0 [ reconfig ]
use_virsh = 1
migrate = live
snapshot =
depend_mode = hard
max_restarts = 0 [ reconfig ]
restart_expire_time = 0 [ reconfig ]
hypervisor = auto
hypervisor_uri = auto
migration_uri = auto
Resource type: vm [INLINE]
Instances: 1/1
Agent: vm.sh
Attributes:
name = winxp_03 [ primary ]
domain = core-02_domain [ reconfig ]
autostart = 1 [ reconfig ]
hardrecovery = 0 [ reconfig ]
exclusive = 0 [ reconfig ]
use_virsh = 1
migrate = live
snapshot =
depend_mode = hard
max_restarts = 0 [ reconfig ]
restart_expire_time = 0 [ reconfig ]
hypervisor = auto
hypervisor_uri = auto
migration_uri = auto
=== Resource Tree ===
vm {
name = "blueonyx_01";
domain = "core-01_domain";
autostart = "1";
hardrecovery = "0";
exclusive = "0";
use_virsh = "1";
migrate = "live";
snapshot = "";
depend_mode = "hard";
max_restarts = "0";
restart_expire_time = "0";
hypervisor = "auto";
hypervisor_uri = "auto";
migration_uri = "auto";
}
vm {
name = "rhel5_01";
domain = "core-01_domain";
autostart = "1";
hardrecovery = "0";
exclusive = "0";
use_virsh = "1";
migrate = "live";
snapshot = "";
depend_mode = "hard";
max_restarts = "0";
restart_expire_time = "0";
hypervisor = "auto";
hypervisor_uri = "auto";
migration_uri = "auto";
}
vm {
name = "winxp_01";
domain = "core-01_domain";
autostart = "1";
hardrecovery = "0";
exclusive = "0";
use_virsh = "1";
migrate = "live";
snapshot = "";
depend_mode = "hard";
max_restarts = "0";
restart_expire_time = "0";
hypervisor = "auto";
hypervisor_uri = "auto";
migration_uri = "auto";
}
vm {
name = "rhel5_02";
domain = "core-02_domain";
autostart = "1";
hardrecovery = "0";
exclusive = "0";
use_virsh = "1";
migrate = "live";
snapshot = "";
depend_mode = "hard";
max_restarts = "0";
restart_expire_time = "0";
hypervisor = "auto";
hypervisor_uri = "auto";
migration_uri = "auto";
}
vm {
name = "winxp_02";
domain = "core-02_domain";
autostart = "1";
hardrecovery = "0";
exclusive = "0";
use_virsh = "1";
migrate = "live";
snapshot = "";
depend_mode = "hard";
max_restarts = "0";
restart_expire_time = "0";
hypervisor = "auto";
hypervisor_uri = "auto";
migration_uri = "auto";
}
vm {
name = "winxp_03";
domain = "core-02_domain";
autostart = "1";
hardrecovery = "0";
exclusive = "0";
use_virsh = "1";
migrate = "live";
snapshot = "";
depend_mode = "hard";
max_restarts = "0";
restart_expire_time = "0";
hypervisor = "auto";
hypervisor_uri = "auto";
migration_uri = "auto";
}
=== Failover Domains ===
Failover domain: core-01_domain
Flags: none
Node core-01 (id 1, priority 0)
Failover domain: core-02_domain
Flags: none
Node core-02 (id 2, priority 0)
=== Event Triggers ===
Event Priority Level 100:
Name: Default
(Any event)
File: /usr/share/cluster/default_event_script.sl


Create a network bridge for our virtual machines to avoid natting so these virtual machines can now be internet facing with public IP addresses. Notice I have the onboot option set to no; we will bring this up after cman manually as there seems to be an issue with bridging and cman.

[root@core-01 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth1
# Networking Interface
DEVICE=eth1
HWADDR=00:23:7D:29:D2:7D
ONBOOT=no
TYPE=Ethernet
BRIDGE=br0


[root@core-01 ~]# vi /etc/sysconfig/network-scripts/ifcfg-br0
DEVICE=br0
TYPE=Bridge
BOOTPROTO=static
DNS1=192.168.0.1
GATEWAY=192.168.0.1
IPADDR=192.168.0.20
NETMASK=255.255.255.0
ONBOOT=no

On our second node lets do the same.

[root@core-02 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth1
# Networking Interface
DEVICE=eth1
HWADDR=00:21:5A:D4:0A:51
ONBOOT=no
TYPE=Ethernet
BRIDGE=br0

[root@core-02 ~]# cat /etc/sysconfig/network-scripts/ifcfg-br0
DEVICE=br0
TYPE=Bridge
BOOTPROTO=static
DNS1=192.168.0.1
GATEWAY=192.168.0.1
IPADDR=192.168.0.30
NETMASK=255.255.255.0
ONBOOT=no



On both nodes add the following to the rc.local file so that we bring up the bridge, mount the clustered filesystem; start libvirtd and the resource manager after the machine has booted.

Also ensure the following on both cluster members:

- chkconfig cman on
- chkconfig drbd on
- chkconfig rgmanager off
- chkconfig libvirtd off


[root@core-01 ~]# cat /etc/rc.local
#!/bin/sh
#
# This script will be executed *after* all the other init scripts.
# You can put your own initialization stuff in here if you don't
# want to do the full Sys V style init stuff.
touch /var/lock/subsys/local
ifup eth1; ifup br0
mount -t gfs2 /dev/drbd0 /gfs2-00
mount -t gfs2 /dev/drbd1 /gfs2-01
#/usr/local/sbin/ctdbd --reclock /gfs2-00/cluster/ctdb/ctdb.lock --lvs
/etc/init.d/libvirtd start
/etc/init.d/rgmanager start


I build my virtual machines through virt-manager first.. and exmaple configuration file is as follows:

[root@core-02 ~]# vi /etc/libvirt/qemu/blueonyx_01.xml
<domain type='kvm'>
<name>blueonyx_01</name>
<uuid>d42b866a-9f70-2faa-0e2a-4182125cf499</uuid>
<memory>1048576</memory>
<currentMemory>1048576</currentMemory>
<vcpu>2</vcpu>
<os>
<type arch='x86_64' machine='pc'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
<pae/>
</features>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<devices>
<emulator>/usr/bin/qemu-kvm</emulator>
<disk type='file' device='disk'>
<source file='/gfs2-00/virtualization/BlueOnyx/blueonyx_01.img'/>
<target dev='hda' bus='ide'/>
</disk>
<disk type='file' device='cdrom'>
<target dev='hdc' bus='ide'/>
<readonly/>
</disk>
<interface type='bridge'>
<mac address='54:52:00:46:28:06'/>
<source bridge='br0'/>
</interface>
<serial type='pty'>
<target port='0'/>
</serial>
<console type='pty'>
<target port='0'/>
</console>
<input type='mouse' bus='ps2'/>
<graphics type='vnc' port='-1' autoport='yes'/>
<sound model='es1370'/>
</devices>
</domain>

Now we need to define the domain

[root@core-01 ~]# virsh define /etc/libvirt/qemu/blueonyx_01.xml
Domain blueonyx_01 defined from /etc/libvirt/qemu/blueonyx_01.xml

Repeat this step for each virtual machine you create, remember to define them on both nodes.

Reboot both nodes:

Now that everything is up and running lets verify a few thing. Clustat is a great way to check the status of your cluster and services. You can monitor services and their migration status when invoked.


[root@core-01 ~]# clustat

Cluster Status for hardcore @ Tue Dec 1 11:06:28 2009
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
core-01 1 Online, Local, rgmanager
core-02 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
vm:blueonyx_01 core-01 started
vm:rhel5_01 core-01 started
vm:rhel5_02 core-02 started
vm:winxp_01 core-01 started
vm:winxp_02 core-02 started
vm:winxp_03 core-02 started

Great, everything is working as expected; we have 4 virtual machines two running on each node.

In the logs you should see something as follows on each node

[root@core-01 ~]# tail -f /var/log/cluster/rgmanager.log
Oct 29 06:21:14 rgmanager Service service:core-01_vms started
Oct 29 06:13:33 rgmanager Starting stopped service service:vm:winxp_02
Oct 29 06:13:33 bash virsh -c qemu:///system start winxp_02
Oct 29 06:13:35 rgmanager Service service:vm:winxp_02 started
[root@core-01 ~]# virsh list --all

Id Name State
----------------------------------
2 rhel5_01 running
5 winxp_01 running
7 blueonyx_01 running
- proxmox_01 shut off
- rhel5_02 shut off
- winxp_02 shut off
- winxp_03 shut off


[root@core-02 ~]# virsh list --all
Id Name State
----------------------------------
1 rhel5_02 running
2 winxp_02 running
3 winxp_03 running
- blueonyx_01 shut off
- rhel5_01 shut off
- winxp_01 shut off



'''Never attempt to migrate a virtual machine outside of rgmanager. rgmanager will automatically respawn the vm and you will end up with two copies of the same virtual machine guest on both nodes.. this is a very bad thing!'''

Core-01
top - 10:22:10 up 4:03, 1 user, load average: 0.03, 0.05, 0.06
Tasks: 168 total, 3 running, 165 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.5%us, 1.8%sy, 0.0%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8113232k total, 1548432k used, 6564800k free, 87892k buffers
Swap: 4095992k total, 0k used, 4095992k free, 421996k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3688 root 20 0 1528m 283m 3112 S 8.0 3.6 14:39.56 qemu-kvm
3882 root 20 0 939m 525m 3136 R 4.7 6.6 11:51.10 qemu-kvm

Core-02
top - 10:15:05 up 4:04, 1 user, load average: 0.15, 0.07, 0.01
Tasks: 165 total, 1 running, 164 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.0%us, 1.5%sy, 0.0%ni, 97.4%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4017328k total, 1480484k used, 2536844k free, 84128k buffers
Swap: 4095992k total, 0k used, 4095992k free, 133384k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3795 root 20 0 939m 525m 3132 S 7.9 13.4 11:11.27 qemu-kvm
3856 root 20 0 939m 525m 3132 S 5.9 13.4 11:40.59 qemu-kvm

Now lets attempt to live migrate the virtual machines from core-01 to core-02. We can monitor the rgmanager.log file or watch clustat to see the status of the migration. Be patient and expect it to take a miniute or so.

[root@core-01 ~]# clusvcadm -M vm:winxp_01 -m core-02
Trying to migrate vm:winxp_01 to core-02...Success


[root@core-01 ~]# tail -f /var/log/cluster/messages

bash virsh migrate --live winxp_01 qemu+ssh://core-02/system
core-01 rgmanager[2234]: Migration of vm:blueonyx_01 to core-01 completed


Verify the status of the cluster.

[root@core-02 ~]# clustat

Cluster Status for hardcore @ Tue Dec 1 11:11:32 2009
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
core-01 1 Online, rgmanager
core-02 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
vm:blueonyx_01 core-01 started
vm:rhel5_01 core-01 started
vm:rhel5_02 core-02 started
vm:winxp_01 core-02 started
vm:winxp_02 core-02 started
vm:winxp_03 core-02 started


'''Adding new resources'''

Check the current running cluster.conf

[root@core-01 ~]# cman_tool version
6.2.0 config 4

Add your new resource to the cluster.conf file, in this example we have chosen core-01 to be the primary domain.

<vm domain="core-01_domain" name="your-new-vm"/>

Increment your version number

<cluster config_version="5" name="hardcore">

Validate the configuration file checks out.

[root@core-01 ~]# ccs_config_validate
Configuration validates


Copy the new configuration to the second node.

[root@core-01 ~]# scp /etc/cluster/cluster.conf core-02:/etc/cluster/
cluster.conf 100% 1794 1.8KB/s 00:00

Update the running cluster configuration: cman_tool version -r $newversion -S

[root@core-01 ~]# cman_tool version -r 5 -S


'''DRBD Fault Tolerance'''

In this test I had virtual machines running on each server both untilization both GFS2 mount points in read/write mode (normal operation). I am able to physically remove the drives from one server (with the exception of operating system) and all services on that node will continue to run!

DRBD in this mode is known as diskless mode. All read/write operations will be carried out over the network to the other node. Over a gigabit dedicated LAN performance may be a problem depending on how high of a load you are actually running. You can migrate your virtual machines to the good node and replace the disks on the failed without any downtime.

From then onwards, DRBD is said to operate in diskless mode, and carries out all subsequent I/O operations, read and write, on the peer node. Performance in this mode is inevitably expected to suffer, but the service continues without interruption, and can be moved to the up2date node.

[root@core-01 ~]# /etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.3.5 (api:88/proto:86-91)
GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by root@core-01, 2009-10-29 19:01:29
m:res cs ro ds p mounted fstype
0:r0 Connected Primary/Primary Diskless/UpToDate C /gfs2-00 gfs2
1:r1 Connected Primary/Primary Diskless/UpToDate C /gfs2-01 gfs2

''Even though all disks have completely failed on one of the nodes (excluding OS) all services and mount points remain available.''

Latest revision as of 07:58, 7 March 2010

Replicated Failover Domain Controller and file server using LDAP


1.0. Configuring Samba

2.0. Configuring LDAP

3.0. Initialization LDAP Database

4.0. User Management

5.0. Heartbeat HA Configuration

6.0. DRBD

7.0. BIND DNS



6.1. Requirements

High Availability and data replication should not replace traditional backups such as tape and external media devices, especially if you are using this configuration and are not familiar with the workings.

DRBD Configuration

Primary/Secondary

Primary/Primary

DRBD is a kernel module which has the ability to network 2 machines to provide Raid1 over LAN. It is assumed that we have two identical drives in both machines; all data on this device will be destroyed.

If you are updating your kernel or version of DRBD, make sure DRBD is stopped on both machines. Never attempt to run different versions of DRBD, this means both machines need the same kernel.

You will need to install the DRBD kernel Module. We will build our own RPM kernel modules so it is optimized for our architecture.

I have tested many different kernels with DRBD, some are not stable so you will need to check Google to make sure your kernel is compatible with the particular DRBD release, most of the time this isn’t an issue.

Please browse this list http://www.linbit.com/support/drbd-current/ and look for packages available.

If you are having problems compileing the software and getting make errors, things can become complicated.

It is best to compile drbd and kernel modules from source to suit your kernel. But if you get make errors you should not have any issues finding prebuilt packages for centOS, RHEL, all Fedora Core versions that work just fine.

Packages for Fedora Core 6 x86 and x86-64 Check here for Fedora Core 6 packages http://atrpms.net/dist/fc6/drbd/

6.2. Installation

Step1.

Extract the latest stable version of DRBD.

[root@node1 stable]# tar zxvf drbd-0.7.20.tar.gz
[root@node1 stable]# cd drbd-0.7.20
[root@node1 drbd-0.7.20]#


Step2.

It is nice to make your own rpm for your distribution. It makes upgrades seamless.

This will give us a RPM build specifically to our kernel, it may take some time.

[root@node1 drbd-0.7.20]# make
[root@node1 drbd-0.7.20]# make rpm

If you get make errors, try and find an RPM for your distribution.

Step3.

[root@node1 drbd-0.7.20]# cd dist RPMS/i386/

[root@node1 i386]# ls
drbd-0.7.20-1.i386.rpm
drbd-debuginfo-0.7.20-1.i386.rpm
drbd-km-2.6.14_1.1656_FC4smp-0.7.20-1.i386.rpm

Step4.

We will now install DRBD and our Kernel module which we built earlier.

[root@node1 i386]# rpm -Uvh drbd-0.7.20-1.i386.rpm drbd-debuginfo-0.7.20-1.i386.rpm 
 drbd-km-2.6.14_1.1656_FC4smp-0.7.20-1.i386.rpm

Step5.

Login to node 2 the backup domain controller and do the same.

6.3. Configuration

In the example throughout this document we have linked /dev/hdd1 to /dev/drbd0; however your configuration may use a different device (for example, it could be SCSI).

All data on the device /dev/hdd will be destroyed.


Step1.

We are going to create a partition on /dev/hdd1 using fdisk. Your actual device will most likely differ from /dev/hdd

[root@node1]# fdisk /dev/hdd1

Command (m for help): m
Command action

  a   toggle a bootable flag
  b   edit bsd disklabel
  c   toggle the dos compatibility flag
  d   delete a partition
  l   list known partition types
  m   print this menu
  n   add a new partition
  o   create a new empty DOS partition table
  p   print the partition table
  q   quit without saving changes
  s   create a new empty Sun disklabel
  t   change a partition's system id
  u   change display/entry units
  v   verify the partition table
  w   write table to disk and exit
  x   extra functionality (experts only)

Command (m for help): d

No partition is defined yet!

Command (m for help): n
Command action
e   extended
p   primary partition (1-4) p
Partition number (1-4): 1
First cylinder (1-8677, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-8677, default 8677):
Using default value 8677
Command (m for help): w

Step2.

Now login to node2 the backup domain controller and fdisk /dev/hdd1 as per above; or your chosen device.

6.3.1. drbd.conf

Create this file on both you master and slave server, it should be identical however it is not a requirement. As long as the partition size is the same any mount point can be used.


Step1.

The below file is fairly self explanatory, you see the real disk link to the DRBD kernel module device.

Make sure you set your hostname as well, otherwise DRBD will not start.

[root@node1]# vi /etc/drbd.conf

# Datadrive (/data) /dev/hdd1 80GB

resource drbd1 {
 protocol C;
 disk {
   on-io-error panic;
 }
 net {
   max-buffers 2048;
   ko-count 4;
   on-disconnect reconnect;
 }
 syncer {
   rate 10000;
 }
 on node1.differentialdesign.org {
   device    /dev/drbd0;
   disk      /dev/hdd1;
   address   10.0.0.1:7789;
   meta-disk internal;
 }
 on node2.differentialdesign.org {
   device    /dev/drbd0;
   disk      /dev/hdd1;
   address   10.0.0.2:7789;
   meta-disk internal;
 }
}

Step2.

[root@node1]# scp /etc/drbd.conf root@node2:/etc/

6.3.2. Initialization

In the following steps we will configure the disks to synchronize and choose a master node.

Step1

On the Primary Domain Controller

[root@node1]# service drbd start

On the Backup Domain Controller

[root@node2]# service drbd start


Step2.

You can see both devices are ready, and waiting for a Primary drive to be activated which will do an initial synchronization to the secondary device.

[root@node1]# service drbd status
drbd driver loaded OK; device status:
version: 0.7.17 (api:77/proto:74)
SVN Revision: 2093 build by root@node1, 2006-04-23 14:40:20
0: cs:Connected st:Secondary/Secondary ld:Inconsistent
   ns:25127936 nr:3416 dw:23988760 dr:4936449 al:19624 bm:1038 lo:0 pe:0 ua:0 ap:0


Step3.

Stop the heartbeat service on both nodes.


Step4.

We are now telling DRBD to make node1 the primary drive; this will overwrite all data on the secondary device.

[root@node1]#  drbdadm -- --do-what-I-say primary all
[root@node1 ~]# service drbd status
drbd driver loaded OK; device status:
version: 0.7.23 (api:79/proto:74)
SVN Revision: 2686 build by root@node1, 2007-01-23 20:26:13
0: cs:SyncSource st:Primary/Secondary ld:Consistent
   ns:67080 nr:85492 dw:91804 dr:72139 al:9 bm:268 lo:0 pe:30 ua:2019 ap:0
       [==>.................] sync'ed: 12.5% (458848/520196)K
       finish: 0:01:44 speed: 4,356 (4,088) K/sec

Step5.

Create a filesystem on our RAID devices.

[root@node1]# mkfs.ext3 /dev/drbd0

6.4. Testing

We have a 2 node cluster replicating drive data, its time to test a failover.


Step1.

Start the heartbeat service on both nodes.


Step2.

On node1 we can see the status of DRBD.

[root@node1 ~]# service drbd status
drbd driver loaded OK; device status:
version: 0.7.23 (api:79/proto:74)
0: cs:Connected st:Primary/Secondary ld:Consistent
   ns:1536 nr:0 dw:1372 dr:801 al:4 bm:6 lo:0 pe:0 ua:0 ap:0
[root@node1 ~]#

On node2 we can see the status of DRBD.

[root@node2 ~]# service drbd status
drbd driver loaded OK; device status:
version: 0.7.23 (api:79/proto:74)
SVN Revision: 2686 build by root@node2, 2007-01-23 20:26:03
0: cs:Connected st:Secondary/Primary ld:Consistent
   ns:0 nr:1484 dw:1484 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0
[root@node2 ~]#

That all looks good; we can see the devices are consistent and ready for use.


Step3.

Now let’s check the mount point we created in the heartbeat haresources file.

We can see heartbeat has successfully mounted “/dev/drbd0 to the /data directory” of course your device will not have any data on it yet.

[root@node1 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      35G   14G   20G  41% /
/dev/hdc1              99M   21M   74M  22% /boot
/dev/shm              506M     0  506M   0% /dev/shm
/dev/drbd0             74G   37G   33G  53% /data
[root@node1 ~]#


Step4.

Login to node1 and execute the following command; once heartbeat is stopped it should only take a few seconds to migrate the services to node2.


[root@node1 ~]# service heartbeat stop
Stopping High-Availability services:
                                         [  OK  ]

We can see drbd change state to secondary on node1.

[root@node1 ~]# service drbd status
drbd driver loaded OK; device status:
version: 0.7.23 (api:79/proto:74)
SVN Revision: 2686 build by root@node1, 2007-01-23 20:26:13
0: cs:Connected st:Secondary/Primary ld:Consistent
   ns:5616 nr:85492 dw:90944 dr:2162 al:9 bm:260 lo:0 pe:0 ua:0 ap:0


Step5.

Now let’s check that status of DRBD on node2; we can see it has changed state and become the primary.

[root@node2 ~]# service drbd status
drbd driver loaded OK; device status:
version: 0.7.23 (api:79/proto:74)
 SVN Revision: 2686 build by root@node2, 2007-01-23 20:26:03
0: cs:Connected st:Primary/Secondary ld:Consistent
   ns:4 nr:518132 dw:518136 dr:17 al:0 bm:220 lo:0 pe:0 ua:0 ap:0
1: cs:Connected st:Primary/Secondary ld:Consistent
   ns:28 nr:520252 dw:520280 dr:85 al:0 bm:199 lo:0 pe:0 ua:0 ap:0

Check that node2 has mounted the device.

[root@node2 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      35G   12G   22G  35% /
/dev/hdc1              99M   17M   78M  18% /boot
/dev/shm              506M     0  506M   0% /dev/shm
/dev/hdh1             111G   97G  7.6G  93% /storage
/dev/drbd0             74G   37G   33G  53% /data
[root@node2 ~]#


Step6.

Finally start the heartbeat service on node1 and be sure that all processes migrate back.


6.5. DRBD 8.0 GFS2 Primary/Primary Clustered Filesystem

- GFS must be used for 8.0 primary/primary

Loose the SAN like a skirt.

Using DRBD we can create a clustered filesystem and avoid expensive SAN and Filer devices. This also opens up gateways for those of us that wish to run CTDB clustered Samba on a 2 node cluster.

In my expieriance SANs themselves have been a single point of failure, changing anything from a cache battery to firmware upgrade is supposed to be non impacting; this is very rarely the case.

Using DRBD in dual primary mode with a clustered file system is far more tolerant to failures then any other configuration I have seen & far less expensive. In allot of cases disk performance will be better as we are using local storage.

Some notes should be taken about RAID controllers. I have found them to be much slower then using your onboard sata controllers. There is no point setting up RAID0 on a hardware controller..

No raid configured but disk running through controller

[root@core-02 ~]# hdparm -tT /dev/cciss/c0d1p1 

/dev/cciss/c0d1p1:
Timing cached reads:   9464 MB in  2.00 seconds = 4738.25 MB/sec
Timing buffered disk reads:   68 MB in  3.02 seconds =  22.50 MB/sec

Raid0 configured running through the hardware conroller.

[root@core-02 ~]# hdparm -tT /dev/cciss/c0d2p1 

/dev/cciss/c0d2p1:
Timing cached reads:   8692 MB in  2.00 seconds = 4351.50 MB/sec
Timing buffered disk reads:  118 MB in  3.01 seconds =  39.19 MB/sec

Running RAID0 through my onboard SATA with software RAID I would expect ~200 MB/sec.

Step1.

Install GFS2 on the node. With x86-64 never install the i386 packages for GFS or or you will receive an error "/usr/sbin/cman_tool: aisexec daemon didn't start"

[root@core-01 ~]# yum install gfs2-utils.x86_64
[root@core-01 ~]# yum install cman.x86_64
[root@core-01 ~]# yum install openais.x86_64

Step2.

In the below example configuration file we have called our 2 nodes core-01 and core-02 respectively; the clustername is "hardcore".

Edit the gfs2 cluster configuration file; this file is to be identical on both nodes.

"Ordinarily, the loss of quorum after one out of two nodes fails will prevent the remaining node from continuing (if both nodes have one vote.) Special configuration options can be set to allow the one remaining node to continue operating if the other fails. To do this only two nodes, each with one vote, can be defined in cluster.conf. The two_node and expected_votes values must then be set to 1 in the cman section as follows."

Note abount fence_manual - when a node is rebooted you will need to ackknowledge the failed node with fence_ack_manual. Manual fencing is only useful in testing and you will need to have some sort of hardware device available to power cycle the machine once in production

[root@core-01 ~]# vi /etc/cluster/cluster.conf 
[root@core-01 ~]# scp /etc/cluster/cluster.conf root@core-02:/etc/cluster/
<?xml version="1.0"?>
<cluster name="hardcore" config_version="2">  
 <dlm plock_ownership="1" plock_rate_limit="0"/>
  <cman two_node="1" expected_votes="1">
   </cman>
   <clusternodes>
     <clusternode name="core-01" votes="1" nodeid="1">
      <fence>
       <method name="single">
        <device name="human" ipaddr="192.168.0.2"/>
      </method>
     </fence>
    </clusternode>
    <clusternode name="core-02" votes="1" nodeid="2">
     <fence>
      <method name="single">
        <device name="human" ipaddr="192.168.0.3"/>
      </method>
     </fence>
   </clusternode>
  </clusternodes>
  <fence_devices>
  <fence_device name="human" agent="fence_manual"/> 
 </fence_devices>
</cluster>  


Step3.

On both nodes edit /etc/drbd.conf. The configuration is to be identical on both nodes

Build the user land tools and kernel modules. When upgrading the kernel you will need to rebuild the kernel module with 'make km-rpm'.

./configure
make rpm
make km-rpm

Manual recovery is recommended to isolate root cause of the failure. In the below configuration file DRBD will attempt automatic recovery which may not be desirable in some situations and could lead to inconsistencies and / or data loss.

As per the drbd.conf manual page (man 5 drbd.conf) there are several actions we can take to achieve an automatic recovery from a failed node.

I put this in here because some people are lazy and do not read man pages. Please read the man page for drbd.conf as these options may have impact on your vital date

The DRBD team have written a great document which goes into detail about DRBD & GFS; http://www.drbd.org/users-guide/ch-gfs.html

after-sb-0pri policy

possible policies are:
- disconnect
No automatic resynchronization, simply disconnect.
- discard-younger-primary
Auto  sync  from  the  node  that  was primary before the split-brain situation happened.
- discard-older-primary
Auto sync from the node that  became  primary  as  second during the split-brain situation.
- discard-zero-changes
In  case  one node did not write anything since the split brain became evident, sync from the node that wrote something to the node that did not write anything. In case
none wrote anything this policy uses a random decision to perform a "resync" of 0 blocks. In case both have written something this policy disconnects the nodes.             
- discard-least-changes
Auto sync from the node that touched more blocks during the split brain situation.
- discard-node-NODENAME
Auto sync to the named node.


after-sb-1pri policy

possible policies are:

- disconnect
No automatic resynchronization, simply disconnect.
- consensus
Discard  the  version  of the secondary if the outcome of the after-sb-0pri algorithm would also destroy  the  current secondary’s data. Otherwise disconnect.
- violently-as0p
Always  take the decision of the after-sb-0pri algorithm. Even if that causes an erratic change  of  the  primary’s view of the data.  This is only useful if you use a 1node
FS (i.e.  not OCFS2 or GFS) with the  allow-two-primaries flag,  _AND_ if you really know what you are doing.  This is DANGEROUS and MAY CRASH YOUR MACHINE if you have an FS
mounted on the primary node.
- discard-secondary
Discard the secondary’s version.
- call-pri-lost-after-sb
Always honor the outcome of the after-sb-0pri algorithm.  In case it decides the current secondary has the right data, it calls the "pri-lost-after-sb"  handler on the current primary.

after-sb-2pri policy

possible policies are:

- disconnect
No automatic resynchronization, simply disconnect.
- violently-as0p
Always  take the decision of the after-sb-0pri algorithm. Even if that causes an erratic change  of  the  primary’s view of the data.  This is only useful if you use a 1node
FS (i.e.  not OCFS2 or GFS) with the  allow-two-primaries flag,  _AND_ if you really know what you are doing. This is DANGEROUS and MAY CRASH YOUR MACHINE if you have an FS

mounted on the primary node.

- call-pri-lost-after-sb
Call the "pri-lost-after-sb" helper program on one of the machines. This program is expected to reboot the machine, i.e. make it secondary. 
- always-asbp 
Normally the automatic after-split-brain policies are only used if current states of the UUIDs do not indicate the presence of a third node. With  this  option  you  request that the automatic after-split-brain policies are used as long as the data sets  of  the  nodes are  somehow related. This might cause a full sync, if the UUIDs            indicate the presence of a third node. (Or double faults led to strange UUID sets.)
- rr-conflict policy
To solve  the  cases when the outcome of the resync decision is incompatible with the current role assignment in the cluster.
-disconnect
No automatic resynchronization, simply disconnect.
- violently
Sync to  the  primary  node  is  allowed,  violating  the assumption that data on a block device are stable for one of the nodes. Dangerous, do not use.
- call-pri-lost
Call  the  "pri-lost"  helper  program on one of the machines. This program is expected to reboot the machine, i.e. make it secondary.
[root@core-01 ~]# vi /etc/drbd.conf
[root@core-01 ~]# scp /etc/drbd.conf root@core-02:/etc/
# Resource r0 DRBD0 /dev/cciss/c0d1p1: 250.0 GB

resource r0 {
	protocol	C;
	device	/dev/drbd0;
	
	disk { 
               on-io-error detach;
       }


	startup {
		become-primary-on	both;
	}

	net {
		allow-two-primaries;
		cram-hmac-alg	sha1;
		shared-secret	123456;
		after-sb-0pri	discard-least-changes;
		after-sb-1pri	violently-as0p;
		after-sb-2pri	violently-as0p;
		rr-conflict	violently;
	}

	syncer {
		rate	100M;
	}


	on core-01 {
		device	/dev/drbd0;
		disk	/dev/cciss/c0d1p1;
		address	10.0.0.1:7788;
		flexible-meta-disk	internal;
	}

	on core-02 {
		device	/dev/drbd0;
		disk	/dev/cciss/c0d1p1;
		address	10.0.0.2:7788;
		flexible-meta-disk	internal;
	}
}

# Resource r1 DRBD0 /dev/cciss/c0d2p1: 500.0 GB

resource r1 {
	protocol	C;
	device	/dev/drbd1;
	
	disk {  
		on-io-error detach;
	}


	startup {
		become-primary-on	both;
	}

	net {
		allow-two-primaries;
		cram-hmac-alg	sha1;
		shared-secret	123456;
		after-sb-0pri	discard-least-changes;
		after-sb-1pri	violently-as0p;
		after-sb-2pri	violently-as0p;
		rr-conflict	violently;
	}

	syncer {
		rate	125M;
	}


	on core-01 {
		device	/dev/drbd1;
		disk	/dev/cciss/c0d2p1;
		address	10.0.1.1:7789;
		flexible-meta-disk	internal;
	}

	on core-02 {
		device	/dev/drbd1;
		disk	/dev/cciss/c0d2p1;
		address	10.0.1.2:7789;
		flexible-meta-disk	internal;
	}
}


Step4.

Now lets start up GFS2.

[root@core-01 ~]# cman_tool nodes
cman_tool: Cannot open connection to cman, is it running ?
[root@core-1 ~]# service cman start
Starting cluster: 
  Loading modules... done
  Mounting configfs... done
  Starting ccsd... done
  Starting cman... 

If cman hangs at this point check /var/log/messages for messages such as:

 core-1 openais[2942]: [TOTEM] The consensus timeout expired.
 core-1 openais[2942]: [TOTEM] entering GATHER state from 3.
[root@core-01 ~]# vi /etc/ais/openais.conf

look for the following line and change the bindnetaddr to listen on the network address.

               bindnetaddr: 192.168.0.0
              # bindnetaddr: 192.168.2.0

If you still receive the error diable SELinux and stop iptables.

[root@core-01 ~]# service cman start
Starting cluster: 
  Loading modules... done
  Mounting configfs... done
  Starting ccsd... done
  Starting cman... done
  Starting daemons... done
  Starting fencing... 

At this point fencing will not start because it is waiting for core-02 to join.

[root@core-01 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  34944   2008-02-16 02:08:14  core-01
   2   X      0                        core-02
[root@core-01 ~]# cman_tool status
Version: 6.0.1
Config Version: 2
Cluster Name: hardcore
Cluster Id: 26333
Cluster Member: Yes
Cluster Generation: 34944
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 2
Quorum: 1  
Active subsystems: 6
Flags: 2node 
Ports Bound: 0  
Node name: core-01
Node ID: 1
Multicast addresses: 239.192.102.68 
Node addresses: 192.168.0.2 

Time to start gfs2 on core-02

[root@core-02 ~]# service cman start
Starting cluster: 
  Loading modules... done
  Mounting configfs... done
  Starting ccsd... done
  Starting cman... done
  Starting daemons... done
  Starting fencing... done
                                                          [  OK  ]

Now lets check the status of the cluster.

[root@core-01 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  34944   2008-02-16 02:08:14  core-01
   2   M  34948   2008-02-16 02:10:09  core-02
[root@core-01 ~]# cman_tool status
Version: 6.0.1
Config Version: 2
Cluster Name: hardcore
Cluster Id: 26333
Cluster Member: Yes
Cluster Generation: 34948
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1  
Active subsystems: 6
Flags: 2node 
Ports Bound: 0  
Node name: core-01
Node ID: 1
Multicast addresses: 239.192.102.68 
Node addresses: 192.168.0.2 
[root@core-01 ~]# cman_tool services
type             level name     id       state       
fence            0     default  00010001 none        
[1 2]
dlm              1     gfs2-00  00030001 none        
[1 2]
dlm              1     gfs2-01  00050001 none        
[1 2]
gfs              2     gfs2-00  00020001 none        
[1 2]
gfs              2     gfs2-01  00040001 none        
[1 2]


Step5.

Start DRBD on both nodes (note this assumes you have no data on your disks)

You want me to create a v08 style flexible-size internal meta data block.

[root@core-01 ~]# drbdadm create-md r0
[root@core-01 ~]# drbdadm create-md r1

DRBD will wait for core-02

[root@core-01 ~]# service drbd start
Starting DRBD resources:    [ d0 d1 s0 s1 n0 n1 ].

......

[root@core-01 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.2.4 (api:88/proto:86-88)
GIT-hash: fc00c6e00a1b6039bfcebe37afa3e7e28dbd92fa build by root@core-01, 2008-02-13 22:22:18
0: cs:WFConnection st:Secondary/Unknown ds:UpToDate/DUnknown C r---
   ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
       resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
       act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
1: cs:WFConnection st:Secondary/Unknown ds:UpToDate/DUnknown C r---
   ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
       resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
       act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0

Start DRBD on core-02

[root@core-02 ~]# service drbd start
Starting DRBD resources:    [ d0 d1 s0 s1 n0 n1 ].

Now on core-01 we can start the initial full synchronization - this may take many hours. You can continue to use the filesystem on both nodes as per usual operation. If you reboot a node during the initial sync you will have to start again. Be patient and leave it overnight depending on your storage size.

[root@core-01 ~]# drbdadm invalidate-remote r0
[root@core-01 ~]# drbdadm invalidate-remote r1

If you receive the message Refusing to be Primary without at least one UpToDate disk you can try the following:

[root@core-01 ~]# drbdadm -- --overwrite-data-of-peer primary all

Now we can see both nodes have clustered filesystem synchronized.

[root@core-01 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.2.4 (api:88/proto:86-88)
GIT-hash: fc00c6e00a1b6039bfcebe37afa3e7e28dbd92fa build by root@core-01, 2008-02-13 22:22:18
0: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate C r---
   ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
       resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
       act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
1: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate C r---
   ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
       resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
       act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0

Step6.

We need to specify 2 journals as each cluster node requires its own lock. Because this configuation is not designed to scale beyond two nodes we specify 2 journals.

Referencing back to our cluster.conf file we have chosen hardcore as our cluster name. We will call the clustered filesystem gfs2-00.

[root@core-01 ~]# mkfs.gfs2 -t hardcore:gfs2-00 -p lock_dlm -j 2 /dev/drbd0

Are you sure you want to proceed? [y/n] y

Device:                    /dev/drbd0
Blocksize:                 4096
Device Size                465.76 GB (122096000 blocks)
Filesystem Size:           465.76 GB (122095999 blocks)
Journals:                  3
Resource Groups:           1864
Locking Protocol:          "lock_dlm"
Lock Table:                "core-01:gfs2-00"

Now do the same for the our second disk we have defined in drbd.conf.

[root@core-01 ~]# mkfs.gfs2 -t hardcore:gfs2-01 -p lock_gulm -j 2 /dev/drbd1

Are you sure you want to proceed? [y/n] y

Device:                    /dev/drbd1
Blocksize:                 4096
Device Size                465.76 GB (122096000 blocks)
Filesystem Size:           465.76 GB (122095999 blocks)
Journals:                  3
Resource Groups:           1864
Locking Protocol:          "lock_dlm"
Lock Table:                "core-01:gfs2-01"


Step7.

Now we have created the filesystem we can go ahead and mount it.

If you are not able to mount the file system check that fence is in a running state.

/sbin/mount.gfs2: lock_dlm_join: gfs_controld join error: -22
/sbin/mount.gfs2: error mounting lockproto lock_dlm

If you loose both nodes and only one comes up you can manually mount with no locking.

Do not allow multiple nodes to mount the same file system while LOCK_NOLOCK is used. Doing so causes one or more nodes to panic their kernels, and may cause file system corruption.

Use only in a disaster!

mount -t gfs2 -o lockproto=lock_nolock /dev/drbd1 /gfs2-01


[root@core-01 ~]# mount -t gfs2 /dev/drbd0 /gfs2-00 -v
/sbin/mount.gfs2: mount /dev/drbd0 /gfs2-00
/sbin/mount.gfs2: parse_opts: opts = "rw"
/sbin/mount.gfs2:   clear flag 1 for "rw", flags = 0
/sbin/mount.gfs2: parse_opts: flags = 0
/sbin/mount.gfs2: parse_opts: extra = ""
/sbin/mount.gfs2: parse_opts: hostdata = ""
/sbin/mount.gfs2: parse_opts: lockproto = ""
/sbin/mount.gfs2: parse_opts: locktable = ""
/sbin/mount.gfs2: message to gfs_controld: asking to join mountgroup:
/sbin/mount.gfs2: write "join /gfs2-00 gfs2 lock_dlm hardcore:gfs2-00 rw /dev/drbd0"
/sbin/mount.gfs2: message from gfs_controld: response to join request:
/sbin/mount.gfs2: lock_dlm_join: read "0"
/sbin/mount.gfs2: message from gfs_controld: mount options:
/sbin/mount.gfs2: lock_dlm_join: read "hostdata=jid=0:id=131073:first=0"
/sbin/mount.gfs2: lock_dlm_join: hostdata: "hostdata=jid=0:id=131073:first=0"
/sbin/mount.gfs2: lock_dlm_join: extra_plus: "hostdata=jid=0:id=131073:first=0"
/sbin/mount.gfs2: mount(2) ok
/sbin/mount.gfs2: lock_dlm_mount_result: write "mount_result /gfs2-00 gfs2 0"
/sbin/mount.gfs2: read_proc_mounts: device = "/dev/drbd0"
/sbin/mount.gfs2: read_proc_mounts: opts = "rw,relatime,hostdata=jid=0:id=131073:first=0"


Now lets add the mounts to fstab so we can have them mount when system boots

[root@core-01 ~]# vi /etc/fstab 
#GFS DRBD MOUNT POINTS
/dev/drbd0              /gfs2-00                gfs2    defaults        1 1
/dev/drbd1              /gfs2-01                gfs2    defaults        1 1


Performance Plocks

Currently GFS2 is in a working state although performance seems to be lacking.

GFS2 + DRBD ping_pong.c test "./ping_ping /gfs2-00/test 3"

- One node plock test

[root@core-01 ~]# ./ping_pong /gfs2-00/test 3
   2159 locks/sec

- On both nodes

[root@core-01 ~]# ./ping_pong /gfs2-00/test 3
   1336 locks/sec
[root@core-02 ~]# ./ping_pong /gfs2-00/test 3
   1333 locks/sec

- One node pclock rw test "./ping_pong -rw /gfs2-00/test 3"

[root@core-01 ~]# ./ping_pong -rw /gfs2-00/test 3
   2192 locks/sec

- Two node pclock rw test "./ping_ping -rw /gfs2-00/test 3"

[root@core-01 ~]# ./ping_pong -rw /gfs2-00/test 3
      2 locks/sec
[root@core-02 ~]# ./ping_pong -rw /gfs2-00/test 3
      2 locks/sec

6.6. Virtualization

Create this new cluster configuration file on both nodes. The last configuration example in 6.5 covered only a basic clustered filesystem. Here we are taking things a step further and configuring failover domains and resources.

For our resources we will be managing virtual machines using qemu-kvm. GFS2 has several resources available pre packaged; the one we are interested in is the resource rule vm.sh. This rules are loaded by default when the cluster first starts from /usr/share/cluster.

In this configuration there are two failover domains configured one for each node core-01_domain & core-02_domain. The resources can failover between these domains or be migrated manually. vm.sh taps directly into virsh and supports live migration.

In order for gfs2 to function correctly you need a fencing device configured, without this your mileage will verify and in the event of a problem your node will need a fence_ack_manual.


[root@core-01 ~]# ccs_config_dump

/etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="3" name="hardcore">
	<dlm plock_ownership="1" plock_rate_limit="0"/>
	<gfs_controld plock_rate_limit="0"/>
	<cman cluster_id="26333" expected_votes="1" nodename="core-01" two_node="1"/>
	<clusternodes>
		<clusternode name="core-01" nodeid="1" votes="1">
			<fence>
				<method name="single">
					<device name="core-01_ipmi"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="core-02" nodeid="2" votes="1">
			<fence>
				<method name="single">
					<device name="core-02_ipmi"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice action="reboot" agent="fence_ipmilan" ipaddr="xxx.xxx.xxx.xxx" login="admin" name="core-01_ipmi" passwd="xxxxxx"/>
		<fencedevice action="reboot" agent="fence_ipmilan" ipaddr="xxx.xxx.xxx.xxx" login="admin" name="core-02_ipmi" passwd="xxxxxx"/>
	</fencedevices>
	<rm>
		<failoverdomains>
			<failoverdomain name="core-01_domain" restricted="0">
				<failoverdomainnode name="core-01"/>
			</failoverdomain>
			<failoverdomain name="core-02_domain" restricted="0">
				<failoverdomainnode name="core-02"/>
			</failoverdomain>
		</failoverdomains>
		<vm domain="core-01_domain" name="blueonyx_01"/>
		<vm domain="core-01_domain" name="rhel5_01"/>
		<vm domain="core-01_domain" name="winxp_01"/>
		<vm domain="core-02_domain" name="rhel5_02"/>
		<vm domain="core-02_domain" name="winxp_02"/>
		<vm domain="core-02_domain" name="winxp_03"/>
	</rm>
</cluster>


We can you the ccs commands to verify the configuration of the cluster.

[root@core-01 ~]# ccs_config_validate
Configuration validates

We can check our fencing devices using the ccs_tool. We are using ipmilan fence devices which I have used with HP Advanced Lights out; a problem with this fencing method is the loss of a power supply, fencing will not be able to verify and complete its action, so it will fail & your resources wont be migrated.

[root@core-01 ~]# ccs_tool lsfence
Name             Agent
core-01_ipmi     fence_ipmilan
core-02_ipmi     fence_ipmilan

We can override a failed fencing agent using manual intervention; however the more you look into the fencing topic you will realise the importance to avoid simulative read/writes. So for production it would be best to add an additional fencing agent such as an apc power rail.

[root@core-01 ~]# fence_ack_manual core-02

About to override fencing for core-02.
Improper use of this command can cause severe file system damage.

Continue [NO/absolutely]? 

We can list the nodes with the corresponding fence devices.

[root@core-01 ~]# ccs_tool lsnode

Cluster name: hardcore, config_version: 2

Nodename                        Votes Nodeid Fencetype
core-01                            1    1    core-01_ipmi
core-02                            1    2    core-02_ipmi


Now lets verify the cluster resource configuration has no errors by using the rg_test facility. Here we can see our resources and failover domains and how the resources are presented in the cluster.

[root@core-01 ~]# rg_test test /etc/cluster/cluster.conf 
Loading resource rule from /usr/share/cluster/vm.sh
Loaded 23 resource rules
=== Resources List ===
Resource type: vm [INLINE]
Instances: 1/1
Agent: vm.sh
Attributes:
 name = blueonyx_01 [ primary ]
 domain = core-01_domain [ reconfig ]
 autostart = 1 [ reconfig ]
 hardrecovery = 0 [ reconfig ]
 exclusive = 0 [ reconfig ]
 use_virsh = 1
 migrate = live
 snapshot = 
 depend_mode = hard
 max_restarts = 0 [ reconfig ]
 restart_expire_time = 0 [ reconfig ]
 hypervisor = auto
 hypervisor_uri = auto
 migration_uri = auto

Resource type: vm [INLINE]
Instances: 1/1
Agent: vm.sh
Attributes:
 name = rhel5_01 [ primary ]
 domain = core-01_domain [ reconfig ]
 autostart = 1 [ reconfig ]
 hardrecovery = 0 [ reconfig ]
 exclusive = 0 [ reconfig ]
 use_virsh = 1
 migrate = live
 snapshot = 
 depend_mode = hard
 max_restarts = 0 [ reconfig ]
 restart_expire_time = 0 [ reconfig ]
 hypervisor = auto
 hypervisor_uri = auto
 migration_uri = auto

Resource type: vm [INLINE]
Instances: 1/1
Agent: vm.sh
Attributes:
 name = winxp_01 [ primary ]
 domain = core-01_domain [ reconfig ]
 autostart = 1 [ reconfig ]
 hardrecovery = 0 [ reconfig ]
 exclusive = 0 [ reconfig ]
 use_virsh = 1
 migrate = live
 snapshot = 
 depend_mode = hard
 max_restarts = 0 [ reconfig ]
 restart_expire_time = 0 [ reconfig ]
 hypervisor = auto
 hypervisor_uri = auto
 migration_uri = auto
Resource type: vm [INLINE]
Instances: 1/1
Agent: vm.sh
Attributes:
 name = rhel5_02 [ primary ]
 domain = core-02_domain [ reconfig ]
 autostart = 1 [ reconfig ]
 hardrecovery = 0 [ reconfig ]
 exclusive = 0 [ reconfig ]
 use_virsh = 1
 migrate = live
 snapshot = 
 depend_mode = hard
 max_restarts = 0 [ reconfig ]
 restart_expire_time = 0 [ reconfig ]
 hypervisor = auto
 hypervisor_uri = auto
 migration_uri = auto

Resource type: vm [INLINE]
Instances: 1/1
Agent: vm.sh
Attributes:
 name = winxp_02 [ primary ]
 domain = core-02_domain [ reconfig ]
 autostart = 1 [ reconfig ]
 hardrecovery = 0 [ reconfig ]
 exclusive = 0 [ reconfig ]
 use_virsh = 1
 migrate = live
 snapshot = 
 depend_mode = hard
 max_restarts = 0 [ reconfig ]
 restart_expire_time = 0 [ reconfig ]
 hypervisor = auto
 hypervisor_uri = auto
 migration_uri = auto

Resource type: vm [INLINE]
Instances: 1/1
Agent: vm.sh
Attributes:
 name = winxp_03 [ primary ]
 domain = core-02_domain [ reconfig ]
 autostart = 1 [ reconfig ]
 hardrecovery = 0 [ reconfig ]
 exclusive = 0 [ reconfig ]
 use_virsh = 1
 migrate = live
 snapshot = 
 depend_mode = hard
 max_restarts = 0 [ reconfig ]
 restart_expire_time = 0 [ reconfig ]
 hypervisor = auto
 hypervisor_uri = auto
 migration_uri = auto

=== Resource Tree ===
vm {
 name = "blueonyx_01";
 domain = "core-01_domain";
 autostart = "1";
 hardrecovery = "0";
 exclusive = "0";
 use_virsh = "1";
 migrate = "live";
 snapshot = "";
 depend_mode = "hard";
 max_restarts = "0";
 restart_expire_time = "0";
 hypervisor = "auto";
 hypervisor_uri = "auto";
 migration_uri = "auto";
}
vm {
 name = "rhel5_01";
 domain = "core-01_domain";
 autostart = "1";
 hardrecovery = "0";
 exclusive = "0";
 use_virsh = "1";
 migrate = "live";
 snapshot = "";
 depend_mode = "hard";
 max_restarts = "0";
 restart_expire_time = "0";
 hypervisor = "auto";
 hypervisor_uri = "auto";
 migration_uri = "auto";
}
vm {
 name = "winxp_01";
 domain = "core-01_domain";
 autostart = "1";
 hardrecovery = "0";
 exclusive = "0";
 use_virsh = "1";
 migrate = "live";
 snapshot = "";
 depend_mode = "hard";
 max_restarts = "0";
 restart_expire_time = "0";
 hypervisor = "auto";
 hypervisor_uri = "auto";
 migration_uri = "auto";
}
vm {
 name = "rhel5_02";
 domain = "core-02_domain";
 autostart = "1";
 hardrecovery = "0";
 exclusive = "0";
 use_virsh = "1";
 migrate = "live";
 snapshot = "";
 depend_mode = "hard";
 max_restarts = "0";
 restart_expire_time = "0";
 hypervisor = "auto";
 hypervisor_uri = "auto";
 migration_uri = "auto";
}
vm {
 name = "winxp_02";
 domain = "core-02_domain";
 autostart = "1";
 hardrecovery = "0";
 exclusive = "0";
 use_virsh = "1";
 migrate = "live";
 snapshot = "";
 depend_mode = "hard";
 max_restarts = "0";
 restart_expire_time = "0";
 hypervisor = "auto";
 hypervisor_uri = "auto";
 migration_uri = "auto";
}
vm {
 name = "winxp_03";
 domain = "core-02_domain";
 autostart = "1";
 hardrecovery = "0";
 exclusive = "0";
 use_virsh = "1";
 migrate = "live";
 snapshot = "";
 depend_mode = "hard";
 max_restarts = "0";
 restart_expire_time = "0";
 hypervisor = "auto";
 hypervisor_uri = "auto";
 migration_uri = "auto";
}
=== Failover Domains ===
Failover domain: core-01_domain
Flags: none
 Node core-01 (id 1, priority 0)
Failover domain: core-02_domain
Flags: none
 Node core-02 (id 2, priority 0)
=== Event Triggers ===
Event Priority Level 100:
 Name: Default
   (Any event)
   File: /usr/share/cluster/default_event_script.sl


Create a network bridge for our virtual machines to avoid natting so these virtual machines can now be internet facing with public IP addresses. Notice I have the onboot option set to no; we will bring this up after cman manually as there seems to be an issue with bridging and cman.

[root@core-01 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth1 
# Networking Interface
DEVICE=eth1
HWADDR=00:23:7D:29:D2:7D
ONBOOT=no
TYPE=Ethernet
BRIDGE=br0


[root@core-01 ~]# vi /etc/sysconfig/network-scripts/ifcfg-br0 
DEVICE=br0
TYPE=Bridge
BOOTPROTO=static
DNS1=192.168.0.1
GATEWAY=192.168.0.1
IPADDR=192.168.0.20
NETMASK=255.255.255.0
ONBOOT=no

On our second node lets do the same.

[root@core-02 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth1 
# Networking Interface
DEVICE=eth1
HWADDR=00:21:5A:D4:0A:51
ONBOOT=no
TYPE=Ethernet
BRIDGE=br0
[root@core-02 ~]# cat /etc/sysconfig/network-scripts/ifcfg-br0 
DEVICE=br0
TYPE=Bridge
BOOTPROTO=static
DNS1=192.168.0.1
GATEWAY=192.168.0.1
IPADDR=192.168.0.30
NETMASK=255.255.255.0
ONBOOT=no


On both nodes add the following to the rc.local file so that we bring up the bridge, mount the clustered filesystem; start libvirtd and the resource manager after the machine has booted.

Also ensure the following on both cluster members:

- chkconfig cman on - chkconfig drbd on - chkconfig rgmanager off - chkconfig libvirtd off


[root@core-01 ~]# cat /etc/rc.local

#!/bin/sh
#
# This script will be executed *after* all the other init scripts.
# You can put your own initialization stuff in here if you don't
# want to do the full Sys V style init stuff.

touch /var/lock/subsys/local

ifup eth1; ifup br0

mount -t gfs2 /dev/drbd0 /gfs2-00
mount -t gfs2 /dev/drbd1 /gfs2-01
#/usr/local/sbin/ctdbd --reclock /gfs2-00/cluster/ctdb/ctdb.lock --lvs
/etc/init.d/libvirtd start
/etc/init.d/rgmanager start


I build my virtual machines through virt-manager first.. and exmaple configuration file is as follows:

[root@core-02 ~]# vi /etc/libvirt/qemu/blueonyx_01.xml 
<domain type='kvm'>
 <name>blueonyx_01</name>
 <uuid>d42b866a-9f70-2faa-0e2a-4182125cf499</uuid>
 <memory>1048576</memory>
 <currentMemory>1048576</currentMemory>
 <vcpu>2</vcpu>
 <os>
   <type arch='x86_64' machine='pc'>hvm</type>
   <boot dev='hd'/>
 </os>
 <features>
   <acpi/>
   <apic/>
   <pae/>
 </features>
 <clock offset='utc'/>
 <on_poweroff>destroy</on_poweroff>
 <on_reboot>restart</on_reboot>
 <on_crash>restart</on_crash>
 <devices>
   <emulator>/usr/bin/qemu-kvm</emulator>
   <disk type='file' device='disk'>
     <source file='/gfs2-00/virtualization/BlueOnyx/blueonyx_01.img'/>
     <target dev='hda' bus='ide'/>
   </disk>
   <disk type='file' device='cdrom'>
     <target dev='hdc' bus='ide'/>
     <readonly/>
   </disk>
   <interface type='bridge'>
     <mac address='54:52:00:46:28:06'/>
     <source bridge='br0'/>
   </interface>
   <serial type='pty'>
     <target port='0'/>
   </serial>
   <console type='pty'>
     <target port='0'/>
   </console>
   <input type='mouse' bus='ps2'/>
   <graphics type='vnc' port='-1' autoport='yes'/>
   <sound model='es1370'/>
 </devices>
</domain>

Now we need to define the domain

[root@core-01 ~]# virsh define /etc/libvirt/qemu/blueonyx_01.xml 
Domain blueonyx_01 defined from /etc/libvirt/qemu/blueonyx_01.xml

Repeat this step for each virtual machine you create, remember to define them on both nodes.

Reboot both nodes:

Now that everything is up and running lets verify a few thing. Clustat is a great way to check the status of your cluster and services. You can monitor services and their migration status when invoked.


[root@core-01 ~]# clustat 
Cluster Status for hardcore @ Tue Dec  1 11:06:28 2009
Member Status: Quorate
Member Name                             ID   Status
------ ----                             ---- ------
core-01                                     1 Online, Local, rgmanager
core-02                                     2 Online, rgmanager

Service Name                   Owner (Last)                   State         
------- ----                   ----- ------                   -----         
vm:blueonyx_01                 core-01                        started       
vm:rhel5_01                    core-01                        started       
vm:rhel5_02                    core-02                        started       
vm:winxp_01                    core-01                        started       
vm:winxp_02                    core-02                        started       
vm:winxp_03                    core-02                        started    

Great, everything is working as expected; we have 4 virtual machines two running on each node.

In the logs you should see something as follows on each node

[root@core-01 ~]# tail -f /var/log/cluster/rgmanager.log

Oct 29 06:21:14 rgmanager Service service:core-01_vms started

Oct 29 06:13:33 rgmanager Starting stopped service service:vm:winxp_02
Oct 29 06:13:33 bash virsh -c qemu:///system start winxp_02
Oct 29 06:13:35 rgmanager Service service:vm:winxp_02 started

[root@core-01 ~]# virsh list --all
Id Name                 State
----------------------------------
 2 rhel5_01             running
 5 winxp_01             running
 7 blueonyx_01          running
 - proxmox_01           shut off
 - rhel5_02             shut off
 - winxp_02             shut off
 - winxp_03             shut off


[root@core-02 ~]# virsh list --all

Id Name                 State
----------------------------------
 1 rhel5_02             running
 2 winxp_02             running
 3 winxp_03             running
 - blueonyx_01          shut off
 - rhel5_01             shut off
 - winxp_01             shut off


Never attempt to migrate a virtual machine outside of rgmanager. rgmanager will automatically respawn the vm and you will end up with two copies of the same virtual machine guest on both nodes.. this is a very bad thing!

Core-01

top - 10:22:10 up  4:03,  1 user,  load average: 0.03, 0.05, 0.06
Tasks: 168 total,   3 running, 165 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.5%us,  1.8%sy,  0.0%ni, 96.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8113232k total,  1548432k used,  6564800k free,    87892k buffers
Swap:  4095992k total,        0k used,  4095992k free,   421996k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
3688 root      20   0 1528m 283m 3112 S  8.0  3.6  14:39.56 qemu-kvm           
3882 root      20   0  939m 525m 3136 R  4.7  6.6  11:51.10 qemu-kvm   
Core-02

top - 10:15:05 up  4:04,  1 user,  load average: 0.15, 0.07, 0.01
Tasks: 165 total,   1 running, 164 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.0%us,  1.5%sy,  0.0%ni, 97.4%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4017328k total,  1480484k used,  2536844k free,    84128k buffers
Swap:  4095992k total,        0k used,  4095992k free,   133384k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
3795 root      20   0  939m 525m 3132 S  7.9 13.4  11:11.27 qemu-kvm           
3856 root      20   0  939m 525m 3132 S  5.9 13.4  11:40.59 qemu-kvm

Now lets attempt to live migrate the virtual machines from core-01 to core-02. We can monitor the rgmanager.log file or watch clustat to see the status of the migration. Be patient and expect it to take a miniute or so.

[root@core-01 ~]# clusvcadm -M vm:winxp_01 -m core-02
Trying to migrate vm:winxp_01 to core-02...Success


[root@core-01 ~]# tail -f /var/log/cluster/messages
bash virsh migrate --live winxp_01 qemu+ssh://core-02/system
core-01 rgmanager[2234]: Migration of vm:blueonyx_01 to core-01 completed


Verify the status of the cluster.

[root@core-02 ~]#  clustat 
Cluster Status for hardcore @ Tue Dec  1 11:11:32 2009
Member Status: Quorate

Member Name                             ID   Status
------ ----                             ---- ------
core-01                                     1 Online, rgmanager
core-02                                     2 Online, Local, rgmanager

Service Name                   Owner (Last)                   State         
------- ----                   ----- ------                   -----         
vm:blueonyx_01                 core-01                        started       
vm:rhel5_01                    core-01                        started       
vm:rhel5_02                    core-02                        started       
vm:winxp_01                    core-02                        started       
vm:winxp_02                    core-02                        started       
vm:winxp_03                    core-02                        started   


Adding new resources

Check the current running cluster.conf

[root@core-01 ~]# cman_tool version
6.2.0 config 4

Add your new resource to the cluster.conf file, in this example we have chosen core-01 to be the primary domain.

<vm domain="core-01_domain" name="your-new-vm"/>

Increment your version number

<cluster config_version="5" name="hardcore">

Validate the configuration file checks out.

[root@core-01 ~]# ccs_config_validate 
Configuration validates


Copy the new configuration to the second node.

[root@core-01 ~]# scp /etc/cluster/cluster.conf core-02:/etc/cluster/ 
cluster.conf                                  100% 1794     1.8KB/s   00:00   

Update the running cluster configuration: cman_tool version -r $newversion -S

[root@core-01 ~]# cman_tool version -r 5 -S


DRBD Fault Tolerance

In this test I had virtual machines running on each server both untilization both GFS2 mount points in read/write mode (normal operation). I am able to physically remove the drives from one server (with the exception of operating system) and all services on that node will continue to run!

DRBD in this mode is known as diskless mode. All read/write operations will be carried out over the network to the other node. Over a gigabit dedicated LAN performance may be a problem depending on how high of a load you are actually running. You can migrate your virtual machines to the good node and replace the disks on the failed without any downtime.

From then onwards, DRBD is said to operate in diskless mode, and carries out all subsequent I/O operations, read and write, on the peer node. Performance in this mode is inevitably expected to suffer, but the service continues without interruption, and can be moved to the up2date node.

[root@core-01 ~]# /etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.3.5 (api:88/proto:86-91)
GIT-hash: ded8cdf09b0efa1460e8ce7a72327c60ff2210fb build by root@core-01, 2009-10-29 19:01:29
m:res  cs         ro               ds                 p  mounted   fstype
0:r0   Connected  Primary/Primary  Diskless/UpToDate  C  /gfs2-00  gfs2
1:r1   Connected  Primary/Primary  Diskless/UpToDate  C  /gfs2-01  gfs2

Even though all disks have completely failed on one of the nodes (excluding OS) all services and mount points remain available.