Testing a filesystem with the ping_pong tool
The ping_pong tool is a tiny piece of C code that can be used to tell you some very useful things about a cluster filesystem. If you are interested in seeing if your favourite cluster filesystem might be used for CTDB/Samba then I highly recommend starting by running ping_pong and making sure it passes.
ping_pong is distributed with CTDB or it can be downloaded from http://junkcode.samba.org/ftp/unpacked/junkcode/ping_pong.c
Compile it like this:
cc -o ping_pong ping_pong.c
What it tests
The ping_pong tool can test the following aspects of your cluster filesystem
- If it supports coherent byte range locks between cluster nodes
- How fast it handles lock contention
- If it supports coherent read/write I/O between nodes
- How fast it handles contended I/O between nodes
- If it supports coherent mmap between nodes
- How fast the mmap coherence works
All this in 176 lines of C ! What a bargain.
I was also rather surprised to find that it isn't uncommon for this test to crash (or lockup) cluster filesystems that haven't tried it before. I guess that just shows how much filesystem developers tend to neglect locking.
Testing lock coherence
Recent Samba versions
The following method can be used to verify lock coherence with recent Samba versions that support the -l option to ping_pong.
Login to one node of your cluster. Start by running ping_pong on just one of the nodes like this:
ping_pong -l /path/to/clusterfs/file
You should see the following output:
Holding lock, press any key to continue... You should run the same command on another node now.
Now login to another node and run the same command. You should see the following on that node:
file already locked, calling check_lock to tell us who has it locked: check_lock failed: lock held: pid='0', type='1', start='0', len='0' Working POSIX byte range locks
If that's what you see on your system, that means byte-range locks are cluster coherent and you're good to go.
Older Samba versions
If your ping_pong command is lacking the -l option you can use the following method to verify lock coherence.
Login to several nodes of your cluster. Start by running ping_pong on just one of the nodes like this:
ping_pong test.dat N
where N is at least 1 more than the number of nodes you will be testing on. The filename (test.dat in the above) should point at the same shared file on all cluster nodes.
You'll see ping_pong print out a lock rate once per second. As you are running only on one node, you should expect to get a very high rate, as you have no contention. So for a typical server style CPU you should expect to get a rate of perhaps 500k to 1M locks/second. If ping_pong doesn't print a locking rate once per second then you have a bug. Talk to your filesystem vendor.
Now start a second copy of ping_pong on another node in your cluster. Use exactly the same parameters. You should see that the locking rate drops dramatically. That is because the cluster filesystem now has to handle the contended case for every lock it grants. On a gigabit network you should hope to now get a locking rate of between 1k/sec and 10k/sec depending on how fast the lock coherence algorithms of your cluster filesystem are.
Again, if you don't see a lock rate printed once per second, or if the locking rates shown in the two instances are not almost equal, or if the locking rate did not drop when you ran the second copy, then you almost certainly have a buggy cluster filesystem. Talk to your vendor.
Now start a 3rd, copy of ping_pong, and keep going up one at a time, noting how the locking rate changes as you add nodes. That shows you how well the lock coherence algorithms scale with the number of nodes.
Finally, kill of the ping_pong test one node at a time. As you kill them, you should see the locking rate increase until you get back to the single node case. If it doesn't increase as expected, then you have a filesystem bug. Contact your friendly vendor.
Testing I/O coherence
OK, so you managed to pass the lock coherence test. Great! Now lets look at I/O coherence.
Kill all your copies of ping_pong, and start the whole process again (adding one at a time) but this time add the command line switch -rw. So you'll do this:
ping_pong -rw test.dat N
You'll probably see a much lower locking rate. This is because ping_pong is now doing a one byte read and a one byte write after each lock. It also prints a "data increment" value, which should be equal to the number of nodes that is running the ping_pong test (I'm afraid it only supports up to 256 nodes with this test).
If the "data increment" value doesn't equal the number of nodes currently running the ping_pong test, or if it doesn't print a lock rate once per second, or if the lock rate starts to approach zero, then you have a bug. Talk to your vendor.
The locking rate this prints is a simple measure of your I/O contention rate. Bigger numbers are better.
Testing mmap coherence
If you add the -m switch to ping_pong along with -rw then it will do the I/O coherence test via mmap. It isn't absolutely essential that a cluster filesystem supports coherent mmap for CTDB/Samba, but it's nice for bragging points over other cluster filesystems. If your cluster filesystem doesn't pass this test then just use the "use mmap = no" option in smb.conf. Even if it does pass this test that option may be a good idea on most cluster filesystems.
Relevance to CTDB and Samba
- To test whether a cluster filesystem supports the CTDB recovery lock you need the lock coherence test to pass.
- To test whether a cluster filesystem supports Samba, with POSIX locking enabled, you need the I/O coherence test to pass.
How it works
Well, you could just read the code. Did I mention it's just 176 lines long?
Anyway, for those of you too lazy to read C or (gasp!) unable to read C, what ping_pong does is a "one foot on the ground" test, aiming at defeating any possible optimisations or shortcuts that clusters filesystems might use to prevent you measuring their coherence times.
So the test does this locking pattern:
lock byte 0 lock byte 1 unlock byte 0 lock byte 2 unlock byte 1 lock byte 3 ... ... lock byte N unlock byte N-1 lock byte 0 unlock byte N ... etc etc
all done in a tight loop. If the filesystem is behaving correctly then two nodes can't lock the same byte at the same time. As each instance of the ping_pong program always has "one foot on the ground", meaning one byte locked, this means that two instances of ping_pong cannot overtake one another.
This means lot of contention. The filesystem can't optimise away this contention with cache mechanisms, so we end up measuring the real contention times that the filesystem achieves.