CTDB Performance

From SambaWiki

Record contention

When a file is accessed through SMB, Samba (smbd to be more precise) keeps a database record in locking.tdb. Concurrent access is registered in the same record. That implies that every time a file is opened or closed, this record is updated. If the open and close operations happen on multiple cluster nodes, CTDB needs to transfer the record to the node for the update.

This can result in poor performance due to contention for files and, therefore, for database records. Although contention is most often seen in locking.tdb, it can also occur in other database such as brlock.tdb. When records are contended they will bounce between nodes, and network latency makes this expensive.

One major cause of contention is Windows clients accessing an SMB share, and keeping an open handle on the root directory of the share. Many Windows clients accessing the same clustered Samba share can easily trigger this problem.

Exactly where is CTDB involved?

In this situation it is important to understand that CTDB is only involved in creating records and moving them between nodes. smbd looks for a record in the desired TDB and if it determines that the latest version of that record is present on the current node then it uses that record. There are 2 other cases:

  • The record is present but the current node does not have the latest copy
  • The record is not present

In both cases smbd will ask ctdbd to fetch the record.

Log messages indicating poor performance

Log messages like the following are an indicator of performance problems:

 db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.N key ABCDEFBC2A66F9AD1C55142C290000000000000000000000, chain 62588 needed 1 attempts, X milliseconds, chainlock: Y ms, CTDB Z ms

If Z is large (multiple seconds, particularly tens of seconds) then CTDB took a long time to fetch the record from another node.

Aside: stuck smbd processes

If Z is even larger (hundreds or thousands of seconds) then this can indicate that an smbd process on a node is stuck in D state, probably in a cluster filesystem system call, while holding a TDB lock. In this case the above db_ctdb_fetch_locked messages may not even be seen because a record is never successfully fetched. In this case, one or more repeated message like the following may be seen:

 Unable to get RECORD lock on database locking.tdb for X seconds

A very large value of X (hundreds or thousands of seconds) indicates a serious problem.

This can be confirmed by finding a long-running smbd process in D state and obtaining a kernel stack trace (on Linux, /proc/<pid>/stack). See the documentation for the ctdb.conf(5) [database] lock debug script option for an automated way of debugging this (when robust mutexes are in use, which is the modern Samba default, this automated method only works on versions >= 4.15).

As hinted at above, the usual reason for this type of problem is a cluster filesystem issue.

Hot keys

The hot keys section of ctdb dbstatistics locking.tdb statistics output lists the keys in locking.tdb that have been fetched to a node the most times. Substitute other database names as appropriate.

High hop count

If the local CTDB node does not have the latest copy of a record then it will ask that record's location master node to fetch the record. If the location master doesn't have the record it knows which node does have it, so it forwards the fetch request to this "last known node". However, the cluster state is quite dynamic, so the record may already have been fetched away from the "last known node". When it receives the fetch request it will forward it to the location master... and so on. The record will be "chased" around the cluster until it is found.

This behaviour is logged as follows:

 High hopcount 198 dbid:locking.tdb

To avoid flooding the logs, such logging occurs when hopcount % 100 > 95.

Samba needs multiple fetch attempts

Returning to this log message:

 db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.N key ABCDEFBC2A66F9AD1C55142C290000000000000000000000, chain 62588 needed 1 attempts, X milliseconds, chainlock: Y ms, CTDB Z ms

In this case it says needed 1 attempts. If this number is greater than 1 then this smbd was informed that the record had been fetched but when it checked for the record it had been migrated away to another node. This can happen repeatedly due to high contention.

Workarounds

Deliberately breaking lock coherency

This section should contain discussion of deliberately but carefully breaking lock coherency using:

 fileid:algorithm = fsname_norootdir

or even:

 fileid:algorithm = fsname_nodirs

See vfs_fileid(8). This needs to be carefully considered and understood to avoid filesystem corruption.

Read-only records

This feature allows read-only leases to be granted for records. This means that many nodes can have the latest copy of a record, which is useful if there is a lot of read-only access. The cost is that all of the read-only leases need to be cancelled when a node wishes to update the record.

See ctdb(1) setdbreadonly.

This feature is known to have been successfully used in production and is used by default on at least one database.

Sticky records

This feature causes a contended record (with high hopcount) to be pinned to a node for a minimum amount of time before it can be migrated away again. This is particularly useful if multiple clients connected to a node have all requested the same record. They can all have their turn reading and updating the record without incurring a networking cost.

See ctdb(1) setdbsticky and ctdb-tunables(7) HopcountMakeSticky, StickyDuration, StickyPindown

This feature is not known to have been used in production but it may provide useful performance benefits. However, like any heuristic it needs to be finely tuned to avoid the cost outweighing the benefit.

Performance monitoring

Sometimes performance is bad and the logs don't say anything useful. This might be explained using general performance monitoring tools.

Running top(1) can provide some insight. For example, is the main ctdbd process spinning at 100% of a CPU? If so, it is probably busy migrating records.

A longer-running option is atop. It can be set up to take a snapshot of various system information regularly. You can then walk back through time, browse the state of the overall system, see where particular processes peak and observe I/O saturation - potentially bad things are highlighted in red. Nice! A 1 minute interval provides a useful amount of information. However, note that you need to consider the performance impact of taking such regular snapshots.

There are obviously many such tools operating at many different levels.

Finally, CTDB ships with an event script called 05.system. This attempts to warn about memory and disk full situations. This can be very useful when there is no other monitoring in place.