Difference between revisions of "CTDB Performance"

From SambaWiki
(Add more sections)
Line 10: Line 10:
 
In both cases <code>smbd</code> will ask <code>ctdbd</code> to fetch the record. So, when multiple nodes need to access the same record then that record will bounce. This can be expensive.
 
In both cases <code>smbd</code> will ask <code>ctdbd</code> to fetch the record. So, when multiple nodes need to access the same record then that record will bounce. This can be expensive.
   
== Log messages indicating poor performance ==
+
=== Log messages indicating poor performance ===
   
 
Log messages like the following are an indicator of performance problems:
 
Log messages like the following are an indicator of performance problems:
Line 18: Line 18:
 
If <code>Z</code> is large (multiple seconds, particularly tens of seconds) then CTDB took a long time to fetch the record from another node.
 
If <code>Z</code> is large (multiple seconds, particularly tens of seconds) then CTDB took a long time to fetch the record from another node.
   
=== Aside: stuck smbd processes===
+
==== Aside: stuck smbd processes====
   
 
If Z is even larger (hundreds or thousands of seconds) then this can indicate that an <code>smbd</code> process on a node is stuck in <code>D</code> state, probably in a cluster filesystem system call, while holding a TDB lock. In this case the above <code>db_ctdb_fetch_locked</code> messages may not even be seen because a record is never successfully fetched. In this case, one or more repeated message like the following may be seen:
 
If Z is even larger (hundreds or thousands of seconds) then this can indicate that an <code>smbd</code> process on a node is stuck in <code>D</code> state, probably in a cluster filesystem system call, while holding a TDB lock. In this case the above <code>db_ctdb_fetch_locked</code> messages may not even be seen because a record is never successfully fetched. In this case, one or more repeated message like the following may be seen:
Line 30: Line 30:
 
As hinted at above, the usual reason for this type of problem is a cluster filesystem issue.
 
As hinted at above, the usual reason for this type of problem is a cluster filesystem issue.
   
== Hot keys ==
+
=== Hot keys ===
   
 
The hot keys section of <code>ctdb dbstatistics locking.tdb</code> statistics output lists the keys in <code>locking.tdb</code> that have been fetched to a node the most times. Substitute other database names as appropriate.
 
The hot keys section of <code>ctdb dbstatistics locking.tdb</code> statistics output lists the keys in <code>locking.tdb</code> that have been fetched to a node the most times. Substitute other database names as appropriate.
  +
  +
=== High hop count ===
  +
  +
If the local CTDB node does not have the latest copy of a record then it will ask that record's ''location master'' node to fetch the record. If the location master doesn't have the record it knows which node does have it, so it forwards the fetch request to this "last known node". However, the cluster state is quite dynamic, so the record may already have been fetched away from the "last known node". When it receives the fetch request it will forward it to the location master... and so on. The record will be "chased" around the cluster until it is found.
  +
  +
This behaviour is logged as follows:
  +
  +
High hopcount 198 dbid:locking.tdb
  +
  +
To avoid flooding the logs, such logging occurs when <code>hopcount % 100 > 95</code>.
  +
  +
=== Samba needs multiple fetch attempts ===
  +
  +
Returning to this log message:
  +
  +
db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.N key ABCDEFBC2A66F9AD1C55142C290000000000000000000000, chain 62588 needed 1 attempts, X milliseconds, chainlock: Y ms, CTDB Z ms
  +
  +
In this case it says <code>needed 1 attempts</code>. If this number is greater than 1 then this <code>smbd</code> was informed that the record had been fetched but when it checked for the record it had been migrated away to another node. This can happen repeatedly due to high contention.
   
 
== Workarounds ==
 
== Workarounds ==
Line 47: Line 65:
   
 
See [https://www.samba.org/samba/docs/current/man-html/vfs_fileid.8.html vfs_fileid(8)]. This needs to be carefully considered and understood to avoid filesystem corruption.
 
See [https://www.samba.org/samba/docs/current/man-html/vfs_fileid.8.html vfs_fileid(8)]. This needs to be carefully considered and understood to avoid filesystem corruption.
  +
  +
=== Read-only records ===
  +
  +
This feature allows read-only leases to be granted for records. This means that many nodes can have the latest copy of a record, which is useful if there is a lot of read-only access. The cost is that all of the read-only leases need to be cancelled when a node wishes to update the record.
  +
  +
See [https://ctdb.samba.org/manpages/ctdb.1.html ctdb(1)] <code>setdbreadonly</code>.
  +
  +
This feature is known to have been successfully used in production and is used by default on at least one database.
  +
  +
=== Sticky records ===
  +
  +
This feature causes a contended record (with high hopcount) to be pinned to a node for a minimum amount of time before it can be migrated away again. This is particularly useful if multiple clients connected to a node have all requested the same record. They can all have their turn reading and updating the record without incurring a networking cost.
  +
  +
See [https://ctdb.samba.org/manpages/ctdb.1.html ctdb(1)] <code>setdbsticky</code> and [https://ctdb.samba.org/manpages/ctdb-tunables.7.html ctdb-tunables(7)] <code>HopcountMakeSticky</code>, <code>StickyDuration</code>, <code>StickyPindown</code>
  +
  +
This feature is not known to have been used in production but it may provide useful performance benefits. However, like any heuristic it needs to be finely tuned to avoid the cost outweighing the benefit.

Revision as of 11:16, 8 October 2021

Record contention

CTDB's distributed volatile databases are subject to contention for database records. This can result in performance issues. Contention is most often seen in locking.tdb (and sometimes brlock.tdb). Records in these databases are directly associated with files, so when several nodes contend for access to metadata for a particular file or directory, the associated record(s) are contended and bounce between nodes.

In this situation it is important to understand that CTDB is only involved in creating records and moving them between nodes. smbd looks for a record in the desired TDB and if it determines that the latest version of that record is present on the current node then it uses that record. There are 2 other cases:

  • The record is present but the current node does not have the latest copy
  • The record is not present

In both cases smbd will ask ctdbd to fetch the record. So, when multiple nodes need to access the same record then that record will bounce. This can be expensive.

Log messages indicating poor performance

Log messages like the following are an indicator of performance problems:

 db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.N key ABCDEFBC2A66F9AD1C55142C290000000000000000000000, chain 62588 needed 1 attempts, X milliseconds, chainlock: Y ms, CTDB Z ms

If Z is large (multiple seconds, particularly tens of seconds) then CTDB took a long time to fetch the record from another node.

Aside: stuck smbd processes

If Z is even larger (hundreds or thousands of seconds) then this can indicate that an smbd process on a node is stuck in D state, probably in a cluster filesystem system call, while holding a TDB lock. In this case the above db_ctdb_fetch_locked messages may not even be seen because a record is never successfully fetched. In this case, one or more repeated message like the following may be seen:

 Unable to get RECORD lock on database locking.tdb for X seconds

A very large value of X (hundreds or thousands of seconds) indicates a serious problem.

This can be confirmed by finding a long-running smbd process in D state and obtaining a kernel stack trace (on Linux, /proc/<pid>/stack). See the documentation for the ctdb.conf(5) [database] lock debug script option for an automated way of debugging this (when robust mutexes are in use, which is the modern Samba default, this automated method only works on versions >= 4.15).

As hinted at above, the usual reason for this type of problem is a cluster filesystem issue.

Hot keys

The hot keys section of ctdb dbstatistics locking.tdb statistics output lists the keys in locking.tdb that have been fetched to a node the most times. Substitute other database names as appropriate.

High hop count

If the local CTDB node does not have the latest copy of a record then it will ask that record's location master node to fetch the record. If the location master doesn't have the record it knows which node does have it, so it forwards the fetch request to this "last known node". However, the cluster state is quite dynamic, so the record may already have been fetched away from the "last known node". When it receives the fetch request it will forward it to the location master... and so on. The record will be "chased" around the cluster until it is found.

This behaviour is logged as follows:

 High hopcount 198 dbid:locking.tdb

To avoid flooding the logs, such logging occurs when hopcount % 100 > 95.

Samba needs multiple fetch attempts

Returning to this log message:

 db_ctdb_fetch_locked for /var/cache/dbdir/volatile/locking.tdb.N key ABCDEFBC2A66F9AD1C55142C290000000000000000000000, chain 62588 needed 1 attempts, X milliseconds, chainlock: Y ms, CTDB Z ms

In this case it says needed 1 attempts. If this number is greater than 1 then this smbd was informed that the record had been fetched but when it checked for the record it had been migrated away to another node. This can happen repeatedly due to high contention.

Workarounds

Deliberately breaking lock coherency

This section should contain discussion of deliberately but carefully breaking lock coherency using:

 fileid:algorithm = fsname_norootdir

or even:

 fileid:algorithm = fsname_nodirs

See vfs_fileid(8). This needs to be carefully considered and understood to avoid filesystem corruption.

Read-only records

This feature allows read-only leases to be granted for records. This means that many nodes can have the latest copy of a record, which is useful if there is a lot of read-only access. The cost is that all of the read-only leases need to be cancelled when a node wishes to update the record.

See ctdb(1) setdbreadonly.

This feature is known to have been successfully used in production and is used by default on at least one database.

Sticky records

This feature causes a contended record (with high hopcount) to be pinned to a node for a minimum amount of time before it can be migrated away again. This is particularly useful if multiple clients connected to a node have all requested the same record. They can all have their turn reading and updating the record without incurring a networking cost.

See ctdb(1) setdbsticky and ctdb-tunables(7) HopcountMakeSticky, StickyDuration, StickyPindown

This feature is not known to have been used in production but it may provide useful performance benefits. However, like any heuristic it needs to be finely tuned to avoid the cost outweighing the benefit.