CTDB log file search patterns

With merged logs available some issues can be more quickly understood if key search patterns are used. Some of these are covered below. Note that the debug messages used in examples are subject to change, so this page comes without guarantees...  :-)

Failed recoveries

Failed recoveries sometime need to be understood. Here are some useful search phrases:

do_recovery
Included in a message logged at the start of recovery
databases recovered
Count of databases recovered during each attempt
database recovery failed
Logged after all attempts in a round (default = 3) have failed
control .* failed for .*\.tdb on node
Message about failed control, can often be the cause of recovery failure
ban
Message about a node being banned

If recoveries go wrong it can be useful to know where ctdbd has been restarted:

  • starting ctdbd
  • shutdown

Adding a ctdb prefix and allowing a choice of search patterns can provide a useful timeline:

 grep -Ei 'ctdb.*(starting ctdbd|shutdown|ban|control .* failed for .*\.tdb on node|databases recovered|database recovery failed|do_recovery)' messaged.merged

System resources

If a system is running out of memory and swapping then this can sometimes explain poor performance. Messages logged by the 05.system event script can be useful. The following will find messages about memory use >= 90%:

 grep -E 'ctdb.*System (swap|memory) utilization 9' messages.merged

05.system also monitors "filesystem utilization".

ctdbd is a single process that is single threaded. If the main daemon starts using close to 100% of a CPU (e.g. due to high contention for files from multiple nodes, large directory listings, creation of lots of small files, ...) then this can be a performance bottleneck. Recent versions of ctdbd will log if CPU utilisation exceeds 90%:

 grep 'ctdb.*CPU utilisation' messages.merged

It would be nice if "utili[sz]ation" was spelled consistently...

Overload: "Handling event took"

Sometimes ctdbd or the node it is running on can stall due to overload (CPU, memory, disk I/O, ...). If too much time is spent in a tevent event handler then ctdbd may not process any packets. In this case "Handling event took <X> seconds" will be logged. If one of these stalls occurs on a node for more than ~25 seconds (see the KeepaliveInterval and KeepaliveLimit configuration tunables) then other nodes will decide that the node is dead. To trace this type of problem, the following can be useful:

 grep -E 'ctdb.*(Handling event took|dead count|connect)' messages.merged