CTDB log file search patterns: Difference between revisions
(Initial version) |
(No difference)
|
Latest revision as of 05:02, 30 June 2020
With merged logs available some issues can be more quickly understood if key search patterns are used. Some of these are covered below. Note that the debug messages used in examples are subject to change, so this page comes without guarantees... :-)
Failed recoveries
Failed recoveries sometime need to be understood. Here are some useful search phrases:
do_recovery
- Included in a message logged at the start of recovery
databases recovered
- Count of databases recovered during each attempt
database recovery failed
- Logged after all attempts in a round (default = 3) have failed
control .* failed for .*\.tdb on node
- Message about failed control, can often be the cause of recovery failure
ban
- Message about a node being banned
If recoveries go wrong it can be useful to know where ctdbd
has been restarted:
starting ctdbd
shutdown
Adding a ctdb
prefix and allowing a choice of search patterns can provide a useful timeline:
grep -Ei 'ctdb.*(starting ctdbd|shutdown|ban|control .* failed for .*\.tdb on node|databases recovered|database recovery failed|do_recovery)' messaged.merged
System resources
If a system is running out of memory and swapping then this can sometimes explain poor performance. Messages logged by the 05.system
event script can be useful. The following will find messages about memory use >= 90%:
grep -E 'ctdb.*System (swap|memory) utilization 9' messages.merged
05.system
also monitors "filesystem utilization".
ctdbd
is a single process that is single threaded. If the main daemon starts using close to 100% of a CPU (e.g. due to high contention for files from multiple nodes, large directory listings, creation of lots of small files, ...) then this can be a performance bottleneck. Recent versions of ctdbd
will log if CPU utilisation exceeds 90%:
grep 'ctdb.*CPU utilisation' messages.merged
It would be nice if "utili[sz]ation" was spelled consistently...
Overload: "Handling event took"
Sometimes ctdbd
or the node it is running on can stall due to overload (CPU, memory, disk I/O, ...). If too much time is spent in a tevent
event handler then ctdbd
may not process any packets. In this case "Handling event took <X> seconds" will be logged. If one of these stalls occurs on a node for more than ~25 seconds (see the KeepaliveInterval
and KeepaliveLimit
configuration tunables) then other nodes will decide that the node is dead. To trace this type of problem, the following can be useful:
grep -E 'ctdb.*(Handling event took|dead count|connect)' messages.merged