LinuxCIFS troubleshooting

From SambaWiki

Asking for Help

The best place to ask for help with Linux CIFS is on the linux-cifs mailing list. When asking for help, it's best to provide some basic info:

  • The kernel version you're using (the output of uname -r)
  • The mount.cifs version you're using (mount.cifs -V)
  • A clear, concise description of the problem
  • A description of the CIFS server with which you're having trouble (Windows version if it's windows, samba version if it's samba, name of the appliance if it's something else)
  • if you're able to mount the host, get the contents of /proc/fs/cifs/DebugData

Enabling Debugging

The CIFS code contains a number of debugging statements that can be enabled. If you ask for help on the list, one of the developers may ask you for this info. You can also turn it on on your own, but it's not generally helpful unless you're willing to dig into the code.

To enable debugging, echo a non-zero value into /proc/fs/cifs/cifsFYI. For example:

# modprobe cifs
# echo 'module cifs +p' > /sys/kernel/debug/dynamic_debug/control
# echo 'file fs/cifs/* +p' > /sys/kernel/debug/dynamic_debug/control
# echo 7 > /proc/fs/cifs/cifsFYI

To disable it:

# echo 0 > /proc/fs/cifs/cifsFYI

These messages end up in the kernel ring buffer. You can view them using dmesg.

# dmesg

syslog will generally also pick up much of it, but if the rate of messages is rather large, syslog tends to drop some of them. Getting the info straight out of the ring buffer is generally preferred since that's lossless.

This debugging however can be rather chatty and have a significant impact on performance. It's often best to use this with easily reproducible problems. That is:

  • turn on debugging
  • (optionally) clear the old information from the message buffer ("dmesg -c")
  • reproduce the issue
  • turn off debugging
  • save the debugging information

Debugging info can contain sensitive data like IP addresses and filenames. Take care when sending this information.

Wire Captures

It's sometimes helpful to capture wire traffic between the client and server. The easiest way to do this is with wireshark which is a graphical network analysis tool. In many cases however, it's not easy or possible to run wireshark directly on one of the hosts. In that case, it's often easier to capture the network traffic in binary format to a file and then feed it into an analyzer to look over it. That also makes it possible to send it to someone who can do some analysis on it.

Here's an example of doing this:

# tcpdump -i eth0 -s0 -w /tmp/cifs-traffic.pcap host cifs_server.example.com and port 445

or alternatively if this is a large capture, and you want to limit the size to a reasonable maximum (200 bytes) try:

# tcpdump -i eth0 -s200 -w /tmp/cifs-traffic.pcap host cifs_server.example.com and port 445

...of course, tcpdump has a lot of options, so these are just an example. In particular you'll want to modify the capture filter depending on what machine you're running the capture on, etc... An excellent overview presentation describing using wireshark to trace SMB workloads can be found at https://www.snia.org/sites/default/orig/sdc_archives/2008_presentations/monday/RonnieSahlberg_UsingWireshark.pdf

The captured traffic in this above example will be in /mnt/cifs-traffic.pcap. Before sending these around, it's a good idea to compress them as they squash down fairly well.

In general, the SMB protocol can be fairly chatty so it's best to use this in a similar manner to the debugging above:

  • start the capture
  • reproduce the problem
  • stop the capture

Wire captures can also contain sensitive data like addresses, password hashes, filenames and data. Be careful to whom you send it. In general, don't send this to mailing lists unless you know that the data isn't sensitive.

Viewing Network Traces

Wireshark provides excellent support for viewing SMB (including the most recent SMB3 and SMB3.1.1 features) traffic. SMB3 and later traffic is often encrypted so in order to view the (decrypted) frames you need to enter in the keys for that session (e.g. "smbinfo keys ..."). See https://wiki.samba.org/index.php/Wireshark_Decryption for more details on how to view encrypted SMB traffic.

Oopses

Occasionally the kernel will panic. When it does, it's helpful to capture the entire message including the kernel messages leading up to the oops. There's a lot of info in an oops message but the main thing that helps debugging is determining where the machine panicked. Here's one way to do this:

Save off the oops message. The main thing that you see in there is a dump of the registers on the CPU that panicked. For instance, an oops on a 32-bit ix86 machine might look something like this:

BUG: unable to handle kernel NULL pointer dereference at 00000414
IP: [<c110d057>] cifs_writepages+0x35/0x60a

...the "IP:" line refers to the instruction pointer. That tells us what instruction the CPU was executing at the time that it panicked. The problem is though that due to architecture and compiler differences, etc, we can't directly turn that into a line of code. Here's how to do that:

Open the kernel module with gdb:

$ gdb cifs.ko

...eventually it should come to a (gdb) prompt. If you're running a vendor kernel, then you may need debuginfo packages for this to work. Once you get a gdb prompt, run:

(gdb) list *(cifs_writepages+0x35)

...obviously, you should replace the stuff in the parenthesis with whatever your oops message says. Pasting the list output can help developers help you.

Other

It can be helpful to know whether the client timed out and had to reconnect to the server (and even if it reconnected successfully if not using the "hard" mount option, this could cause some pending commands to fail). Check the value of "session" and "share reconnects" in /proc/fs/cifs/Stats ("cat /proc/fs/cifs/Stats | grep reconnect") before the failure and again after the failure to see if they have increased. If the value of "session reconnects" and/or "share reconnects" has increased, that indicates that an operation has timed out (sometimes due to a network failure, or a server or file system hang, or other bug). In addition "dmesg" (the kernel message buffer) will often show a message similar to the following "CIFS VFS: Server 172.22.149.109 has not responded in 120 seconds. Reconnecting …"

Additional Debugging Features

Support for dynamic tracing e.g. trace-cmd (ftrace) for cifs.ko is in recent kernels, starting with the 4.18 kernel. This allows selective control of cifs tracepoints via trace-cmd ("trace-cmd record -e cifs") or /sys/kernel/debug/tracing/events/cifs and is much easier to use for many scenarios. As of the 6.4 kernel there are 101 dynamic trace events for cifs.ko that can be selectively monitored (see the directory above).


#                                  func to trigger      cmd to trace
#                                  start of tracing
trace-cmd record -p function_graph -g SyS_mkdir       -F mkdir foo

# show call graph
trace-cmd report | less

There are scripts to capture multiple types of SMB trace information for Linux e.g: https://github.com/Azure-Samples/azure-files-samples/tree/master/SMBDiagnostics