New clustering features in SMB3 and Samba
The protocol flags
With SMB3, Windows 2012 introduces new cluster features for SMB shares. At the SMB protocol level, these controlled by three share capability flags:
- Cluster (SMB2_SHARE_CAP_CLUSTER)
The share is based on a cluster resource and provides monitoring the availability of the share through the witness service ([MS-SWN]).
- Scale-out (SMB2_SHARE_CAP_SCALEOUT)
The share is active on all nodes in the cluster at the same time (all-active characteristic). [MS-SMB2] says: The specified share is present on a server configuration which facilitates faster recovery of durable handles.
- Continuous Availabiliy (CA) (SMB2_SHARE_CAP_CONTINUOUS_AVAILABILITY)
The share offers "SMB transparent failover", which is realized with the new concept of persistent file handles.
It is the purpose of this text to explain which combination of the above are valid from the perspective of the windows client, how the presence of these combinations on a share change the client's behavior, and how these flags can be controlled through windows, also connecting these protocol bits with the high level concepts like "SOFS-cluster" found in various guides on the internet.
Windows clustering concepts and prerequisites
The starting point for setting up an SMB cluster with windows is the installation of a failover cluster. This prerequires a storage device like iSCSI attached on all nodes, formatted with NTFS (or ReFS). The cluster creation then creates an active/passive cluster volume from this file system, called the "witness disk". This disk holds the cluster configuration and is always active on exactly one node. (Not to be confused with the witness service.)
- For the creation of such a failover cluster, the Windows feature "Failover-Clustering" needs to be installed.
- In the creation of the failover cluster, a cluster network name (including an AD computer object) is created and a (new) static network address is associated to it. The failover cluster's ressources are accessed from the outside via this name and IP address. The address is always configured on the one currently active node.
We call the type of active/passive clustered disk that is created for the witness disk at cluster creation a "cluster volume" or even "cluster failover volume" when we want to emphasize the failover characteristic. (What is the proper terminology here?) (These volumes are created by making file systems on shared storage known to the cluster as cluster resources.)
A cluster volume can be "upgraded" into a "cluster shared volume" (CSV), which is the equivalent of a clustered or shared file system: A "cluster shared volume" can be accessed on all nodes of the cluster simultaneously. (Detail: The process replaces the NTFS metadata operations by a mechanism that negotiates all metadata through a special node--the "coordinator node". Note that metadata operations between the nodes are performed using SMB.)
So we have two variants of cluster volumes, failover and shared. Clustered SMB shares can be created with SMB3 by sharing any of the two kinds of cluster volumes.
- In order to share a CSV with SMB, one first needs to activate the "Scale Out File Server" (SOFS) role and create a netname to use for the cluster. This sofs network name is associated with those addresses of the cluster nodes that have previously been configured for external client access.
- CSV are known since Windows Server 2008 R2, but prior Windows Server 2012, one could only run a Hyper-V cluster off a CSV. Serving an CVS via SMB is new in Server 2012.)
So we have two basic kinds of cluster SMB shares:
- A shared failover cluster volume is only ever active on one node of the cluster at a given point in time. This setup is called a "traditional clustered file server", or "(clustered) file server for general use".
- A shared CSV requires a SOFS role on the cluster and is attached to the SOFS net name. Such a share is active on all nodes in the cluster simultaneously.
Controlling the capabilites through windows
Lets explain how to control the three protocol share capabilites listed above with windows. First any of these flags can only ever be set on a share offered by a cluster (failover cluster).
A share on a cluster carries this flag if and only if the shared file system is a cluster volume.
A share on a cluster carries this flag if and only if the shared file system is a CSV (cluster shared volume).
The CA capability can be independently turned on or off at share creation or on an existing cluster share. (Non-cluster shares are not supported.)
Hence, cluster and scale-out capabilities are automatically set depending on the type of volume shared. CA is the only capability that can be set arbitrarily by a configuration switch. On Windows, scale-out and CA can only be set on a cluster share.
Windows cluster shares offer the so called witness service.
The witness service is an RPC service that allows a client to be actively notified about the state change of resources. The client asks the node it is connected to for a list of interfaces and registers itself on a different node with the witness service for notification about a resource, which might be a netname, or an interface group and an IP address. Afterward it can request notification.
There are two versions of Witness:
- V1: windows 8 and server 2012
- V2: windows 8.1 and server 2012 R2.
With version 1 there are the 2 events:
- a resource being enabled or disabled and
- a request for a client to move to another resource.
Version 2 adds the following new events:
- the ownership of a share moving between resources and
- an IP address being added, removed, enabled, or disabled.
Scale-Out shares have a very special behaviour: No batch oplocks, no write or handle leases ar granted, see
- [MS-SMB2] 18.104.22.168
Receiving an SMB2 CREATE Request,
- [MS-SMB2] 22.214.171.124.11
Handling the SMB2_CREATE_REQUEST_LEASE_V2 Create Context
- One would expect exclusive oplocks also not to be granted since they correspond to RW leases. But you can get exclusive oplocks (smbtorture).
- According to [MS-SMB2], windows clients *never* use exclusive oplocks. So it is not very bad, but probably, this is just a bug/omission in the server and doc.
- The motivation behind this probably is that this specification removes the need to implement cross cluster lease and oplock breaks only for read leases and level 2 oplocks which can be done asynchronously in contrast to the other lease/oplock types.
- TODO: write a torture test with cross-node exclusive oplock break...
- possible bug (GB): No batch oplocks, no write or handle leases are granted when the scale-out feature is installed, even when the share in question is not scale-out. (really?)
Lack of batch oplocks and handle leases means that clients won't get durable handles from the server, only in the form of persistent handles, which are available when the share is also continuously available (see below).
A scale-out server has some limitations which might be specific to the implementation (windows server 2012).
- It is only accessible with SMB2 or higher, older clients get NT_STATUS_ACCESS_DENIED.
- It is mainly for data heavy operations because all meta data has to go through the one Coordinator Node of the CSV.
- It does not support e.g. BranchCache, DataDeduplication, DFS Namespaces, and DFS Replication.
CA shares offer persistent file handles:
- persistent handles are like durable handles with strong guarantees.
- persistent handles are requested through the persistent flag in the durable v2 create request blob.
- For persistent handles, the timout in the request blob is honoured. (For durable handles, it is ignored and an implementation specific constant value is taken.) If the timeout in the request is zero, an implementation specific default is taken.
According to [MS-SMB2], a cluster share must offer the witness service. But Windows 8 clients happy connect to the share when witness is not running after having asked the end point mapper. Also when witness is running but not offering monitoring for any ressource related to the share.
Question: Are there any different timings or retry characteristics for cluster shares whend compared to non-clustered shares?
- On a non-CA scale-out share, clients won't get write and handles leases, batch oplocks, and durable handles. But Windows 8 still requests these.
- On Windows, a scale-out share is always a cluster share, but Windows 8 clients happily connect to a scale-out share without the cluster capability set.
- Are there any different timings or retry characteristics for scale-out shares whend compared to non-scale-out shares?
- Different characteristics for cluster+scale-out vs scale-out?
- Windows 8 clients on ca shares typically request persistent handles with rwh lease. (or batch oplock)
- What are the precise diffences in retry characteristics of Windows 8 clients against a CA share?
Windows 8 client explorer operations against win2012 cluster. Entries of the form "requested/got":
|win8 req \ share type||sofs + ca||sofs||ca||cluster|
In our tests, Windows 8 (and newer? - TODO: test 8.1) sends SMB2_FLAGS_REPLAY_OPERATION in write requests and in read requests starting with the second read of a read that requires more than one read request, if (and only if) the server announces the persistent handles capability in the negprot response (SMB2_CAP_PERSISTENT_HANDLES).
This happens on *any* share on such a server, also on non-cluster-shares.
When copying a file from the server, windows 8 first tries an SMB2_IOCTL FSCTL_OFFLOAD_READ when the file exceeds a certain size theshold (e.g. roughly 2MB is enough).
Note: this is not cluster-specific at all, but mentioned anyways, since we just stumbled across it.
We tested with Windows 8 the following matrix of server configurations:
- persistent handles server capability announced: on/off
- durable handles (on share): on/off
- cluster share cap: on/off
- scale-out share cap: on/off
- continuous availability share cap: on/off
The test server:
- a non-clustered samba from master, augmented with a few patches to be able to set the various share caps, fake presistent handles (using plain durables) on ca share and not fail on REPLAY_OPERATION.
- server joined into win server 2012 AD domain.
- Two "public" addresses registered in AD-DNS with the server name.
- connect to the share with the explorer
- and start copying a big (2G) file off the server.
- Shortly after the capture begins, kill smbd on the server
- Wait for the windows client to pop up an error dialog
- click on cancel
- Stop the capture.
The key observation is this:
There are essentially only two different retry characteristics: One is used when CA is set, the other when CA is not set. This is in contrast to what is written on the 2012 SDC slides.
- Client does three consecutive reconnect attempts before giving up.
- each attempt consists of:
- arp first IP (except for in the first attempt)
- three tcp syn attempts to first IP with ~0.5sec break (=> 1second)
- arp second IP
- three tcp syn attempts to second IP with ~0.5sec break (=> 1second)
- ==> overall ~2.1 seconds for 1 attempt
- between two attempts, client tries
- dns lookup and netbios name lookup for the network name
- pings addresses (succeeds in our setup)
- does arp requests.
- small breaks
- ==> overall this gap between two attempts lasts ~5.8seconds
- ==> overall ~18seconds
- if a different number of IP addresses were assigned, we would have (by theoretical extrapolation) other concrete numbers, e.g. ~15 seconds for 1 IP and ~21 seconds for 3 IPs, etc. (needs verification)
- Client does consecutiove retry attempts as above, but many more of them, sometimes adding a longer break when doing dns lookups, pings and arp (11.8 instead of 5.8 seconds).
- overall time is 13-14 minutes.
Note on different setup:
Making the server unavailable by adding firewall rules instead of only killing smbd (so that the client gets "ICMP destination/port unreachable" packets from the firewall instead of RESET/ACK packages as replies to the SYN attempts, yields different results. This has not been tested systematically, but it seems that in that case, also in the non-CA case the client hangs much longer and tries again and again for many minutes.
Note on a different client:
It would be interesting to test (e.g.) Windows 7. Whether it also shows the same retry behaviour with durable handles and without. (3 reconnect attempts).
Considerations for Samba
- Windows connects to a SO and/or CA share without CLUSTER being set.
- Windows connects to a CLUSTER share without witness being available or doing anything useful. So it appears we can set any of Cluster, SO, CA without great danger.
- Windows clients still request all sorts of leases, oplocks and durable on a SO share. Since Samba/ctdb already offers cross-node oplocks and durable handles, it seems that Samba could well offer all of these on a SO share.
- Setting CA on a share lets the Windows 8 client try much longer to reconnect to a lost share/handle. (in our case 13/14 minutes instead of 18 seconds). This might be used, setting only CA without persistent handles in samba.
- MS Cluster Feature Table
- [SMB3_kernel_status samba smb3 status]