Samba3/SMB2

From SambaWiki

Jump to: navigation, search

Contents

Introduction

This page describes the plan, design and work in progress of the efforts to implement SMB2 in Samba3.

  • SMB 2.0 (SMB2.02 dialect) was introduced with Windows Vista/2008.
    • Samba 3.6 added basic support for SMB2.0. This support was essentially complete except for one big item:
      • durable file handles (DONE: Added in Samba 4.0)
  • SMB 2.1 was introduced with Windows 7/Windows 2008R2.
    • Basic support for SMB 2.1 was added in Samba 4.0
    • Features done:
      • multi credit/large MTU (Added in Samba 4.0)
      • dynamic reauthentication (Added in Samba 4.0)
      • writethrough (Added in Samba 4.0)
    • Features TODO:
      • leases (in progress)
      • resilient file handles
      • branch cache
  • SMB 3 (previously known as SMB2.2 dialect) was introduced with Windows 8 and Windows Server 2012. SMB3 dialect defines the following features:
    • Basic support for SMB3 is included in Samba 4.0 and later.
    • security improvements (improved faster more secure packet signing, secure negotiate protection against downgrade attacks and share level encryption)
    • directory leases
    • persistent file handles
    • multi channel
    • witness notification protocol (a new RPC service)
    • interface discovery (a new FSCTL)
    • SMB direct (SMB 3 over RDMA)
    • Support for a misc. set of loosely related storage features for virtualization (new fsctls, T10 block copy offload, TRIM etc.)
    • remote shadow copy support
    • branch cache v2
  • SMB3.02 was introduced in Windows 8.1 (Windows 'Blue') and Windows Server 2012 R2. Among the new protocol features are those particularly useful for virtualization (HyperV):
    • SMB3.02 dialect is not yet negotiated by Samba servers
    • SMB3.02 dialect can be requested by the Linux cifs client ("vers=3.02" on mount) but the new optional features, unique to SMB3.02, are not requested.
    • Unbuffered I/O flags (ie a 'no cache flag' which may be sent on read or write)
    • New RDMA remote invalidate flag
    • MS-RSVD (a set of remoteable FSCTLs that improve "SCSI over SMB3")
    • Asymmetric Shares (extensions to Witness protocol to allow moving users of one share to a different server, eg for load balancing or maintenance - previously witness protocol could only do this on a per server rather than per-share basis).

Prerequisite / accompanying work

VFS layering: introduce a NT-FSA-layer

Samba3's current VFS is a mixture of NT/SMB level calls (e.g. SMB_VFS_CREATE_FILE, SMB_VFS_GET_NT_ACL) and POSIX calls (e.g. SMB_VFS_OPEN, SMB_VFS_CHOWN). There are even lower level pluggable structures for specific POSIX ACL implementations. The implementations of the NT level VFS calls also call out into the POSIX level calls. The idea of this part is to split up the layers, so that the layering is clean: A NT-Layer on top that implements only the NT/SMB style calls. This should be guided by the FSA description from the Microsoft documentation ([MS-FSA]). Some of the NT/SMB-level calls are not present in the current SMB_VFS yet at all so these would have to be abstracted out of the smbd code. The current implementation of the SMB_VFS calls and some portion of smbd code would become the default "POSIX" backend to the FSA vfs layer.

This step is technically not strictly necessary, but a desired foundation for the SMB2 and future changes. When we touch the code anyways, we have a chance to improve the structure and untangle the layers. We don't need to do it in one step and we don't need to implement all of FSA right away, but we can tryp to improve the layering as we go along and touch calls.

dependence

  • does not depend on other work
  • accompanies work on the whole project
  • The splitting out of NTFSA calls can be made a prerequisite for further work on the corresponding calls in work on SMB 2.0 durable handles and SMB 2.1 (e.g. leases and resilient file handles).

steps

  • define VFS structures:
    • NTFSA layer
    • POSIX backend to call into the current SMB_VFS
  • first implement NTFSA by calling directly into current SMB_VFS code (or move code from smbd into the default NTFSA backend implementation) and have smbd call out into the FSA layer instead
  • start with one call after another, e.g. smb2_create and use NTFSA calls in the implementation.
  • Move logic from the smbd/ code to new NTFSA calls. These call the lower layer SMB_VFS calls.
  • Once the NTFSA calls are used everywhere, one can start to split up and fix the vfs layering underneath, i.e. remove the FSA-style calls from the SMB_VFS etc.
  • data structures: split up files / connections into smbXsrv layer and fsa layer, e.g.:
      smb level       |       ntfsa      |   ntfsa_posix level
      smbXsrv_session -->  ntfsa_context --> users_struct
      smbXsrv_tcon    -->  ntfsa_context --> connections_struct
      smbXsrv_open    -->  ntfsa_open    --> files_struct

SMB 2.0

durable handles

Note: Support for durable handles has been released with Samba 4.0

These steps describe the necessary steps towards the implementation of durable handles. For now for a single, non-clustered Samba-Server. For details on durable handles in a CTDB+Samba-cluster, see below.

dbwrap work

This is prerequisite work to avoid code duplication in record watching and so on:

  • clean up locking order
  • add dbwrap record watch mechanisms to abstract the mechanims for waiting for lock records to become available

state: essentially done(?)

rewrite messaging

Note: this is done and will be included in Samba 4.2

For the implementation of durable handles, the smbd processes will need to communicate more than before: When a client reconnects to Samba after a network outage, it will end up at a different smbd. The new smbd will need to work on the files that had been in use as durable handles in the original client. There are two possible approaches: keep files open or reopen files. Depending on the approach, it might become necessary to pass open files from one smbd to another using fd passing. For this, we need to change our messaging. But also for the generally more demanding messaging, it would be extremely useful to get rid of the tdb+signal based messaging and replace it by an asynchronous mechanism based on sockets and in a second step have the messaging infrastructure IDL-generated.

add new tevent_req based API

dependence: This is independent of other tasks.

  • in order to simplify the higher layers a new tevent_req based messaging api is needed.

rewrite messaging with sockets

dependence: This is independent of other tasks.

  • raw messaging: unix domain datagram sockets.
  • if there are too large packets, then we need stream in addition
  • if possible: keep s3 api messaging_send/receive for a start in order to reduce scope of change

implement messaging based on iRPC

dependence: Based on the two previous steps

  • do "irpc" over this raw messaging
  • rpc services defined by idl, generated by pidl
  • write rpc services for fd-passing

Define New Data Structures

locking/open files (fs layer)

  • define data structures (idl)
  • identify various databases
  • design goal API for each such database or structure

state: essentially done?


sessions/tcons/opens (smb layer)

  • define data structures (idl):
    • struct smbXsrv_session*
    • struct smbXsrv_tcon*
    • struct smbXsrv_open*
  • identify various databases
  • design goal API for each such DB or structure

state: in progress/largely done


Use New Data Structures In The Server

use in FS layer

  • refactor locking code etc: create corresponding APIs with current backend code, use in server
  • extend current structures to match targeted structures
  • change code beneath APIs to use new marshalled databases
  • add logic to use new parts of the structures

state: essentially done?

use in smb layer

  • cleanup/simplify core smbd code
  • make use of new structures

state: essentially done?

Implement durable open and reconnect

Session reconnect with previous session id

  • if previous session exists, tear it down and thereby close tcons and (non-durable) open files
  • open new session

state: done

implement durable open

  • Interpret durable flag in smb2_create call
    • Mark the file handle durable in the database record.
    • confirm durable open in the response to the client
  • change cleanup routines to not delete open file entries for durable handles, even when the opening process does not exist any more

state: done

  • implement scavanger mechanism to clean durable handles without corresponding smbd process after the scavenger timeout (maybe simply as part of cleanup routine?)

state: done

implement durable reconnect with reopening files (CIFS only)

  • implement reconnect for durable handles at SMB2 level after session reconnect and tcon:
    • new smbd looks for file info by persistent ID.
    • smbd should reopen the file based on the information from the databases.
  • fine-tuning of lock/oplock(/lease) behaviour under durable reopen
  • fencing against conflicting opens (==> CIFS only?! - need to keep files open for shell / nfs interop)

state: progress/largely done

improve nfs/shell interop for conflicting opens

Note: may be implemented later as an add-on

  • write tests to trigger the problem between a connection loss and a non-cifs open of the file that is still a durable handle
  • possiblity: create extra process that reopens the closed files to be able to catch opens from shell or nfs while cifs client is disconnected (==> there is still a race condition here)


implement durable reconnect with fd-passing

Note: may be implemented later as an add-on

  • have smbd keep files open which are durable when the client is disconnected
  • implement reopen:
    • requests fd-handle (implemented by fd-passing for posix) via irpc messaging

Durable handle cross-node

(To be filled)

SMB 2.1

Unbuffered Write

Supported since 3.6.0.

Multi Credit / Large MTU

Supported since 4.0.0.

Reauthentication

state: done

Supported since 4.0.0.

Leases

new concepts

  • read lease (<=> lvl2 oplock)
  • read + handle lease (new)
  • read + write lease (write locally, do brlocks) (<=> exclusive oplock)
  • read + write + handle lease (<=> batch oplock)
  • new: multiple opens with same client ID without breaking the lease
  • new: client can upgrade a lease (not downgrade)
  • Linux kernel oplocks don't provide the needed features. (They don't even work correctly for oplocks...) ==> SMB-only feature.

analyze exact algorithms

  • object store semantics
  • smbX break semantics (share modes, oplocks, leases), e.g.:
    • oplock break batch --> lvl2 or none
    • lease break r+w+h --> r+w (only handle caching is broken)
  • documents: see e.g.
    • [MS-SMB2], "Algorithms for Leasing in an Object Store"
    • [MS-SMB2], "Object Store indicates a Lease Break"
    • [MS-SMB2], "Object Store indicates an Oplock Break"
  • write additional smbtorture tests
  • this also determines the exact details of the data structure in below

vfs-layer: change data model and code

locking.tdb and the locking code (and part of open) is essentially Samba's implementation of the FSA layer aspect of oplocks. The FSA layer ([MS-FSA]) knows oplocks and not a speration between leases and oplocks. These FSA level oplocks are able to cover both SMB(2) oplocks and SMB2 leases. In the [MS-SMB2] document, there are descriptions how leases and oplocks are mapped down to the FSA level.

So the basic idea is to extend our FSA layer oplocks, i.e. the locking.tdb data model and the locking/open code so that it can cope with SMB2 leases as well.

  • data model:
    • lease in open_file (locking.tdb)
      • parallel to share_modes[] (opens)
      • we need an array for oplocks[] where opens may have a reference (index value) into the oplocks[] array, multiple opens can reference the same oplock element.
  • extract of new data structures from open_files.idl:
         typedef [public] struct {
             ...
             uint32 oplock_idx; // UINT32_MAX => none
         } share_mode_entry;
         typedef [public,bitmap8bit] bitmap {
             SHARE_MODE_NO_CACHING = 0x00,
             SHARE_MODE_READ_CACHING = 0x01,
             SHARE_MODE_WRITE_CACHING = 0x02,
             SHARE_MODE_HANDLE_CACHING = 0x04
         } share_mode_caching;
         typedef [public,flag(NDR_PAHEX)] struct {
             DATA_BLOB               oplock_key;
             share_mode_caching      current_state;
             /*
              * allowed_shared_state is the mask for the
              * cache level that can be held simultaneously
              * by multiple opens. On the SMB level, this
              * depends on the kind of caching that is in effect.
              *
              * allowed_shared_state is:
              *
              * - for SMB oplocks: READ Caching
              * - for SMB leases:  READ and READ+HANDLE Caching
              *
              * This means that:
              * - level2 oplocks are not granted if there
              *   is already a RH lease.
              * - A R lease is granted if a level2 oplock was
              *   present and a R or RH lease was requested.
              * - A batch oplock is broken to a level2
              *   oplock and a R lease is granted if a
              *   RH lease was requested.
              */
             share_mode_caching      allowed_shared_state;
             /*
              * breaking_to_state indicates to which level
              * the current state is broken when a conflicting
              * request is processed. The calculation is as follows:
              *
              *   breaking_to_state = current_state;
              *   breaking_to_state &= ~(remove_state)
              *   breaking_to_state &= allowed_shared_state
              */
             share_mode_caching      breaking_to_state;
             boolean8                breaking;
             timeval                 break_timeout;
         } share_mode_oplock;
         typedef [public] struct {
             ...
             uint32 num_oplocks;
             [size_is(num_oplocks)] share_mode_oplock oplocks[];
             ...
         } share_mode_data;
  • mapping SMB oplocks / leases --> locking.tdb share_mode_oplock.
       oplocks:
         level-2 oplocks:
            current_state = READ_CACHING
            allowed_shared_state = READ_CACHING
         exclusive oplock:
            current_state = READ_CACHING|WRITE_CACHING
            allowed_shared_state = READ_CACHING
         batch oplock:
            current_state = READ_CACHING|WRITE_CACHING|HANDLE_CACHING
            allowed_shared_state = READ_CACHING
       leases:
         R-lease:
            current_state = READ_CACHING
            allowed_shared_state = READ_CACHING
         RH-lease:
            current_state = READ_CACHING|WRITE_CACHING
            allowed_shared_state = READ_CACHING|HANDLE_CACHING
         RW-lease:
            current_state = READ_CACHING|WRITE_CACHING
            allowed_shared_state = READ_CACHING
         RWH-lease:
            current_state = READ_CACHING|WRITE_CACHING|HANDLE_CACHING
            allowed_shared_state = READ_CACHING|HANDLE_CACHING
  • break table:
    • todo: verify / fix
    • todo: dependence on share modes
    • note: table cells in the form "granted\brokento"
       requested \ existing || lvl2      | excl      | batch     | r      | rh    | rw     | rwh
       ---------------------------------------------------------------------------------------------
                       lvl2 || lvl2\lvl2 | lvl2\lvl2 | lvl2\lvl2 | lvl2\r | 0\rh  | lvl2\r | 0\rh
                       excl || lvl2\lvl2 | lvl2\lvl2 | lvl2\lvl2 | lvl2\r | 0\rh  | lvl2\r | 0\rh
                       batch|| lvl2\lvl2 | lvl2\lvl2 | lvl2\lvl2 | lvl2\r | 0\rh  | lvl2\r | 0\rh
                       r    || r\lvl2    | r\lvl2    | r\lvl2    | r\r    | r\rh  | r\r    | r\rh
                       rh   || r\lvl2    | r\lvl2    | r\lvl2    | rh\r   | rh\rh | rh\r   | rh\rh
                       rwh  || r\lvl2    | r\lvl2    | r\lvl2    | rh\r   | rh\rh | rh\r   | rh\rh


  • server code:
    • adapt code to implement new/extended semantics
    • this includes restructuring and possibly fixing existing oplock code

SMB-layer: extend data model and add lease code

  • data model:
    • introduce lease
      • lease key
      • filename
      • lease state
      • break to lease state
      • lease break timeout
      • lease opens (--> smbXsrv_open)
      • breaking
      • epoch (SMB 3.0)
      • version
    • [MS-SMB2] list structures: (maybe not necessary for us)
      • lease table
        • client guid
        • lease list (indexed by lease key)
      • global lease table list
      • in samba possibly: one db indexed by the (pair ClientGUID,LeaseKey)
  • server code:
    • implement leasing capability
    • answer lease requesting variants (v1, v2) of create call
      • implement smb level break code
      • this includes restructuring and possibly fixing existing oplock code

Resilient File Handles

Branch Cache

SMB 3.0

Security Features

  • Encryption and improved packet signing on the server side (done in Samba 4.0)
  • Encryption and improved packet signing for the smb3 client tools (done in Samba 4.1)
  • Secure negotiate (complete)

Replay/Retry Detection

locks

  • "LockSequence" number (in SMB2 Lock request) uniquely identifies (un)lock request among all (un)lock requests to the same file
  • applies to SMB version >= 2.1
    • resilient handles, multi channel, persistent, ...
  • array of 64 lock requests per open on client and server
    • client can only have 64 outstanding lock/unlock requests per open)
    • bucket index = index (0..63) into array of outstanding lock requests (bucket)
    • bucket number = bucket index + 1
    • client sends lock sequence = (bucket number << 4) + mod 16 incrementing sequence number
    • server stores the sequence number in LockSequenceArray by reversed calculation after successful lock processing:
      • index = (locksequence >> 4) - 1
      • sequence = least 4 bits of lock sequence
    • if the server recieves a lock request with sequence already existing in the array, it simply replies with success
  • server implementation: simple
  • TEST:
    • write tests in smbtorture and on windows to verify behaviour
  • TODO: ask dochelp:
    • about initialization with 0xFF instead of 0x00
    • about scope (resilient, leasing, ... + ?)

create replay

  • replay detection by CreateGUID (part of durable request v2)
    • client retries create on different channel (multichannel) with SMB2_FLAGS_REPLAY_OPERATION set (otherwise identical)
    • server detects replay by CreateGUID (see [MS-SMB2], 4.9)
  • We need to store most of the input parameters of the SMB2 Create in smbXsrv_open_global and verify on replay.
  • We need a smbXsrv_open_global_create_guid.tdb (as index, maybe prefixed by client_guid?)
  • server behaviour:
    • when create with durable_v2 request and REPLAY_OPERATION flag set:
      • look for open associated to create guid
      • if not found, proceed with open execution
      • if found, check parameters of handle against the replay request
        • durable, file attributes, create disp, persistent, oplock/lease state
      • if parameters don't match ==> fail with invalid parameter
      • if parameters match ==> return existing handle
      • Q: Do we have to check that the handle is disconnected or open in the same 'smbd' (--> multi channel)?
  • TEST:
    • smbtorture tests and possibly windows tests

application instance ID

  • 16-byte value that associates a handle with a calling application
  • handling the SMB2_APP_INSTANCE_ID create context
    • only together with SMB2_CREATE_DURABLE_HANDLE_REQUEST_V2 or SMB2_CREATE_DURABLE_RECONNECT_V2, according to [MS-SMB2]
  • We need a smbXsrv_open_global_app_instance_id.tdb (as index, maybe prefixed by persistent file id? )
  • server behaviour:
    • if smb2_create request contains SMB2_APP_INSTANCE_ID context and SMB2_CREATE_DURABLE_HANDLE_REQUEST_V2 or SMB2_CREATE_DURABLE_RECONNECT_V2:
      • server looks for an open file handle:
        • same AppInstanceId
        • same path name
        • same share
        • different ClientGUID
        • granted_access containing FILE_GENERIC_READ
      • if found: close the handle before proceeding with open request.
  • implementation similar to session reconnect (handling previous session-id)
  • TEST:
    • smbtorture tests and possibly windows tests


write: channel sequence number (SMB 3.0)

  • incremented by client for each network (channel) failure:
    • client stores channel sequence number on session.
    • client sets channel sequence number in the SMB2 header of any request if SMB dialect is 3.0 or higher and if the connection supports multi channel or persistent handles
    • client increments channel sequence number on session if there is a disconnet on the transport of a channel (and if there is more than one channel associated to the session ??)
  • server behaviour (according to [MS-SMB2]):
    • server stores channel sequence on open instead of session
    • server compares channel sequence number of incoming packet with channel sequence number stored with the corresponding open
    • server acts differenly based on the result of the comparison and the values of counters OutstandingRequestCount and OutstandingPreRequestCount and the presence of the REPLAY_OPERATION_FLAG.
  • possible improvement:
    • handle 16-bit wraparound correctly (or better?)
    • maybe use uint64_t channel_generation and uint16_t channel_sequence on the session and open, channel_generation counts the overflows and channel_sequence appears on the wire. if channel_generation of the session is higher than the channel_generation of the open, we know that the channel_sequence is newer)
  • additional failure conditions for stale channel sequence numbers
  • write replay for persistent handles with single channel?
  • TESTS

Directory Leases

Ideas/concepts

  • Directory leases are a mechanism for caching metadata read operations/directory listings of child objects of a directory (File leases are a mechanism for caching the data operations.)
  • The client maintains separate caches for each user context, but still using just one lease to invalidate the cache. This is needed because access based enumeration may cause different directory listing depending on the user context.
  • Only read or read+handle caching is granted, no write caching.
  • Meta data updates on a child object (file or directory) CHILD1 of a directory DIR1 trigger lease breaks for all directory leases on DIR1 with a lease key different from the parent lease key of CHILD1. (Also if CHILD1 is opened without a parent lease key.)
  • Explicit meta data updates are propagated at set time (TODO testing)
  • Implicit meta data updates (e.g. write time) are propagated at close time.
  • Meta data updates of childs revoke read caching (including handle caching).
  • Revoking of read caching is triggered in the same code path where change notifications are triggered.
  • Handle caching is revoked on SHARING_VIOLATION errors (against the directory handle's share mode) on open. (Same behaviour as with handle caching on files).
  • Directory lease breaks do not block any meta data operations on child objects, but for RH leases the server requires lease break acknowledgements.
    • With RH lease: [[1]]
    • With R lease: [[2]]
  • Linux kernel oplocks don't know the concept of directory caching.
    • ==> SMB-only feature.

analyze exact algorithms

  • depends on leases exact algorithms
  • object store semantics
  • smbX break semantics
    • which operations trigger lease breaks.
  • documents: see e.g.
    • [MS-SMB2], "Algorithms for Leasing in an Object Store"
    • [MS-SMB2], "Object Store indicates a Lease Break"
    • [MS-SMB2], "Object Store indicates an Oplock Break"
  • meta data on hardlinks are only updated on open
  • write additional smbtorture tests
  • this also determines the exact details of the data structure in 9.2 and 9.3

vfs-layer: change data model and code

  • depends on lease data model
  • data model:
    • lease in open_file (locking.tdb)
      • parallel to share_modes[] (opens)
      • we need an array for parents[] where opens may have a reference (index value) into the parents[] array.
      • multiple opens most likely reference the same parent element (unless there are hardlinks).
  • extract of new data structures from open_files.idl:
         typedef [public] struct {
             ...
             uint32 oplock_idx;
             uint32 parent_idx;
             DATA_BLOB parent_oplock_key; // most likely only for debugging
         } share_mode_entry;
         
         typedef [public] struct {
             uint32          name_hash;
             file_id         parent_file_id;
         } share_mode_parent;
         
         typedef [public] struct {
             ...
             uint32 num_parents;
             [size_is(num_parents)] share_mode_parent parents[];
             ...
         } share_mode_data;
  • server code:
    • adapt code to implement new/extended semantics

SMB-layer: extend data model and extend lease code

  • depends on 3.3
  • data model:
    • introduce and pass down (to vfs) parent lease key
    • Q: do we need to store this at smb level at all?
  • server code:
    • implement directory leasing capability
    • answer lease v2 of create call for directories
    • interpret parent lease key as part of lease v2 request blob
  • constraint: no interop! ==> SMB-only shares

Persistent File Handles

Introduction

Persistent file handles are a like durable file handles with strong guarantees. They are requested with the durable v2 create request blob with the persistent flag set to true. The server only grants persistent handles on shares that are marked CA (continuously available).

There is no finished design yet for the implementation of persistent handles. The foundations have been layed with the introduction of durable handles. The challenge is to implement the additional guarantees.

Requirements

  • replay/retry mechanisms
  • CA shares

Ideas

  • some dbs need to be made (at least partially) persistent:
    • smbXsrv_open_global
    • locking
    • brlock
    • index databases for smbXsrv_open (CreateGUID, AppInstanceID, LeaseKey)
  • using persistent copies is probably not an option, because because persistent transactions are too expensive
  • maybe we could introduce an intermediate variant between volatile and persistent dbs, where individual records are made persistent.
  • Maybe we also need to make a parallel copy of these databases for persistent files / ca shares so that other shares / file handles can be served with the known performance.

Multi Channel

Note: see work in progress at https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/master3-multi-channel

scope

  • CIFS-only
  • multiple channels on a single node, not no multiple nodes simultaneouly (like with Windows 8)

Ideas

  • TCP-connect, session bind
  • ==> move tcp-socket-fd to the smbd already serving the existing session.
  • maybe move TCP-socket already in negprot to the smbd serving connection(s) with the same ClientGUID (this would reduce problems where the session bind is not the first call after negprot)
  • With this mode, only one process has the file open for multi-channel sessions, so we only need to do book-keeping on the smb level (replay/retry counters, channel sequence numbers, ....) and not on the posix/file system level

interface discovery

  • document: [MS-SMB2]
  • retrieve information about attached network interfaces from kernel
    • possible: ethtool ioctl interface (for ethernet devices)
  • translate kernel information to format used by windows:
    • interface index
    • capability (rss/rdma capable)
    • link speed
    • sockaddr_storage
  • implement FSCTL_QUERY_NETWORK_INTERFACE_INFO
  • TESTS (smbtorture)
  • NOTE:
    • in a ctdb cluster, we should make sure that we only return ip addresses local to the node. possibly ctdb needs change / be configured to not handle public addresses.

fd-passing to transfer tcp connection between smbds

  • EITHER: rewrite existing messaging:
    • essentially adapted from SMB2 plan
    • rewrite messaging with sockets:
      • raw messaging: unix domain datagram sockets.
      • if there are too large packets, then we need stream sockets in addition or implement fragmentation
      • unify messaging between source3 and source4
      • if possible: keep s3 api messaging_send/receive for a start in order to reduce scope of change
    • rewrite messaging add iRPC:
      • do "irpc" over this raw messaging
      • rpc services defined by idl, generated by pidl
      • write rpc services for fd-passing
  • OR: specialized mechanism

make sure one smbd process can serve multiple transport connections

  • determine global variables to be eliminated.
  • eliminate them.
  • change users of exit_server[_cleanly]() to use smbd_server_connection_terminate().
  • the server may only be terminated when the last connection has been terminated.
  • ==> possible: keep server running even without transport connections when there are disconnected durable opens.
  • TESTS


transfer tcp socket in negprot based on ClientGUID

  • when a negprot request is received (new TCP connetion), and there is already an smbd process serving the same client, transfer the tcp socket fd, maybe some meta data and the negprot request (to that smbd process).
  • We need to provide a means for finding the server (smbd) based on the client GUID.
    • new index database
    • or smbd listening on unix domain socket with filename == ClientGUID
    • ...
  • The other smbd must receive the incoming socket fd and construct a smbXsrv_connection struct (and smbd_server_connection) from the given meta data. Then it has to inspect and process the transferred negprot.
  • TESTS (same/different client GUID)


implement channel bind session setup

  • special session setup binds a transport connection to an existing session.
  • this is performed in single smbd after negprot transferred the connection.
  • smbXsrv_session struct (and more..) must be shared by the smbXsrv_connection structs (and smbd_server_connection) for the multiple transport connections.
  • That should basically implement multi channel, modulo bugs to fix.


Server-Client retry

  • enable keepalives
  • retry to send Oplock/Lease Breaks on a different channel (if there is more than one channel)

Witness Notification Protocol

ideas

  • set SMB2_SHARE_CAP_CLUSTER
    • ([MS-SMB2]: The specified share is present on a server configuration which provides monitoring of the availability of share through the Witness service specified in [MS-SWN].)
    • seems to work independently of SMB2_SHARE_CAP_SCALEOUT and SMB2_SHARE_CAP_CONTINUOUS_AVAILABILITY, which means it could be used in the current Samba/CTDB design.
  • check how client behaves (fail over, etc)
    • initial research showed some file copy problems
      • the windows 2012 client was moved to a different ip each 15 seconds while copying a large file with the windows explorer.
      • On the network the durable (v2) reconnect looked good and the client continued to send write requests
      • Then the client stopped with setting delete-on-close followed by a close, without a obvious reason.
      • The error is reported in the GUI.
  • Simplifications:
    • the initial approach monitors all shares together this is what ctdb is currently able to provide

DCERPC infrastructure

  • async dcerpc infrastructure needed
  • MGMT interface support needed
  • use single process model
  • see DCERPC

CTDB changes

  • ctdbd needs at least one fixed public address per node
  • ctdbd may need to allow clients to register for more events.
    • Tests will show what we really need


Witness process (maybe child of smbd)

  • ask ctdbd about all public addresses in the cluster (both fixed and dynmanic)
  • maintain a global state of our current view of the cluster
  • register for a lot of relevant CTDB events (TAKE_IP,RELEASE_IP, TAKEOVER_RUN, REBALANCE, RECONFIGURE...).
  • listen for IP address changes in the kernel (AF_NETLINK)
  • witness_GetInterfaceList() should return the list of all public addresses
    • all fixed addresses get the INTERFACE_WITNESS flag
    • the node health state is used to set the state to AVAILABLE or UNAVAILABLE
  • witness_Register() should register the client in a global state
    • maybe store the registration in a non-persistent tdb
  • witness_UnRegister() should remove the state attached to a client
  • witness_AsyncNotify() should remember the dcerpc request on theclient state and don't responde to the client.
  • when we receive events from ctdb we need to create RESOURCE_CHANGE or MOVE_REQUEST messages and attach them to the registered client states
    • if there's a pending witness_AsyncNotify() we should send a dcerpc reponse

Admin Tools

  • we need a tool to display the witness registrations
  • we need a tool to move client to a different node


SMB Direct

aka SMB 3.0 over RDMA

Requires Multi-Channel

ideas

  • TODO: make libibverbs/librdmacm fork() safe
  • TODO: add support for "FD-passing" to libibverbs/librdmacm

Wireshark support

  • Available with wireshark-1.11.3 and above

7.2 buffer abstraction

  • we need an abstration for buffers, which can be
    • a "memory buffer" represented as uint8_t array (the default),
    • a "file buffer" represented as (fd, offset, length),
    • a "rdma buffer" represented as SMB_DIRECT_BUFFER_DESCRIPTOR_1 array
    • or other things.
  • There need to be a tevent_req based _send/_recv function to copy data between two buffers.
  • This needs to be used instead of the explicit SMB_VFS_SENDFILE/SMB_VFS_RECVFILE or SMB_VFS_PREAD_*/SMB_VFS_PWRITE_*
  • It should be just SMB_VFS_READ_BUFFER_SEND/RECV and SMB_VFS_WRITE_BUFFER_SEND/RECV, where the SMB layer provides an abstracted buffer and the SMB_VFS layer copies from/to the provided buffer.


smb_transport abstraction

  • The socket handling should be abstracted in a way that the SMB layer only receives and submits buffers with SMB1/2 PDUs.
  • The SMB layer never sees the NBT header
  • The abstraction layer should be used for client and server side
  • start with a simple design
   struct smb_transport_ops {
      const char *name;
      
      struct tevent_req *(*write_pdu_send)(TALLOC_CTX *mem_ctx,
                                           struct tevent_context *ev,
                                           struct smb_transport *transport,
                                           struct iovec *vector,
                                           int count);
      NTSTATUS (*write_pdu_recv)(struct tevent_req *req);
      
      struct tevent_req *(*read_pdu_send)(TALLOC_CTX *mem_ctx,
                                          struct tevent_context *ev,
                                          struct smb_transport *transport);
      NTSTATUS (*read_pdu_recv)(struct tevent_req *req,
                                TALLOC_CTX *mem_ctx,
                                struct iovec *vector);
   };
  • maybe some keepalive hooks are also needed
  • How can we add sendfile/recvfile support, using the buffer abstraction?

SMB-Direct backend for smb_transport abstraction

  • Research regarding SMB-Direct credits
  • prototype available (based on current libibverbs/librdmacm), but cleanup and testing needed
  • add RDMA Read/Write using the buffer abstraction to hide the details of ibv_post_send(IBV_WR_RDMA_READ) and ibv_post_send(IBV_WR_RDMA_WRITE)
  • When buffer abstraction could try to mmap the file, when copying a "rdma buffer" from/to a "file buffer", which hopefully provides zero copy.


Listen on RDMA interfaces in the server

  • this requires useable libraries (libibverbs/librdmacm):
    • fork-safety and and support for "FD-passing" are needed because RDMA-connections are added to sessions via multi-channel
    • (see implementation details for multi-channel)
  • provide RDMA interface in FSCTL_QUERY_NETWORK_INTERFACE_INFO response


RDMA Read/Write support in the server

  • smb2_read/smb_write to understand SMB2_CHANNEL_RDMA_V1 and parse the SMB_DIRECT_BUFFER_DESCRIPTOR_1 array
  • create abstracted buffers out of the SMB_DIRECT_BUFFER_DESCRIPTOR_1 array.

Remote Shadow Copy (FSRVP)

Not an SMB 3.0 specific feature per se.

Branch Cache v2

Branch Cache is a wide area network caching protocol implemented in Windows 7 and later. It allows the server to return hashes of the data to the client, and then the client can use these hashes to request copies of the actual data from nearby systems, optimizing network bandwidth. Although Branch Cache is not SMB3 specific (e.g. HTTP etc) it is useful in conjunction with SMB2.1 and SMB3 file serving to improve WAN performance and better optimize bandwidth usage. See MS-PCCRC, MS-PCCRD, MS-PCCRR.

SMB3.02

See http://www.snia.org/sites/default/files2/SDC2013/presentations/SMB3/DavidKruse_SMB3_Update.pdf SMB3.02 is very similar to SMB3 but with some optional features added. Note that the Linux CIFS client can negotiate SMB3.02 dialect (with these optional features disabled) by specifying vers=3.02 on mount. Samba server can not currently negotiate SMB3.02 as it does not have support for the new READ/WRITE flags (and the RDMA and Witness protocol improvements for SMB3.02 are not possible until the corresponding prerequisite optional SMB3.0 features that they are based on are added)

RDMA Improvements

SMB Direct Remote Invalidation. Improves performance.

New ReadWrite Flags

SMB2_READFLAG_UNBUFFERED and SMB2_WRITEFLAG_UNBUFFERED allow the client to indicate whether or not any particular individual i/o request (read or write) should be cached by the server or not. HyperV apparently does use this (backup?) to avoid caching data that is not going to be rerequested.

Asymmetric Shares

The Witness protocol can now signal to Windows clients to 'move' from one share to another, to allow more flexible migration, allowing taking a volume offline without taking the whole server down, with applications continuing to run even as the storage which that application uses is moved. Previous versions of the witness protocol allowed users of one server to be moved to another server, but this allows more granular movement - those using a particular share now can be redirected on the fly to another share.

Cluster-Wide Durable Handles

Work in progress branches

Note: this is really work in progress!!! And some branches might be outdated!

Talks

Demos

Personal tools