Samba3/SMB2: Difference between revisions

From SambaWiki
Line 223: Line 223:


== Leases ==
== Leases ==

=== new concepts ===

* read lease (<=> lvl2 oplock)
* read + handle lease (new)
* read + write lease (write locally, do brlocks) (<=> exclusive oplock)
* read + write + handle lease (<=> batch oplock)
* new: multiple opens with same client ID without breaking the lease
* new: client can upgrade a lease (not downgrade)

* Linux kernel oplocks don't provide the needed features. (They don't even work correctly for oplocks...) ==> SMB-only feature.

=== analyze exact algorithms ===

* object store semantics
* smbX break semantics (share modes, oplocks, leases), e.g.:
** oplock break batch --> lvl2 or none
** lease break r+w+h --> r+w (only handle caching is broken)
* documents: see e.g.
** [MS-SMB2], "Algorithms for Leasing in an Object Store"
** [MS-SMB2], "Object Store indicates a Lease Break"
** [MS-SMB2], "Object Store indicates an Oplock Break"
* write additional smbtorture tests
* this also determines the exact details of the data structure in below

=== vfs-layer: change data model and code ===

locking.tdb and the locking code (and part of open) is essentially Samba's
implementation of the FSA layer aspect of oplocks. The FSA layer ([MS-FSA])
knows oplocks and not a speration between leases and oplocks. These FSA
level oplocks are able to cover both SMB(2) oplocks and SMB2 leases.
In the [MS-SMB2] document, there are descriptions how leases and oplocks are
mapped down to the FSA level.

So the basic idea is to extend our FSA layer oplocks, i.e. the locking.tdb data
model and the locking/open code so that it can cope with SMB2 leases as well.

* data model:
** lease in open_file (locking.tdb)
*** parallel to share_modes[] (opens)
*** we need an array for oplocks[] where opens may have a reference (index value) into the oplocks[] array, multiple opens can reference the same oplock element.

* extract of new data structures from open_files.idl:

typedef [public] struct {
...
uint32 oplock_idx; // UINT32_MAX => none
} share_mode_entry;

typedef [public,bitmap8bit] bitmap {
SHARE_MODE_NO_CACHING = 0x00,
SHARE_MODE_READ_CACHING = 0x01,
SHARE_MODE_WRITE_CACHING = 0x02,
SHARE_MODE_HANDLE_CACHING = 0x04
} share_mode_caching;

typedef [public,flag(NDR_PAHEX)] struct {
DATA_BLOB oplock_key;
share_mode_caching current_state;
/*
* allowed_shared_state is the mask for the
* cache level that can be held simultaneously
* by multiple opens. On the SMB level, this
* depends on the kind of caching that is in effect.
*
* allowed_shared_state is:
*
* - for SMB oplocks: READ Caching
* - for SMB leases: READ and READ+HANDLE Caching
*
* This means that:
* - level2 oplocks are not granted if there
* is already a RH lease.
* - A R lease is granted if a level2 oplock was
* present and a R or RH lease was requested.
* - A batch oplock is broken to a level2
* oplock and a R lease is granted if a
* RH lease was requested.
*/
share_mode_caching allowed_shared_state;
/*
* breaking_to_state indicates to which level
* the current state is broken when a conflicting
* request is processed. The calculation is as follows:
*
* breaking_to_state = current_state;
* breaking_to_state &= ~(remove_state)
* breaking_to_state &= allowed_shared_state
*/
share_mode_caching breaking_to_state;
boolean8 breaking;
timeval break_timeout;
} share_mode_oplock;

typedef [public] struct {
...
uint32 num_oplocks;
[size_is(num_oplocks)] share_mode_oplock oplocks[];
...
} share_mode_data;

* mapping SMB oplocks / leases --> locking.tdb share_mode_oplock.

oplocks:
level-2 oplocks:
current_state = READ_CACHING
allowed_shared_state = READ_CACHING
exclusive oplock:
current_state = READ_CACHING|WRITE_CACHING
allowed_shared_state = READ_CACHING
batch oplock:
current_state = READ_CACHING|WRITE_CACHING|HANDLE_CACHING
allowed_shared_state = READ_CACHING

leases:
R-lease:
current_state = READ_CACHING
allowed_shared_state = READ_CACHING
RH-lease:
current_state = READ_CACHING|WRITE_CACHING
allowed_shared_state = READ_CACHING|HANDLE_CACHING
RW-lease:
current_state = READ_CACHING|WRITE_CACHING
allowed_shared_state = READ_CACHING
RWH-lease:
current_state = READ_CACHING|WRITE_CACHING|HANDLE_CACHING
allowed_shared_state = READ_CACHING|HANDLE_CACHING

* break table:
** todo: verify / fix
** todo: dependence on share modes
** note: table cells in the form "granted\brokento"

requested \ existing || lvl2 | excl | batch | r | rh | rw | rwh
---------------------------------------------------------------------------------------------
lvl2 || lvl2\lvl2 | lvl2\lvl2 | lvl2\lvl2 | lvl2\r | 0\rh | lvl2\r | 0\rh
excl || lvl2\lvl2 | lvl2\lvl2 | lvl2\lvl2 | lvl2\r | 0\rh | lvl2\r | 0\rh
batch|| lvl2\lvl2 | lvl2\lvl2 | lvl2\lvl2 | lvl2\r | 0\rh | lvl2\r | 0\rh
r || r\lvl2 | r\lvl2 | r\lvl2 | r\r | r\rh | r\r | r\rh
rh || r\lvl2 | r\lvl2 | r\lvl2 | rh\r | rh\rh | rh\r | rh\rh
rwh || r\lvl2 | r\lvl2 | r\lvl2 | rh\r | rh\rh | rh\r | rh\rh


* server code:
** adapt code to implement new/extended semantics
** this includes restructuring and possibly fixing existing oplock code

=== SMB-layer: extend data model and add lease code ===

* data model:
** introduce lease
*** lease key
*** filename
*** lease state
*** break to lease state
*** lease break timeout
*** lease opens (--> smbXsrv_open)
*** breaking
*** epoch (SMB 3.0)
*** version
** [MS-SMB2] list structures: (maybe not necessary for us)
*** lease table
**** client guid
**** lease list (indexed by lease key)
*** global lease table list
*** in samba possibly: one db indexed by the (pair ClientGUID,LeaseKey)

* server code:
** implement leasing capability
** answer lease requesting variants (v1, v2) of create call
*** implement smb level break code
*** this includes restructuring and possibly fixing existing oplock code


== Resilient File Handles ==
== Resilient File Handles ==

Revision as of 19:11, 16 September 2013

Introduction

This page describes the plan, design and work in progress of the efforts to implement SMB2 in Samba3.

  • SMB 2.0 (SMB2.02 dialect) was introduced with Windows Vista/2008. Samba 3.6 added support for SMB2.0. This support is essentially complete except for one big item:
    • durable file handles
  • SMB 2.1 was introduced with Windows 7/Windows 2008R2. The major features that remain to be implemented are:
    • multi credit/large MTU
    • reauthentication
    • leases
    • resilient file handles
    • branch cache
    • unbuffered write
  • SMB 3 (previously known as SMB2.2 dialect) was introduced with Windows 8 and Windows Server 2012. The features include:
    • directory leases
    • persistent file handles
    • multi channel
    • witness notification protocol (a new RPC service)
    • interface discovery (a new FSCTL)
    • SMB direct (SMB 3 over RDMA)
    • remote shadow copy support
    • branch cache v2

Prerequisite / accompanying work

VFS layering: introduce a NT-FSA-layer

Samba3's current VFS is a mixture of NT/SMB level calls (e.g. SMB_VFS_CREATE_FILE, SMB_VFS_GET_NT_ACL) and POSIX calls (e.g. SMB_VFS_OPEN, SMB_VFS_CHOWN). There are even lower level pluggable structures for specific POSIX ACL implementations. The implementations of the NT level VFS calls also call out into the POSIX level calls. The idea of this part is to split up the layers, so that the layering is clean: A NT-Layer on top that implements only the NT/SMB style calls. This should be guided by the FSA description from the Microsoft documentation ([MS-FSA]). Some of the NT/SMB-level calls are not present in the current SMB_VFS yet at all so these would have to be abstracted out of the smbd code. The current implementation of the SMB_VFS calls and some portion of smbd code would become the default "POSIX" backend to the FSA vfs layer.

This step is technically not strictly necessary, but a desired foundation for the SMB2 and future changes. When we touch the code anyways, we have a chance to improve the structure and untangle the layers. We don't need to do it in one step and we don't need to implement all of FSA right away, but we can tryp to improve the layering as we go along and touch calls.

dependence

  • does not depend on other work
  • accompanies work on the whole project
  • The splitting out of NTFSA calls can be made a prerequisite for further work on the corresponding calls in work on SMB 2.0 durable handles and SMB 2.1 (e.g. leases and resilient file handles).

steps

  • define VFS structures:
    • NTFSA layer
    • POSIX backend to call into the current SMB_VFS
  • first implement NTFSA by calling directly into current SMB_VFS code (or move code from smbd into the default NTFSA backend implementation) and have smbd call out into the FSA layer instead
  • start with one call after another, e.g. smb2_create and use NTFSA calls in the implementation.
  • Move logic from the smbd/ code to new NTFSA calls. These call the lower layer SMB_VFS calls.
  • Once the NTFSA calls are used everywhere, one can start to split up and fix the vfs layering underneath, i.e. remove the FSA-style calls from the SMB_VFS etc.
  • data structures: split up files / connections into smbXsrv layer and fsa layer, e.g.:
      smb level       |       ntfsa      |   ntfsa_posix level
      smbXsrv_session -->  ntfsa_context --> users_struct
      smbXsrv_tcon    -->  ntfsa_context --> connections_struct
      smbXsrv_open    -->  ntfsa_open    --> files_struct

SMB 2.0

durable handles

Note: Support for durable handles has been released with Samba 4.0

These steps describe the necessary steps towards the implementation of durable handles. For now for a single, non-clustered Samba-Server. For details on durable handles in a CTDB+Samba-cluster, see below.

dbwrap work

This is prerequisite work to avoid code duplication in record watching and so on:

  • clean up locking order
  • add dbwrap record watch mechanisms to abstract the mechanims for waiting for lock records to become available

state: essentially done(?)

rewrite messaging

For the implementation of durable handles, the smbd processes will need to communicate more than before: When a client reconnects to Samba after a network outage, it will end up at a different smbd. The new smbd will need to work on the files that had been in use as durable handles in the original client. There are two possible approaches: keep files open or reopen files. Depending on the approach, it might become necessary to pass open files from one smbd to another using fd passing. For this, we need to change our messaging. But also for the generally more demanding messaging, it would be extremely useful to get rid of the tdb+signal based messaging and replace it by an asynchronous mechanism based on sockets and in a second step have the messaging infrastructure IDL-generated.

add new tevent_req based API

dependence: This is independent of other tasks.

  • in order to simplify the higher layers a new tevent_req based messaging api is needed.

rewrite messaging with sockets

dependence: This is independent of other tasks.

  • raw messaging: unix domain datagram sockets.
  • if there are too large packets, then we need stream in addition
  • if possible: keep s3 api messaging_send/receive for a start in order to reduce scope of change

implement messaging based on iRPC

dependence: Based on the two previous steps

  • do "irpc" over this raw messaging
  • rpc services defined by idl, generated by pidl
  • write rpc services for fd-passing


Define New Data Structures

locking/open files (fs layer)

  • define data structures (idl)
  • identify various databases
  • design goal API for each such database or structure

state: essentially done?


sessions/tcons/opens (smb layer)

  • define data structures (idl):
    • struct smbXsrv_session*
    • struct smbXsrv_tcon*
    • struct smbXsrv_open*
  • identify various databases
  • design goal API for each such DB or structure

state: in progress/largely done


Use New Data Structures In The Server

use in FS layer

  • refactor locking code etc: create corresponding APIs with current backend code, use in server
  • extend current structures to match targeted structures
  • change code beneath APIs to use new marshalled databases
  • add logic to use new parts of the structures

state: essentially done?

use in smb layer

  • cleanup/simplify core smbd code
  • make use of new structures

state: essentially done?

Implement durable open and reconnect

Session reconnect with previous session id

  • if previous session exists, tear it down and thereby close tcons and (non-durable) open files
  • open new session

state: done

implement durable open

  • Interpret durable flag in smb2_create call
    • Mark the file handle durable in the database record.
    • confirm durable open in the response to the client
  • change cleanup routines to not delete open file entries for durable handles, even when the opening process does not exist any more

state: essentially done

  • implement scavanger mechanism to clean durable handles without corresponding smbd process after the scavenger timeout (maybe simply as part of cleanup routine?)

state: in progress

implement durable reconnect with reopening files (CIFS only)

  • implement reconnect for durable handles at SMB2 level after session reconnect and tcon:
    • new smbd looks for file info by persistent ID.
    • smbd should reopen the file based on the information from the databases.
  • fine-tuning of lock/oplock(/lease) behaviour under durable reopen
  • fencing against conflicting opens (==> CIFS only?! - need to keep files open for shell / nfs interop)

state: progress/largely done

improve nfs/shell interop for conflicting opens

Note: may be implemented later as an add-on

  • write tests to trigger the problem between a connection loss and a non-cifs open of the file that is still a durable handle
  • possiblity: create extra process that reopens the closed files to be able to catch opens from shell or nfs while cifs client is disconnected (==> there is still a race condition here)


implement durable reconnect with fd-passing

Note: may be implemented later as an add-on

  • have smbd keep files open which are durable when the client is disconnected
  • implement reopen:
    • requests fd-handle (implemented by fd-passing for posix) via irpc messaging

Durable handle cross-node

(To be filled)

SMB 2.1

Unbuffered Write

Supported since 3.6.0.

Multi Credit / Large MTU

Supported since 4.0.0.

Reauthentication

state: done

Supported since 4.0.0.

Leases

new concepts

  • read lease (<=> lvl2 oplock)
  • read + handle lease (new)
  • read + write lease (write locally, do brlocks) (<=> exclusive oplock)
  • read + write + handle lease (<=> batch oplock)
  • new: multiple opens with same client ID without breaking the lease
  • new: client can upgrade a lease (not downgrade)
  • Linux kernel oplocks don't provide the needed features. (They don't even work correctly for oplocks...) ==> SMB-only feature.

analyze exact algorithms

  • object store semantics
  • smbX break semantics (share modes, oplocks, leases), e.g.:
    • oplock break batch --> lvl2 or none
    • lease break r+w+h --> r+w (only handle caching is broken)
  • documents: see e.g.
    • [MS-SMB2], "Algorithms for Leasing in an Object Store"
    • [MS-SMB2], "Object Store indicates a Lease Break"
    • [MS-SMB2], "Object Store indicates an Oplock Break"
  • write additional smbtorture tests
  • this also determines the exact details of the data structure in below

vfs-layer: change data model and code

locking.tdb and the locking code (and part of open) is essentially Samba's implementation of the FSA layer aspect of oplocks. The FSA layer ([MS-FSA]) knows oplocks and not a speration between leases and oplocks. These FSA level oplocks are able to cover both SMB(2) oplocks and SMB2 leases. In the [MS-SMB2] document, there are descriptions how leases and oplocks are mapped down to the FSA level.

So the basic idea is to extend our FSA layer oplocks, i.e. the locking.tdb data model and the locking/open code so that it can cope with SMB2 leases as well.

  • data model:
    • lease in open_file (locking.tdb)
      • parallel to share_modes[] (opens)
      • we need an array for oplocks[] where opens may have a reference (index value) into the oplocks[] array, multiple opens can reference the same oplock element.
  • extract of new data structures from open_files.idl:
         typedef [public] struct {
             ...
             uint32 oplock_idx; // UINT32_MAX => none
         } share_mode_entry;
         typedef [public,bitmap8bit] bitmap {
             SHARE_MODE_NO_CACHING = 0x00,
             SHARE_MODE_READ_CACHING = 0x01,
             SHARE_MODE_WRITE_CACHING = 0x02,
             SHARE_MODE_HANDLE_CACHING = 0x04
         } share_mode_caching;
         typedef [public,flag(NDR_PAHEX)] struct {
             DATA_BLOB               oplock_key;
             share_mode_caching      current_state;
             /*
              * allowed_shared_state is the mask for the
              * cache level that can be held simultaneously
              * by multiple opens. On the SMB level, this
              * depends on the kind of caching that is in effect.
              *
              * allowed_shared_state is:
              *
              * - for SMB oplocks: READ Caching
              * - for SMB leases:  READ and READ+HANDLE Caching
              *
              * This means that:
              * - level2 oplocks are not granted if there
              *   is already a RH lease.
              * - A R lease is granted if a level2 oplock was
              *   present and a R or RH lease was requested.
              * - A batch oplock is broken to a level2
              *   oplock and a R lease is granted if a
              *   RH lease was requested.
              */
             share_mode_caching      allowed_shared_state;
             /*
              * breaking_to_state indicates to which level
              * the current state is broken when a conflicting
              * request is processed. The calculation is as follows:
              *
              *   breaking_to_state = current_state;
              *   breaking_to_state &= ~(remove_state)
              *   breaking_to_state &= allowed_shared_state
              */
             share_mode_caching      breaking_to_state;
             boolean8                breaking;
             timeval                 break_timeout;
         } share_mode_oplock;
         typedef [public] struct {
             ...
             uint32 num_oplocks;
             [size_is(num_oplocks)] share_mode_oplock oplocks[];
             ...
         } share_mode_data;
  • mapping SMB oplocks / leases --> locking.tdb share_mode_oplock.
       oplocks:
         level-2 oplocks:
            current_state = READ_CACHING
            allowed_shared_state = READ_CACHING
         exclusive oplock:
            current_state = READ_CACHING|WRITE_CACHING
            allowed_shared_state = READ_CACHING
         batch oplock:
            current_state = READ_CACHING|WRITE_CACHING|HANDLE_CACHING
            allowed_shared_state = READ_CACHING
       leases:
         R-lease:
            current_state = READ_CACHING
            allowed_shared_state = READ_CACHING
         RH-lease:
            current_state = READ_CACHING|WRITE_CACHING
            allowed_shared_state = READ_CACHING|HANDLE_CACHING
         RW-lease:
            current_state = READ_CACHING|WRITE_CACHING
            allowed_shared_state = READ_CACHING
         RWH-lease:
            current_state = READ_CACHING|WRITE_CACHING|HANDLE_CACHING
            allowed_shared_state = READ_CACHING|HANDLE_CACHING
  • break table:
    • todo: verify / fix
    • todo: dependence on share modes
    • note: table cells in the form "granted\brokento"
       requested \ existing || lvl2      | excl      | batch     | r      | rh    | rw     | rwh
       ---------------------------------------------------------------------------------------------
                       lvl2 || lvl2\lvl2 | lvl2\lvl2 | lvl2\lvl2 | lvl2\r | 0\rh  | lvl2\r | 0\rh
                       excl || lvl2\lvl2 | lvl2\lvl2 | lvl2\lvl2 | lvl2\r | 0\rh  | lvl2\r | 0\rh
                       batch|| lvl2\lvl2 | lvl2\lvl2 | lvl2\lvl2 | lvl2\r | 0\rh  | lvl2\r | 0\rh
                       r    || r\lvl2    | r\lvl2    | r\lvl2    | r\r    | r\rh  | r\r    | r\rh
                       rh   || r\lvl2    | r\lvl2    | r\lvl2    | rh\r   | rh\rh | rh\r   | rh\rh
                       rwh  || r\lvl2    | r\lvl2    | r\lvl2    | rh\r   | rh\rh | rh\r   | rh\rh


  • server code:
    • adapt code to implement new/extended semantics
    • this includes restructuring and possibly fixing existing oplock code

SMB-layer: extend data model and add lease code

  • data model:
    • introduce lease
      • lease key
      • filename
      • lease state
      • break to lease state
      • lease break timeout
      • lease opens (--> smbXsrv_open)
      • breaking
      • epoch (SMB 3.0)
      • version
    • [MS-SMB2] list structures: (maybe not necessary for us)
      • lease table
        • client guid
        • lease list (indexed by lease key)
      • global lease table list
      • in samba possibly: one db indexed by the (pair ClientGUID,LeaseKey)
  • server code:
    • implement leasing capability
    • answer lease requesting variants (v1, v2) of create call
      • implement smb level break code
      • this includes restructuring and possibly fixing existing oplock code

Resilient File Handles

Branch Cache

SMB 3.0

Directory Leases

Persistent File Handles

Multi Channel

Witness Notification Protocol

Interface Discovery

SMB Direct (SMB 3.0 over RDMA)

Remote Shadow Copy (FSRVP)

Not an SMB 3.0 specific feature per se.

Branch Cache v2

Cluster-Wide Durable Handles

Work in progress branches

Note: this is really work in progress!!! And some branches might be outdated!

Talks

Demos