Difference between revisions of "CTDB Project"

(Hardware acceleration)
(Link to Release Notes)
 
(33 intermediate revisions by 9 users not shown)
Line 5: Line 5:
 
== Project Members ==
 
== Project Members ==
  
Sven Oehme (project leader)
+
* Andrew Tridgell
Andrew Tridgell (technical lead)
+
* Alexander Bokovoy
Alexander Bokovoy
+
* Aleksey Fedoseev
Aleksey Fedoseev
+
* Jim McDonough
Jim McDonough
+
* Peter Somogyi
Peter Somogyi
+
* Volker Lendecke
 +
* Ronnie Sahlberg
 +
* Sven Oehme
 +
* Michael Adam
 +
* Martin Schwenke
 +
* Amitay Isaacs
 +
 
 +
== Setting Up CTDB ==
 +
 
 +
See the [[CTDB Setup]] page for instructions on setting up the current CTDB code, using either
 +
loopback networking or a TCP/IP network
 +
 
 +
== Status Updates ==
 +
 
 +
* [[CTDB2releaseNotes | CTDB 2.x Release Notes ]]
 +
* 29th October 2013: CTDB release 2.5
 +
* 30th October 2012: [[CTDB2.0announcement|CTDB release 2.0]]
 +
* 27th January 2009: Samba 3.3.0 released - the first Samba version with full CTDB support in the vanilla sources
 +
* ...
 +
* 3d June 2007: pCIFS using Samba and CTDB works reliable in tests. NFS clustering has been added and initial tests pass.
 +
* 27th April 2007: First usable version available
  
 
== Project Outline ==
 
== Project Outline ==
Line 25: Line 45:
  
 
From discussions so far it looks like the 'verbs' API, perhaps with a modification to allow us to hook it into epoll(), will be the right choice. Basic information on this API is available at https://openib.org/tiki/tiki-index.php
 
From discussions so far it looks like the 'verbs' API, perhaps with a modification to allow us to hook it into epoll(), will be the right choice. Basic information on this API is available at https://openib.org/tiki/tiki-index.php
 +
===The basic features we want from a messaging API are:===
  
The basic features we want from a messaging API are:
+
* low latency. We would like to get it down to just a few microseconds per message. Messages will vary in size, but typically be small (say between 64 and 512 bytes).
  
- low latency. We would like to get it down to just a few
+
* non-blocking. We would really like an API that hooks into poll, so we can use epoll(), poll() or select().  
microseconds per message. Messages will vary in size, but typically
 
be small (say between 64 and 512 bytes).
 
  
- non-blocking. We would really like an API that hooks into poll, so
+
* If we can't have an API that hooks into poll() or epoll(), then a callback or signal based API would do if the overheads are small enough. In the same code we also need to be working on a unix domain socket (datagram socket) so we'd like the overhead of dealing with both the infiniband messages and the local datagrams to be low.
we can use epoll(), poll() or select().  
 
  
- If we can't have an API that hooks into poll() or epoll(), then a
+
* What we definately don't want to use is an API that chews a lot of CPU. So we don't want to be spinning in userspace on a set a mapped registers in the hope that a message might come along. The CPU will be needed for other tasks. Using mapped registers for send would probably be fine, but we'd probably need some kernel mediated mechanism for receive unless you can suggest a way to avoid it.
callback or signal based API would do if the overheads are small
 
enough. In the same code we also need to be working on a unix
 
domain socket (datagram socket) so we'd like the overhead of
 
dealing with both the infiniband messages and the local datagrams
 
to be low.
 
  
- What we definately don't want to use is an API that chews a lot of
+
* ideally we'd have reliable delivery, or at least be told when delivery has failed on a send, but if that is too expensive then we'll do our own reliable delivery mechanism.
CPU. So we don't want to be spinning in userspace on a set of
 
mapped registers in the hope that a message might come along. The
 
CPU will be needed for other tasks. Using mapped registers for send
 
would probably be fine, but we'd probably need some kernel mediated
 
mechanism for receive unless you can suggest a way to avoid it.
 
  
- ideally we'd have reliable delivery, or at least be told when
+
* we need to be able to add/remove nodes from the cluster. The Samba clustering code will have its own recovery protocol.
delivery has failed on a send, but if that is too expensive then
 
we'll do our own reliable delivery mechanism.
 
  
- we need to be able to add/remove nodes from the cluster. The Samba
+
* a 'message' like API would suite us better than a 'remote DMA' style API, unless the remote DMA API is significantly more efficient. Ring buffers would be fine.
clustering code will have its own recovery protocol.
 
  
- a 'message' like API would suite us better than a 'remote DMA'
+
An abstract interface can be found here: [[CTDB_Project_ibwrapper]] Please note this interface should be able to cover more possible implementations.
style API, unless the remote DMA API is significantly more
 
efficient. Ring buffers would be fine.
 
  
An abstract interface can be found here: [[CTDB_Project_ibwrapper]]
+
===TODOs regarding this interface:===
  
TODOs regarding this interface:
+
* verify implementability
 +
* reduction
  
- verify how connection build-up suits us
+
== CTDB API ==
  
- enumerate nodes
+
Finished.<br>
 +
The CTDB API is now fairly stable. Communications between samba and CTDB is across a domain socket /tmp/ctdb.socket.
 +
The API contains the following PDUs :
  
== Flesh out CTDB API ==
+
  CTDB_REQ_CALL          Fetch a remote record
 +
  CTDB_REPLY_CALL
 +
  CTDB_REQ_DMASTER        Transfer a record back to the LMASTER
 +
  CTDB_REPLY_DMASTER      Transfer a record from the LMASTER to a new DMASTER
 +
  CTDB_REPLY_ERROR     
 +
  CTDB_REQ_MESSAGE        Send a message to another client attached to a local or remote CTDB daemon
 +
  CTDB_REQ_CONTROL        Get/Set configuration or runtime status data
 +
  CTDB_REPLY_CONTROL
 +
  CTDB_REQ_KEEPALIVE
  
(note: Alexander and Aleksey are looking at this)
+
Of these, the only PDUs used by a client connecting to CTDB are:
  
By this I mean the C api in the "Clustered TDB API" section of the
+
  CTDB_REQ_CALL          Fetch a remote record
wiki page. The API as given there now is missing some pieces, and I
+
  CTDB_REQ_MESSAGE        Send a message to another client attached to a local or remote CTDB daemon
think can be greatly improved.
+
  CTDB_REQ_CONTROL        Get/Set configuration or runtime status data
  
This is likely to feed back into the CTDB protocol description as
+
== Code s3/s4 databases ontop of ctdb api ==
well. Ideally we'd get rid of these calls:
 
  
  CTDB_REQ_FETCH_LOCKED
+
Finished.<br>
  CTDB_REPLY_FETCH_LOCKED
+
All important temporary databases have now been converted to CTDB and demonstrated.
  CTDB_REQ_UNLOCK
 
  CTDB_REPLY_UNLOCK
 
  
assuming we can demonstrate they aren't needed. I also think we can
+
== Code client CTDB api on top of dumb tdb ==
combine the CTDB_REQ_CONDITIONAL_APPEND and the CTDB_REQ_FETCH call
 
into a single CTDB_REQ_REQUEST call which takes a key, a blob of data
 
and a condition ID as input, and returns a blob of data and a status
 
code as output. For a fetch call the input blob of data would be zero
 
length.
 
  
== Code s3/s4 opendb and brlock on top of ctdb api ==
+
Finished.<br>
 +
The ctdb branch for samba3 now implements a simple api ontop of CTDB.
 +
The record header has been expanded to contain a "dmaster" field which allows
 +
the samba daemon to determine whether the current version of this record is held locally in the local
 +
tdb and if so samba daemon will access the record immediately without any involvement of the ctdb daemon.
  
Whoever does this would pick either s3 or s4 initially, I don't think
+
If the record is not stored locally, samba will request that the ctdb daemon will locate the most current version of
there is any point in doing them in parallel (we are bound to make the
+
the record in the cluster and transfer it to the local tdb before the samba daemon will access it the normal way out of the locally
same sorts of mistakes on both if we did that).
+
held tdb.
  
This will also feed a lot into the previous line item, working out the API.
 
  
== Code CTDB api on top of dumb tdb ==
+
The process used in the client can be described as :
  
This also feeds into the API discussion. It should be a very simple
+
  1 Lock record in TDB
and dumb implementation, aiming to be used to allow the s3/s4
+
  2 Read CTDB header from the record and check if DMASTER is this node
implementation to have something to test against.
 
  
== Prototype CTDB library on top of UDP/TCP ==
+
  If we are DMASTER for this record:
 +
  3 If the current node is the DMASTER for the record then operate on the record and unlock it when finished.
  
(note: tridge is looking at this task)
+
  If we are NOT DMASTER for this record
 +
  4 Unlock the record.
 +
  5 Send a CTDB_REQ_CALL to the local daemon to request the record to be migrated onto this node.
 +
  6 Wait for the local daemon to send us a CTDB_REPLY_CALL back, indicating the record is now held locally.
 +
  7 Goto 1
  
The initial implementation of the CTDB protocol will be on top of UDP/TCP
+
== Prototype CTDB library on top of UDP/TCP ==
  
Status: prototype work on this has begun. You can watch progress at http://build.samba.org/?tree=ctdb&function=Recent+Checkins
+
Finished.<br>
  
 
== Setup standalone test environment ==
 
== Setup standalone test environment ==
Line 133: Line 146:
 
that way.
 
that way.
  
== Flesh out recovery part of ctdb protocol ==
+
== Recovery ==
 +
 
 +
Finished.<br>
 +
A recovery mechanism has been implemented in CTDB.
 +
 
 +
One of the nodes will by an election process become the recovery master which is the node that will monitor the cluster and drive the
 +
recovery process when required.
 +
This election process is currently based on the VNN number of the node and the lowest VNN number becomes the recovery master.
 +
 
 +
The daemon that is designated the recovery master will continuously monitor the cluster and verify that the cluster information is consistent.
 +
 
 +
To ensure that there can only be one recovery master active at any given time a file held on shared storage is used. To become a recovery master, a node must be able
 +
to aquire an exclusive lock on that file.
  
(note: tridge is looking at this one)
+
The recovery process consists of :
 +
* Freezing the cluster. This includes locking all local tdb databases to prevent any clients from accessing the databases while recovery is in progress.
 +
* Verifying that all active nodes have all databases created. And if required create the missing databases.
 +
* Pull all records from all databases on all remote nodes and merge these records onto the local tdb databases on the node that is the recovery master. Merging of records are based on RSN numbers.
 +
* After merging, Push all records out to all remote nodes.
 +
* Cleanup and delete all old empty records in all databases.
 +
* Assign nodes to takeover the public ip address of failed nodes.
 +
* Build and distribute a new mapping for the lmaster role for all records (the vnn map)
 +
* Create a new generation number for the cluster and distribute to all nodes.
 +
* Update all local and remote nodes to mark the recovery master  as the current dmaster for all records.
 +
* Thawing the cluster.
  
I explained on the phone that I think the simplest recovery process
+
== IP Takeover ==
will be something like this:
 
  
- global sync and pick 'master' for recovery
+
Finished.<br>
- every node sends all records from its local tdb to the LMASTER
+
Each CTDB node is assigned two ip addresses, one private that is tied to a physical node and is dedicated to inter-CTDB traffic only and a second "public" ip address
- master waits till all nodes say they are done
+
which is the address where clustered services such as SMB will bind to.
- global sync and restart
 
  
The recovery phase will need to very carefully cope with lots of
+
The CTDB cluster will ensure that when physical nodes fail, the remaining nodes will temporarily take over the public ip addresses of the failed nodes.
corner cases, like when a node goes down during recovery.
+
This ensures that even when nodes a temporarily/permanently unavailable, the public ip addresses assigned to these nodes will still be available to clients.
 +
 
 +
The private CTDB address is the primary ip address assigned to the interface used by the cluster and is the address which will show up in ifconfig.
 +
To view which public service addresses are served by a specific node you can use
 +
  ip addr show eth0
 +
which will show all ip addresses assigned to the interface.
 +
 
 +
When a physical node takes over the public ip address of a failed node it will first send out a few Gratious ARPs to ensure that the arp table is updated to reflect the new physical address that serves that public ip address on all locally attached hosts, secondly the new node will also send a few "tcp tickles" to ensure that all clients that have established tcp connections to the failed node immediately detects that the tcp connections have terminated and needs to be recovered.
  
 
== Work out details for persistent tdbs ==
 
== Work out details for persistent tdbs ==
Line 155: Line 195:
 
== Wireshark dissector ==
 
== Wireshark dissector ==
  
We'll need a wireshark dissector, but only once the protocol settles down a little.
+
There is a basic dissector for CTDB in current SVN for wireshark. This dissector follows development and changes of the protocol.
 +
 
 +
== Filter driver for Windows ==
 +
 
 +
A filter driver could be developed for windows to monitor all calls and perform reconnect and reissuing of calls during/after recovery events have occured. this would greatly enhance the ability of windows applications to survive a cluster node failure and recovery.

Latest revision as of 09:41, 6 November 2013

CTDB Project

This project aims to produce an implementation of the CTDB protocol described in the Samba & Clustering page

Project Members

  • Andrew Tridgell
  • Alexander Bokovoy
  • Aleksey Fedoseev
  • Jim McDonough
  • Peter Somogyi
  • Volker Lendecke
  • Ronnie Sahlberg
  • Sven Oehme
  • Michael Adam
  • Martin Schwenke
  • Amitay Isaacs

Setting Up CTDB

See the CTDB Setup page for instructions on setting up the current CTDB code, using either loopback networking or a TCP/IP network

Status Updates

  • CTDB 2.x Release Notes
  • 29th October 2013: CTDB release 2.5
  • 30th October 2012: CTDB release 2.0
  • 27th January 2009: Samba 3.3.0 released - the first Samba version with full CTDB support in the vanilla sources
  • ...
  • 3d June 2007: pCIFS using Samba and CTDB works reliable in tests. NFS clustering has been added and initial tests pass.
  • 27th April 2007: First usable version available

Project Outline

The initial work will focus on an implementation as part of tdb itself. Integration with the Samba source tree will happen at a later date. Work will probably happen in a bzr tree, but the details have not been worked out yet. Check back here for updates.

Project Tasks

Hardware acceleration

(note: Peter is looking at this one)

We want CTDB to be very fast on hardware that supports fast messaging. In particular we are interested in good use of infiniband adapters, where we expect to get messaging latencies of the order of 3 to 5 microseconds.

From discussions so far it looks like the 'verbs' API, perhaps with a modification to allow us to hook it into epoll(), will be the right choice. Basic information on this API is available at https://openib.org/tiki/tiki-index.php

The basic features we want from a messaging API are:

  • low latency. We would like to get it down to just a few microseconds per message. Messages will vary in size, but typically be small (say between 64 and 512 bytes).
  • non-blocking. We would really like an API that hooks into poll, so we can use epoll(), poll() or select().
  • If we can't have an API that hooks into poll() or epoll(), then a callback or signal based API would do if the overheads are small enough. In the same code we also need to be working on a unix domain socket (datagram socket) so we'd like the overhead of dealing with both the infiniband messages and the local datagrams to be low.
  • What we definately don't want to use is an API that chews a lot of CPU. So we don't want to be spinning in userspace on a set a mapped registers in the hope that a message might come along. The CPU will be needed for other tasks. Using mapped registers for send would probably be fine, but we'd probably need some kernel mediated mechanism for receive unless you can suggest a way to avoid it.
  • ideally we'd have reliable delivery, or at least be told when delivery has failed on a send, but if that is too expensive then we'll do our own reliable delivery mechanism.
  • we need to be able to add/remove nodes from the cluster. The Samba clustering code will have its own recovery protocol.
  • a 'message' like API would suite us better than a 'remote DMA' style API, unless the remote DMA API is significantly more efficient. Ring buffers would be fine.

An abstract interface can be found here: CTDB_Project_ibwrapper Please note this interface should be able to cover more possible implementations.

TODOs regarding this interface:

  • verify implementability
  • reduction

CTDB API

Finished.
The CTDB API is now fairly stable. Communications between samba and CTDB is across a domain socket /tmp/ctdb.socket. The API contains the following PDUs :

 CTDB_REQ_CALL           Fetch a remote record
 CTDB_REPLY_CALL
 CTDB_REQ_DMASTER        Transfer a record back to the LMASTER
 CTDB_REPLY_DMASTER      Transfer a record from the LMASTER to a new DMASTER
 CTDB_REPLY_ERROR       
 CTDB_REQ_MESSAGE        Send a message to another client attached to a local or remote CTDB daemon
 CTDB_REQ_CONTROL        Get/Set configuration or runtime status data
 CTDB_REPLY_CONTROL
 CTDB_REQ_KEEPALIVE

Of these, the only PDUs used by a client connecting to CTDB are:

 CTDB_REQ_CALL           Fetch a remote record
 CTDB_REQ_MESSAGE        Send a message to another client attached to a local or remote CTDB daemon
 CTDB_REQ_CONTROL        Get/Set configuration or runtime status data

Code s3/s4 databases ontop of ctdb api

Finished.
All important temporary databases have now been converted to CTDB and demonstrated.

Code client CTDB api on top of dumb tdb

Finished.
The ctdb branch for samba3 now implements a simple api ontop of CTDB. The record header has been expanded to contain a "dmaster" field which allows the samba daemon to determine whether the current version of this record is held locally in the local tdb and if so samba daemon will access the record immediately without any involvement of the ctdb daemon.

If the record is not stored locally, samba will request that the ctdb daemon will locate the most current version of the record in the cluster and transfer it to the local tdb before the samba daemon will access it the normal way out of the locally held tdb.


The process used in the client can be described as :

 1 Lock record in TDB
 2 Read CTDB header from the record and check if DMASTER is this node
 If we are DMASTER for this record:
 3 If the current node is the DMASTER for the record then operate on the record and unlock it when finished.
 If we are NOT DMASTER for this record
 4 Unlock the record.
 5 Send a CTDB_REQ_CALL to the local daemon to request the record to be migrated onto this node.
 6 Wait for the local daemon to send us a CTDB_REPLY_CALL back, indicating the record is now held locally.
 7 Goto 1

Prototype CTDB library on top of UDP/TCP

Finished.

Setup standalone test environment

This test environment is meant for non-clustered usage, instead emulating a cluster using IP on loopback. It will need to run multiple instances talking over 127.0.0.X interfaces. This will involve some shell scripting, plus some work on adding/removing nodes from the cluster. It might be easiest to add a CTDB protocol request asking a node to 'go quiet', then asking it to become active again later to simulate a node dying and coming back.

Code CTDB test suite

(note: jim is looking at this one)

This reflects the fact that I want this project to concentrate on building ctdb on tdb + messaging, and not concentrate on the "whole problem" involving Samba until later. We'll do a basic s3/s4 backend implementation to make sure the ideas can work, but I want the major testing effort to involve simple tests directly against the ctdb API. It will be so much easier to simulate exotic error conditions that way.

Recovery

Finished.
A recovery mechanism has been implemented in CTDB.

One of the nodes will by an election process become the recovery master which is the node that will monitor the cluster and drive the recovery process when required. This election process is currently based on the VNN number of the node and the lowest VNN number becomes the recovery master.

The daemon that is designated the recovery master will continuously monitor the cluster and verify that the cluster information is consistent.

To ensure that there can only be one recovery master active at any given time a file held on shared storage is used. To become a recovery master, a node must be able to aquire an exclusive lock on that file.

The recovery process consists of :

  • Freezing the cluster. This includes locking all local tdb databases to prevent any clients from accessing the databases while recovery is in progress.
  • Verifying that all active nodes have all databases created. And if required create the missing databases.
  • Pull all records from all databases on all remote nodes and merge these records onto the local tdb databases on the node that is the recovery master. Merging of records are based on RSN numbers.
  • After merging, Push all records out to all remote nodes.
  • Cleanup and delete all old empty records in all databases.
  • Assign nodes to takeover the public ip address of failed nodes.
  • Build and distribute a new mapping for the lmaster role for all records (the vnn map)
  • Create a new generation number for the cluster and distribute to all nodes.
  • Update all local and remote nodes to mark the recovery master as the current dmaster for all records.
  • Thawing the cluster.

IP Takeover

Finished.
Each CTDB node is assigned two ip addresses, one private that is tied to a physical node and is dedicated to inter-CTDB traffic only and a second "public" ip address which is the address where clustered services such as SMB will bind to.

The CTDB cluster will ensure that when physical nodes fail, the remaining nodes will temporarily take over the public ip addresses of the failed nodes. This ensures that even when nodes a temporarily/permanently unavailable, the public ip addresses assigned to these nodes will still be available to clients.

The private CTDB address is the primary ip address assigned to the interface used by the cluster and is the address which will show up in ifconfig. To view which public service addresses are served by a specific node you can use

 ip addr show eth0

which will show all ip addresses assigned to the interface.

When a physical node takes over the public ip address of a failed node it will first send out a few Gratious ARPs to ensure that the arp table is updated to reflect the new physical address that serves that public ip address on all locally attached hosts, secondly the new node will also send a few "tcp tickles" to ensure that all clients that have established tcp connections to the failed node immediately detects that the tcp connections have terminated and needs to be recovered.

Work out details for persistent tdbs

this will need some more thought - its not our top priority, but eventually the long lived databases will matter.

Wireshark dissector

There is a basic dissector for CTDB in current SVN for wireshark. This dissector follows development and changes of the protocol.

Filter driver for Windows

A filter driver could be developed for windows to monitor all calls and perform reconnect and reissuing of calls during/after recovery events have occured. this would greatly enhance the ability of windows applications to survive a cluster node failure and recovery.