This project aims to produce an implementation of the CTDB protocol described in the Samba & Clustering page
Sven Oehme (project leader) Andrew Tridgell (technical lead) Alexander Bokovoy Aleksey Fedoseev Jim McDonough Peter Somogyi
The initial work will focus on an implementation as part of tdb itself. Integration with the Samba source tree will happen at a later date. Work will probably happen in a bzr tree, but the details have not been worked out yet. Check back here for updates.
We want CTDB to be very fast on hardware that supports fast messaging. In particular we are interested in good use of infiniband adapters, where we expect to get messaging latencies of the order of 3 to 5 microseconds.
From discussions so far it looks like the 'verbs' API, perhaps with a modification to allow us to hook it into epoll(), will be the right choice. Basic information on this API is available at https://openib.org/tiki/tiki-index.php
The basic features we want from a messaging API are:
- low latency. We would like to get it down to just a few microseconds per message. Messages will vary in size, but typically be small (say between 64 and 512 bytes).
- non-blocking. We would really like an API that hooks into poll, so we can use epoll(), poll() or select().
- If we can't have an API that hooks into poll() or epoll(), then a callback or signal based API would do if the overheads are small enough. In the same code we also need to be working on a unix domain socket (datagram socket) so we'd like the overhead of dealing with both the infiniband messages and the local datagrams to be low.
- What we definately don't want to use is an API that chews a lot of CPU. So we don't want to be spinning in userspace on a set of mapped registers in the hope that a message might come along. The CPU will be needed for other tasks. Using mapped registers for send would probably be fine, but we'd probably need some kernel mediated mechanism for receive unless you can suggest a way to avoid it.
- ideally we'd have reliable delivery, or at least be told when delivery has failed on a send, but if that is too expensive then we'll do our own reliable delivery mechanism.
- we need to be able to add/remove nodes from the cluster. The Samba clustering code will have its own recovery protocol.
- a 'message' like API would suite us better than a 'remote DMA' style API, unless the remote DMA API is significantly more efficient. Ring buffers would be fine.
Flesh out CTDB API
By this I mean the C api in the "Clustered TDB API" section of the wiki page. The API as given there now is missing some pieces, and I think can be greatly improved.
This is likely to feed back into the CTDB protocol description as well. Ideally we'd get rid of these calls:
CTDB_REQ_FETCH_LOCKED CTDB_REPLY_FETCH_LOCKED CTDB_REQ_UNLOCK CTDB_REPLY_UNLOCK
assuming we can demonstrate they aren't needed. I also think we can combine the CTDB_REQ_CONDITIONAL_APPEND and the CTDB_REQ_FETCH call into a single CTDB_REQ_REQUEST call which takes a key, a blob of data and a condition ID as input, and returns a blob of data and a status code as output. For a fetch call the input blob of data would be zero length.
Code s3/s4 opendb and brlock on top of ctdb api
Whoever does this would pick either s3 or s4 initially, I don't think there is any point in doing them in parallel (we are bound to make the same sorts of mistakes on both if we did that).
This will also feed a lot into the previous line item, working out the API.
Code CTDB api on top of dumb tdb
This also feeds into the API discussion. It should be a very simple and dumb implementation, aiming to be used to allow the s3/s4 implementation to have something to test against.
Setup standalone test environment
This test environment is meant for non-clustered usage, instead emulating a cluster using IP on loopback. It will need to run multiple instances talking over 127.0.0.X interfaces. This will involve some shell scripting, plus some work on adding/removing nodes from the cluster. It might be easiest to add a CTDB protocol request asking a node to 'go quiet', then asking it to become active again later to simulate a node dying and coming back.
Code CTDB test suite
This reflects the fact that I want this project to concentrate on building ctdb on tdb + messaging, and not concentrate on the "whole problem" involving Samba until later. We'll do a basic s3/s4 backend implementation to make sure the ideas can work, but I want the major testing effort to involve simple tests directly against the ctdb API. It will be so much easier to simulate exotic error conditions that way.
Flesh out recovery part of ctdb protocol
I explained on the phone that I think the simplest recovery process will be something like this:
- global sync and pick 'master' for recovery - every node sends all records from its local tdb to the LMASTER - master waits till all nodes say they are done - global sync and restart
The recovery phase will need to very carefully cope with lots of corner cases, like when a node goes down during recovery.
Work out details for persistent tdbs
this will need some more thought - its not our top priority, but eventually the long lived databases will matter.
We'll need a wireshark dissector, but only once the protocol settles down a little.