This project aims to produce an implementation of the CTDB protocol described in the Samba & Clustering page
Sven Oehme (project leader) Andrew Tridgell (technical lead) Alexander Bokovoy Aleksey Fedoseev Jim McDonough Peter Somogyi
The initial work will focus on an implementation as part of tdb itself. Integration with the Samba source tree will happen at a later date. Work will probably happen in a bzr tree, but the details have not been worked out yet. Check back here for updates.
We want CTDB to be very fast on hardware that supports fast messaging. In particular we are interested in good use of infiniband adapters, where we expect to get messaging latencies of the order of 3 to 5 microseconds.
From discussions so far it looks like the 'verbs' API, perhaps with a modification to allow us to hook it into epoll(), will be the right choice. Basic information on this API is available at https://openib.org/tiki/tiki-index.php
The basic features we want from a messaging API are:
- low latency. We would like to get it down to just a few microseconds per message. Messages will vary in size, but typically be small (say between 64 and 512 bytes).
- non-blocking. We would really like an API that hooks into poll, so we can use epoll(), poll() or select().
- If we can't have an API that hooks into poll() or epoll(), then a callback or signal based API would do if the overheads are small enough. In the same code we also need to be working on a unix domain socket (datagram socket) so we'd like the overhead of dealing with both the infiniband messages and the local datagrams to be low.
- What we definately don't want to use is an API that chews a lot of CPU. So we don't want to be spinning in userspace on a set of mapped registers in the hope that a message might come along. The CPU will be needed for other tasks. Using mapped registers for send would probably be fine, but we'd probably need some kernel mediated mechanism for receive unless you can suggest a way to avoid it.
- ideally we'd have reliable delivery, or at least be told when delivery has failed on a send, but if that is too expensive then we'll do our own reliable delivery mechanism.
- we need to be able to add/remove nodes from the cluster. The Samba clustering code will have its own recovery protocol.
- a 'message' like API would suite us better than a 'remote DMA' style API, unless the remote DMA API is significantly more efficient. Ring buffers would be fine.