Running Samba AD Domain Controllers in large domains
- 1 DRS replication (and joining a DC)
- 2 Running samba-tool dbcheck
- 3 Subtree rename
- 4 wbinfo takes a long time (or doesn't work)
- 5 LDAP full scans (and internal scans)
- 6 LDAP bind
- 7 Samba multi-process model
DRS replication (and joining a DC)
The time it takes to DRS replicate is proportional to the size of the database and is one of the longest running operations one may run against a domain controller. Not only does returning all the data across the network in the correct format take time, reformatting the responses and writing the results to disk also takes significant amounts of time. Simultaneously joining two domain controllers is a serial operation (one RPC process), and does not speed up the time it takes to prepare domain controllers.
As replication generally triggers a number of writes, it is recommended that fastest storage possible is used. Even in the case where no meaningful changes will be written (full synchronization on a synchronized database), faster storage has a notable effect on the overall synchronization time.
Due to longer replication times, the internal queue of replications in the drepl_server process will be unable to be processed. This might have an effect on RID allocation which uses the same flow of replication operations. Avoid attempting a full synchronization while adding bulk users simultaneously, otherwise the DC may run out of RIDs to allocate from its pool. Eventually the RID pool should refresh, but in the meantime, operations that consume RIDs should be done against a different domain controller.
Following a full synchronization of a large database, the drepl_server process may have accumulated a large number of pending notifications and pull requests. It may take some time to flush these operations and so user triggered replications via samba-tool may not respond for a while. Using the --local option of samba-tool drs replicate is one way to avoid waiting, alternatively restarting the Samba process will flush the in-memory queue.
Linked attributes like the member attribute for group membership, contribute a large portion of the overall synchronization time. Avoiding having too many links may reduce the time required to replicate a database.
LMDB map size errors
Linked attributes processing in Samba 4.9 caused bad (transaction) memory behaviour with LMDB during a join, triggering a map size error with a large number of links. Samba 4.10 should address these issues, but increasing the map size limits may also be a sufficient workaround in some smaller cases.
Running samba-tool dbcheck
Running the standard dbcheck on a large domain can take a very long time (on the order of days, when only checking the consistency and not fixing any issues). The most significant contributor to this time is linked attributes. Regardless of the size of the database, checking consistency rules is important.
The safest way to dbcheck a database (both to check for errors and to fix errors) is while the Samba processes are all offline, because some checks may be interfered with by modifications on a live server. Local database modifications may also interfere with the dbcheck, so you should make sure there is no other local accesses are being made. When running with --fix, --yes ensures that no other access to the database is possible with a transaction, and note that using this against a live server would be extremely unwise as it would disrupt normal operations for a long period of time.
Skipping checks associated with the member attribute
In versions of dbcheck > 4.11, there will be a new option to allow a quick check of member linked attributes. In a large domain, member attributes may be quite common and running the full list of checks consumes far too much time. Since Samba 4.7, a number of consistency issues associated with linked attributes should no longer be simple to trigger. This means that a noticeable number of the checks present in dbcheck are highly unlikely to find any issues, despite consuming a large amount of time.
Fixing a large number of dbcheck errors
As write operations can disrupt normal operations, it is possible to change the scope of what dbcheck inspects and restrict it to an LDAP subtree (base or one-level). This may even be used for single objects in the database so that you can generate a list of distinguished names and then subsequently run a fix for each of them (ideally the code should generate a unmodified list, but it does not currently have this capability). This method may also be used to restrict the scope of checking the consistency rules, and not applying fixes yet. You could generate a list of all distinguished names in the database and then trigger dbcheck on each to determine if there might be an issue, however note that discrepancies in the object list due to modifications and consistency fixes made externally during dbcheck checking (rather than fixing) may cause unexpected results.
wbinfo takes a long time (or doesn't work)
As the number of groups and users increase in the database, the time it takes to complete calls to wbinfo -g (groups) and wbinfo -u (users) will increase. Normally there should be far more users than groups, so wbinfo -u will be the call of most concern. Once these calls reach around 1 minute, they will start to fail and continue to fail. There is currently no workaround which uses winbind to retrieve this information, although if you can reduce the amount of users it could help. Consider retrieving this information through some other mechanism, e.g. via the SAMR pipe and associated enumeration RPC calls, or via an LDAP (or or local LDB) search query. The LDAP dirsync control (or the samba-tool user syncpasswords which is a user of this control) could also be used to maintain a constantly updating list of users which is reasonably close to the actual list at any point in time.
LDAP full scans (and internal scans)
As soon as the total size of the sam.ldb database starts to reach several gigabytes, the time taken to return a full retrieval of the database with default attributes might start taking a minute or more. These reads could be blocking writes and so may bog down the server (particularly DNS updates and logon success or failure accounting). If possible avoid triggering LDAP full scans of the entire database (or even just the domain partition), and consider restricting the visibility of objects and attributes for ordinary users.
A full scan currently exists in the DRSUAPI pipe of the RPC server. The replication call also has a maximum wait time of 10 seconds due to any searching which occurs, which can make non-NETLOGON RPC calls delayed by up to roughly this amount. Under heavy replication load, expect the RPC server to have higher latencies.
A full scan currently exists in the periodic check for tombstoned objects and linked attributes. This scan should not take more than 10 or 20 seconds even with a database of several gigabytes, which may impede operations, but it should not be run very frequently in the background.
256MB total limit for data returned
Due to a memory allocation limit within Samba, any LDAP search is restricted to only returning less than 256 MB (roughly) of data. This is quite a lot of data, and if possible, restricting users from reading this much data from the database in one go by restricting visibility may be advisable. There is no way to configure this limit however if you wish to return more data than 256 MB, then you can use the paged results LDAP control. Alternatively, you may wish to reduce the amount of data to be returned by a single LDAP query, by filtering out attributes or changing default visibility of attributes, or by manually splitting data retrieval into more than one query e.g. one search for one half of the attributes, with another query for the other half, or dividing an OR search expression.
When binding against a user belonging to a group (or recursively inherited group) with many users, the bind time may be noticeably increased (2-3x as long with groups with 20,000+ users). This is because the database needs to load the entire record and all the user entries. Work is in progress to try to improve this time.
Samba multi-process model
Under excessive load, the standard process model quickly consumes large amounts of memory and resources which often results in the out-of-memory killer taking out services. In Samba 4.9 and above, it is strongly recommended to use the prefork process model for starting the Samba DC. More about the prefork model can be found at https://wiki.samba.org/index.php/Samba_server_process_model. One of the advantages of the prefork model is that you can use the smb.conf option to change the amount of processes that will be used (and per service as well). On domain controllers with significant resources, this allows administrators to have one process per CPU and allow significant amounts of throughput and minimizes latencies.
The prefork process model will be the default in Samba 4.11, with four worker children per service.
Automatic restarting of child processes
With Samba 4.10, there are a number of improvements so that errors affecting a single client connection do not affect the overall availability of the service. Refer to the Samba server process model wiki page and the associated smb.conf manpage documentation for the new parameters.