Running Samba AD Domain Controllers in large domains

From SambaWiki

DRS replication (and joining a DC)

The time it takes to DRS replicate is proportional to the size of the database and is one of the longest running operations one may run against a domain controller. Not only does returning all the data across the network in the correct format take time, reformatting the responses and writing the results to disk also takes significant amounts of time. Simultaneously joining two domain controllers is a serial operation (one RPC process), and does not speed up the time it takes to prepare domain controllers.

As replication generally triggers a number of writes, it is recommended that fastest storage possible is used. Even in the case where no meaningful changes will be written (full synchronization on a synchronized database), faster storage has a notable effect on the overall synchronization time.

RID allocation

Due to longer replication times, the internal queue of replications in the drepl_server process will be unable to be processed. This might have an effect on RID allocation which uses the same flow of replication operations. Avoid attempting a full synchronization while adding bulk users simultaneously, otherwise the DC may run out of RIDs to allocate from its pool. Eventually the RID pool should refresh, but in the meantime, operations that consume RIDs should be done against a different domain controller.

Queued replications

Following a full synchronization of a large database, the drepl_server process may have accumulated a large number of pending notifications and pull requests. It may take some time to flush these operations and so user triggered replications via samba-tool may not respond for a while. Using the --local option of samba-tool drs replicate is one way to avoid waiting, alternatively restarting the Samba process will flush the in-memory queue.

Linked attributes

Linked attributes like the member attribute for group membership, contribute a large portion of the overall synchronization time. Avoiding having too many links may reduce the time required to replicate a database.

Running samba-tool dbcheck

Running the standard dbcheck on a large domain can take a very long time (on the order of days, when only checking the consistency and not fixing any issues). The most significant contributor to this time is linked attributes. Regardless of the size of the database, checking consistency rules is important.

The safest way to dbcheck a database (both to check for errors and to fix errors) is while the Samba processes are all offline, because some checks may be interfered with by modifications on a live server. Local database modifications may also interfere with the dbcheck, so you should make sure there is no other local accesses are being made. When running with --fix, --yes ensures that no other access to the database is possible with a transaction, and note that using this against a live server would be extremely unwise as it would disrupt normal operations for a long period of time.

Skipping checks associated with the member attribute

In versions of dbcheck > 4.11, there will be a new option to allow a quick check of member linked attributes. In a large domain, member attributes may be quite common and running the full list of checks consumes far too much time. Since Samba 4.7, a number of consistency issues associated with linked attributes should no longer be simple to trigger. This means that a noticeable number of the checks present in dbcheck are highly unlikely to find any issues, despite consuming a large amount of time.

Fixing a large number of dbcheck errors

As write operations can disrupt normal operations, it is possible to change the scope of what dbcheck inspects and restrict it to an LDAP subtree (base or one-level). This may even be used for single objects in the database so that you can generate a list of distinguished names and then subsequently run a fix for each of them (ideally the code should generate a unmodified list, but it does not currently have this capability). This method may also be used to restrict the scope of checking the consistency rules, and not applying fixes yet. You could generate a list of all distinguished names in the database and then trigger dbcheck on each to determine if there might be an issue, however note that discrepancies in the object list due to modifications and consistency fixes made externally during dbcheck checking (rather than fixing) may cause unexpected results.

Subtree rename

wbinfo takes a long time (or doesn't work)

As the number of groups and users increase in the database, the time it takes to complete calls to wbinfo -g (groups) and wbinfo -u (users) will increase. Normally there should be far more users than groups, so wbinfo -u will be the call of most concern. Once these calls reach around 1 minute, they will start to fail and continue to fail. There is currently no workaround which uses winbind to retrieve this information, although if you can reduce the amount of users it could help. Consider retrieving this information through some other mechanism, e.g. via the SAMR pipe and associated enumeration RPC calls, or via an LDAP (or or local LDB) search query.

LDAP full scans (and internal scans)

As soon as the total size of the sam.ldb database starts to reach several gigabytes, the time taken to return a full scan of the database with default attributes might start taking a minute or more. These reads could be blocking writes and so may bog down the server (particularly DNS updates and logon success or failure accounting). If possible avoid triggering LDAP full scans of the entire database (or even just the domain partition), and consider restricting the visibility of objects and attributes for ordinary users.

DRS replication

A full scan currently exists in the DRSUAPI pipe of the RPC server. The replication call also has a maximum wait time of 10 seconds due to any searching which occurs, which can make non-NETLOGON RPC calls delayed by up to roughly this amount. Under heavy replication load, expect the RPC server to have higher latencies.

Tombstones expunge

256MB total limit for data returned

The only current workaround is to use the paged results LDAP control, or to reduce the amount of data to be returned by a single LDAP query (either by filtering out what is absolutely necessary or splitting the data retrieval into more than one query).

Associated with this bug: https://bugzilla.samba.org/show_bug.cgi?id=13674

LDAP bind