Camthompson/Migration Notes From win2k Server to Samba4
Preamble
This page looks scary for people who want to migrate from win2k to samba4, but it really isn't difficult anymore. Use this page as a reference where needed
This is the wiki page for a 140-computer production environment being migrated from two windows Domain Controllers to two Samba4 Domain Controllers. It is by no means a howto. Everyone has their way of doing things, and this is the story of how we are going to do it.
Outstanding questions
- Currently we have one of our lab environments configured with a 2k3 primary controller (including 2k3 func level, forest level, dc level). Why is it that our win2k3 AD can accept direct edits and properly replicate those edits... yet our samba4 machine cannot be directly edited via dsa.msc (or any other snap in)? pastebin here. When you try to edit any entry directly in samba, it fails with permission denied (authenticated as domain admin)
- We do not have tkey configured in dns as of now, just allow-update any - if that makes answering this easier.
UPDATE
With the help of Andrew Tridgell and Anatoly I can now directly edit the S4 DC and it will replicate properly. The code they implemented is now in master branch
dev-teadc1 being the samba machine:
C:\Documents and Settings\Administrator.WINTEAL>ntfrsutl ds dev-teadc1 ERROR - Cannot bind w/authentication to computer, dev-teadc1; 000006d9 (1753) ERROR - Cannot bind w/o authentication to computer, dev-teadc1; 000006d9 (1753) ERROR - Cannot RPC to computer, dev-teadc1; 000006d9 (1753)
- why would a CN=RID set object not be created upon vampiring (2003 domain) for the samba4 DC object in the ldap db? Does only the fsmo role holder have a RIDset ?
Checkpoint log
Syntax problems with net vampire
[root@dev-teadc1 bin]# ./net vampire -Uadministrator -WWINTEAL --target-dir=/usr/local/samba winteal.tundraeng.com Password for [WINTEAL\administrator]: Become DC [(null)] of Domain[WINTEAL]/[winteal.tundraeng.com] Promotion Partner is Server[tedc2.winteal.tundraeng.com] from Site[Default-First-Site-Name] Options:crossRef behavior_version[0] schema object_version[13] domain behavior_version[0] domain w2k3_update_revision[0] Failed to bind to uuid e3514235-4b06-11d1-ab04-00c04fc2dcd2 - NT_STATUS_INVALID_PARAMETER libnet_BecomeDC() failed - NT_STATUS_INVALID_PARAMETER Traceback (most recent call last): File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/__init__.py", line 99, in _run return self.run(*args, **kwargs) File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/vampire.py", line 51, in run (domain_name, domain_sid) = net.vampire(domain=domain, target_dir=target_dir) RuntimeError: NT_STATUS_INVALID_PARAMETER
- The above is still an issue, here are additional snippets showing the syntax parsing problems ./net vampire is experiencing right now
[root@dev-teadc1 bin]# ./net vampire -Uadministrator -WWINTEAL winteal Traceback (most recent call last): File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/__init__.py", line 99, in _run return self.run(*args, **kwargs) File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/vampire.py", line 51, in run (domain_name, domain_sid) = net.vampire(domain=domain, target_dir=target_dir) TypeError: argument 2 must be string, not None
- Above is complaining that there is no "--target-dir" parameter defined
[root@dev-teadc1 bin]# ./net -Uadministrator -WWINTEAL --target-dir=/tmp vampire winteal Invalid option --target-dir=/tmp: unknown option Usage: net <command> [options] Type 'net help' for all available commands
- And now it's complaining that --target-dir isn't a valid option
[root@dev-teadc1 bin]# ./net -Uadministrator -WWINTEAL vampire winteal No command: vampire Usage: net <command> [options] Type 'net help' for all available commands
- I worked around the above issue (I guess it's not finding the domain properly) by specifying -Uadministrator@domain.example.com
Functionality problems
- At this point, aatanasov can replicate in his test environment (non-win2k windows domain)
- Now that I've gotten past initial syntactical problems with the net command, I am running into real errors:
Aquiring initiator credentials failed: Cannot allocate memory Failed to start GENSEC client mech gssapi_krb5: NT_STATUS_UNSUCCESSFUL Failed to start GENSEC client mechanism gssapi_krb5: NT_STATUS_UNSUCCESSFUL
Update: 2010-04-22
abartlett asked me to try with the new git yesterday, as tridge had gone bug-hunting the night before. I git'ed and ./net vampire produced the exact same error message as I have posted on 2010-04-19 (./net vampire debug output)
Update: 2010-04-27
Status:
Abartlet provided me a new branch with better kerberos errors. I also found some cases where the PDC and S4 machine were trying to do lookups on a network that doesn't exist. I fixed those DNS problems and I have also e-mailed 2 .pcap wireshark captures for Andrew to examine at his leisure.
Observations:
- Fully qualifying the domain as the first argument of vampire doesn't make a difference vs relative domain. (./net vampire winteal vs ./net vampire winteal.tundraeng.com)
- Fully qualifying the user with "-Uadministrator@realm.example.com" allows the S4 machine to join the domain, just doesn't vampire or make a DC
- Capitalising the "realm.example.com" causes logon to fail completely - doesn't cause any entry to be created in audit log on win2k pdc or anything
- If you don't fully qualify user as shown in "2" and just specify -Uadministrator, vampire fails differently: Failed to get CCACHE for GSSAPI client: Cannot contact any KDC for requested realm /
Cannot reach a KDC we require to contact ldap@TEDC2.WINTEAL.TUNDRAENG.COM : kinit for administrator@ failed (Cannot contact any KDC for requested realm: unable to reach any KDC in realm )
When I run:
/usr/local/samba/bin/net vampire winteal.tundraeng.com -Uadministrator@WINTEAL.TUNDRAENG.COM%PASS --target-dir=/tmp/samba4.s4 -d
It will bind the machine to the domain, but fail to vampire as shooown by this output:
GSS Update(krb5)(1) Update failed: Miscellaneous failure (see text): Decrypt integrity check failed SPNEGO(gssapi_krb5) NEG_TOKEN_INIT failed: NT_STATUS_LOGON_FAILURE Failed initial gensec_update with mechanism spnego: NT_STATUS_LOGON_FAILURE Traceback (most recent call last): File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/__init__.py", line 99, in _run return self.run(*args, **kwargs) File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/vampire.py", line 51, in run (domain_name, domain_sid) = net.vampire(domain=domain, target_dir=target_dir) RuntimeError: Connection to SAMR pipe of PDC for winteal.tundraeng.com failed: Connection to DC failed: NT_STATUS_LOGON_FAILURE
It does this:
GSS Update(krb5)(1) Update failed: Miscellaneous failure (see text): Decrypt integrity check failed SPNEGO(gssapi_krb5) NEG_TOKEN_INIT failed: NT_STATUS_LOGON_FAILURE Failed initial gensec_update with mechanism spnego: NT_STATUS_LOGON_FAILURE Traceback (most recent call last): File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/__init__.py", line 99, in _run return self.run(*args, **kwargs) File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/vampire.py", line 51, in run (domain_name, domain_sid) = net.vampire(domain=domain, target_dir=target_dir) RuntimeError: Connection to SAMR pipe of PDC for winteal.tundraeng.com failed: Connection to DC failed: NT_STATUS_LOGON_FAILURE
However, when I run:
/usr/local/samba/bin/net vampire winteal.tundraeng.com -Uadministrator@winteal.tundraeng.com%PASS --target-dir=/tmp/samba4.s4 -d5
(Notice the only difference is the second instance of winteal.tundraeng.com isn't capitalised), it will actually join/bind to the domain and create the domain account on the win2k DC, but won't vampire (as shown by this output):
Aquiring initiator credentials failed: gss_krb5_import_cred failed: Decrypt integrity check failed Failed to start GENSEC client mech gssapi_krb5: NT_STATUS_UNSUCCESSFUL Failed to start GENSEC client mechanism gssapi_krb5: NT_STATUS_UNSUCCESSFUL Failed to bind to uuid e3514235-4b06-11d1-ab04-00c04fc2dcd2 - NT_STATUS_UNSUCCESSFUL libnet_BecomeDC() failed - NT_STATUS_UNSUCCESSFUL Traceback (most recent call last): File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/__init__.py", line 99, in _run return self.run(*args, **kwargs) File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/vampire.py", line 51, in run (domain_name, domain_sid) = net.vampire(domain=domain, target_dir=target_dir) RuntimeError: NT_STATUS_UNSUCCESSFUL
The above error is confusing to me.... because "Decrypt integrity check failed" essentially means logon failed, which is the same error as produced with an all upper-case realm. However, the all upper-case realm neither causes a Success Audit in the audit log on the win2k box nor binds the machine to the domain
And, lastly... output of: "/usr/local/samba/bin/net vampire winteal -Uadministrator --target-dir=/tmp/samba4.s4 -d5 " Starting GENSEC mechanism gssapi_krb5 Failed to get CCACHE for GSSAPI client: Cannot contact any KDC for requested realm Cannot reach a KDC we require to contact ldap@TEDC2.WINTEAL.TUNDRAENG.COM : kinit for administrator@ failed (Cannot contact any KDC for requested realm: unable to reach any KDC in realm )
Failed to start GENSEC client mech gssapi_krb5: NT_STATUS_INVALID_PARAMETER Failed to start GENSEC client mechanism gssapi_krb5: NT_STATUS_INVALID_PARAMETER Failed to bind to uuid e3514235-4b06-11d1-ab04-00c04fc2dcd2 - NT_STATUS_INVALID_PARAMETER libnet_BecomeDC() failed - NT_STATUS_INVALID_PARAMETER Traceback (most recent call last): File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/__init__.py", line 99, in _run return self.run(*args, **kwargs) File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/vampire.py", line 51, in run (domain_name, domain_sid) = net.vampire(domain=domain, target_dir=target_dir) RuntimeError: NT_STATUS_INVALID_PARAMETER
Notice in the following wireshark the output is a lot different - notably the lack of AS-REQ, AS-RES, etc.
tshark of aforementioned command
Debug level 10 output of above cmd
Apr 29, 2010
Strange issue I noticed with net vampire:
The <domain> argument in ./net vampire <domain> is behaving weirdly. When I do a vampire, wireshark shows me this as the first few lines of traffic:
[root@dev-teadc1 bin]# /usr/local/samba/bin/net vampire winteal -Uadministrator%PASS --target-dir=/tmp/samba4.s4 -d10
[root@dev-teadc1 ~]# tshark not tcp port 22 Running as user "root" and group "root". This could be dangerous. Capturing on eth0 0.000000 10.1.2.6 -> 10.1.2.1 DNS Standard query A winteal.winteal.tundraeng.com 0.001451 10.1.2.1 -> 10.1.2.6 DNS Standard query response, No such name 0.002676 10.1.2.6 -> 10.1.2.1 DNS Standard query A winteal.winteal.tundraeng.com 0.003697 10.1.2.1 -> 10.1.2.6 DNS Standard query response, No such name 0.004417 10.1.2.6 -> 10.1.2.1 DNS Standard query A winteal.winteal.tundraeng.com 0.005335 10.1.2.1 -> 10.1.2.6 DNS Standard query response, No such name 0.006165 10.1.2.6 -> 10.1.2.1 DNS Standard query A winteal.winteal.tundraeng.com 0.007146 10.1.2.1 -> 10.1.2.6 DNS Standard query response, No such name 0.007826 10.1.2.6 -> 10.1.2.1 DNS Standard query A winteal.winteal.tundraeng.com 0.008907 10.1.2.1 -> 10.1.2.6 DNS Standard query response, No such name 0.010006 10.1.2.6 -> 10.1.2.1 DNS Standard query A winteal.winteal.tundraeng.com 0.010935 10.1.2.1 -> 10.1.2.6 DNS Standard query response, No such name 0.011588 10.1.2.6 -> 10.1.2.1 DNS Standard query A winteal.winteal.tundraeng.com 0.012451 10.1.2.1 -> 10.1.2.6 DNS Standard query response, No such name 0.013155 10.1.2.6 -> 10.1.2.1 DNS Standard query A winteal.winteal.tundraeng.com 0.014072 10.1.2.1 -> 10.1.2.6 DNS Standard query response, No such name
So when I change <domain> to tedc2 (hostname), the first few lines of wireshark are this:
[root@dev-teadc1 bin]# /usr/local/samba/bin/net vampire tedc2 -Uadministrator%PASSWORD --target-dir=/tmp/samba4.s4 -d10
12.207277 10.1.2.6 -> 10.1.2.1 DNS Standard query A tedc2.winteal.tundraeng.com 12.208712 10.1.2.1 -> 10.1.2.6 DNS Standard query response A 10.1.2.3 12.233326 10.1.2.6 -> 10.1.2.3 NBNS Name query NBSTAT *<00><00><00><00><00><00><00><00><00><00><00><00><00><00><00> 12.234237 10.1.2.3 -> 10.1.2.6 NBNS Name query response NBSTAT 12.257014 10.1.2.6 -> 10.1.2.3 TCP 60328 > microsoft-ds [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=180145581 TSER=0 WS=4 12.257353 10.1.2.3 -> 10.1.2.6 TCP microsoft-ds > 60328 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460 WS=0 TSV=0 TSER=0 12.257473 10.1.2.6 -> 10.1.2.3 TCP 60328 > microsoft-ds [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSV=180145582 TSER=0 12.260156 10.1.2.6 -> 10.1.2.3 SMB Negotiate Protocol Request 12.261681 10.1.2.3 -> 10.1.2.6 SMB Negotiate Protocol Response 12.261760 10.1.2.6 -> 10.1.2.3 TCP 60328 > microsoft-ds [ACK] Seq=224 Ack=190 Win=6912 Len=0 TSV=180145587 TSER=1789238 12.273305 10.1.2.6 -> 10.1.2.3 SMB Session Setup AndX Request, NTLMSSP_NEGOTIAT
Strange. It's trying to do dns lookups on <domain>.<domain>.tundraeng.com
Been a while since I've done an update - we've been doing a lot of parallel-path work...
June 3, 2010
So, we have tested migration from windows 2000 (dc func level 0) to windows 2003 - it was more time consuming than expected. After raising forest functional level to 1 we are able to vampire against windows 2003 no problem. We still have the test environment to vampire against windows 2000, so once we have thoroughly tested 2003 we will continue testing the direct migration approach.
June 14, 2010
Replication working 90%. I had to symlink /etc/krb5.keytab to /usr/local/samba/private/secrets.keytab at one point. And I also used ./samba_dnsupdate --verbose to check which records I needed to enter. So now I can replicate from 2k3 to s4, but s4 can't replicate to 2k3.
[root@dev-teadc1 bin]# ./net drs showrepl -Uadministrator%<password> dev-teadc1 Default-First-Site-Name\DEV-TEADC1 DSA Options: (none) Site Options: (none) DSA object GUID: afaf0e30-1375-40e6-8e46-9bdf99d483a3 DSA invocationID: f433cb88-84a6-4b34-8833-d3c909ed26e7 ==== INBOUND NEIGHBORS ==== DsReplicaGetInfo failed - WERR_DS_DRA_ACCESS_DENIED. return code = -1 DsReplicaGetInfo() failed for DRSUAPI_DS_REPLICA_INFO_KCC_DSA_CONNECT_FAILURES
Which seems to be the exact same problem as described here
June 24, 2010
Discovered segmentation fault during provisioning. I have submitted it to samba-technical@lists.samba.org. I have also sent a valgrind debug info and -d10 to Andrew Bartlett. I have since tried several clean builds and clean provision attempts and the segfault keeps happening. There's very little I can do until this is resolved.
June 30
With Andrew's patch re: Segfault while vampiring 4.0.0alpha12-GIT-1e897f6 We are able to vampire now without segfaulting. AND now we can modify group membership of a user on the S4 DC and it will replicate to the win2k3 machine. before this, the memberOf= ldap attrribute on the user's wasn't set to anything. Now when I vampire, it's replicated properly to S4. However, I still can't create users or groups within the S4 DC using dsa.msc - I get the error message: The server is unwilling to process the request. I haven't yet noted the specific ldap, tshark or krb logging information yet which is the cause of this windows error message.
note: ekacnet suggests named_config for debugging setup
July 6
Today I am going to try seizing the fsmo role for rid set master from dev-tedc3. This should solve the symptom of not being able to create new objects on the S4 server. It doesn't fix the problem however. I am also giong to test the windows DC re-seizing the fsmo role for the purpose of disaster recovery testing.
Samba4 Detailed Migration Plan
Plan for moving from testing environment to production environment
Config and Naming
For simplicity sake, the main win2k AD DC with all 5 FSMO roles is referred to as PDC.
2nd win2k AD DC is BDC
Neither PDC or BDC run DNS or DHCP services, this is done on other linux nodes with dhcpd and bind.
Both PDC and BDC run WINS.
S4 intended replacement PDC is S4DC1
S4 intended replacement BDC is S4DC2
Config - DNS
Primarily a BIND environment on other Linux nodes. PDC is tertiary DNS and a slave, updating Primary DNS.
Additional Preparation before S4 Enters Production
TODO - remove DNS service from PDC completely and test TODO - move user homes from PDC to primary file and print TODO - virtualize PDC (BDC already virtualized)
Provisioning to Production
Clean Provision
TODO: provision command line TODO: net rpc samsync command line TODO: How to provision samba to avoid logins until in sync?
- firewall? our vlans could help here -- block all but ssh on all but vlan2 (server core)
Daily Tasks
- PDC and BDC log review at the beginning and end of the day.
Weekly Tasks
- update "The Architect" (Andrew Bartlett)
- consider git diff as seen in dev-lan, rebuild and upgrade or re-provision
Potential Scenarios
PDC Corruption - Minor
- domain remains active for logins
- perhaps replication stops
PDC Corruption - Disaster
- domain does not allow logins
- TODO: need to know very quickly which DC is directly being used for a given login test
- TODO: shorewall panic script to run on S4 nodes to block all comm except for ssh
Monitoring Plan:
- hourly test login script, failure SMS'ed
Recovery Plan:
- quick assessment, revert to snapshots
- Note: Snapshot reversion will likely cause replication to fail. Depending on severity, we could attempt to revert memory-included snapshots for both PDC and BDC near simultaneously
References
Relevant port references gratefully taken from http://people.samba.org/people/2005/09/03
- udp 88 - kerberos - udp 53 - dns - udp 389 - cldap - tcp 135 - rpc portmapper - tcp 139 - SMB/CIFS - tcp 389 - ldap - tcp 445 - SMB/CIFS - tcp 1024, 1025, 1026 - RPC