Difference between revisions of "Camthompson/Migration Notes From win2k Server to Samba4"

(Checkpoint log)
(Checkpoint log)
Line 146: Line 146:
  
 
[http://wiki.samba.org/index.php/Image:Tshark_output_apr_27_2010.txt tshark of aforementioned command]
 
[http://wiki.samba.org/index.php/Image:Tshark_output_apr_27_2010.txt tshark of aforementioned command]
 +
 
[http://pastebin.com/ECeBdaSr Debug level 10 output of above cmd]
 
[http://pastebin.com/ECeBdaSr Debug level 10 output of above cmd]
  

Revision as of 22:29, 28 April 2010

Preamble

This is the wiki page for a 140-computer production environment being migrated from two windows Domain Controllers to two Samba4 Domain Controllers. It is by no means a howto. Everyone has their way of doing things, and this is the story of how we are going to do it.

Outstanding questions

  1. What needs to be done to be able to vampire a win2k ad?
  2. In a multiple windows AD domain environment, how do you tell S4 which DC has the fsmo roles?
  3. Does replication work in a mixed functional-level environment? (ie. win2k DC, win2k8 DC)

Checkpoint log

Syntax problems with net vampire

[root@dev-teadc1 bin]# ./net vampire -Uadministrator  -WWINTEAL --target-dir=/usr/local/samba winteal.tundraeng.com
Password for [WINTEAL\administrator]:
Become DC [(null)] of Domain[WINTEAL]/[winteal.tundraeng.com]
Promotion Partner is Server[tedc2.winteal.tundraeng.com] from Site[Default-First-Site-Name]
Options:crossRef behavior_version[0]
        schema object_version[13]
        domain behavior_version[0]
        domain w2k3_update_revision[0]
Failed to bind to uuid e3514235-4b06-11d1-ab04-00c04fc2dcd2 - NT_STATUS_INVALID_PARAMETER
libnet_BecomeDC() failed - NT_STATUS_INVALID_PARAMETER
Traceback (most recent call last):
  File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/__init__.py", line 99, in _run
    return self.run(*args, **kwargs)
  File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/vampire.py", line 51, in run
    (domain_name, domain_sid) = net.vampire(domain=domain, target_dir=target_dir)
RuntimeError: NT_STATUS_INVALID_PARAMETER
  • The above is still an issue, here are additional snippets showing the syntax parsing problems ./net vampire is experiencing right now
[root@dev-teadc1 bin]# ./net vampire -Uadministrator  -WWINTEAL winteal
Traceback (most recent call last):
  File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/__init__.py", line 99, in _run
    return self.run(*args, **kwargs)
  File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/vampire.py", line 51, in run
    (domain_name, domain_sid) = net.vampire(domain=domain, target_dir=target_dir)
TypeError: argument 2 must be string, not None
  • Above is complaining that there is no "--target-dir" parameter defined
[root@dev-teadc1 bin]# ./net -Uadministrator  -WWINTEAL --target-dir=/tmp vampire winteal
Invalid option --target-dir=/tmp: unknown option                                         
Usage:                                                                                   
net <command> [options]                                                                  
Type 'net help' for all available commands                                               
  • And now it's complaining that --target-dir isn't a valid option
[root@dev-teadc1 bin]# ./net -Uadministrator  -WWINTEAL vampire winteal
No command: vampire                                                    
Usage:                                                                 
net <command> [options]                                                
Type 'net help' for all available commands                    
  • I worked around the above issue (I guess it's not finding the domain properly) by specifying -Uadministrator@domain.example.com

Functionality problems

  • At this point, aatanasov can replicate in his test environment (non-win2k windows domain)
  • Now that I've gotten past initial syntactical problems with the net command, I am running into real errors:
Aquiring initiator credentials failed: Cannot allocate memory
Failed to start GENSEC client mech gssapi_krb5: NT_STATUS_UNSUCCESSFUL
Failed to start GENSEC client mechanism gssapi_krb5: NT_STATUS_UNSUCCESSFUL

./net vampire debug output

Update: 2010-04-22

abartlett asked me to try with the new git yesterday, as tridge had gone bug-hunting the night before. I git'ed and ./net vampire produced the exact same error message as I have posted on 2010-04-19 (./net vampire debug output)

Update: 2010-04-27

Status:

Abartlet provided me a new branch with better kerberos errors. I also found some cases where the PDC and S4 machine were trying to do lookups on a network that doesn't exist. I fixed those DNS problems and I have also e-mailed 2 .pcap wireshark captures for Andrew to examine at his leisure.

Observations:

  1. Fully qualifying the domain as the first argument of vampire doesn't make a difference vs relative domain. (./net vampire winteal vs ./net vampire winteal.tundraeng.com)
  2. Fully qualifying the user with "-Uadministrator@realm.example.com" allows the S4 machine to join the domain, just doesn't vampire or make a DC
    1. Capitalising the "realm.example.com" causes logon to fail completely - doesn't cause any entry to be created in audit log on win2k pdc or anything
  3. If you don't fully qualify user as shown in "2" and just specify -Uadministrator, vampire fails differently: Failed to get CCACHE for GSSAPI client: Cannot contact any KDC for requested realm /

Cannot reach a KDC we require to contact ldap@TEDC2.WINTEAL.TUNDRAENG.COM : kinit for administrator@ failed (Cannot contact any KDC for requested realm: unable to reach any KDC in realm )

When I run:

/usr/local/samba/bin/net vampire winteal.tundraeng.com -Uadministrator@WINTEAL.TUNDRAENG.COM%PASS --target-dir=/tmp/samba4.s4 -d

It will bind the machine to the domain, but fail to vampire as shooown by this output:

GSS Update(krb5)(1) Update failed:  Miscellaneous failure (see text): Decrypt integrity check failed
SPNEGO(gssapi_krb5) NEG_TOKEN_INIT failed: NT_STATUS_LOGON_FAILURE
Failed initial gensec_update with mechanism spnego: NT_STATUS_LOGON_FAILURE
Traceback (most recent call last):
  File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/__init__.py", line 99, in _run
    return self.run(*args, **kwargs)
  File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/vampire.py", line 51, in run
    (domain_name, domain_sid) = net.vampire(domain=domain, target_dir=target_dir)
RuntimeError: Connection to SAMR pipe of PDC for winteal.tundraeng.com failed: Connection to DC failed: NT_STATUS_LOGON_FAILURE

It does this:

GSS Update(krb5)(1) Update failed:  Miscellaneous failure (see text): Decrypt integrity check  failed                                                        
SPNEGO(gssapi_krb5) NEG_TOKEN_INIT failed:  NT_STATUS_LOGON_FAILURE                                                                                          
Failed initial gensec_update with mechanism spnego:   NT_STATUS_LOGON_FAILURE                                                                                 
Traceback (most recent call last):                                                                                                                          
  File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/__init__.py", line 99, in _run                                                            
   return self.run(*args, **kwargs)                                                                                                                        
  File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/vampire.py", line 51, in run                                                              
   (domain_name, domain_sid) = net.vampire(domain=domain,   target_dir=target_dir)                                                                           
RuntimeError: Connection to SAMR pipe of PDC for winteal.tundraeng.com failed: Connection to DC failed: NT_STATUS_LOGON_FAILURE

However, when I run:

/usr/local/samba/bin/net vampire winteal.tundraeng.com -Uadministrator@winteal.tundraeng.com%PASS --target-dir=/tmp/samba4.s4 -d5

(Notice the only difference is the second instance of winteal.tundraeng.com isn't capitalised), it will actually join/bind to the domain and create the domain account on the win2k DC, but won't vampire (as shown by this output):

Aquiring initiator credentials failed: gss_krb5_import_cred failed: Decrypt integrity check failed
Failed to start GENSEC client mech gssapi_krb5: NT_STATUS_UNSUCCESSFUL
Failed to start GENSEC client mechanism gssapi_krb5: NT_STATUS_UNSUCCESSFUL
Failed to bind to uuid e3514235-4b06-11d1-ab04-00c04fc2dcd2 - NT_STATUS_UNSUCCESSFUL
libnet_BecomeDC() failed - NT_STATUS_UNSUCCESSFUL
Traceback (most recent call last):
  File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/__init__.py", line 99, in _run
    return self.run(*args, **kwargs)
  File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/vampire.py", line 51, in run
    (domain_name, domain_sid) = net.vampire(domain=domain, target_dir=target_dir)
RuntimeError: NT_STATUS_UNSUCCESSFUL

The above error is confusing to me.... because "Decrypt integrity check failed" essentially means logon failed, which is the same error as produced with an all upper-case realm. However, the all upper-case realm neither causes a Success Audit in the audit log on the win2k box nor binds the machine to the domain

And, lastly... output of: "/usr/local/samba/bin/net vampire winteal -Uadministrator --target-dir=/tmp/samba4.s4 -d5 " Starting GENSEC mechanism gssapi_krb5 Failed to get CCACHE for GSSAPI client: Cannot contact any KDC for requested realm Cannot reach a KDC we require to contact ldap@TEDC2.WINTEAL.TUNDRAENG.COM : kinit for administrator@ failed (Cannot contact any KDC for requested realm: unable to reach any KDC in realm )

Failed to start GENSEC client mech gssapi_krb5: NT_STATUS_INVALID_PARAMETER
Failed to start GENSEC client mechanism gssapi_krb5: NT_STATUS_INVALID_PARAMETER
Failed to bind to uuid e3514235-4b06-11d1-ab04-00c04fc2dcd2 - NT_STATUS_INVALID_PARAMETER
libnet_BecomeDC() failed - NT_STATUS_INVALID_PARAMETER
Traceback (most recent call last):
  File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/__init__.py", line 99, in _run
    return self.run(*args, **kwargs)
  File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/vampire.py", line 51, in run
    (domain_name, domain_sid) = net.vampire(domain=domain, target_dir=target_dir)
RuntimeError: NT_STATUS_INVALID_PARAMETER

Notice in the following wireshark the output is a lot different - notably the lack of AS-REQ, AS-RES, etc.

tshark of aforementioned command

Debug level 10 output of above cmd

Samba4 Detailed Migration Plan

Plan for moving from testing environment to production environment

Config and Naming

For simplicity sake, the main win2k AD DC with all 5 FSMO roles is referred to as PDC.

2nd win2k AD DC is BDC

Neither PDC or BDC run DNS or DHCP services, this is done on other linux nodes with dhcpd and bind.

Both PDC and BDC run WINS.


S4 intended replacement PDC is S4DC1

S4 intended replacement BDC is S4DC2


Config - DNS

Primarily a BIND environment on other Linux nodes. PDC is tertiary DNS and a slave, updating Primary DNS.


Additional Preparation before S4 Enters Production

TODO - remove DNS service from PDC completely and test 
TODO - move user homes from PDC to primary file and print
TODO - virtualize PDC (BDC already virtualized)


Provisioning to Production

Clean Provision

TODO:  provision command line 
TODO:  net rpc samsync command line
TODO:  How to provision samba to avoid logins until in sync?
  • firewall? our vlans could help here -- block all but ssh on all but vlan2 (server core)


Daily Tasks

  1. PDC and BDC log review at the beginning and end of the day.


Weekly Tasks

  1. update "The Architect" (Andrew Bartlett)
  2. consider git diff as seen in dev-lan, rebuild and upgrade or re-provision


Potential Scenarios

PDC Corruption - Minor

  • domain remains active for logins
  • perhaps replication stops

PDC Corruption - Disaster

  • domain does not allow logins
  • TODO: need to know very quickly which DC is directly being used for a given login test
  • TODO: shorewall panic script to run on S4 nodes to block all comm except for ssh

Monitoring Plan:

  • hourly test login script, failure SMS'ed

Recovery Plan:

  • quick assessment, revert to snapshots
    • Note: Snapshot reversion will likely cause replication to fail. Depending on severity, we could attempt to revert memory-included snapshots for both PDC and BDC near simultaneously

References

Relevant port references gratefully taken from http://people.samba.org/people/2005/09/03

- udp 88  - kerberos
- udp 53  - dns
- udp 389 - cldap
- tcp 135 - rpc portmapper
- tcp 139 - SMB/CIFS
- tcp 389 - ldap
- tcp 445 - SMB/CIFS
- tcp 1024, 1025, 1026 - RPC