If you're managing Active Directory (AD), then the health of your Domain Controllers (DCs) is crucial. The DC is at the heart of user authentication, authorization, and network services. If DCs can't sync, authentication might fail, leaving users locked out or allowing unauthorized access. Inconsistent data between DCs will lead to accounts and groups having conflicts and out-of-date settings.
So how can we know if there's a problem before the tickets start coming in? There are several aspects of a domain controller that can be checked to ensure smooth sailing!
Services
AD functionality depends on a few services to run. These should always be running!
- NTDS (Windows NT Directory Services) is the primary service for AD. Remember ntds.dit? That's the service that uses that DB file.
- Netlogon is the service that locates DCs, and authenticates users into the service. This service establishes a secure connection between the computer and DC to authenticate the user into the network.
- DNS is the service that makes it possible for clients to locate DCs and other AD resources. The DCs also use the AD DNS zones to help them communicate with each other.
DCDiag
DCDiag is a tool to see the results of a variety of tests against DCs and DNS servers. These tests provide high level overview of the overall health of a domain controller. Specifically we want to check the following:
- Advertising - Checks whether each domain controller advertises itself in the roles that it should be capable of performing.
- FSMOCheck - Checks that the domain controller can contact a Kerberos Key Distribution Center (KDC), a time server, a preferred time server, a primary domain controller (PDC), and a global catalog server.
- KnowsOfRoleHolders - Checks whether the domain controller can contact the servers that hold the five operations master roles.
- Replications - Checks for timely replication and any replication errors between domain controllers.
- Services - Checks whether the appropriate domain controller services are running.
Replication
The best practice is to have multiple domain controllers in order to provide high availability and redundancy. As mentioned earlier, if replication breaks, then the domain will go sideways and there will be inconsistencies in the database.
Disk Space
This might seem obvious, but it's worth checking! The AD database is contained in a file called "ntds.dit". If the drive where that file resides was to ever fill up, then bad things would happen!
Monitoring with PowerShell and Level
We've put together a script that will bring all these pieces together so that an alert can be generated when a problem arises.
When run, the server will run checks against all the different types of health indicators and generate a report like this:
Script started...
Server: DC1.MyDomain.com
Site: DataCenter
OS Version: Microsoft Windows Server 2019 Standard
Operation Master Roles: none
DNS: Success
Ping: Success
Uptime (hrs): 172
DIT Free Space (%): 9%
OS Free Space (%): 9%
DNS Service: Success
NTDS Service: Success
NetLogon Service: Success
DCDIAG: Advertising: Failed <-------------------- ALERT
DCDIAG: Replications: Passed
DCDIAG: FSMO KnowsOfRoleHolders: Passed
DCDIAG: FSMO Check: Passed
DCDIAG: Services: Passed
Replication Errors: 0 - Passed
Last Replication: 08/31/2023 17:12:30 - Passed
Processing Time: 1
Summary: 1 Error(s) Detected
In this case we see a failure for the DCDiag check of Advertising and an ALERT message is generated. With this keyword ALERT, we can use this script in Level to monitor any devices tagged as Domain Controllers. Add a new monitor to a Monitor Policy that is used for domain controllers. In this case the policy named Monitor DCs is targeting all devices with the tag of DCs.
In the new monitor, provide a name that will show in the alert. In this case "Failed Domain Controller health check". The script type is Run Script, and the script used is the script linked above from Github - in this case it's named "Monitor - Domain Controller (DC) Health Check". The script output trigger is set to "Contains" and the value is set to "ALERT".
Once the monitor is in place, any problems will trigger an alert and show which portions of the health check failed. In this example we see three failed tests have generated alerts. Because the word "ALERT" is returned, the Level alert is generated.
Once these issues are fixed, the alert will auto-resolve. If there is ever a problem again, an alert will be generated before the flood of calls come in to the helpdesk! 💪
Have an idea for a script? Please let us know, or contribute on our community script repo: https://github.com/levelsoftware/scripts
Sign up for our newsletter
Get our latest articles and our most exciting updates delivered straight to your inbox.