Lesson 19 of 60 intermediate

Domain Joins & Login Failures

Safe first actions under identity pressure

Open interactive version (quiz + challenge)

Real-world analogy

A domain join is like a device signing an employment contract. Once signed, it gets an ID, trusts the building, and plays by the rules. If the contract is torn or the building loses the signature page, the device is suddenly a stranger at the door.

What is it?

This lesson is the crisis-response layer on top of everything you learned about AD, DNS, DHCP, GPO, and permissions. It makes sure that when identity goes wrong, your first moves are safe, documented, and scoped.

Real-world relevance

Half of HQ can’t log in after a weekend change window. A junior immediately rejoins laptops and blames ‘AD.’ A senior runs gpresult, checks Event Viewer, finds a new GPO blocking network logons on Windows 11 because of a misconfigured logon right. Reverts the GPO, 200 users back in 10 minutes.

Key points

What actually happens in a domain join — The computer creates a machine account in AD, establishes a secure channel with a DC, and begins using Kerberos/NTLM for auth. The password on the machine account rotates automatically every 30 days by default.
The real pre-flight checklist — (1) Correct internal DNS settings, (2) can resolve the domain and SRV records, (3) clock is within tolerance, (4) reachable DC on required ports, (5) valid domain user with permission to join computers to the target OU.
‘The trust relationship between this workstation and the primary domain failed’ — Classic error. Usually means the machine account password de-synced between the client and AD. Safest fix: Test-ComputerSecureChannel, then Reset-ComputerMachinePassword with an admin account — before rejoining.
When rejoin is required — Occasionally — after long offline periods, image restores, or tampered machines. Rejoining should be a tracked action, not a reflex. Always capture evidence of what failed before you reset.
Multi-user login failures → think infra, not user — If 1 user can’t log in, suspect the user. If MANY users can’t log in, suspect shared infrastructure: DNS, DHCP, DC availability, time sync, firewall rule, or GPO rollout.
The account lockout dance — Too many bad passwords → account locks. Stale saved credentials (mapped drives, phones, scripts) can silently keep locking an account. Use the Account Lockout Status tools (LockoutStatus.exe) to find the source.
GPO-induced logon problems — A brand-new GPO can break logons fleet-wide (drive mappings, scripts, restricted paths). Always pilot with security filtering; never link a new GPO domain-wide without testing.
Cached credentials help but are not magic — Windows caches last-known domain credentials so users can log in when the network is unreachable. Cache only covers already-logged-in users on that device. It does not replace a healthy AD path.

Code example

// Multi-user logon failure — scoped triage

1) Scope
   - 1 user or many?
   - 1 site or many?
   - 1 OS/image or mixed?

2) Identity path
   ipconfig /all                  # internal DNS?
   nslookup _ldap._tcp.<domain>   # SRV reachable?
   Test-ComputerSecureChannel -Verbose
   w32tm /query /status           # clock skew

3) Policy
   gpresult /h report.html
   Recent GPO changes -> correlation with symptom time

4) Auth logs
   Event Viewer -> Security on DCs
   Event IDs 4768 (TGT), 4769 (service ticket), 4625 (logon failed)

5) Change control
   What changed in the last 48h?
   What can be reverted safely?

Line-by-line walkthrough

1. Scoped triage playbook
2. Step 1 — scope the blast radius
3. How many users
4. How many sites
5. Single or multi-image
6. Blank separator
7. Step 2 — identity path checks
8. Verify DNS
9. Verify SRV resolution
10. Check secure channel
11. Check clock
12. Blank separator
13. Step 3 — policy check
14. gpresult report
15. Correlate with recent GPO changes
16. Blank separator
17. Step 4 — auth logs on DCs
18. Security log
19. TGT event ID
20. Service ticket event ID
21. Logon failure event ID
22. Blank separator
23. Step 5 — change control
24. Recent changes
25. Safe revert candidates

Spot the bug

Monday morning: 40 users from HQ say ‘cannot connect to domain’.
A junior rejoins all 40 laptops one by one. Takes 6 hours.

Need a hint?

What cheaper and safer first step would have revealed the real root cause?

Show answer

Scope first. Run gpresult /h on a couple of affected machines; check Event Viewer on the DC; look for recent changes (DNS, DHCP, GPO, firewall). In most real cases a single misconfigured change is the cause and can be reverted in minutes, saving the rebuild of 40 devices.

Explain like I'm 5

When the building doesn’t recognize your ID anymore, you don’t knock down the front door. You check: is the nameplate right, is the badge still valid, is the building’s clock the same as yours, and did anyone change the rules last night?

Fun fact

Many real corporate outages are just phones. A salesperson changes their domain password on a laptop but forgets their iPhone’s Exchange profile. The phone keeps trying the old password until the account locks out — over and over — and the ‘attacker’ is the user themselves.

Hands-on challenge

On a VM (or your lab), simulate the trust failure: disconnect a domain-joined client for a long time, then try to log in with a domain user. Run Test-ComputerSecureChannel. Reset-ComputerMachinePassword if you have admin credentials. Document each step.

More resources

Domain join and machine accounts (Microsoft Learn)
Account Lockout troubleshooting (Microsoft Learn)
Kerberos for admins (John Savill)

Open interactive version (quiz + challenge) ← Back to course: IT Jobs Bootcamp