Oh what a wonderful time it was, not having to deal with Server 2003, but thanks to a career change, I find myself once again with the joys of inconsistent SYSVOL replication thanks to FRS.

Everything is fine! or is it?

So I start working for a company, and I've been tasked with leading the move to a modern, cloud based infrastructure. First port of call is to get the whole domain upgraded to Server 2016. The previous guy, who looked after the domain, left months ago and no-one really knows what state it's in. I hear various complaints from other members of IT, that there have been unexplained difficulties with folder redirection and other policy based settings, so I set out to discover the truth.

Oh dear...

At first everything seemed to be in order. An initial check through the event logs and the usual reports (dcdiag/repdamin etc) showed no issues. So I turned my focus to the actual policies, where I found lot's of outdated/poorly implemented policies which I initially attributed the "unexplained difficulties" to, however then the first real clue arose.

The service desk manager approached me, and asked if I knew why he had a desktop background from last Christmas suddenly appear on his machine. I didn't immediately realise, but the background had appeared due to him visiting another site which hadn't replicated in the 7 months since Christmas.

The real "Oh Dear..." moment came a few days later, when one of the domain controllers at a site crashed, and manged to corrupt the default domain policy in the process, this corruption then got replicated out to many other sites, but not all of them.

Whilst investigating the corrupt policy, another member of IT asked me, to look at why the latest desktop background hadn't appeared on anyone's machines. With the suspicion already pointing to FRS issues, I went hunting and found that only 1 domain controller had the new image, and that none of the others had yet replicated it, despite many scheduled replications having occurred since the image was uploaded. I then notice that one server has no modifications to any files for more than a year, even though I know there should have been changes from the last few days.

PowerShell to the Rescue

Now I'm certain FRS isn't working, but having checked through all the logs again, and dug out FRSDiag and Ultrasound, I've still not found any "evidence" on the domain controllers I've checked, but given there are more than 200 sites, I decide it's time to break out PowerShell to build a report of the state for me.

I start by creating a script to compare the state of the Default domain policy, knowing that it has corruption on at least a few domain controllers.

#Get the Id of the policy
$Policy = get-gpo -Name "Default Domain Policy" | Select -ExpandProperty Id
$PolicyGuid = $Policy.Guid.ToString()
#Get the Fully Qualified Domain Name of the primary domain controller, which has the authoritative replica.
$pdcFQDN = Get-ADForest | Select -ExpandProperty RootDomain | Get-ADDomain | Select -ExpandProperty PDCEmulator
#Split out the machine name from the FQDN
$pdc = ($pdcFQDN -split '\.')[0]
#Get the domain name
$Domain = Get-ADForest | Select -ExpandProperty Name
#Reference to the Default domain policy folder on the PDC
$policyToCheckPath = "\\$($pdc)\sysvol\$($Domain)\Policies\{$($PolicyGuid)}"
#Get the Default Domain Policy contents once for comparison against the other domain controllers
$pdcPolicy = Get-ChildItem -Path $policyToCheckPath -Recurse -Force
#Get a list of all the other Domain Controllers to check
$domainControllers = Get-ADDomainController -Filter {name -ne $pdc} | select -ExpandProperty Name
#Create an array for storing the results of any failures
$results = @()

foreach($name in $domainControllers){
    #Get a reference to the Default Domain Policy on the Domain Controller to compare
    $policyToComparePath = "\\$($name)\sysvol\$($Domain)\Policies\{$($PolicyGuid)}"
    #Check if the folder even exists, it didn't on some!
    if (Test-Path $policyToComparePath) {
        Write-Host "Policy: $($PolicyGuid) exists on Domain Controller: $($name)" -ForegroundColor Gray
        #As the policy folder exists, get its items for comparison
        $policyToCompare = Get-ChildItem $policyToComparePath -Recurse -Force
        #Compare against the PDC copy
        $res = Compare-Object $pdcPolicy $policyToCompare
        if (-not $res) {
            #If $res was null/empty then the folders where identical, so we can ignore that server
            Write-Host "Policy: $($PolicyGuid) contains the same files on Domain Controller: $($name)" -ForegroundColor Gray
        }else {
            #Differences were found, so store it in a PowerShell object for later reference
            Write-Host "Policy: $($PolicyGuid) does not contain the same files on Domain Controller: $($name)" -ForegroundColor Red
            $r = [PsCustomObject]@{
                DCName = $name;
                Results = $res;
            }
            $results += $r
        }
    }else{
    $r = [PsCustomObject]@{
                DCName = $name;
                Results = "Policy Missing";
            }
            $results += $r
    }
}
$results

The results were worse than I had expected, while at least 10 domain controllers had replicated the corruption, many others had old versions dating back over a year.

The real trouble begins

Then to make the situation more fun, 2 different certificates required for client machines to function, that are assigned via policy, happen to expire. The corruption and lack of replication of these policies, has resulted in hundreds of client machines suddenly unable to operate correctly. I attempted to perform a non-authoritative restore on one of the corrupt domain controllers, but due to the replication topology, it just ended up with the corruption again. It quickly became apparent that there were widespread serious FRS issues that I couldn't hope to resolve in a timely manner, in order to get the increasing number of client machines back in a working state.

Time for a temporary work around

Be Warned! THIS IS A BAD IDEA! It means you will have to perform a non-authoritative restore on all domain controllers, and you will still have to fix the underlying FRS issues. In my situation it was the only way to get our users back up and working promptly, while I tracked down the underlying issues that were preventing FRS working.

So in order to get our users back in a working state, I decided to copy the policy files onto all the domain controllers. This will cause more replication issues, but it will also allow our users to continue to work.

I simply changed the foreach statment from the script above, to instead forcefully copy the policy from the PDC to all other domain controllers. I used robocopy instead of copy-item to ensure that all permissions were preserved.

foreach($name in $domainControllers){
    robocopy $policyToCheckPath "\\$($name)\sysvol\$(Domain)\Policies\{$($PolicyGuid)}" /copy:DATSOU
}

This had the desired effect, allowing users to get the required settings, and get back to work. However as anticipated, it also resulted in some of the more healthy Domain Controllers attempting to replicate the manually copied files, and running into conflicts.

A more thorough report

Now with the pressing issues temporarily resolved, I set about discovering how bad each domain controllers copy was, so that I could target the worst offenders for investigation.

    #path to all policies on PDC
    $d = "\\$($pdc)\SYSVOL\$($Domain)\Policies"
    #get all the policies from pdc
    $di = Get-ChildItem $d -Recurse -Force
    #store a list of file hashes
    $pdcHashes = @{}
    foreach ($item in $di) {
        if ($item.PsIsContainer -eq $false) {
            $pdcHashes.Add($item.FullName,(Get-FileHash $item.FullName))
        }
    }
    $all = @{}
    foreach ($server in $domainControllers) {
        $results = @()
        #compare every file in every policy against the hash list created earlier
        $paths = $di | where {$_.FullName -notlike "*_NTRFRS_*"} | select @{Name="Name";Expression={$_.FullName -replace "$($pdc)","$($server)"}},@{Name="IsDirectory";Expression={$_.PsIsContainer}},@{Name="Exists";Expression={$false}},@{Name="OriginalFileHash";Expression={$pdcHashes[$_.FullName]}}
        foreach ($item in $paths) {
            
            if (test-path $item.Name) {
                $item.Exists = $true
                $hash = Get-FileHash -Path $($item.Name)
                $results += Get-ResultsObject $item ($hash.Hash -eq $item.OriginalFileHash.Hash)
            }
            else{
                $results += Get-ResultsObject $item $false
            }
        }
        $all.Add($server,$results)
    }
return $all

The Get-ResultsObject function simply converts the results into a psobject

function Get-ResultsObject(){
    param([parameter(Mandatory=$true, Position=0)][ValidateNotNullOrEmpty()] [object]$Item,
    [parameter(Mandatory=$true, Position=1)][ValidateNotNullOrEmpty()] [bool]$HashCompareSuccess)
    return [PSCustomObject]@{Name = $Item.Name;IsDirectory = $Item.IsDirectory;Exists = $Item.Exists;HashCompareSuccess = $HashCompareSuccess}
}

The results were bad, almost every single domain controller had differences in the files. I picked some of the worst offenders and went digging.

The Causes

Suffice to say, there were a lot of issues. I won't go into details about them as they are all well documented:

  • ACL Corruption
  • Replica Set Members missing (suspect this was due to rename/decommision of servers)
  • Quest Change Auditor. Someone had misconfigured it, so it was causing files to become locked, preventing FRS from replicating
  • OnAccess scanning. Some of the servers had not had an exclusion set up, so was also causing file locks.
  • Network. Some of the sites have terrible network connections, which were contributing to the issues

Final Restore

Once I had resolved all the issues listed above, I then set about following the standard SYSVOL restoration procedure, stopping FRS on all domain controllers other than the PDC, deleting the contents of SYSVOL off all of the other domain controllers, then setting the BurFlags key to D2, and proceeding to start the FRS service on Domain Controllers in waves out from the PDC. This finally resulted in all of the Domain Controllers fully replicating the latest versions of the policies.