AD DS infrastructure failures and K2

I recently worked on a number of cases where clients complained about errors on K2 side caused by failures on AD DS side. Specifically there were some suggestions that K2 was unable to handle partial outage of AD DS, namely failure of single DC while there are other DCs available. So based on recent cases I saw I did some research and you may find the results of this research below. It is rather a long form write up which may require some updates/edits afterwards but I decided to post it to share this information with wider community as well as to keep all these notes for my own reference in case it will be necessary for me to revisit these findings.

DISCLAIMER: Some flaws in interpretation of system behavior described below are possible, those will be edited/corrected when/if necessary.

Symtoms/What you may see in your K2 environment when there are some issues with your AD DS infrastructure

Most of the applications being used in Microsoft Active Directory networks have certain degree of dependency on availability of Active Directory Directory Services (AD DS) for purposes of authentication and for obtaining required data about directory objects (users, groups etc.).

In case you have failures or other availability issues with your AD DS infrastructure you may observe symptoms/problems on K2 side similar to those described below.

Example scenario 1 (WAN link outage/no DCs are reachable to serve queries against remote domain)

You may observe growing queue of AD SmO queries on IIS side to the point at which all of the queries sent from K2 smarforms to AD DS fail/no longer returning any information and after a long delay the following error message is being thrown:

A referral was returned from the server.

This error comes from AD DS (more specifically from DC which serves K2 app/K2 server queries to specific domain) and most likely caused by the fact that there is no domain available to serve this query at all.

Example scenario 2 (single DC failed, other DCs are available)

You receiving the following error on K2 server:

System.DirectoryServices.ActiveDirectory.ActiveDirectoryServerDownException: The server is not operational. Name: “DC-XYZ.domain.com” —> System.Runtime.InteropServices.COMException: The server is not operational.

You confirmed that DC mentioned in the error message is down but there are other DCs up and running in this domain.

Example scenario 3 (it could be 2b, but you see that only K2 smartforms are affected)

You may see the same error message as in scenario 2, i.e.:

System.DirectoryServices.ActiveDirectory.ActiveDirectoryServerDownException: The server is not operational. Name: “DC-XYZ.domain.com” —> System.Runtime.InteropServices.COMException: The server is not operational.

But you also see that both K2 workspace and base OS working just fine and using alternate DC, but K2 smartforms keep throwing an error which mentions failed DC (which is indeed failed).

All described scenarios are slightly different but in all of these cases it may seem that one K2 didn’t switch to alternative available DC for specific domain. Key question/requirement here is to switch to another available DC without any downtime or with a minimum downtime (no K2 server or K2 service restart).

Research and general recommendations

First of all it is necessary to understand what kind of dependency on AD DS we have from K2 side. Most obvious things are  AD Service SmOs and User Role Manager Service (URM) – both of them dependent on AD DS availability but in a different ways. AD Services queries AD DS directly (so it is a good test to check whether AD DS queries can be served without issues) whereas URM service relies on K2 identity cache and returns you cached data from K2 database. URM service return data from multiple security providers registered in K2 and it stores cached data in Identity.Identity table in K2 database. URM service is dependent on AD DS only at the time of cached data refresh thus it will allow you not to notice AD DS failure if your AD DS cache is not expired yet.

In the beginning of this blog post we mentioned two major scenarios for AD DS failure (with third type which can be qualified as a sub-case of (2)):

1) WAN link failure when no domains are available to serve K2 request to specific domain because all DCs are behind the WAN link. This is applicable to multi domain environments with remote domains.

2) Failure of specific DC to which K2 server is connected for querying specific domain.

Given AD DS design best practices none of those scenarios should present any problems for applications dependent on AD DS:

(1) It is best practice to have extra DCs placed on remote sites so that there is no dependency on WAN link to preserve link bandwidth and safeguard against availability issues. At the very least RODC should be present locally on site for any remote domain if for some reason you can not place RWDC locally on each remote site.

NOTE: in a link failure scenario when there are no locally available DCs there is nothing that can be done from K2 side, it is a question of restoring WAN link or placing locally available RWDC/RODC to mitigate against this scenario.

(2) Golden rule and requirement for any production AD DS deployment is to have not less than two DCs per domain. So failure of one domain controller should not present any issues.

Now separately on scenario (3) when you getting the same error as in scenario (2): “System.DirectoryServices.ActiveDirectory.ActiveDirectoryServerDownException: The server is not operational. Name: “DC-XYZ.domain.com” but you clearly see that your base OS and K2 workspace using alternate available DC where as K2 smartforms keep throwing an error which mentions failed DC. With high probability you may see this error with K2 4.6.8/4.6.9.

In this specific scenario you clearly see that K2 workspace works fine at the time when you have this issue with K2 smartforms. This is because Designer, Runtime and ViewFlow web applications in K2 are using the newer WindowsSTS redirect implementation (http://k2.denalilx.com/Identity/STS/Windows) which was introduced in 4.6.8 whereas K2 Workspace still uses “Windows Authentication”.

I.e. you may see that K2 workspace uses windows authentication and in its web.config file ADConnectionString is configured as “LDAP://domain.com”, for WindowsSTS K2 label is being used, i.e. “LDAP://dc=domain,dc=com”

You may see aforementioned error occurring on the redirect to “http://k2.domain.com/Identity/STS/Windows/”

There is also a known issue with Windows STS implementation in K2 when exception on GetGroups causes user authentification to fail on Windows STS which was fixed in 4.6.10 but there is still open request to improve error handling with the aim to catch exceptions caused by temporary unavailability of DC and then have STS retry again. So that in cases where the DC is inaccessible for a short interval for unknown reasons the retry will then connect successfully.

So in scenario (3) you will likely see that DC locator is switched to alternate DC but Windows STS not performing switch/retry after temporary DC failure. It is something I need to research more, but it seems that in this case you have to restart K2 service to get back to normal operation of K2 smartforms.

Irrespective of scenario (maybe apart from scenario (3)) first point to check when you see any such issues is to work with your AD DS team to clarify which specific issue do you have on AD DS side and whether it is fixed/addressed or not. There is no use to perform any attempts of fixing things from K2 side if AD DS issue is not addressed unless this is an issue with specific DC and there are other locally available DCs. The only possible thing is to temporarily remove connection string to some extra domain if you can afford this (and if it is a less important/additional domain which has an issue).

You may get a confirmation from your AD DS administration/support team that the they have issue with one specific DC which is failed or down for maintenance (the latter should be very rare/exceptional case of planned maintenance during business hours) and there are others locally available DCs to serve requests from K2 server. If this is the case you can try to do the following things:

1) Use AD Service SmO to check that you can query affected domain – if it works you should not have any issues in K2, if not proceed with further checks.

2) Use the following command to verify which DC is currently being used by K2 server for specific domain:

nltest /dsgetdc:DomainName

If this command returns failed DC then this is an issue with your DC locator service/AD DS infrastructure, or to put it another way problem external to K2.

In general AD DS as a technology with decades of evolution and high adoption rate is very stable and there are no well known cases where DC locator fails to switch to alternative available DC. But depending on configuration and issues of specific environments as well as implementations of application of code which interacts with AD DS there can be some cases when DC locator switching does not work properly.

3) If on the 2nd step you getting failed/unavailable DC try to use the following command:

nltest /dsgetdc:DomainName /force

This will force DC locator cache refresh and may help you to switch to another DC. Note sometimes it is necessary to run this a few times till another DC is selected.

4) If step 3 does not help you to switch to another available DC you may try to restart the netlogon service as DC locator cache is implemented as a part of this service. Here is an example of how to do it with PowerShell:

Get-Service netlogon | restart-service
nltest.exe /sc_verify:<fully.qualified.domain.name.here>

Once this is done verify whether you are switched to available DC with use of the following command:

nltest /dsgetdc:DomainName

5) If you see that after switching of DC locator to available DC K2 AD Service SmOs are still does not work consider K2 service restart/or server reboot. This is most likely could be scenario (3) when K2 workspace/base OS works well but K2 smartforms “stuck” with server down exception.

Note the only valid test here is use of AD Service SmOs to query domain – if it works then no need to do something else from K2 side. In case you see issue in the areas depending on URM User service it may simply be the case that cached data is expired and new data is still builds up. Sometimes it may be necessary to force identity cache refresh and wait till cache builds up completely (this can take very long time in large scale production environments).

Additional details and recommendations

K2 performs bind with the DirectoryEntry class e.g:

new DirectoryEntry(“LDAP://DC=Domain,DC=COM”, “”, “”,AuthenticationTypes.ReadOnly);

This process relies on Domain Controller Locator which is an algorithm that runs in the context of the Net Logon service. Essentially Domain Controller Locator is a sort of AD DS client part which is responsible for selecting specific DC for specific domain. Domain Controller Locator has its own cache. The Net Logon service caches the domain controller information so that it is not necessary to repeat the discovery process for subsequent requests. Caching this information encourages the consistent use of the same domain controller and, thus, a consistent view of Active Directory.

NOTE: as you may notice in explanations for scenario 3 K2 Workspace and K2 smartforms perform bind to AD differently, at least connection string they use are different.

Refer to the Microsoft documentation for details:

Domain Controller Location Process

Domain Controller Locator

Recommendations

1) Reconfigure K2 to use GC instead of LDAP.

The global catalog is a distributed data repository that contains a searchable, partial representation of every object in every domain in a multidomain Active Directory Domain Services (AD DS) forest. So essentially your GC placed in local domain can serve part of the queries which otherwise should go to DCs in another domain, potentially over WAN link.

From purely AD DS side GC has the following benefits:

– Forest-wide searches. The global catalog provides a resource for searching an AD DS forest.

– User logon. In a forest that has more than one domain GC can be used during logon for universal group membership group enumeration (Windows 2000 native DFL or higher) and for resolving UPN name when UPN is used at logon.

– Universal Group Membership Caching: In a forest that has more than one domain, in sites that have domain users but no global catalog server, Universal Group Membership Caching can be used to enable caching of logon credentials so that the global catalog does not have to be contacted for subsequent user logons. This feature eliminates the need to retrieve universal group memberships across a WAN link from a global catalog server in a different site. Essentially you may enable this feature to make use of GC even more efficient.

To reconfigure K2 to use GC you have to edit RoleInit XML field of HostServer.SecurityLabel table and replace “LDAP://” to “GC://” with subsequent restart of K2 service.

From K2 prospective it should improve responsiveness of AD SmartObjects as well as slightly decrease reliance on WAN link/number of queries to DCs outside of local domain.

2) Try to use Domain Locator cache refresh clear up for example scenario 2 (see details above, nltest /dsgetdc:DomainName /force) and verify whether it is viable workaround. Use “nltest /dsgetdc:DomainName” to confirm which specific DC is being used by K2 server and verify status and availability of this specific DC with your infrastructure team.

3) In scenario 3 try to restart K2 service but first confirm that DC locator uses working DC.

4) There is also an existing feature request to investigate possibility to built in some DC failure detection/switching capabilities into K2 code in the future versions of the product.

Leave a Reply

Your email address will not be published. Required fields are marked *