Tag Archives: Troubleshooting

System.IO.IOException : The requested operation could not be completed due to a file system limitation

I recently had a support case thanks to which I discovered rather cool way of checking out on big files in specific directory which I will describe later here.

Under certain conditions you may see the following issue in K2: very high CPU usage and by extension overall sluggishness of K2 applications accompanied with “System.IO.IOException : The requested operation could not be completed due to a file system limitation.”

As in most of the cases error message itself indicates what is wrong here and “The requested operation could not be completed due to a file system limitation” should ring a bell for you that some file or files run amok and growth beyond file system limits or something along these lines. If you read your logs even more closely they may even give away specific culprit to you indicating log file name which is responsible for this.

K2 has broad logging capabilities for monitoring and troubleshooting purposes (quite good overview of K2 logging can be found here) but in terms of logging volume main suspects are: SmO logging (the only logging which can’t be capped in terms of file size), ADUM logs (very voluminous, especially on debug logging level; file size can be limited by adjusting configurable settings, meaning that you have to go extra mile if you want to allow unhealthy big file name) and lastly debug assemblies you may receive from K2 support. Debug assemblies usually are quickly build ad-hoc troubleshooting tools to investigate specific issue and may well not have log file limit and write super detailed logging (=voluminous log files). As such those supposed to be removed upon completion of your troubleshooting effort, but in reality can be left applied for a while which gradually evolves into forever…

Anyhow exception “System.IO.IOException : The requested operation could not be completed due to a file system limitation.” in K2 host server log in most of the cases caused by abnormally high in size log file, which becomes so big that it exceeds RAM size which makes it difficult to open it and append for writing, and then you have that slippery slope situation with degraded performance and high CPU moment, and to that “aha, I forgot to disable/remove unneeded logging” moment.

Now my take away from this case (though what is said above also worth noting). How to quickly check on huge files in specific directory. Just use this PS script:

Get-ChildItem -Path 'C:\Program Files (x86)\K2 blackpearl' -Recurse -Force -File | Select-Object -Property FullName,@{Name='SizeGB';Expression={$_.Length / 1GB}},@{Name='SizeMB';Expression={$_.Length / 1MB}} | Sort-Object { $_.SizeGB } -Descending | Out-GridView

You may add “-First 10” parameter right after Select-Object in the script above to minimize output which is especially useful when you primarily interested to identify largest file or files.

Here is how the result for healthy K2 folder looks like (by healthy I mean one without strangely big log files):

Large files search

As you can see normally you should not have anything with size of 1 gigabyte more, but above mentioned exception is usually caused by 10-20 GB log file which will be featured prominently on the top of the output.

See also related K2 community KB: Exception – The requested operation could not be completed due to a file system limitation.

Please follow and like us:
error0
Memory Leaks Everywhere

K2 host server eats up my RAM! :) Oh, really?

One of the frequent type of the issues I have to work on is high RAM usage (normally description of such problem is accompanied by phrase “with no apparent reason”) by K2 host server service. Most of the time I’m trying to create meaningful K2 community KB articles based on support cases I work on but not always everything what I want to say fits in into Click2KB format. So to discuss  “K2 host server eats up my RAM/I think I see memory leak here” issue in detail I decided to write this blog post.

Common symptom and starting point here is that you noticed abnormally high RAM usage by K2 host server service, which maybe even leads to service crash or total unresponsiveness of your K2 platform. What’s next and what the possibilities here?

Of course it is all depends on what exactly you see.

I think it is quite expected to see that immediately after server reboot K2 service memory consumption is lower than after server works for a while: Once you rebooted your server it starts clean – all threads and allocated memory is clear hence low RAM usage. But as server warms up it starts checking if it has tasks to process, and it fires other activities like users and group resolution by ADUM manager, recording data in identity cache table and so on. The more task processing threads are active the more memory is required. And keep in mind your host server treads configuration if you increased your default thread pool limits you should realize that it allows server to use more available resources on your server.

Empty (no deployed processes, no active users) K2 host server service has really tiny memory footprint:

K2 empty server with default thread pool settings

As you can see it uses less than 300 MB of RAM. And even if double default thread pool settings (and I heard that for that resources allocated upfront) memory usage stays the same at least on the box without any load.

Now we switching to interesting stuff, i.e. what could it be if RAM usage of K2 service is abnormally high?

And here comes important point: if your process design or custom code has any design flaws or hardware is poorly sized for intended workload processing queue starts growing and it may lead to resource overuse. I.e. it is not a memory leak but bottleneck caused by such things as (and I’m listing them based on probability of being cause of your issue):1) Custom code or process design. Easy proof that this is the cause is the fact that you unable to reproduce this “memory leak” on empty platform with no running processes. It does tell you that there is no memory leak in K2 platform base code in a way.

You can refer to process design best practices as a starting point here: K2 blackpearl Best Practices (last updated November 2008).

I seen enough cases when high memory usage was caused by inefficient process design choices (something like mass uploads to DB or updating 20 MS Word documents properties in a row designed so that file is being downloaded/uploaded 20 times from SharePoint instead of doing batch update with one download/upload of a file).

Also when next time you will see this high memory usage state before doing reboot execute the following queries against K2 database:

A) Check how many process running at the same time right now and if any of them constantly stays in running state:

SELECT * FROM [K2].[Server].[ProcInst] WHERE [Status] = 1

It will give you number of running processes at specific point in time. Constantly having 20 process or more in status 1 may indicate a problem, but more importantly to execute this query multiple times with 1-2 minutes interval and see if some of the process instances with the same ID stays running constantly or for a very long time. This will be likely your “offending” process and you will want to check at which step it is so slow and so on.

B) Check for processes with abnormally high state size:

SELECT TOP 200

ID,

DATALENGTH(State)/1048576.0 AS StateSize,

Version,

StartDate,

Originator,

Folio,

Status

FROM Server.ProcInst WITH(NOLOCK)

WHERE Status IN (1, 2) AND DATALENGTH(State)/1048576.0 >= 1

ORDER BY DATALENGTH(State) DESC

This query will return processes to 200 processes with state size 1 MB or greater (if you have any). So if this query brings some results those are problematic processes which cause abnormally high memory usage/performance problems (most likely due to use of looping within the process).

Just some illustrative example of what else can be wrong (and possibilities are huge here 🙂 ): my colleague run into an issue where K2 service process memory usage suddenly started growing at ~16 GB per day rate, and in the end the reason was that every 10 seconds K2 smartactions tried to process an email which was sent to K2 service account mailbox and at is the same account under which smaractions were configured and it lead to sort of cycle and each sending attempt eat up couple of MB of memory. It was only possible to see this with full logging level and during the night where there was no other activities on the server cluttering log files.

2) Slow response/high latency of external systems or network. Depending on design of your workflows they may have dependencies on external systems (SQL, SharePoint) and it could be the case that slow response from their side causing growth of queue on K2 side with memory usage growth (sort of vicious circle or something like race condition can be in play here and it is often difficult to untangle this and isolate root cause).

In such scenario it is better to:

A) At the time of an issue verify K2 host server logs and ADUM logs to see if there are any time outs or communication type of errors/exceptions.

B) Check all servers which comprise your environment (K2, SQL, SharePoint, IIS) and watch out for resource usage spikes and errors in Event Viewer (leverage “Administrative Events” view). K2 relies heavily on SQL where K2 DB is hosted and if it is undersized or overloaded (scheduled execution of some SSIS packages, scheduled antivirus scan or backup) and if it is slow to respond you may see memory usage growth/slowness on K2 server side.

If your servers virtualized confirm your K2 vServers placement with virtualization platform admins – K2 and K2 DB SQL instance should not coexist on the same vHost with I/O intensive apps (especially Exchange, SharePoint).

You should pay special attention to ADUM logs – if there are loads of errors those have to be addressed as K2 server may constantly waste resources on futile attempts to resolve some no longer existing SharePoint group provider (site collection deleted, but group provider still in K2) or resolving objects from non working domain (failed connectivity or trusts config). These resolution attempts eat up resources and may prevent ADUM from timely refreshing things which are needed for running processes by this making situation worse (growing queue).

IMPORTANT NOTE: It never woks in large organizations if you just ask your colleagues (SQL admins/virtualization admins) whether all is OK on their side – you will always get response that all is OK 🙂 You have to ask specific questions and get explicit confirmation of things like VM placement, whether your K2 DB SQL instance is shared with any other I/O intensive apps. You want to have a list and go through it eliminating possibilities.

I personally worked with one client who spent months troubleshooting performance and reviewing their K2 solutions inside out and searching for a leak while it was solved in the end by moving K2 DB to a dedicated SQL server instance, and in a hindsight they realized that previously K2 DB coexisted with some obscure integration DB not heavily used but it had a SSIS package which was firing twice a day and maxed out SQL resources for couple of hours causing prolonged and different disruptions to their K2 system. Checking SQL was suggested from the very beginning and answer was we don’t have issues on SQL side, even after they asked twice their SQL admins.

3) Inadequate hardware sizing. To get an idea about how to size your K2 server you can look at this table:

Scale out

This may look a bit controversial to you but this table is from was at some point included in Performance and Capacity Planning document from K2 COE document (currently old version of it is replaced with new one dated 6/1/2017 which no longer contains this table). Table above illustrates how you have to scale out based on total number of users and number of concurrent user with base configuration of 1 server with 8GB of RAM. Depending on your current hardware configuration this may or may not support your idea of scaling up.

Also see these documents on sizing and performance: K2 blackpearl Performance Testing Results and K2 blackpearl Performance Testing White Paper

Also see this K2 community KB: K2 Host Service CPU usage close to 100%

4) Memory leak. This is rather unlikely as K2 code (like code of any other mature commercial software) goes through strict QA and testing, and, personally, I saw not more than 3 cases where there was memory leak type of an issue which had to be fixed in K2 – it was all in old versions and in very specific, not frequent scenarios.

If what you observe is not prolonged memory usage spikes which not going away by themselves, but your K2 service just at times maxing out resource usage but then all goes back to normal with no intervention from your side (such as K2 service/server restart) then it looks like insufficient hardware type of situation (though other issues I mentioned previously still may have influence here). Memory leak is rather imply that you need to stop service or something like this to resolve it.

If after checking all the points mentioned above you still suspect that there could be some memory leak I would recommend you to place K2 support case and prepare all K2 logs along with memory dumps collected in the low and high memory usage states (you can obtain instructions on collecting memory dumps from K2 support).

Please follow and like us:
error0

AD DS infrastructure failures and K2

I recently worked on a number of cases where clients complained about errors on K2 side caused by failures on AD DS side. Specifically there were some suggestions that K2 was unable to handle partial outage of AD DS, namely failure of single DC while there are other DCs available. So based on recent cases I saw I did some research and you may find the results of this research below. It is rather a long form write up which may require some updates/edits afterwards but I decided to post it to share this information with wider community as well as to keep all these notes for my own reference in case it will be necessary for me to revisit these findings.

DISCLAIMER: Some flaws in interpretation of system behavior described below are possible, those will be edited/corrected when/if necessary.

Symtoms/What you may see in your K2 environment when there are some issues with your AD DS infrastructure

Most of the applications being used in Microsoft Active Directory networks have certain degree of dependency on availability of Active Directory Directory Services (AD DS) for purposes of authentication and for obtaining required data about directory objects (users, groups etc.).

In case you have failures or other availability issues with your AD DS infrastructure you may observe symptoms/problems on K2 side similar to those described below.

Example scenario 1 (WAN link outage/no DCs are reachable to serve queries against remote domain)

You may observe growing queue of AD SmO queries on IIS side to the point at which all of the queries sent from K2 smarforms to AD DS fail/no longer returning any information and after a long delay the following error message is being thrown:

A referral was returned from the server.

This error comes from AD DS (more specifically from DC which serves K2 app/K2 server queries to specific domain) and most likely caused by the fact that there is no domain available to serve this query at all.

Example scenario 2 (single DC failed, other DCs are available)

You receiving the following error on K2 server:

System.DirectoryServices.ActiveDirectory.ActiveDirectoryServerDownException: The server is not operational. Name: “DC-XYZ.domain.com” —> System.Runtime.InteropServices.COMException: The server is not operational.

You confirmed that DC mentioned in the error message is down but there are other DCs up and running in this domain.

Example scenario 3 (it could be 2b, but you see that only K2 smartforms are affected)

You may see the same error message as in scenario 2, i.e.:

System.DirectoryServices.ActiveDirectory.ActiveDirectoryServerDownException: The server is not operational. Name: “DC-XYZ.domain.com” —> System.Runtime.InteropServices.COMException: The server is not operational.

But you also see that both K2 workspace and base OS working just fine and using alternate DC, but K2 smartforms keep throwing an error which mentions failed DC (which is indeed failed).

All described scenarios are slightly different but in all of these cases it may seem that one K2 didn’t switch to alternative available DC for specific domain. Key question/requirement here is to switch to another available DC without any downtime or with a minimum downtime (no K2 server or K2 service restart).

Research and general recommendations

First of all it is necessary to understand what kind of dependency on AD DS we have from K2 side. Most obvious things are  AD Service SmOs and User Role Manager Service (URM) – both of them dependent on AD DS availability but in a different ways. AD Services queries AD DS directly (so it is a good test to check whether AD DS queries can be served without issues) whereas URM service relies on K2 identity cache and returns you cached data from K2 database. URM service return data from multiple security providers registered in K2 and it stores cached data in Identity.Identity table in K2 database. URM service is dependent on AD DS only at the time of cached data refresh thus it will allow you not to notice AD DS failure if your AD DS cache is not expired yet.

In the beginning of this blog post we mentioned two major scenarios for AD DS failure (with third type which can be qualified as a sub-case of (2)):

1) WAN link failure when no domains are available to serve K2 request to specific domain because all DCs are behind the WAN link. This is applicable to multi domain environments with remote domains.

2) Failure of specific DC to which K2 server is connected for querying specific domain.

Given AD DS design best practices none of those scenarios should present any problems for applications dependent on AD DS:

(1) It is best practice to have extra DCs placed on remote sites so that there is no dependency on WAN link to preserve link bandwidth and safeguard against availability issues. At the very least RODC should be present locally on site for any remote domain if for some reason you can not place RWDC locally on each remote site.

NOTE: in a link failure scenario when there are no locally available DCs there is nothing that can be done from K2 side, it is a question of restoring WAN link or placing locally available RWDC/RODC to mitigate against this scenario.

(2) Golden rule and requirement for any production AD DS deployment is to have not less than two DCs per domain. So failure of one domain controller should not present any issues.

Now separately on scenario (3) when you getting the same error as in scenario (2): “System.DirectoryServices.ActiveDirectory.ActiveDirectoryServerDownException: The server is not operational. Name: “DC-XYZ.domain.com” but you clearly see that your base OS and K2 workspace using alternate available DC where as K2 smartforms keep throwing an error which mentions failed DC. With high probability you may see this error with K2 4.6.8/4.6.9.

In this specific scenario you clearly see that K2 workspace works fine at the time when you have this issue with K2 smartforms. This is because Designer, Runtime and ViewFlow web applications in K2 are using the newer WindowsSTS redirect implementation (http://k2.denalilx.com/Identity/STS/Windows) which was introduced in 4.6.8 whereas K2 Workspace still uses “Windows Authentication”.

I.e. you may see that K2 workspace uses windows authentication and in its web.config file ADConnectionString is configured as “LDAP://domain.com”, for WindowsSTS K2 label is being used, i.e. “LDAP://dc=domain,dc=com”

You may see aforementioned error occurring on the redirect to “http://k2.domain.com/Identity/STS/Windows/”

There is also a known issue with Windows STS implementation in K2 when exception on GetGroups causes user authentification to fail on Windows STS which was fixed in 4.6.10 but there is still open request to improve error handling with the aim to catch exceptions caused by temporary unavailability of DC and then have STS retry again. So that in cases where the DC is inaccessible for a short interval for unknown reasons the retry will then connect successfully.

So in scenario (3) you will likely see that DC locator is switched to alternate DC but Windows STS not performing switch/retry after temporary DC failure. It is something I need to research more, but it seems that in this case you have to restart K2 service to get back to normal operation of K2 smartforms.

Irrespective of scenario (maybe apart from scenario (3)) first point to check when you see any such issues is to work with your AD DS team to clarify which specific issue do you have on AD DS side and whether it is fixed/addressed or not. There is no use to perform any attempts of fixing things from K2 side if AD DS issue is not addressed unless this is an issue with specific DC and there are other locally available DCs. The only possible thing is to temporarily remove connection string to some extra domain if you can afford this (and if it is a less important/additional domain which has an issue).

You may get a confirmation from your AD DS administration/support team that the they have issue with one specific DC which is failed or down for maintenance (the latter should be very rare/exceptional case of planned maintenance during business hours) and there are others locally available DCs to serve requests from K2 server. If this is the case you can try to do the following things:

1) Use AD Service SmO to check that you can query affected domain – if it works you should not have any issues in K2, if not proceed with further checks.

2) Use the following command to verify which DC is currently being used by K2 server for specific domain:

nltest /dsgetdc:DomainName

If this command returns failed DC then this is an issue with your DC locator service/AD DS infrastructure, or to put it another way problem external to K2.

In general AD DS as a technology with decades of evolution and high adoption rate is very stable and there are no well known cases where DC locator fails to switch to alternative available DC. But depending on configuration and issues of specific environments as well as implementations of application of code which interacts with AD DS there can be some cases when DC locator switching does not work properly.

3) If on the 2nd step you getting failed/unavailable DC try to use the following command:

nltest /dsgetdc:DomainName /force

This will force DC locator cache refresh and may help you to switch to another DC. Note sometimes it is necessary to run this a few times till another DC is selected.

4) If step 3 does not help you to switch to another available DC you may try to restart the netlogon service as DC locator cache is implemented as a part of this service. Here is an example of how to do it with PowerShell:

Get-Service netlogon | restart-service
nltest.exe /sc_verify:<fully.qualified.domain.name.here>

Once this is done verify whether you are switched to available DC with use of the following command:

nltest /dsgetdc:DomainName

5) If you see that after switching of DC locator to available DC K2 AD Service SmOs are still does not work consider K2 service restart/or server reboot. This is most likely could be scenario (3) when K2 workspace/base OS works well but K2 smartforms “stuck” with server down exception.

Note the only valid test here is use of AD Service SmOs to query domain – if it works then no need to do something else from K2 side. In case you see issue in the areas depending on URM User service it may simply be the case that cached data is expired and new data is still builds up. Sometimes it may be necessary to force identity cache refresh and wait till cache builds up completely (this can take very long time in large scale production environments).

Additional details and recommendations

K2 performs bind with the DirectoryEntry class e.g:

new DirectoryEntry(“LDAP://DC=Domain,DC=COM”, “”, “”,AuthenticationTypes.ReadOnly);

This process relies on Domain Controller Locator which is an algorithm that runs in the context of the Net Logon service. Essentially Domain Controller Locator is a sort of AD DS client part which is responsible for selecting specific DC for specific domain. Domain Controller Locator has its own cache. The Net Logon service caches the domain controller information so that it is not necessary to repeat the discovery process for subsequent requests. Caching this information encourages the consistent use of the same domain controller and, thus, a consistent view of Active Directory.

NOTE: as you may notice in explanations for scenario 3 K2 Workspace and K2 smartforms perform bind to AD differently, at least connection string they use are different.

Refer to the Microsoft documentation for details:

Domain Controller Location Process

Domain Controller Locator

Recommendations

1) Reconfigure K2 to use GC instead of LDAP.

The global catalog is a distributed data repository that contains a searchable, partial representation of every object in every domain in a multidomain Active Directory Domain Services (AD DS) forest. So essentially your GC placed in local domain can serve part of the queries which otherwise should go to DCs in another domain, potentially over WAN link.

From purely AD DS side GC has the following benefits:

– Forest-wide searches. The global catalog provides a resource for searching an AD DS forest.

– User logon. In a forest that has more than one domain GC can be used during logon for universal group membership group enumeration (Windows 2000 native DFL or higher) and for resolving UPN name when UPN is used at logon.

– Universal Group Membership Caching: In a forest that has more than one domain, in sites that have domain users but no global catalog server, Universal Group Membership Caching can be used to enable caching of logon credentials so that the global catalog does not have to be contacted for subsequent user logons. This feature eliminates the need to retrieve universal group memberships across a WAN link from a global catalog server in a different site. Essentially you may enable this feature to make use of GC even more efficient.

To reconfigure K2 to use GC you have to edit RoleInit XML field of HostServer.SecurityLabel table and replace “LDAP://” to “GC://” with subsequent restart of K2 service.

From K2 prospective it should improve responsiveness of AD SmartObjects as well as slightly decrease reliance on WAN link/number of queries to DCs outside of local domain.

2) Try to use Domain Locator cache refresh clear up for example scenario 2 (see details above, nltest /dsgetdc:DomainName /force) and verify whether it is viable workaround. Use “nltest /dsgetdc:DomainName” to confirm which specific DC is being used by K2 server and verify status and availability of this specific DC with your infrastructure team.

3) In scenario 3 try to restart K2 service but first confirm that DC locator uses working DC.

4) There is also an existing feature request to investigate possibility to built in some DC failure detection/switching capabilities into K2 code in the future versions of the product.

Please follow and like us:
error0

How to check whether the UPA properties are populated correctly for specific user

Certain SharePoint 2013 features as well as K2 for SharePoint need to have User Profile Application (UPA) working and its database populated with correct data.

Sometimes it is difficult to confirm whether or not UPA is correctly configured as SharePoint UI does not show you all the properties for the users. Moreover, even if UPA is not configured properly users still can login to SharePoint and successfully get OAuth tokern, and this fact complicates troubleshooting.

As a quick way to confirm that UPA is populated correctly for a particular user you may ask him to login to SharePoint and navigate to the following page:

https://<siteurl>/_api/SP.UserProfiles.PeopleManager/GetMyProperties

It will return all UPA propeerties for the user. For OAuth tokens to work correctly following properties should be popluated: SPS-ClaimID, SPS-ClaimProviderID, SPS-ClaimProviderType, and SPS-UserPrincipalName.

Please follow and like us:
error0