One of the frequent type of the issues I have to work on is high RAM usage (normally description of such problem is accompanied by phrase “with no apparent reason”) by K2 host server service. Most of the time I’m trying to create meaningful K2 community KB articles based on support cases I work on but not always everything what I want to say fits in into Click2KB format. So to discuss “K2 host server eats up my RAM/I think I see memory leak here” issue in detail I decided to write this blog post.
Common symptom and starting point here is that you noticed abnormally high RAM usage by K2 host server service, which maybe even leads to service crash or total unresponsiveness of your K2 platform. What’s next and what the possibilities here?
Of course it is all depends on what exactly you see.
I think it is quite expected to see that immediately after server reboot K2 service memory consumption is lower than after server works for a while: Once you rebooted your server it starts clean – all threads and allocated memory is clear hence low RAM usage. But as server warms up it starts checking if it has tasks to process, and it fires other activities like users and group resolution by ADUM manager, recording data in identity cache table and so on. The more task processing threads are active the more memory is required. And keep in mind your host server treads configuration if you increased your default thread pool limits you should realize that it allows server to use more available resources on your server.
Empty (no deployed processes, no active users) K2 host server service has really tiny memory footprint:
As you can see it uses less than 300 MB of RAM. And even if double default thread pool settings (and I heard that for that resources allocated upfront) memory usage stays the same at least on the box without any load.
Now we switching to interesting stuff, i.e. what could it be if RAM usage of K2 service is abnormally high?
And here comes important point: if your process design or custom code has any design flaws or hardware is poorly sized for intended workload processing queue starts growing and it may lead to resource overuse. I.e. it is not a memory leak but bottleneck caused by such things as (and I’m listing them based on probability of being cause of your issue):
1) Custom code or process design. Easy proof that this is the cause is the fact that you unable to reproduce this “memory leak” on empty platform with no running processes. It does tell you that there is no memory leak in K2 platform base code in a way.
You can refer to process design best practices as a starting point here:
I seen enough cases when high memory usage was caused by inefficient process design choices (something like mass uploads to DB or updating 20 MS Word documents properties in a row designed so that file is being downloaded/uploaded 20 times from SharePoint instead of doing batch update with one download/upload of a file).
Also when next time you will see this high memory usage state before doing reboot execute the following queries against K2 database:
A) Check how many process running at the same time right now and if any of them constantly stays in running state:
SELECT * FROM [K2].[Server].[ProcInst] WHERE [Status] = 1
It will give you number of running processes at specific point in time. Constantly having 20 process or more in status 1 may indicate a problem, but more importantly to execute this query multiple times with 1-2 minutes interval and see if some of the process instances with the same ID stays running constantly or for a very long time. This will be likely your “offending” process and you will want to check at which step it is so slow and so on.
B) Check for processes with abnormally high state size:
SELECT TOP 200
DATALENGTH(State) AS StateSize,
Status IN (1, 2)
This query will return processes with largest state size in bytes. If any of the processes has state size more than 1 MB this is problematic process which causes memory over use most likely due to use of looping within the process.
Just some illustrative example of what else can be wrong (and possibilities are huge here 🙂 ): my colleague run into an issue where K2 service process memory usage suddenly started growing at ~16 GB per day rate, and in the end the reason was that every 10 seconds K2 smartactions tried to process an email which was sent to K2 service account mailbox and at is the same account under which smaractions were configured and it lead to sort of cycle and each sending attempt eat up couple of MB of memory. It was only possible to see this with full logging level and during the night where there was no other activities on the server cluttering log files.
2) Slow response/high latency of external systems or network. Depending on design of your workflows they may have dependencies on external systems (SQL, SharePoint) and it could be the case that slow response from their side causing growth of queue on K2 side with memory usage growth (sort of vicious circle or something like race condition can be in play here and it is often difficult to untangle this and isolate root cause).
In such scenario it is better to:
A) At the time of an issue verify K2 host server logs and ADUM logs to see if there are any time outs or communication type of errors/exceptions.
B) Check all servers which comprise your environment (K2, SQL, SharePoint, IIS) and watch out for resource usage spikes and errors in Event Viewer (leverage “Administrative Events” view). K2 relies heavily on SQL where K2 DB is hosted and if it is undersized or overloaded (scheduled execution of some SSIS packages, scheduled antivirus scan or backup) and if it is slow to respond you may see memory usage growth/slowness on K2 server side.
If your servers virtualized confirm your K2 vServers placement with virtualization platform admins – K2 and K2 DB SQL instance should not coexist on the same vHost with I/O intensive apps (especially Exchange, SharePoint).
You should pay special attention to ADUM logs – if there are loads of errors those have to be addressed as K2 server may constantly waste resources on futile attempts to resolve some no longer existing SharePoint group provider (site collection deleted, but group provider still in K2) or resolving objects from non working domain (failed connectivity or trusts config). These resolution attempts eat up resources and may prevent ADUM from timely refreshing things which are needed for running processes by this making situation worse (growing queue).
IMPORTANT NOTE: It never woks in large organizations if you just ask your colleagues (SQL admins/virtualization admins) whether all is OK on their side – you will always get response that all is OK 🙂 You have to ask specific questions and get explicit confirmation of things like VM placement, whether your K2 DB SQL instance is shared with any other I/O intensive apps. You want to have a list and go through it eliminating possibilities.
I personally worked with one client who spent months troubleshooting performance and reviewing their K2 solutions inside out and searching for a leak while it was solved in the end by moving K2 DB to a dedicated SQL server instance, and in a hindsight they realized that previously K2 DB coexisted with some obscure integration DB not heavily used but it had a SSIS package which was firing twice a day and maxed out SQL resources for couple of hours causing prolonged and different disruptions to their K2 system. Checking SQL was suggested from the very beginning and answer was we don’t have issues on SQL side, even after they asked twice their SQL admins.
3) Inadequate hardware sizing. To get an idea about how to size your K2 server you can look at this table:
This may look a bit controversial to you but this table is from Performance and Capacity Planning document from K2 COE document and it illustrates how you have to scale out based on total number of users and number of concurrent user with base configuration of 1 server with 8GB of RAM. Depending on your current hardware configuration this may or may not support your idea of scaling up.
Also see these documents on sizing and performance:
K2 blackpearl Performance Testing
Also see this K2 community KB:
4) Memory leak. This is rather unlikely as K2 code (like code of any other mature commercial software) goes through strict QA and testing, and, personally, I saw not more than 3 cases where there was memory leak type of an issue which had to be fixed in K2 – it was all in old versions and in very specific, not frequent scenarios.
If what you observe is not prolonged memory usage spikes which not going away by themselves, but your K2 service just at times maxing out resource usage but then all goes back to normal with no intervention from your side (such as K2 service/server restart) then it looks like insufficient hardware type of situation (though other issues I mentioned previously still may have influence here). Memory leak is rather imply that you need to stop service or something like this to resolve it.
If after checking all the points mentioned above you still suspect that there could be some memory leak I would recommend you to place K2 support case and prepare all K2 logs along with memory dumps collected in the low and high memory usage states (you can obtain instructions on collecting memory dumps from K2 support).