Tuesday, August 2, 2016

Diagnosing Content Management Server Memory Issues After a Large Publish

Standard

Background

My current project involved importing a fairly large number of items into Sitecore from an external data source. We were looking at roughly around 600k items that weren't volatile at all. We would have a handful of updates per week.

At the start of development, we debated between using a data provider or going with the import, but after doing a POC using the data provider, it was clear that an import was the best option.

The details of what we discovered would make a great post for another time.

NOTE: The version we were running was Sitecore 8.1 Update 2.

The Problem 

After running the import on the our Staging Content Management Server, we were able to successfully populate 594k items in the master database without any issues.

The problem reared its ugly head after we published the large number of items.

After the successful publish, we noticed that there was an instant memory spike on the Content Management Server after the application pool had recycled. Within about 10 seconds, memory usage would reach 90%, and would continue to climb until IIS simply gave up the ghost.

Mind you, our Staging server was pretty decent, an AWS EC2 Windows instance loaded with 15GB of RAM.

So what would cause this?


Troubleshooting 

I confirmed that my issue was in fact caused by the publish by restoring a backup of the web database from before the publish had occurred and recycling the application pool of my Sitecore instance. 

I decided to take a look at what objects were filling up the memory, and so I loaded and launched dotMemory from JetBrains and started my snapshot.

The snapshot revealed some QueuedEvent lists that were eating up the memory:



Next, I decided to fire up SQL Server Profiler to investigate what was happening on the database server.

Running Profiler for about 10 seconds while Sitecore was starting up, showed the following query being executed 186 times within the same process:

SELECT TOP(1) [EventType], [InstanceType], [InstanceData], [InstanceName], [UserName], [Stamp], [Created] FROM [EventQueue] ORDER BY [Stamp] DESC

Why would Sitecore be executing this query so many times, and then filling up the memory on our server?

I know that Content Management instances have a trigger to check the event queue periodically and collect all events to be processed. But, this seemed very strange.

For more info on how this works, you can check out this article by Patel Yogesh: http://sitecoreblog.patelyogesh.in/2013/07/sitecore-event-queue-scalability-king.html.
It's older but still applicable.

I shifted focus onto the EventQueue table to see what it looked like.

EventQueue Table 

A count on the items in my Web database's EventQueue table returned 1.2M.

99% of the items in the EventQueue table were the following remote event records: 

Sitecore.Data.Eventing.Remote.SavedItemRemoteEvent, Sitecore.Kernel, Version=8.1.0.0, Culture=neutral, PublicKeyToken=null 

Sitecore.Data.Eventing.Remote.CreatedItemRemoteEvent, Sitecore.Kernel, Version=8.1.0.0, Culture=neutral, PublicKeyToken=null 

I ran the following queries to tell me how many "SaveItem" event entries and how many "CreatedItem" event entries existed in the table, that were ultimately put there by my publish: 

SELECT *   FROM [Sitecore_Web].[dbo].[EventQueue]    
WHERE EventType LIKE '%SavedItem%'  AND UserName = 'sitecore\arke'  
ORDER BY Created DESC

SELECT *   FROM [Sitecore_Web].[dbo].[EventQueue]
WHERE EventType LIKE '%CreatedItem%'  AND UserName = 'sitecore\arke'  
ORDER BY Created DESC

Both the former and the later returned 594K items each. This seemed to line up correctly with the number of items that I had recently published, but the fact that we had two entries for each item was the obvious cause of the table having well over 1 million records.

The Solution 

There is a good post on the Sitecore Community site, where Vincent van Middendorp mentions a few Truncate queries to empty the EventQueue table along with the History table: https://community.sitecore.net/developers/f/8/t/1450

Truncating the table seemed a bit too evasive at first, so I went ahead and wrote up a quick query to delete the records from the EventQueue table that I knew I had put there (based on my username):

DELETE   FROM [Sitecore_Web].[dbo].[EventQueue] 
WHERE EventType LIKE '%CreatedItem%' OR EventType LIKE '%SavedItem%' 
AND UserName = 'sitecore\arke' 

Running another count on the records in my EventQueue table returned a count of 7.

So, I may well have just run a truncate :)

After firing up my the Sitecore instance again, I was happy to report that memory on the server was now stable.


The Moral of the Story 

Keep an eye on that EventQueue after a large publish!

Looking forward to seeing the publishing improvements coming in Sitecore 8.2.