Thursday, September 30, 2021

Understanding Sitecore's Self-Adjusting Thread Pool Size Monitor

Standard

Background

In a previous post, I focused on the inner workings of the .NET CLR and Thread Pool and how they can impact the stability of the Sitecore application.

I must admit, I have become mildly obsessed with the threading over the last couple years, mostly because a great deal of my work has involved stabilization and optimization practices on high-traffic Sitecore sites. 

In this post, I want to focus in the Thread Pool Size Monitor that comes baked into Sitecore from 9 onwards, because it is not widely known that it exists, the job it does, and how it can be tuned to optimize performance.

Thread Pool Size Monitor

To recap, the most important thread configuration settings are the minWorkerThreads and minIOThreads where you can specify the minimum number of threads that are available to your application's Thread Pool instead of relying on the default formula's based on processor count which is always too few.

Threads that are controlled by these values can be created at a much faster rate (because they are spawned from the Thread Pool), than worker threads that are created from the CLR's default "thread-tuning" capabilities. 

To summarize: 

  • Thread pool threads get thrown in faster to handle work. 
  • The CLR thread spin up algorithm is too slow and we can't rely on it to support high performance applications.

As previously mentioned, in Sitecore 9 and above, there is a pipeline processor that allows the application to adjust thread limits dynamically based on real-time thread availability (using the Thread Pool API).

This can be found in the following namespace: Sitecore.Analytics.ThreadPoolSizeMonitor.

By default, every 500 milliseconds, the processor will keep adding a value of 50 to the minWorkerThreads setting via the Thread Pool API until it determines that the minimum number of threads is adequate based on available threads.

How It Works

Since a picture is worth a thousand words, I put together a diagram of how the logic of the Thread Pool Size Monitor logic works, and provided an example with the default settings that are set on an Azure P3v2 App Service that has 4 cores.  




Custom Thread Pool Size Monitor Configuration

An enhancement that I have made on my past 9.1 PaaS implementation was to tune Sitecore’s dynamic thread processor using a more “aggressive” configuration. This helped me with those “bursty” web traffic situations where I needed to be sure that I had enough threads available to serve the current demands. 

Here is the configuration that I used:

Tuesday, September 21, 2021

Sitecore xDB - Troubleshooting xDB Index Rebuilds on Azure

Standard
In my previous post, I shared some important tips to help ensure that if you are faced with an xDB index rebuild, you can get it done successfully and as quickly as possible.

I mentioned a lot of things in the post, but now, I want to mention common reasons where and why things can go wrong, and highlight the most critical items that impact the rebuild speed and stability.


Causes of Need To Rebuild xDB Index

Your xDB relies on your shard database's SQL Server change tracking feature in order to ensure that it stays in sync. This basically determines how long changes are stored in SQL. As mentioned in Sitecore's docs, the Retention Period setting is set to 5 days for each collection shard. 

So, why would 5-day old data not be indexed in time?
  • The Search Indexer is shut down for too long
  • Live indexing is stuck for too long
  • Live indexing falls too far behind

Causes of Indexing Being Stuck or Falling Behind, and Rebuild Failures

High Resource Utilization: Collection Shards 
99% of the time, this is due to high resource utilization on your shard databases. Basically, if you see your shard databases hitting above 80% DTUs, you will run into this problem.

High Resource Utilization: Azure Search or Solr
If you have a lot of data, you need to scale your Azure Search Service or Solr instance.  Sharding is the answer, and I will touch in this further down.

What to check?

If you are on Azure, make sure your xConnect Search Indexer WebJob is running.
Most importantly, check your xConnect Search Indexer logs for SQL timeouts. 

On Azure, the Webjob logs are found in this location: D:\local\Temp\jobs\continuous\IndexWorker\{randomjobname}\App_data\Logs"

Key Ingredients For Rebuild Indexing Speed and Stability

SQL Collection Shards

Database Health 

Maintaining the database indexes and statistics is critically important. As I mentioned in my previous post:  "Optimize, optimize, optimize your shard databases!!!" 

If you are preparing for a rebuild, make sure that you run the AzureSQLMaintenance Stored Procedure on all of your shard databases.

Database Size

The amount of data and the number of collection shards is directly related to resource utilization and rebuild speed and stability. 

Unfortunately, there is no supported way to "reshard" your databases after the fact. We are hoping this will be a feature that is added to a future Sitecore release.

xDB Search Index

Similarly to the collection shards, the amount of data and the number of shards is directly related to resource utilization on both Azure Search and Solr. 

Specifically on Solr, you will see high JVM heap utilization.

If your rebuilds are slowing down or failing, or even if search performance on your xDB index is deteriorating, it's most likely due to the amount of data in your index, the number of shards and distribution amongst nodes that you have set up.  

Search index sharding strategies can be pretty complex, and I might touch on in these in a later post.

Reduce Your Indexer Batch Size

Another item that I mentioned in my previous post. If you drop this down from 1000 to 500 and you are still having trouble, reduce it even further. 

I have dropped the batch size to 250 on large databases to reduce the chance of timeouts (default is 30 seconds) when the indexer is reading contacts and interactions from the collection shards.