Saturday, August 31, 2019

Sitecore xDB - Optimizing Your xDB Index Rebuild For Speed

Standard
Having performed Sitecore xDB index rebuilds many times with large data sets, I wanted to share some key tips to ensure a successful and speedy rebuild.

My experience has been on Azure PaaS, with both Azure Search and SolrCloud, but these techniques can be applied to on-premise and IaaS as well.

Change The Log Level

Before starting a large rebuild job, it's important to enable the proper logging in case you need to investigate any issues that may arise. By default, the log level is set to Warning. Change it to Information.

Navigate to the: App_Data\jobs\continuous\IndexWorker\App_data\config\sitecore\CoreServices\sc.Serilog.xml file, and change the MinimumLevel to Information: 
   <MinimumLevel>
      <Default>Information</Default>
    </MinimumLevel>

Optimize Your Indexer Batch Size

Don't be over eager with your indexer's batch size setting. This setting determines how many contacts or interactions are loaded per parallel stream during an index rebuild. This setting is found in the following location:
App_Data\jobs\continuous\IndexWorker\App_data\config\sitecore\SearchIndexer\sc.Xdb.Collection.IndexerSettings.xml file:
<BatchSize>1000</BatchSize>

I have had success reducing this size to 500, as it helps execution go faster and prevents you from hitting timeouts during your rebuild when your shard databases are under heavy load.

Reduce Your SplitRecordsThreshold

A good tip from a member of our Sitecore community. By default, this value is set to 25000. Tweaking and reducing this value can also make your rebuilds run faster, and improve your live indexing.

Like the Indexer Batch size, this setting is found in the following location:
App_Data\jobs\continuous\IndexWorker\App_data\config\sitecore\SearchIndexer\sc.Xdb.Collection.IndexerSettings.xml
<SplitRecordsThreshold>25000</SplitRecordsThreshold>

I have had success reducing this size to a little less than half the original value, 12000.

Reduce Your Index Writer's ParallelizationDegree

To decrease load in your Search Service, another good suggestion is to decrease the PrallelizationDegree Setting which is 4 by default.

You will see this in many of the other configs in your xConnect App Services. This setting determines how many parallel streams of data can be processed at the same time.

This setting is found in this location for Solr:
\App_Data\jobs\continuous\IndexWorker\App_data\config\sitecore\SearchIndexer\ sc.Xdb.Collection.IndexWriter.SOLR.xml
 <ParallelizationDegree>4</ParallelizationDegree>​

This setting is found in this location for Azure Search:
\App_Data\jobs\continuous\IndexWorker\App_data\config\sitecore\SearchIndexer\ sc.Xdb.Collection.IndexWriter.AzureSearch.xml
 <ParallelizationDegree>4</ParallelizationDegree>​

I have seen large improvements reducing this value from 4 to 1.

Optimize Your Databases

Optimize, optimize, optimize your shard databases!!!

If you are not already using the AzureSQLMaintenance Stored Procedure on your Sitecore databases, do it today!

This is critically important for not only your shard databases but also your other Sitecore databases like Core, Master and Web.  Marketing Automation and Reference databases also get hammered pretty hard, so make sure that this gets applied and run regularly on these.

Note that this maintenance is 100% necessary. As Grant Killian says: "The expectation that Azure SQL is a 'fully managed solution' is somewhat misleading, as rebuilding query stats and defragmentation are the user’s responsibility."

Sitecore recommends an approach like this: https://techcommunity.microsoft.com/t5/Azure-Database-Support-Blog/Automating-Azure-SQL-DB-index-and-statistics-maintenance-using/ba-p/368974

Schedule an Azure Automation “Runbook” to attend to this after hours for all Sitecore databases.

Run the Rebuild

After all this things have been completed, run the xDB rebuild as Sitecore's docs mentions.  In Kudu, go to site\wwwroot\App_data\jobs\continuous\IndexWorker and execute this command:

.\XConnectSearchIndexer.exe -rr

After this, you will see the magic start with docs in your inactive / rebuild index first be reset to 0, and then counts start gradually increasing.

Unfortunately, there is no way to see how far along you are. There is however a query that you can run against your Azure Search or Solr indexes to check the status.

Azure Search Query
$filter=id eq 'xdb-rebuild-status'&$select=id,rebuildstate

Solr Query
id:"xdb-rebuild-status"

Returned Index Rebuild Status:
Default = 0
RebuildRequested = 1
Starting = 2
RebuildingExistingData = 3
RebuildingIncomingChanges = 4
Finishing = 5
Finished = 6

Rebuild Success

When you run the status query, and you see a "6" (sometimes you will see "5" returned, and that's ok too),  you have a successfully completed the rebuild, and new data will start flowing into your xDB index.

Friday, August 9, 2019

Demystifying Pools and Threads to Optimize and Troubleshoot Your Sitecore Application

Standard

Background

If you are .NET application developer that works on Sitecore or not, it is important to have an understanding of how the Microsoft .NET Common Language Runtime (CLR) Thread Pool works will help you determine how to configure your application for optimal performance and help you troubleshoot issues that may present themselves in high traffic production environments.

This topic has been of great interest to me, and it's understanding has helped me troubleshoot and solve many difficult problems within the Sitecore realm.

I am hoping that this post helps other fellow Sitecore developers who may not be as familiar with the inner workings of the .NET CLR and Thread Pool, to have a starting pointing to understand where potential threading issues may occur if the application you support shows symptoms similar to what I intend to discuss.



Thread Pool and Threads 

To put it simply, a thread pool is a group of warmed up threads that are ready to be assigned work to process. 

The CLR Thread Pool contains 2 types of threads that have different roles.

1) Worker Threads 

Worker threads are threads that process HTTP requests that come into your web server - basically they handle and process your application's logic. 

2) Input/Output (I/O) Completion Port or IOCP Threads 

These threads handle communication from your application's code to a network type resource, like a database or web service.

There is really no technical difference between worker threads and IOCP threads. The CLR Thread Pool keeps separate pools of each simply to avoid a situation where high demand on worker threads exhausts all the threads available to dispatch native I/O callbacks, potentially leading to a deadlock. However, this can still occur under certain circumstances.

Out of the Box / Default Thread Pool Thread Counts 

Minimums 

By default, the number of Worker and IOCP threads that your Thread Pool will have ready for work is determined by the number of processors your server has.

Min Formula: Processor Count =  Thread Pool Worker Threads = Thread Pool IOCP Threads

Example: If you have a server with 8 CPUs, you will start with only 8 worker and 8 IOCP threads.

Maximums 

By default, the maximum number of Worker and IOCP threads is 20 per processor.

Max Formula: Processor Count * 20 =  Max Thread Pool Worker Threads = Max Thread Pool IOCP Threads

Example:  If you have a server with 8 CPUs, the default max worker and IOCP threads will be 20 x 8 = 160.

Safety Switch 

The Thread Pool WILL NOT inject new threads when the CPU usage is above 80%. This is a safely mechanism to prevent overloading the CPU.

The Thread Pool In Action

As requests come into your web server, the Thread Pool will inject new worker or I/O completion threads when all the other threads are busy until it reaches the "Minimum" number for each type of thread.

After this "Minimum" has been reached, the Thread Pool will throttle the rate at which it injects new threads and will only add or remove 1 thread per 500ms / 2 threads per second, or as a thread has completed work and becomes free, whatever comes first.

Through its "hill climbing technique algorithm", it is self-tuning and will stop adding threads and remove them if they are not actually helping improve throughput. The thread injection will continue while there is still work to be done until the "Maximum" number for each thread type has been reached.

As the number of requests is reduced, the threads in the Thread Pool start timing out waiting for new work (if an existing thread stays idle for 15 seconds), and will eventually retire themselves until the pool shrinks back to the minimum.

"Bursty" Web Traffic, Thread Starvation and 503 Service Unavailable

Let's say you have your Sitecore site running on an untuned, single Content Delivery server that has 8 processors with the default Thread Pool thread settings. For the sake of the simple example, let's assume we have an under-powered web service (perhaps used for looking up customer information from a backend CRM system) that under heavy load takes 5 seconds to provide a response to a request. Our developers have not implemented asynchronous programming in this example, and use the HttpWebRequest class.

We start out with 8 warmed up and ready worker and IOCP threads in our Thread Pool.

Now, lets say we have burst of 100 visitors accessing different pages (pages that consume the web service) on our site at the same time. The Thread Pool will quickly assign the 8 threads to handle the first 8 requests that will be busy for the next 5 seconds, while the other 92 sit in a queue. As you can see, it will take many 500ms intervals to catch up with the workload. IIS will wait some time for the threads to get free, so that the requests in queue can be processed. If any thread gets free in the waiting time, then it will be used to process the request. Otherwise IIS will return a 503 Service Unavailable error message. Both the slow web service and the untuned Thread Pool will result in some unhappy visitors seeing the 503 error message.

Looking at this a bit closer, a call to a web service uses one worker thread to execute the code that sends the request and one IOCP thread to receive the callback from the web service. In our case, the Thread Pool is completely saturated with work, and so the callback can never get executed because the items that were queued in the thread pool were blocked.

This problem is called Thread Pool Starvation - we have a "hungry" queue waiting to be served threads from the pool to perform some work, but none are available.

This example is a good reason for using asynchronous programming. With async programming, threads aren’t blocked while requests are being handled, so the threads would be freed up almost immediately.

Optimizing Thread Settings 

The ability to tune / manage thread settings has been available in the .NET framework for ages - since v1.1 actually.

Arguably, the most important settings are the minWorkerThreads and minIOThreads where you can specific the minimum number of threads that are available to your application's Thread Pool out of the gate (overriding the default formula's based on processor count as described above).

Threads that are controlled by these settings can be created at a much faster rate (because they are spawned from the Thread Pool), than worker threads that are created from the CLR's default "thread-tuning" capabilities - 1 thread per 500ms / 2 threads per second when all available threads in the pool are busy.

These and other important thread settings can be set in either your server's machine configuration file (in the \WINDOWS\Microsoft.Net\Framework\vXXXX\CONFIG directory) or with the Thread Pool API.

Beware: Out-of-Process Session State and Redis Client  

Out-of-Process Session State

If you are using Out-of-Process Session State in your Sitecore environment, you need to tune your Thread Pool!

Each of your Sitecore Content Delivery instances are individually configured to query expired sessions from your session store. This mechanism will add a ton of additional request overhead to your CD instances, and if your Thread Pools aren't tuned to handle this, you will find yourself in a Thread Starvation situation.

For more background on how and why this happens, please check out Ivan Sharamok's great post: http://blog.sharamok.com/2018-04-07/prepare-cd-for-experience-data-collection

Redis Client

If you are running your Sitecore environments on Microsoft Azure, you will be using Redis for session management. Sitecore makes use of the StackExchange.Redis client within the platform. Even though the client is built for high performance, it get's finicky if your Thread Pool threads are all busy, the "minimum" has been reached and thread injection slows down. You will start seeing Redis service request timeouts.

It is important for you to go through a Thread Pool tuning exercise to ensure that you don't run into Thread Starvation issues.

The nice thing is that the client prints Thread Pool statistics to your logs with details about worker and IOCP threads, to help you with your tuning exercise.

For more details, follow this Microsoft Redis FAQ link: https://docs.microsoft.com/en-us/azure/azure-cache-for-redis/cache-faq#important-details-about-threadpool-growth

Self-adjusting Thread Settings 

Lucky for us on Sitecore 9 and above, there is a pipeline processor that allows the application to adjust thread limits dynamically based on real-time thread availability (using the Thread Pool API).

By default, every 500 milliseconds, the processor will keep adding 50 to the minWorkerThreads setting via the Thread Pool API until it determines that the minimum number of threads is adequate based on available threads.

In my next post, I intend to explore this processor in detail and provide information on it's self-tuning abilities.