Tuesday, September 21, 2021

Sitecore xDB - Troubleshooting xDB Index Rebuilds on Azure

Standard
In my previous post, I shared some important tips to help ensure that if you are faced with an xDB index rebuild, you can get it done successfully and as quickly as possible.

I mentioned a lot of things in the post, but now, I want to mention common reasons where and why things can go wrong, and highlight the most critical items that impact the rebuild speed and stability.


Causes of Need To Rebuild xDB Index

Your xDB relies on your shard database's SQL Server change tracking feature in order to ensure that it stays in sync. This basically determines how long changes are stored in SQL. As mentioned in Sitecore's docs, the Retention Period setting is set to 5 days for each collection shard. 

So, why would 5-day old data not be indexed in time?
  • The Search Indexer is shut down for too long
  • Live indexing is stuck for too long
  • Live indexing falls too far behind

Causes of Indexing Being Stuck or Falling Behind, and Rebuild Failures

High Resource Utilization: Collection Shards 
99% of the time, this is due to high resource utilization on your shard databases. Basically, if you see your shard databases hitting above 80% DTUs, you will run into this problem.

High Resource Utilization: Azure Search or Solr
If you have a lot of data, you need to scale your Azure Search Service or Solr instance.  Sharding is the answer, and I will touch in this further down.

What to check?

If you are on Azure, make sure your xConnect Search Indexer WebJob is running.
Most importantly, check your xConnect Search Indexer logs for SQL timeouts. 

On Azure, the Webjob logs are found in this location: D:\local\Temp\jobs\continuous\IndexWorker\{randomjobname}\App_data\Logs"

Key Ingredients For Rebuild Indexing Speed and Stability

SQL Collection Shards

Database Health 

Maintaining the database indexes and statistics is critically important. As I mentioned in my previous post:  "Optimize, optimize, optimize your shard databases!!!" 

If you are preparing for a rebuild, make sure that you run the AzureSQLMaintenance Stored Procedure on all of your shard databases.

Database Size

The amount of data and the number of collection shards is directly related to resource utilization and rebuild speed and stability. 

Unfortunately, there is no supported way to "reshard" your databases after the fact. We are hoping this will be a feature that is added to a future Sitecore release.

xDB Search Index

Similarly to the collection shards, the amount of data and the number of shards is directly related to resource utilization on both Azure Search and Solr. 

Specifically on Solr, you will see high JVM heap utilization.

If your rebuilds are slowing down or failing, or even if search performance on your xDB index is deteriorating, it's most likely due to the amount of data in your index, the number of shards and distribution amongst nodes that you have set up.  

Search index sharding strategies can be pretty complex, and I might touch on in these in a later post.

Reduce Your Indexer Batch Size

Another item that I mentioned in my previous post. If you drop this down from 1000 to 500 and you are still having trouble, reduce it even further. 

I have dropped the batch size to 250 on large databases to reduce the chance of timeouts (default is 30 seconds) when the indexer is reading contacts and interactions from the collection shards.