Monday, June 22, 2020

Sitecore xDB - Extending xConnect To Reduce xDB Growth

Standard

Background

As I mentioned in a previous post,  putting a Sitecore environment into Production with xDB enabled means opening up the flood gates for a massive amount of data flowing into the platform as it starts collecting interactions and events for both anonymous and identified contacts.

xDB Storage Smarts

Just because you can store everything in xDB, it doesn't necessary mean that you should store everything. 

xDB is a marketing database. As a rule of thumb, you should only store data in xDB if it's required for:
  • Personalization
  • Targeting
  • Measurement / Analytics

It is important to plan for the data requirements early, and again, only store the data that you really need to!

Remove xDB Analytics Data

In version 9.2 and later, xConnect allows you to delete contacts and all associated data in xDB if you choose to. So, you have the ability to remove the data that isn't useful to you anymore. https://doc.sitecore.com/developers/93/sitecore-experience-platform/en/deleting-contacts-and-interactions-from-the-xdb.html

But, you need to do this by writing code, and so you would need to work with your Sitecore architect and developers to carefully plan this out for your implementation. And again, this is only an option for the newer versions of the platform.

Extend xConnect To Filter xDB Analytics Data

One good option that I have implemented is to prevent analytics data that is not valuable to your business from getting into xDB in the first place. xConnect allows you to extend the XConnectDataAdapterProvider, and this allows you to hook into the sweet spot where the session data is being written to the shard databases for storage.



For my implementation, I extended the data adapter provider and added configuration to only goals and other events that I explicitly specify, to be recorded in my collection database.

I hope that this helps with your implementation!

Sunday, June 14, 2020

Sitecore xDB - How To Fix Right To Be Forgotten Breaking EXM Campaigns

Standard

Background

Our email manager faced an Email Experience Manager (EXM) issue where when starting or resuming an email campaign, it will start initializing and then it will go directly to a paused state.

Errors

The logs of our Content Management Server revealed the following errors: 

Cause

After digging in, I discovered that this was caused by contacts in our lists that where missing an Alias identifier. This is a dependency for EXM in the way it pulls in the necessary contact data during dispatch. 

What had happened was that our email manager had used the Anonymize feature in the Experience Profile dashboard that executes the right to be forgotten xConnect API under the covers. This process removes the Alias identifier from the contact record in our key xDB shard database tables. 



Checking the xDB Database

If you have SQL experience, you will be pleasantly surprised to find that the xDB databases are not that complicated to work with it all.

Executing the following SQL query helped me find the contacts that where missing contact identifiers:


If you compare what a normal contact looks like vs an anonymized contact, you can see that the records have been wiped from the key ContactIdentifiers and ContactIdentifiersIndex tables.

This is the SQL statement that I used:


Normal Contact


Anonymized Contact



Another important point to note is that when a contact has been anonymized, a new ConsentInformation facet as added to the contact.

 


The Fix

I simply wrote a SQL statement that would insert new Alias identifier records in the ContactIdentifiers and ContactIdentifiersIndex tables for the contacts in my lists that had been anonymized.


Running this SQL statement fixed the missing dependency that EXM required for its email dispatch, and our email campaigns starting sending again.



Sunday, June 7, 2020

Sitecore xDB - Solr xDB Index Troubleshooting Postman Collection

Standard

Background

Over time, I have saved a series of Solr queries that I have found to be extremely useful in understanding, troubleshooting and maintaining xDB Solr indexes.  Some are especially useful when performing an xDB index rebuild.

Getting Started

After downloading the Sitecore xDB.postman_collection.json file, you can install it in your Postman application by clicking the "Import" button located in the top-left corner of the app and selecting the file downloaded on your computer.

After doing this, you will see the Sitecore xDB collection appear in your Postman app:



The core / index name, and Solr url are variables, so the next thing you want to do is set the values for your configuration.

Click on the Sitecore xDB collection, and then the more actions option (3 dots) will appear. Click on Edit, and then navigate to Variables.  




Set the solr_url variable to your Solr instance url, and the xdb_index to match the index you want to work with.  

There are two xConnect Solr cores: the live core (usually xdb) and the rebuild core, which has a _rebuild suffix (like xdb_rebuild). So, you can set this to either value, depending on the core you want to work with.


That's it! You should be ready to test.

Working with the collection

The collection has the following queries available to help with your troubleshooting or maintenance.

xDB Contacts Count

This will return the number of contacts in your index (shown in the numberFound value that is returned). 

When you start an index rebuild, depending on the core you have configured, you will see this number initially be 0, and then gradually start to increase as contacts get added to the core from your shard databases. 

Obviously, this is also useful to see how many total contacts you have in your index over time.



xDB Interactions Count

This returns the number of interactions in your index (shown in numberFound value that is returned). 

xDB Docs Count

This returns the total number of documents in your xDB core. This is a combination of contacts and interactions.

xDB Sync Token

This returns the sync token which is a key component of the Sitecore xConnect / xDB optimistic concurrency model. See more information about it here: https://sitecore.stackexchange.com/questions/12453/xconnect-indexworker-error-tokens-are-incompatible-they-have-different-set-of#answer-12454

xDB Rebuild Status

As described in one of my previous posts, it will return the status of your index, which is especially useful if you are performing an index rebuild.

Default = 0
RebuildRequested = 1
Starting = 2
RebuildingExistingData = 3
RebuildingIncomingChanges = 4
Finishing = 5
Finished = 6

xDB Schema Modifications

When setting up xDB cores for the first time, you are required to make specific schema modifications to both your xDB and xDB rebuild cores using the Solr Schema API. See the Sitecore docs site for more information: https://doc.sitecore.com/developers/90/platform-administration-and-architecture/en/walkthrough--using-solrcloud-for-xconnect-search.html

This post request contains the JSON schema modifications that need to be applied to each core.  The JSON that is used in the Body of this request can be found in the schema.json file located in the \App_Data\solrcommands folder of your xConnect instance.





xDB Delete Interactions

Do not use this in your Production environment! Like the name implies, it will remove interactions from your index. This is useful for local dev environments, if you need to clean things up in your index. This can also we used as an example for working in with your Solr cores via the API.

Monday, May 18, 2020

Sitecore xDB - Troubleshooting Contacts Moving Slowly Through Marketing Automation Plans

Standard

Background

We ran into an issue on our Sitecore 9.1 Azure PaaS deployment, where contacts were moving extremely slowly through the elements of several of our Marketing Automation plans. The contacts were being enrolled correctly, but when they got to a certain step, the could be stuck for several days.

Our instance includes several high traffic sites, but our plans were not complicated at all.  For example, one was set up like this:

  • Visitor updates their profile, and after clicking submit, a goal is triggered that enrolls the contact into the plan.
  • There is a 1 minute delay.
  • An email is sent.

In this post, I will describe the approach I took to getting to the bottom of this problem.

Getting More Information From Marketing Automation Logs

This first step in any troubleshooting process is to get as much detail from your logs as possible so that you can better understand what could be going wrong. 

One of the key components of Sitecore Marketing Automation, is the task that is responsible for the processing of contacts in the Automation Plans. In Azure, this is the Marketing Automation Operations App Service (ma-ops) WebJob, or the Sitecore Marketing Automation Engine service on a Windows Server. What you need to do is set the logging level to “Information” in the sc.Serilog.xml file. 

The full path of the file location is the following (Azure):
wwwroot/App_Data/jobs/continuous/AutomationEngine/App_Data/Config/sitecore/CoreServices/sc.Serilog.xml

For more information, check the following article: https://kb.sitecore.net/articles/017744

After getting this in place, I was able to rule out the possibility of exceptions being thrown that could have been causing the problem.

Finding the clogged Automation Pool

After digging in, I discovered that the number of contacts in the UI corresponded to the data in the Marketing Automation (ma-db) database’s AutomationPool table.  

These are the other things I discovered:
  • Running a “row count” query against the AutomationPool table determined that there were 100s of millions of records.
  • Checking Azure resource analytics showed large database utilization spikes (100% DTU) that corresponded to high CPU on my ma-ops service (averaging above 70%).
  • Running a profile on the maengine.exe job revealed that getting and processing the data from the SQL server seemed to be the bottleneck.

Fixing the clogged Automation Pool

Database Health

One key assumption here was that my ma-db was healthy, and that its indexes were not seriously fragmented.

In previous posts, I have spoken about how important it is to make sure that regular maintenance is performed on your databases. Same thing applies here. Make sure that you run the AzureSQLMaintenance Stored Procedure on your ma-db regularly.

Increase Pricing Tier If Possible

The analysis determined that both my ma-ops app service and ma-db database were struggling to keep up with the processing load, and so I increased the pricing tier on both to give them more horsepower.

Increase Low Priority Worker Batch Size

The AutomationPool table showed me that almost all of the items had a priority of 80, and would be processed by the LowPriorityWorker. I opted to increase the batch size of this worker, so that it would process more items in each batch.  My worker was initially set to 200, and so I tippled it to 600.

The config file location can be found here: App_Data/jobs/continuous/AutomationEngine/App_Data/Config/sitecore/MarketingAutomation/sc.MarketingAutomation.Workers.xml

<LowPriorityWorkerOptions>
<!-- How long the worker sleeps when there is no work available -->
<Schedule>00:00:20</Schedule>
<!-- The minimum priority of work item to process. -->
<MinimumPriority>0</MinimumPriority>
<!-- The timeout period to use for work items. -->
<WorkItemTimeout>00:00:45</WorkItemTimeout>
<!-- The period after which the work item timeout is set again. -->
<WorkItemTimeoutSchedule>00:00:30</WorkItemTimeoutSchedule>
<!-- The batch size multiplier to use when checking out work items from the pool. -->
<BatchHead>4</BatchHead>
<!-- The batch size to use when checking items out from the pool. -->
<BatchSize>150</BatchSize>
</LowPriorityWorkerOptions>

Clog Removed

After making these adjustments, I kept a close eye on the number of rows in the AutomationPool table.

I am happy to report that they were decreasing at a pleasing rate, and that the data started to appear in the reports in the UI.


Monday, December 9, 2019

Huge xDB and xConnect Improvements in Sitecore 9.3

Standard
With the release of Sitecore 9.3, there is a lot of hype in the community around the new bells and whistles that the updated version of the platform offers.

What I am most excited about are the huge performance and stability improvements in xConnect and xDB, and in this post, I will highlight those key areas.



Search Indexer Rebuild Enhancements

Performance

Sitecore made some very important improvements to the stability of rebuilds in systems under stress scenarios, allowing for faster rebuilds:

  • Implemented batching and parallelization during the rebuild synchronization stage
  • Reduced the default SplitRecordsThreshold from 25,000 to 1,000
  • Reduced the ParallelizationDegree of the rebuild to half
  • The sync stage of rebuild can now finish if there is an indexing cycle where less than 25k of changes are detected.


New Commands

  • Check status of rebuild: -rebuildmonitor / -rm
    • Before you could just check logs for the completion entires
  • Delete obsolete data from index: -cleanuprebuildindex / -cri


Rebuild Resume

If you have a lot of interaction data, rebuilds can take a very long time!

If you have a hiccup during the process, there is finally an option to resume without having to start from scratch! A major feature to say the least.


Sharding Deployment Tool Enhancements

With the 9.3 Database Deployment Tool, we finally have the ability to add and remove shards to the xDB collection store. Adding is so important, as I mentioned in my previous post.

To summarize, additional collection shards means additional SQL compute capacity to serve incoming queries in the distributed configuration, and thus faster query response times and index builds. Additional shards will increase total cluster storage capacity, speed up processing, and offer higher availability at a much lower cost than vertical scaling.

You will be required to run an upgrade tool that will modify your pre 9.3 collection databases to allow for the ability to make these adds and removes to your shards.

Web Tracker Stability

The following improvements have been made to the Web Tracker's stability and performance:

  • Session expiration batching
  • Optimization of the Web Tracker and xConnect communication
  • Optimization of the Web Tracker and Reference Data Service communication


Reference Data Layer Redesign 

There have been a lot of issues in previous versions of 9.x that required patches and SQL database scripts to resolve. See my post on Critical Patches for 9.1.

In 9.3, this Reference Data layer has been completely redesigned from the ground up, with focus the main focus being on performance.

New TVP Provider

Sitecore uses table-valued parameters (TVP) to send aggregated information to SQL, because this is a very fast way to communicate with SQL server. Basically, you can use table-valued parameters to send multiple rows of data to a function or stored procedure without the overhead of creating a temporary table or using tons of parameters.

Sitecore has added a new TVP provider in 9.3, that provides major performance improvements.

Ability to Turn On and Off Performance Counters in Azure

If you are deploying to PaaS, performance counters are now disabled by default just like CMS counters.

With 9.3, you have the ability to switch performance counters off and on for xConnect, xDB and Cortex services.


Tuesday, October 15, 2019

Sitecore xDB Resharding Tool - Unlock xDB Storage and Performance Limitations by Increasing Collection Shards Without Data Loss

Standard

Background

There is currently no way for you to increase the number of xDB collection database shards for an existing deployment, without starting from scratch and losing all your data.

The release of my colleague Vitaly Taleyko's Sitecore xDB Resharding Tool solves this problem for all of us, as it provides us with a migration path from an old shard collection to a new one.

https://github.com/pblrok/Sitecore.XDB.ReshardingTool


The inability to increase shards after deployment is a major problem for enterprise customers using the platform, who may not be aware of how quickly the collection databases will grow over time.

If you are a Sitecore veteran, you have experienced this rapid collection growth in MongoDB. As soon as the platform is "turned on", it starts collecting interactions and events for both anonymous and identified contacts.

Putting a Sitecore environment into Production, means opening up the flood gates for a massive amount of data that you don't have much control over.

xDB Search Index

On the xDB index side of the house, Sitecore has filtered the interactions and contact data in the xDB index to identified contacts only (by default). This is good for simple customers who aren't doing much with the platform.

However, if you have millions of contacts, you will have the same index problems as you may have faced when dealing with anonymous contact data in previous versions mentioned in this blog post from a while ago. There is a solve for this, that I may touch on in a later post.

The 2 Shard SQL Problem

Out of the box, if you install the Sitecore platform using default scripts and SIF or if Sitecore Managed Cloud has set up your environment in Azure, you will have 2 SQL shard collection databases.

The value proposition of Sitecore xDB is the ability to store any events tied to an interaction from any channel in xDB. They have provided a robust set of APIs to allow this.

The problem is that storing hundreds of gigabytes or even terabytes of data requires a very carefully planned strategy, or else, you will fail. It is just a matter of when.

I always have the following scene from Evan Almighty in my head when I talk about this problem:




The bottom line - if you have an enterprise deployment, and are using xDB, the out of the box 2 shard collection database configuration is not enough!

New Deployments

If you are new to the Sitecore platform, you can fix this by using the Shard Map Manager Tool to increase the number of shards. This great post by Kelly Rusk explains how: http://thebitsthatbyte.com/what-is-and-how-to-use-the-sitecore-9-shard-map-manager-tool

Existing / Live Deployments

Bad news if you have an existing deployment.

You will hit bottlenecks as you store more contact, interaction and event data. With limited CPU, storage capacity and memory, database performance will start to suffer and query performance and routine maintenance will slow down.

When it comes to adding resources to support database operations, vertical scaling (aka scaling up which is very easy to do on Azure) has its own set of limits and eventually reaches a point of diminishing returns.

The Negative Ripple Effect on xDB

I have seen cases where due to the massive amount of data stored in the xDB collection shards over time, xConnect Search will fail to keep the xDB index in sync, and xConnect search will stop working.

After this happens, the only option is to rebuild your xDB index, but because of the poor xDB collection database performance, it will take days, if not weeks if you are lucky.

Or, it will simply keep failing.

How Increasing Shards Helps

Adding additional collection shards to xDB means additional SQL compute capacity to serve incoming queries in the distributed configuration, and thus faster query response times and index builds.

Additional shards will increase total cluster storage capacity, speed up processing, and offer higher availability at a much lower cost than vertical scaling.

How the xDB Resharding Tool Helps

As I started working with Vitaly to architect this tool, our first idea was to use the Data Exchange Framework that is used to power the xDB Migration Tool. We had used a customized version of this tool when we migrated from our Sitecore 8.2 deployment to our current 9.1.

We decided to pivot, because we wanted a lightweight tool that could run on any Windows-based machine, and would run directly against SQL and as a result, be much more efficient!

The Beauty of the Tool Part 1: Migrating Your Data

This tool allows you to reshard your Sitecore xDB Collection Databases without any data loss.

What does this mean exactly?!

This tool allows you to migrate your current SQL xDB analytics collection database shards to a new set of xDB analytics collection database shards without losing any of your data.

So, for example if you have 2 shard databases, and want to move up to 4 shard databases, this tool will allow you to migrate over.

For this example, you would set up 4 new shards using SIF (as shown in Vitaly's GitHub readme doc), or use the Shard Map Manager Tool, and then point the tool at your old shards and the new shards and voila! Your data will get migrated over!

The Beauty of the Tool Part 2: Resume Mode

Another fantastic feature that Vitaly added to this tool is "resume mode". If there is a glitch in the migration process, or you need to stop it manually for some reason and resume later, it will remember where it left off and pick the migration right back up!

Battle-Tested and Ready For Download!

This tool has been tested, and I can say that it works, and works well!

You can check the full source code out on Vitaly's GitHub: https://github.com/pblrok/Sitecore.XDB.ReshardingTool

You can download your copy today using this link: https://github.com/pblrok/Sitecore.XDB.ReshardingTool/raw/master/ToolReleases/win-x64.zip


Saturday, September 21, 2019

Sitecore xDB - Troubleshooting xDB Index Rebuilds on Azure

Standard
In my previous post, I shared some important tips to help ensure that if you are faced with an xDB index rebuild, you can get it done successfully and as quickly as possible.

I mentioned a lot of things in the post, but now, I want to mention common reasons where and why things can go wrong, and highlight the most critical items that impact the rebuild speed and stability.


Causes of Need To Rebuild xDB Index

Your xDB relies on your shard database's SQL Server change tracking feature in order to ensure that it stays in sync. This basically determines how long changes are stored in SQL. As mentioned in Sitecore's docs, the Retention Period setting is set to 5 days for each collection shard. 

So, why would 5-day old data not be indexed in time?
  • The Search Indexer is shut down for too long
  • Live indexing is stuck for too long
  • Live indexing falls too far behind

Causes of Indexing Being Stuck or Falling Behind, and Rebuild Failures

High Resource Utilization: Collection Shards 
99% of the time, this is due to high resource utilization on your shard databases. Basically, if you see your shard databases hitting above 80% DTUs, you will run into this problem.

High Resource Utilization: Azure Search or Solr
If you have a lot of data, you need to scale your Azure Search Service or Solr instance.  Sharding is the answer, and I will touch in this further down.

What to check?

If you are on Azure, make sure your xConnect Search Indexer WebJob is running.
Most importantly, check your xConnect Search Indexer logs for SQL timeouts. 

On Azure, the Webjob logs are found in this location: D:\local\Temp\jobs\continuous\IndexWorker\{randomjobname}\App_data\Logs"

Key Ingredients For Rebuild Indexing Speed and Stability

SQL Collection Shards

Database Health 

Maintaining the database indexes and statistics is critically important. As I mentioned in my previous post:  "Optimize, optimize, optimize your shard databases!!!" 

If you are preparing for a rebuild, make sure that you run the AzureSQLMaintenance Stored Procedure on all of your shard databases.

Database Size

The amount of data and the number of collection shards is directly related to resource utilization and rebuild speed and stability. 

Unfortunately, there is no supported way to "reshard" your databases after the fact. We are hoping this will be a feature that is added to a future Sitecore release.

xDB Search Index

Similarly to the collection shards, the amount of data and the number of shards is directly related to resource utilization on both Azure Search and Solr. 

Specifically on Solr, you will see high JVM heap utilization.

If your rebuilds are slowing down or failing, or even if search performance on your xDB index is deteriorating, it's most likely due to the amount of data in your index, the number of shards and distribution amongst nodes that you have set up.  

Search index sharding strategies can be pretty complex, and I might touch on in these in a later post.

Reduce Your Indexer Batch Size

Another item that I mentioned in my previous post. If you drop this down from 1000 to 500 and you are still having trouble, reduce it even further. 

I have dropped the batch size to 250 on large databases to reduce the chance of timeouts (default is 30 seconds) when the indexer is reading contacts and interactions from the collection shards.