Monday, May 18, 2020

Sitecore xDB - Troubleshooting Contacts Moving Slowly Through Marketing Automation Plans

Standard

Background

We ran into an issue on our Sitecore 9.1 Azure PaaS deployment, where contacts were moving extremely slowly through the elements of several of our Marketing Automation plans. The contacts were being enrolled correctly, but when they got to a certain step, the could be stuck for several days.

Our instance includes several high traffic sites, but our plans were not complicated at all.  For example, one was set up like this:

  • Visitor updates their profile, and after clicking submit, a goal is triggered that enrolls the contact into the plan.
  • There is a 1 minute delay.
  • An email is sent.

In this post, I will describe the approach I took to getting to the bottom of this problem.

Getting More Information From Marketing Automation Logs

This first step in any troubleshooting process is to get as much detail from your logs as possible so that you can better understand what could be going wrong. 

One of the key components of Sitecore Marketing Automation, is the task that is responsible for the processing of contacts in the Automation Plans. In Azure, this is the Marketing Automation Operations App Service (ma-ops) WebJob, or the Sitecore Marketing Automation Engine service on a Windows Server. What you need to do is set the logging level to “Information” in the sc.Serilog.xml file. 

The full path of the file location is the following (Azure):
wwwroot/App_Data/jobs/continuous/AutomationEngine/App_Data/Config/sitecore/CoreServices/sc.Serilog.xml

For more information, check the following article: https://kb.sitecore.net/articles/017744

After getting this in place, I was able to rule out the possibility of exceptions being thrown that could have been causing the problem.

Finding the clogged Automation Pool

After digging in, I discovered that the number of contacts in the UI corresponded to the data in the Marketing Automation (ma-db) database’s AutomationPool table.  

These are the other things I discovered:
  • Running a “row count” query against the AutomationPool table determined that there were 100s of millions of records.
  • Checking Azure resource analytics showed large database utilization spikes (100% DTU) that corresponded to high CPU on my ma-ops service (averaging above 70%).
  • Running a profile on the maengine.exe job revealed that getting and processing the data from the SQL server seemed to be the bottleneck.

Fixing the clogged Automation Pool

Database Health

One key assumption here was that my ma-db was healthy, and that its indexes were not seriously fragmented.

In previous posts, I have spoken about how important it is to make sure that regular maintenance is performed on your databases. Same thing applies here. Make sure that you run the AzureSQLMaintenance Stored Procedure on your ma-db regularly.

Increase Pricing Tier If Possible

The analysis determined that both my ma-ops app service and ma-db database were struggling to keep up with the processing load, and so I increased the pricing tier on both to give them more horsepower.

Increase Low Priority Worker Batch Size

The AutomationPool table showed me that almost all of the items had a priority of 80, and would be processed by the LowPriorityWorker. I opted to increase the batch size of this worker, so that it would process more items in each batch.  My worker was initially set to 200, and so I tippled it to 600.

The config file location can be found here: App_Data/jobs/continuous/AutomationEngine/App_Data/Config/sitecore/MarketingAutomation/sc.MarketingAutomation.Workers.xml

<LowPriorityWorkerOptions>
<!-- How long the worker sleeps when there is no work available -->
<Schedule>00:00:20</Schedule>
<!-- The minimum priority of work item to process. -->
<MinimumPriority>0</MinimumPriority>
<!-- The timeout period to use for work items. -->
<WorkItemTimeout>00:00:45</WorkItemTimeout>
<!-- The period after which the work item timeout is set again. -->
<WorkItemTimeoutSchedule>00:00:30</WorkItemTimeoutSchedule>
<!-- The batch size multiplier to use when checking out work items from the pool. -->
<BatchHead>4</BatchHead>
<!-- The batch size to use when checking items out from the pool. -->
<BatchSize>150</BatchSize>
</LowPriorityWorkerOptions>

Clog Removed

After making these adjustments, I kept a close eye on the number of rows in the AutomationPool table.

I am happy to report that they were decreasing at a pleasing rate, and that the data started to appear in the reports in the UI.