Friday, October 21, 2016

Taming Your Sitecore Analytics Index by Filtering Anonymous Contact Data

With the release of Sitecore versions 8.1 U3 and 8.2, there is a new setting that will dramatically reduce the activity on your instance's analytics index by filtering out anonymous contact data from it.

To put this simply; you don't have to have all the anonymous visitor data added to your analytics index anymore.

 xDB will still capture and show the anonymous visitor data in the various reporting dashboards, but this data won't be added to your analytics index, and you won't see the anonymous contacts in the Experience Profile dashboard.

The new "ContentSearch.Analytics.IndexAnonymousContacts" setting can be found in the Sitecore.ContentSearch.Analytics.config file, and is set to "true" by default.

To quote the setting comments found in this file:

"This setting specifies whether anonymous contacts and their interactions are indexed.
If true, all contacts and all their interactions are indexed. If false, only identified contacts and their interactions are indexed. Default value: true".

One of the key changes to the core code can be seen in in the Sitecore.ContentSearch.Analytics assembly. The magic is on line 14:

1:  using Sitecore.Analytics.Model.Entities;  
2:  using Sitecore.ContentSearch.Analytics.Abstractions;  
3:  using Sitecore.Diagnostics;  
5:  namespace Sitecore.ContentSearch.Analytics.Extensions  
6:  {  
7:   public static class ContactExtensions  
8:   {  
9:    public static bool ShouldBeIndexed(this IContact contact)  
10:    {  
11:     Assert.ArgumentNotNull((object) contact, "contact");  
12:     ISettingsAnalytics instance = ContentSearchManager.Locator.GetInstance<ISettingsAnalytics>();  
13:     Assert.IsNotNull((object) instance, "Settings for contact segmentation index cannot be found.");  
14:     if (instance.IndexAnonymousContacts())  
15:      return true;  
16:     return !string.IsNullOrEmpty(contact.Identifiers.Identifier);  
17:    }  
18:   }  
19:  }  

Why does this matter? 

One of our clients started having some severe Apache Solr issues due to the JVM using a massive amount of memory after running xDB for several months. After our investigation, we discovered that the root cause of the memory usage was due to the analytics index being pounded during the aggregation process. 

The JVM memory usage was like a ticking time bomb. As we started collecting more and more analytics data, our java.exe process started using more and more memory. 

At launch, we gave 4GB to the Java heap size (for more info around this, you can Google Xms<size> -Xmx<size> values). After a few months of running the sites and discovering the memory issue, we felt as though perhaps we assigned our Xmx too low, and upped the memory limit to 8GB. A few weeks later, we outgrow this limit, and we bumped it up to 16GB. 

The high memory usage would eventually cause Solr not respond to query requests and the Sitecore instance to stop functioning. As we know, Sitecore is heavily dependent on an indexing technology (Solr or Lucene), and if it fails, chances are your instance will stop functioning unless you have a magically patch that I mentioned in my previous post:

Analytics Index Comparison 

After upgrading our instance from 8.1 U1 to 8.1 U3, and disabling this setting, we performed an index size comparison. Our analytics index went from 21,728,706 docs and 8GB in size to 0 docs and 101 bytes in size (empty). It's important to note that this is because we currently don’t have any identified contacts within xDB. I find it hard to believe that when we start our contact identification process using CRM system data, that we will be seeing sizes like this in the future.

Final Thoughts 

This setting has made a major different in the stability of our client's high traffic Sitecore sites. It's up to you and your team to decide how important it is to have those anonymous contact records show up in the Experience Profile dashboard. 

To us, it was a no-brainer.