Thursday, May 16, 2019

Going to Production with Sitecore 9.1 on Azure PaaS: Critical Patches Required For Stability

After spending several months upgrading our custom solution to Sitecore 9.1, and launching on Azure PaaS, I have learned a lot about what it takes to eventually see the sunshine between those stormy clouds.

This is the first of a series of posts intended to help you and your team make the transition as smooth as possible.

Critical Patches

There are several patches and things that you will need to deploy that are imperative to your success on Azure PaaS.

High CPU - Excessive Thread Consumption

Sitecore traditional server roles (Content Management, Content Delivery etc) operate in a synchronous context while xConnect operations are asynchronous. Therefore, communication between your Sitecore traditional servers and xConnect are performed in a synchronous to asynchronous context.

This sync to async operation requires double the number of threads on the sync side in order to do the job.  This could result in there not being enough threads available to unblock the main thread.

Sitecore handled this excessive threading problem in their application code by building a custom thread scheduler. What this does is take advantage of a blocked thread to execute the operation, thus reducing the need for the additional thread, and making this synchronous to asynchronous context more efficient.

Great stuff right? Well, the problem that everyone will be faced with is that if you are not using an exact version of the System.Net.Http library, this thread scheduler simply doesn't work!

New versions of System.Net.Http don't respect the custom thread schedulers that Sitecore has built.

With the configurations that are shipped with Sitecore 9.x, the application uses the Global Assembly Cache to reference System.Net.Http, and 9 times out of 10, it will be a newer version of this library.

Without this thread scheduler working, you will end up with high CPU due to thread blocking, and your application will start failing to respond to incoming http requests.

In my case, I saw blocking appear in session end pipelines, and also in some calls on my Content Management server when working with EXM and contacts.

More detail about his issue, and the fix is described in this article:

When you read the article, you would think that it doesn't apply to you because it is referring to .NET 4.7.2, and if you are working with Sitecore 9.x, the application ships using 4.7.1.

The truth is that it does! You need to perform the following actions in order to fix the threading problem:

1. Apply the binding redirect to your web.config to force Sitecore to use System.Net.Http version mentioned in the article:

2. Deploy the System.Net.Http version to the bin folder on all your traditional Sitecore instances.

NOTE: Make sure you remove any duplicate System.Net.Http binding redirect entries in your web.config, and that you only have the one described above.

Reference Data

First Issue

This first patch you need adds the ability to configure cache sizes and expiration time for the UserAgentDictionaryCache, ReferringSitesDictionary, and GeoIpDataDictionary, and the size for ReferenceDataClientDictionary cache. Without this patch, you will see high DTU (up to 100%) in your Reference Data database as there is a bug that allows the cache size to grow enormously, which leads to performance issues and shutdowns.

In order to fix the issue, you need to review the following KB article:

In our 9.1 instance, I used the version of the patch.

Second Issue

This first patch is not enough to fix your Reference Data woes. There is another set of Stored Procedure performance issues related to SQL when querying the Reference Data database. 

You will need to download and execute the following SQL scripts in order to fix this issue:

Redis Session Provider

First Issue

If you are on Azure PaaS, you will most definitely using Redis as your Out of Proc Session State Provider.

Patch 210408 is critical for the stability of session state in your environment

This patch limits the number of worker threads per CPU core and also reserves threads so they can handle session end requests/threads with the least amount of delay as possible. Reading between the lines, this patch simply handles the Redis timeout issue more gracefully.

Without this, you will see session end events using all the threads and leaving no room to handle incoming http requests. After hanging for some time, they eventually end up with 502 error due to a timeout.

After applying the patch, the timeout settings referenced in this KB article will need to be made in both your web.config and Sitecore.Analytics.Tracking.config. You also want to update your pollingInterval to 60 seconds to reduce the stress on your Redis instance as well.

Note: Depending on how much traffic your site takes on, you may need to adjust the patch settings in order to free up more threads.

So for example, you can take the original settings, and add a multiplication factor of 3 or 4. As I mentioned before, this will be up to you to determine, based on your experienced load.

Example with multiplication factor of 3:

For my shared session tracker update, I created a patch file like the following:

Second Issue

Gabe Streza has a great post regarding the symptoms experienced when Redis instances powering your session state are under load:

It's important to read through his post, and also Sitecore's KB article:

What both are basically saying is that you will need to create a new Redis instance in Azure, so that you can split your private sessions and shared sessions. So, to be clear, you will have one Redis Instance to handle private sessions and another to handle shared sessions.

I decided to keep my existing Redis instance to handle shared sessions, and used the new Redis instance to handle private sessions.

Similar to Gabe's steps, I created a new redis.sessions.private entry in the ConnectionString.config.

I then updated my Session State provider in my web.config to the following:

Final Thoughts 

These fixes have made a night and day difference on the stability of our high traffic 9.1 sites on Azure PaaS.

Feel free to reach out to me on Twitter or Sitecore Slack if you have any questions.

Monday, January 21, 2019

Improving the Sitecore Broken Links Removal Tool



While working through an upgrade to Sitecore 9.1, I ran into a broken links issues that couldn't be resolved using Sitecore's standard Broken Links Removal tool.

While searching the internet, I was able to determine that I wasn't the only one that faced these types of issues.

In this post, I intend to walk you through the link problems that I ran into, and why I decided to create an updated Broken Links Removal tool to overcome the issues that the standard links removal tool wasn't able to resolve.

NOTE: The issues that I present in this post are not specific to version 9.1.  They exist in Sitecore versions going back to 8.x.

Exceptions after Upgrade Package Installation

After installing the 9.1 upgrade package and completing the post installation steps of rebuilding the links database and publishing, I discovered that lots of my site's pages started throwing the following exceptions:

The model item passed into the dictionary is of type 'Castle.Proxies.IGlassBaseProxy', but this dictionary requires a model item of type 'My Custom Model'.

During the solution upgrade, I had upgraded to Glass Mapper to version 5, so I thought that the issue could be related to this.  After digging in, I noticed that my items / pages that were throwing exceptions had broken links.  I determine this by turning on Broken Links using the Sitecore Gutter in the Content Editor.

Next, I attempted to run Broken Links Removal tool located at http://{your-sitecore-url}/sitecore/admin/RemoveBrokenLinks.aspx.

After it had run for several minutes, it threw the following exception:

ERROR Error looking up template field. Field id: {00000000-0000-0000-0000-000000000000}. Template id: {128ADD89-E6BC-4C54-82B4-A0915A56B0BD}
Exception: System.ArgumentException
Message: Null ids are not allowed.
Parameter name: fieldID
Source: Sitecore.Kernel
   at Sitecore.Diagnostics.Assert.ArgumentNotNullOrEmpty(ID argument, String argumentName)
   at Sitecore.Data.Templates.Template.DoGetField(ID fieldID, String fieldName, Stack`1 stack)
   at Sitecore.Data.Templates.Template.GetField(ID fieldID)

Digging In

I needed to understand why this exception was being thrown, and started down the path of decompiling Sitecore's assemblies.  My starting point for reviewing the code was Sitecore.sitecore.admin.RemoveBrokenLinks.cs which is the code behind for the Broken Links Removal page.

I took all the code and pasted it into my own ASPX page so that I could throw in a break point and debug what was going on.  After a lot of trial and error and a ton of logging,  I discovered that code that was throwing the error existed in the FixBrokenLinksInDatabase method on line 11 shown below:

If the Source Field ID / "itemLink.SourceFieldID" on line 11 is null (this is the field where it has determined that there is a broken link), the exception noted above will be thrown.

The Cause of the Null Source Field

During my investigation, I determined that the cause of this field being null was due to the item being created from a branch template that no longer existed.

To put this another way, the target item represented as the sourceItem in the code above (line 8), had a reference to a branch template that no longer existed, and the lookup for item was returning a null source field.

Through my code logging and Content Editor validation, I found that we had a massive amount of broken links caused by a developer deleting several EXM branch templates:

Stack Exchange and Sitecore Community uncovered some decent information regarding this type of issue, and how to solve it manually by running a SQL query:

Now, to fix this problem automatically using the tool, I just needed to add a null check in the code, and also create a way to clean up the references to the invalid branch templates.

Improved Broken Links Tool

The outcome of my work was an improved Broken Links Removal tool that I call the "Broken Links Eraser".

The tool does everything that the Sitecore Broken Links Removal tool does, with the following improvements:

  • Detects and removes item references to branch templates that no longer exist.
  • Removes all invalid item field references to other items (inspects all fields that contain an id).
  • Allows you to target broken links using a target path, you don't have to run through every item in the target database. This is useful when working with large sets of content.
  • Has detailed logging while it is running and feedback after it has completed. 

The tool is built as a standalone ASPX page, so you can simply drop the file in your {webroot}/sitecore/admin folder to use it. No need to deploy assemblies and recycle app pools etc.

All updates were made using Sitecore's SqlDataApi, so the code is consistent with Sitecore's standards. The code is available on GitHub for you to download and modify as needed:

Final Thoughts

I hope that you find this tool useful in solving your broken link issues. Please feel free to add comments or contact me with any questions on either Sitecore Slack or Twitter.

Monday, December 3, 2018

Fix Email Campaign Pausing: Sitecore Email Experience Manager 3.x Retry Data Provider



My company uses Email Experience Manager (EXM) to send several million emails a day, and we have been facing issues where our large campaigns would pause mid-send.

We have a scaled EXM environment with 2 dedicated dispatch servers, and a separate SQL Server, all with appropriate resources so the hardware was not an issue. We also ensured that databases were kept in tiptop condition (proper maintenance plans with stats being updated), and configurations where optimal for our environment.

The causing of the pausing

After digging in, I discovered that the pausing was caused by SQL deadlocks due to the massive amount of records and CRUD activity on the EXM SQL databases.

Sample Exception:

 ERROR Transaction (Process ID 116) was deadlocked on lock | communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction.  
 Exception: System.Data.SqlClient.SqlException  
 Message: Transaction (Process ID 116) was deadlocked on lock | communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction.  
 Source: .Net SqlClient Data Provider  
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)  
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)  
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)  
   at System.Data.SqlClient.SqlDataReader.TryHasMoreRows(Boolean& moreRows)  
   at System.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)  
   at System.Data.SqlClient.SqlDataReader.Read()  
   at System.Data.SqlClient.SqlCommand.CompleteExecuteScalar(SqlDataReader ds, Boolean returnSqlValue)  
   at System.Data.SqlClient.SqlCommand.ExecuteScalar()  
   at Sitecore.Modules.EmailCampaign.Core.Data.SqlDbEcmDataProvider.CountRecipientsInDispatchQueue(Guid messageId, RecipientQueue[] queueStates)  
   at Sitecore.Modules.EmailCampaign.Core.Gateways.DefaultEcmDataGateway.CountRecipientsInDispatchQueue(Guid messageId, RecipientQueue[] queueStates)  
   at Sitecore.Modules.EmailCampaign.Core.Analytics.MessageStatistics.get_Unprocessed()  
   at Sitecore.Modules.EmailCampaign.Core.Analytics.MessageStatistics.get_Processed()  
   at Sitecore.Modules.EmailCampaign.Core.MessageStateInfo.InitializeSendingState()  
   at Sitecore.Modules.EmailCampaign.Core.MessageStateInfo.InitializeMessageStateInfo()  
   at Sitecore.Modules.EmailCampaign.Factory.GetMessageStateInfo(String messageItemId, String contextLanguage)  
   at Sitecore.EmailCampaign.Server.Services.MessageInfoService.Get(String messageId, String contextLanguage)  
   at Sitecore.EmailCampaign.Server.Controllers.MessageInfo.MessageInfoController.MessageInfo(MessageInfoContext data)  

How does this new data provider fix the problem?

The new data provider introduces efficient SQL deadlock handling. When a deadlock is detected, it will wait 5 seconds and then retry the transaction. The code will try to execute a deadlocked transaction 3 times.


Defaults are set to wait 5 seconds for the retry, and the max retry attempts is 3. The DelaySeconds and RetryCount settings can be modified to suit your needs.

 <configuration xmlns:patch="">  
   <ecmDataProvider defaultProvider="sqlretry">  
     <add name="sqlretry" type="Sitecore.EmailCampaign.RetryDataProvider.RetrySqlDbEcmDataProvider, Sitecore.EmailCampaign.RetryDataProvider" connectionStringName="exm.master">  
      <Logger type="Sitecore.ExM.Framework.Diagnostics.Logger, Sitecore.ExM.Framework" factoryMethod="get_Instance"/>  
     <add name="sqlbase" type="Sitecore.Modules.EmailCampaign.Core.Data.SqlDbEcmDataProvider, Sitecore.EmailCampaign" connectionStringName="exm.master">  
      <Logger type="Sitecore.ExM.Framework.Diagnostics.Logger, Sitecore.ExM.Framework" factoryMethod="get_Instance"/>  

Source Code and Documentation

Full source code, documentation and package download is available from my GitHub repository:

Sunday, September 30, 2018

Sitecore Azure PaaS: Updating Your License File For All Deployed App Service Instances



If you are provisioning a new set of Sitecore environments on your own, or if the Sitecore Managed Cloud Hosting Team provisions your environments for you, you will most likely be using a temporary license file that is valid for 1 month while you are waiting for your permanent license file .

When the temporary license expires, your Sitecore instances will stop working. Therefore, it is important that you upload a valid permanent license.xml file as soon as it is available.

File Locations

In an XP Scaled environment, there are many different App Services and locations where the license.xml file will need to be updated.

I created a list of the App Service roles and the license file locations for your reference:

App Service Role License File Location
xc-search  \App_data & \App_data\jobs\continuous\IndexWorker\App_data
ma-ops  \App_data & \App_data\jobs\continuous\AutomationEngine\App_Data
cd  \App_data
cm  \App_data
ma-rep  \App_data
prc  \App_data
rep  \App_data
xc-collect  \App_data
xc-refdata  \App_data

Updating the License File

The easiest way to update the file is to use the Debug console in the Kudu "Advanced Tools" in your App Service Instance, or an FTPS client to connect directly to the App's filesystem.

Thursday, September 20, 2018

Sitecore GeoIP - A Developer's Guide To What Has Changed In Sitecore 9

In my previous post, I took a dive into the 8.x version of the Sitecore GeoIP service from a developer's point of view. Sitecore 9 introduced great improvements to xDB, GeoIP being one of those features.

In this post, I intend to help developers understand what has changed in Sitecore 9 GeoIP.  Like my previous post, the purpose is to arm developers with the necessary details to understand what is happening under the hood, so that they can successfully troubleshoot a problem if one arises.

Reference Data

One of the first things that I discovered when diving into version 9 is the use of a series of "ReferenceDataClientDictionaries" that are exposed to us as "KnownDataDictionaries".

As inferred by the name, these are known collections of things that are used to store common data, one being IP Geolocation data. The data is ultimately stored in a SQL database, so that it can be referenced throughout the Experience Platform.

There is a new pipeline in Sitecore 9 that initializes these dictionaries, as shown here:


<initializeKnownDataDictionaries patch:source="Sitecore.Analytics.Tracking.config">
<processor type="Sitecore.Analytics.DataAccess.Pipelines.InitializeKnownDataDictionaries.InitializeKnownDataDictionariesProcessor, Sitecore.Analytics.DataAccess"/>
<processor type="Sitecore.Analytics.XConnect.DataAccess.Pipelines.InitializeKnownDataDictionaries.InitializeDeviceDataDictionaryProcessor, Sitecore.Analytics.XConnect" patch:source="Sitecore.Analytics.Tracking.Database.config"/>

Processor Code:

namespace Sitecore.Analytics.DataAccess.Pipelines.InitializeKnownDataDictionaries
  public class InitializeKnownDataDictionariesProcessor : InitializeKnownDataDictionariesProcessorBase
    public override void Process(InitializeKnownDataDictionariesArgs args)
      Condition.Requires<InitializeKnownDataDictionariesArgs>(args, nameof (args)).IsNotNull<InitializeKnownDataDictionariesArgs>();
      GetDictionaryDataPipelineArgs args1 = new GetDictionaryDataPipelineArgs();
      Condition.Ensures<DictionaryBase>(args1.Result).IsNotNull<DictionaryBase>("Check configuration, 'getDictionaryDataStorage' pipeline  must set args.Result property with instance of DictionaryBase type.");
      args.LocationsDictionary = new LocationsDictionary(args1.Result);
      args.ReferringSitesDictionary = new ReferringSitesDictionary(args1.Result);
      args.GeoIpDataDictionary = new GeoIpDataDictionary(args1.Result);
      args.UserAgentsDictionary = new UserAgentsDictionary(args1.Result);
      args.DeviceDictionary = new DeviceDictionary(args1.Result);

If you look at line 13 above, the GeoIpDataDictionary object being created is inherited from Sitecore's new ReferenceDataDictionary.

This is the glue between GeoIP and the new Reference Data "shared storage" mechanism.

Here is what the code looks like:

namespace Sitecore.Analytics.DataAccess.Dictionaries
  public class GeoIpDataDictionary : ReferenceDataDictionary<Guid, GeoIpData>
    public GeoIpDataDictionary(DictionaryBase dictionary, int cacheSize)
      : base(dictionary, "GeoIpDataDictionaryCache", XdbSettings.GeoIps.CacheSize * cacheSize)
      this.ReadCounter = AnalyticsDataAccessCount.DataDictionariesGeoIpsReads;
      this.WriteCounter = AnalyticsDataAccessCount.DataDictionariesGeoIpsWrites;
      this.CacheHitCounter = AnalyticsDataAccessCount.DataDictionariesGeoIpsCacheHits;
      this.DataStoreReadCounter = AnalyticsDataAccessCount.DataDictionariesGeoIpsDataStoreReads;
      this.DataStoreReadTimeCounter = AnalyticsDataAccessCount.DataDictionariesGeoIpsDataStoreReadTime;
      this.DataStoreWriteTimeCounter = AnalyticsDataAccessCount.DataDictionariesGeoIpsDataStoreWriteTime;

    public GeoIpDataDictionary(DictionaryBase dictionary)
      : this(dictionary, XdbSettings.GeoIps.CacheSize)

    public override TimeSpan CacheExpirationTimeout
        return TimeSpan.FromSeconds(600.0);

    public override Guid GetKey(GeoIpData value)
      return value.Id;

    public string GetKey(Guid id)
      return id.ToString();

Notice on line 25 that this object is cached for 10 minutes. More on this below.

Reference Data Storage and the GeoIP Lookup Flow

You may be wondering how this Reference Data feature changes what you know about the GeoIP flow from previous versions of the platform.

Let's review the steps:

  • Sitecore runs the CreateVisits pipeline. Within this pipeline, there is a processor called UpdateGeoIpData that fires a method called GeoIpManager.GetGeoIpData within Sitecore.Analytics.Tracking.CurrentVisitContext that initiates the GeoIP lookup for the visitor's interaction.

  • Sitecore performs a GeoIP data lookup in the GeoIP memory cache.
    • NOTE: Cache expiration is set to 10 seconds => TimeSpan.FromSeconds(10.0)


    public void Add(GeoIpHandle handle)
      Assert.ArgumentNotNull((object) handle, nameof (handle));
      if (this.cache.Count >= this.maxCount)
      this.cache.Add(handle.Id, (object) handle, TimeSpan.FromSeconds(10.0));
      AnalyticsTrackingCount.GeoIPCacheSize.Value = (long) this.cache.Count;

  • If the GeoIP data IS in the GeoIP memory cache, then it will attach it to the visitor's interaction.

  • If the GeoIP data IS NOT in the GeoIP memory cache, it performs a lookup in the Reference Data's GeoIpDataDictionary (KnownDictionaries) memory cache.
    • NOTE: Cache expiration is set to 10 minutes => TimeSpan.FromSeconds(600.0). See above for the 10 minute CacheExpirationTimout property on the Sitecore.Analytics.DataAccess.Dictionaries.GeoIpDataDictionary class.

  • If the GeoIP data IS in the Reference Data's GeoIpDataDictionary memory cache, it attaches it to the visitor's interaction and adds it to the GeoIP memory cache.

  • If the GeoIP data IS NOT in the Reference Data's GeoIpDataDictionary memory cache, it performs a lookup in the SQL ReferenceData database and if found, stores the result in the Reference Data's GeoIpDataDictionary cache and GeoIP memory cache, and then attaches it to the visitor's interaction.

  • If the GeoIP data IS NOT in the SQL ReferenceData database, it performs a lookup using the Sitecore Geolocation service and stores the result in the SQL ReferenceData database, the Reference Data's GeoIpDataDictionary cache and GeoIP memory cache, and then attaches it to the visitor's interaction.

Reference Data Storage in SQL

By using SQL Server Management Studio, and opening up the ReferenceData database's DefinitionTypes table, you can see the different types of reference data that is being stored. The GeoIp data type name as you can see below, is called "Tracking Dictionary - GeoIpData".

By looking at the Definitions table, you can see that the data is stored as a Binary data type:

The following SQL Query will return the top 100 GeoIP reference data results:

SELECT TOP 100 [xdb_refdata].[DefinitionTypes].Name, [xdb_refdata].[Definitions].Data, [xdb_refdata].[Definitions].IsActive, [xdb_refdata].[Definitions].LastModified, [xdb_refdata].[Definitions].Version
FROM [xdb_refdata].[Definitions]
INNER JOIN [xdb_refdata].[DefinitionTypes] ON [xdb_refdata].[DefinitionTypes].ID = [xdb_refdata].[Definitions].TypeID
WHERE [xdb_refdata].[DefinitionTypes].Name = 'Tracking Dictionary - GeoIpData'

Changes to the GeoIpManager class

Finally, I wanted to provide a glimpse of the changes in the GeoIpManager class that I referenced in my previous post.

By comparing the 8.x version of the GeoIpManager code to 9, you can see the usage of the KnownDataDictionaries.GeoIPs dictionary instead of the Tracker.Dictionaries.GeoIpData (ContactLocation class) from 8.x:

Final Words

I hope that this information helps developers understand more about Reference Data and the updated GeoIP Lookup Flow in Sitecore 9.

As always, feel free to comment or reach me on Slack or Twitter if you have any questions.

Thursday, August 16, 2018

Sitecore GeoIP - What Is Happening Under The Hood In 8.x?



Most posts explain how Sitecore's GeoIP service works from a high-level point of view.

In this post, I intend to take the explanation a few steps deeper, so that developers can understand all the pieces that make this process work. The goal is to arm developers with the necessary details to successfully troubleshoot a problem if one arises.

Visitor Interaction - Start of visitor's session

  • Visitor visits Sitecore website, and this is regarded as a new interaction. Sitecore's definition of an interaction is ".. any point at which a contact interfaces with a brand, either online or offline". In our case, this is a new visitor session on the website.

  • Sitecore runs the CreateVisits pipeline. Within this pipeline, there is a processor called UpdateGeoIpData that fires a method called GeoIpManager.GetGeoIpData within Sitecore.Analytics.Tracking.CurrentVisitContext that initiates the GeoIP lookup for the visitor's interaction.

  • Within the GeoIP lookup logic, Sitecore will use the visitor's IP address to generate a unique identifier (GUID) based on the visitor's IP address. Eg. => fd747022-dd48-b1ca-1312-eb4ba55030b2. 

NOTE: Sitecore performs all GeoIP lookups using this unique identifier. You can see this id by looking inside your MongoDB's GeoIPs collection. The field is named _id and this is the unique naming convention that MongoDB uses across all of its content. See my previous post for a snapshot.

  • Sitecore performs a GeoIP data lookup in memory cache.

  • If the GeoIP data IS in memory cache, then it will attach it to the visitor's interaction.

  • If the GeoIP data IS NOT in memory cache, it performs a GeoIP lookup in the MongoDB Analytics database's GeoIps collection.

  • If the GeoIP data IS in the MongoDB Analytics database's GeoIps collection, it attaches it to the visitor's interaction and stores the result in memory cache.

  • If the GeoIP data IS NOT in the GeoIps collection, it performs a lookup using the Sitecore Geolocation service and stores the result in memory cache and attaches it to the visitor's interaction.

NOTE: After a successful lookup, the GeoIP data is stored in the Tracker.Current.Interaction.GeoData (ContactLocation class)

GeoIP Data Cache

  • When the GeoIP data is obtained, it is added to a dictionary object that is part of the Sitecore Tracker so that it can be referenced via the Tracker.Current.Interaction.GeoData (shown above).

  • The odd thing that I noticed was that the cache expiration was set to 10 seconds (by default)
          Code reference:
          private readonly TimeSpan defaultCacheExpirationTimeout = TimeSpan.FromSeconds(10.0);

GeoIP Data - End of visitor's session

  • At the end of the visitor's interaction / session, Sitecore will run the CommitSession pipeline.

  • Like the CreateVisits pipeline, there is a processor called UpdateGeoIpData that fires a method called GeoIpManager.GetGeoIpData (with the exact same code as in the CreateVisits pipeline). This initiates the GeoIP lookup flow once again (Cache / MongoDB / GeoIP Service).

  • Seems like the intention here is to confirm the visitor's GeoData before storing the data in MongoDB that will ultimately make it's way to the reporting database.

More To Come

Next, I intend to dig into Sitecore's GeoIP code for the 9.x series, and talk about the differences identified in that implementation.

Friday, July 27, 2018

Sitecore xDB - GeoIP and Contention Dynamics in MongoDB



In my previous post, I discussed how our team has been diligently working to alleviate pressure on the our servers and MongoDB, on a high-traffic client's Sitecore Commerce site.

We use mLab to host our Experience Database, and while monitoring the telemetry of cluster, we noticed a series of contention indicators related to the increased number of queries and connections during high-traffic surges during the day.

In our scenario, our client's site has a lunchtime traffic surge between 11am and 3pm every day.

Contention Dynamics

Overall, our MongoDB was not being over-taxed in terms of overall capacity, as we were not using up all the RAM and CPU, but the telemetry charts did show what looked like pretty clear contention.

We noticed a certain pattern and volume in the site's traffic that lead to contention dynamics on our MongoDB nodes. The contention would eventually start to affect the Sitecore Content Delivery servers, which were obviously also dealing with that day’s peak load of web lunchtime traffic.

We were seeing a surge in connections with data reads (as reflected in MongoDB metric) such as the count of Queries (Operations Per Second) and the Returned documents count (Docs Affected Per Second). This was leading to a high degree of contention, as reflected in various other MongoDB metrics (CPU time, queues, disk I/O, page faults).

Our initial theory supported the idea the root cause of this contention in MongoDB was caused by high volume of lunchtime traffic in Sitecore, but in an indirect way.

GeoIP and MongoDB

Having troubleshooted Sitecore's GeoIP service before, I had a pretty good understanding of the flow.

If you need some insight, I suggest reading Grant Killian's post:

In summary, the flow looks like this:
  • Visitor visits Sitecore website
  • Sitecore performs a GeoIP information lookup from the memory cache using the visitor's IP address
  • If the GeoIP information IS in memory cache then it uses it in the visitor's interaction
  • If the GeoIP information IS NOT in memory cache, it performs a GeoIP lookup in the MongoDB Analytics database's GeoIps collection
  • If the GeoIP information IS in the MongoDB Analytics database's GeoIps collection, it uses it in the visitor's interaction and stores the result in memory cache
  • If the GeoIP information IS NOT in the GeoIps collection, it performs a lookup using the Sitecore Geolocation service and stores the result in memory and uses it in the visitor's interaction

Our high-traffic site makes heavy use of GeoIP, as the Home Page is personalized based on the visitor's location and local time. 

There had to be a correlation between the high-traffic, GeoIP and the activity we were seeing on our MongoDB cluster. 

The item that stood out at me was the highlight above - the GeoIP lookup against the MongoDB Analytics GeoIps collection.

Running a record count query against the GeoIps collection, we discovered that it contained 7.4 million records! This confirmed our theory that the MongoDB GeoIp collection was heavily populated and being used for the lookups to hydrate the visitor's interaction and memory cache.

As a side note, if you crack open the interaction collection, you can see how Sitecore ties the GeoIP data from the lookup to the visitor's interaction (this is old news):

GeoIP Cache Settings

After digging into the code, we discovered that Sitecore's GeoIP service uses the cache called LegacyLocationList to store the GeoIP lookup data after is has been returned from either MongoDB or the GeoLocation service.

The naming of the cache is what caught us by surprise. One would think that a "legacy" cache would no longer be used.

If you crack open the Sitecore.CES.GeoIp.LegacyLocation.dll with your favorite .NET Decompiler  and you will see the following:

We started monitoring this legacy location cache closely, and discovered that it was in fact hitting capacity and clearing frequently during our lunchtime traffic surge. This had a direct relationship with the contention we were seeing on our MongoDB nodes during that period of time.

It was obvious to us at this point, that the 12MB default size of this cache was not enough to handle all that GeoIP lookup data!

GeoIP Cache Size Updates and Results

Our team decided to increase the LegacyLocationList cache size to 20MB via a simple patch update:

 <setting name="CES.GeoIp.LegacyLocation.Caching.LegacyLocationListCacheSize">  
     <patch:attribute name="value">20MB</patch:attribute>  
After our deployment, we monitored the cluster's telemetry closely. It was apparent by looking at the connection count, that there was an instant improvement resulting from the increased cache size.

Before the deployment of the cache setting change (LegacyLocationList cache default set to 12MB), we were averaging around 400 connections during the traffic surge.

After the deployment (increase the LegacyLocationList cache size to 20MB), our connection count was only averaging around 200!

Over the course of several weeks, our team was happy to report that during our lunchtime traffic surges, there was a dramatic reduction in connections with Data Reads, Operations Per Second, Docs Affected Per Second, CPU time, queues, disk I/O, page faults on our MongoDB cluster.

Another positive step towards our overall goal of improving MongoDB connection management on our Content Delivery servers.

Final Note

Another special thanks to Dan Read (Arke), Alex Mayle (Sogeti) for their contributions.