Friday, July 27, 2018

Sitecore xDB - GeoIP and Contention Dynamics in MongoDB

Standard

Background

In my previous post, I discussed how our team has been diligently working to alleviate pressure on the our servers and MongoDB, on a high-traffic client's Sitecore Commerce site.

We use mLab to host our Experience Database, and while monitoring the telemetry of cluster, we noticed a series of contention indicators related to the increased number of queries and connections during high-traffic surges during the day.

In our scenario, our client's site has a lunchtime traffic surge between 11am and 3pm every day.

Contention Dynamics

Overall, our MongoDB was not being over-taxed in terms of overall capacity, as we were not using up all the RAM and CPU, but the telemetry charts did show what looked like pretty clear contention.

We noticed a certain pattern and volume in the site's traffic that lead to contention dynamics on our MongoDB nodes. The contention would eventually start to affect the Sitecore Content Delivery servers, which were obviously also dealing with that day’s peak load of web lunchtime traffic.

We were seeing a surge in connections with data reads (as reflected in MongoDB metric) such as the count of Queries (Operations Per Second) and the Returned documents count (Docs Affected Per Second). This was leading to a high degree of contention, as reflected in various other MongoDB metrics (CPU time, queues, disk I/O, page faults).


Our initial theory supported the idea the root cause of this contention in MongoDB was caused by high volume of lunchtime traffic in Sitecore, but in an indirect way.

GeoIP and MongoDB

Having troubleshooted Sitecore's GeoIP service before, I had a pretty good understanding of the flow.

If you need some insight, I suggest reading Grant Killian's post: https://grantkillian.wordpress.com/2015/03/09/geoip-resolution-for-sitecore-explained

In summary, the flow looks like this:
  • Visitor visits Sitecore website
  • Sitecore performs a GeoIP information lookup from the memory cache using the visitor's IP address
  • If the GeoIP information IS in memory cache then it uses it in the visitor's interaction
  • If the GeoIP information IS NOT in memory cache, it performs a GeoIP lookup in the MongoDB Analytics database's GeoIps collection
  • If the GeoIP information IS in the MongoDB Analytics database's GeoIps collection, it uses it in the visitor's interaction and stores the result in memory cache
  • If the GeoIP information IS NOT in the GeoIps collection, it performs a lookup using the Sitecore Geolocation service and stores the result in memory and uses it in the visitor's interaction

Our high-traffic site makes heavy use of GeoIP, as the Home Page is personalized based on the visitor's location and local time. 

There had to be a correlation between the high-traffic, GeoIP and the activity we were seeing on our MongoDB cluster. 

The item that stood out at me was the highlight above - the GeoIP lookup against the MongoDB Analytics GeoIps collection.



Running a record count query against the GeoIps collection, we discovered that it contained 7.4 million records! This confirmed our theory that the MongoDB GeoIp collection was heavily populated and being used for the lookups to hydrate the visitor's interaction and memory cache.

As a side note, if you crack open the interaction collection, you can see how Sitecore ties the GeoIP data from the lookup to the visitor's interaction (this is old news):


GeoIP Cache Settings

After digging into the code, we discovered that Sitecore's GeoIP service uses the cache called LegacyLocationList to store the GeoIP lookup data after is has been returned from either MongoDB or the GeoLocation service.

The naming of the cache is what caught us by surprise. One would think that a "legacy" cache would no longer be used.

If you crack open the Sitecore.CES.GeoIp.LegacyLocation.dll with your favorite .NET Decompiler  and you will see the following:


We started monitoring this legacy location cache closely, and discovered that it was in fact hitting capacity and clearing frequently during our lunchtime traffic surge. This had a direct relationship with the contention we were seeing on our MongoDB nodes during that period of time.

It was obvious to us at this point, that the 12MB default size of this cache was not enough to handle all that GeoIP lookup data!



GeoIP Cache Size Updates and Results

Our team decided to increase the LegacyLocationList cache size to 20MB via a simple patch update:

 <setting name="CES.GeoIp.LegacyLocation.Caching.LegacyLocationListCacheSize">  
     <patch:attribute name="value">20MB</patch:attribute>  
 </setting>  
After our deployment, we monitored the cluster's telemetry closely. It was apparent by looking at the connection count, that there was an instant improvement resulting from the increased cache size.

Before the deployment of the cache setting change (LegacyLocationList cache default set to 12MB), we were averaging around 400 connections during the traffic surge.


After the deployment (increase the LegacyLocationList cache size to 20MB), our connection count was only averaging around 200!


Over the course of several weeks, our team was happy to report that during our lunchtime traffic surges, there was a dramatic reduction in connections with Data Reads, Operations Per Second, Docs Affected Per Second, CPU time, queues, disk I/O, page faults on our MongoDB cluster.

Another positive step towards our overall goal of improving MongoDB connection management on our Content Delivery servers.

Final Note

Another special thanks to Dan Read (Arke), Alex Mayle (Sogeti) for their contributions.

Sunday, July 8, 2018

Sitecore xDB: Performance Tuning your MongoDB Driver Configuration Settings

Standard

The Goal

Working with my team on a high-traffic client's Sitecore Commerce site, we were tasked with improving MongoDB connection management on the Content Delivery servers to help alleviate pressure on the servers and MongoDB, particularly during busy times of the day, and at times when traffic surges occur due to marketing campaigns or other real-world events.

The Key Settings

We confirmed that Sitecore ships with the default MongoDB driver settings that are actually set in the driver code.  You can view the default settings in the driver code, by following this GitHub link:
https://github.com/mongodb/mongo-csharp-driver/blob/v2.0.x/src/MongoDB.Driver/MongoDefaults.cs

Working with mLab Support, we determined that our focus would be on the following:

Min Pool Size

We decided to increase the Min Pool Size from the default 0 to 20. The mLab team approved this suggestion on the basis that we observed the Content Delivery server's connection pools maxing out due to the amount of operations that were happening during the Sitecore startup process.

Max Pool Size

We increased our Max Pool Size from the default of 100 to 150 in order to better accommodate surges in connection demand. The purpose of this update was to lessen the chance of running out of connections altogether.

Connection Idle Time

We increased the Connection Idle Time from the default of 10 minutes to 25 minutes to reduce the need to create new connections during normal and high-traffic surges.

Connection Life Time

We dropped the default setting of 30 minutes down to 0 (no lifetime). This change was based on the default setting that could also be a contributing factor to our observed connection churning.

Per this thread, a MongoDB engineer (driver author) suggested that this setting is likely not needed:
https://stackoverflow.com/questions/32816076/what-is-the-purpose-of-the-maxconnectionlifetime-setting

The How

As Kam explains in his post, Sitecore exposes an empty updateMongoDriverSettings pipeline that you can hook into to modify configurations that are not available in the connection string.

I created a processor to add to this pipeline that alters the MongoClientSettings:

Finally, I added the following patch to add the processor to pipeline, allowing us to pass the updated MongoDB driver settings to the custom processor:

Final Note

A special thanks to Dan Read (Arke), Alex Mayle (Sogeti) and the mLab Support Team for their contributions.