Friday, July 27, 2018

Sitecore xDB - GeoIP and Contention Dynamics in MongoDB



In my previous post, I discussed how our team has been diligently working to alleviate pressure on the our servers and MongoDB, on a high-traffic client's Sitecore Commerce site.

We use mLab to host our Experience Database, and while monitoring the telemetry of cluster, we noticed a series of contention indicators related to the increased number of queries and connections during high-traffic surges during the day.

In our scenario, our client's site has a lunchtime traffic surge between 11am and 3pm every day.

Contention Dynamics

Overall, our MongoDB was not being over-taxed in terms of overall capacity, as we were not using up all the RAM and CPU, but the telemetry charts did show what looked like pretty clear contention.

We noticed a certain pattern and volume in the site's traffic that lead to contention dynamics on our MongoDB nodes. The contention would eventually start to affect the Sitecore Content Delivery servers, which were obviously also dealing with that day’s peak load of web lunchtime traffic.

We were seeing a surge in connections with data reads (as reflected in MongoDB metric) such as the count of Queries (Operations Per Second) and the Returned documents count (Docs Affected Per Second). This was leading to a high degree of contention, as reflected in various other MongoDB metrics (CPU time, queues, disk I/O, page faults).

Our initial theory supported the idea the root cause of this contention in MongoDB was caused by high volume of lunchtime traffic in Sitecore, but in an indirect way.

GeoIP and MongoDB

Having troubleshooted Sitecore's GeoIP service before, I had a pretty good understanding of the flow.

If you need some insight, I suggest reading Grant Killian's post:

In summary, the flow looks like this:
  • Visitor visits Sitecore website
  • Sitecore performs a GeoIP information lookup from the memory cache using the visitor's IP address
  • If the GeoIP information IS in memory cache then it uses it in the visitor's interaction
  • If the GeoIP information IS NOT in memory cache, it performs a GeoIP lookup in the MongoDB Analytics database's GeoIps collection
  • If the GeoIP information IS in the MongoDB Analytics database's GeoIps collection, it uses it in the visitor's interaction and stores the result in memory cache
  • If the GeoIP information IS NOT in the GeoIps collection, it performs a lookup using the Sitecore Geolocation service and stores the result in memory and uses it in the visitor's interaction

Our high-traffic site makes heavy use of GeoIP, as the Home Page is personalized based on the visitor's location and local time. 

There had to be a correlation between the high-traffic, GeoIP and the activity we were seeing on our MongoDB cluster. 

The item that stood out at me was the highlight above - the GeoIP lookup against the MongoDB Analytics GeoIps collection.

Running a record count query against the GeoIps collection, we discovered that it contained 7.4 million records! This confirmed our theory that the MongoDB GeoIp collection was heavily populated and being used for the lookups to hydrate the visitor's interaction and memory cache.

As a side note, if you crack open the interaction collection, you can see how Sitecore ties the GeoIP data from the lookup to the visitor's interaction (this is old news):

GeoIP Cache Settings

After digging into the code, we discovered that Sitecore's GeoIP service uses the cache called LegacyLocationList to store the GeoIP lookup data after is has been returned from either MongoDB or the GeoLocation service.

The naming of the cache is what caught us by surprise. One would think that a "legacy" cache would no longer be used.

If you crack open the Sitecore.CES.GeoIp.LegacyLocation.dll with your favorite .NET Decompiler  and you will see the following:

We started monitoring this legacy location cache closely, and discovered that it was in fact hitting capacity and clearing frequently during our lunchtime traffic surge. This had a direct relationship with the contention we were seeing on our MongoDB nodes during that period of time.

It was obvious to us at this point, that the 12MB default size of this cache was not enough to handle all that GeoIP lookup data!

GeoIP Cache Size Updates and Results

Our team decided to increase the LegacyLocationList cache size to 20MB via a simple patch update:

 <setting name="CES.GeoIp.LegacyLocation.Caching.LegacyLocationListCacheSize">  
     <patch:attribute name="value">20MB</patch:attribute>  
After our deployment, we monitored the cluster's telemetry closely. It was apparent by looking at the connection count, that there was an instant improvement resulting from the increased cache size.

Before the deployment of the cache setting change (LegacyLocationList cache default set to 12MB), we were averaging around 400 connections during the traffic surge.

After the deployment (increase the LegacyLocationList cache size to 20MB), our connection count was only averaging around 200!

Over the course of several weeks, our team was happy to report that during our lunchtime traffic surges, there was a dramatic reduction in connections with Data Reads, Operations Per Second, Docs Affected Per Second, CPU time, queues, disk I/O, page faults on our MongoDB cluster.

Another positive step towards our overall goal of improving MongoDB connection management on our Content Delivery servers.

Final Note

Another special thanks to Dan Read (Arke), Alex Mayle (Sogeti) for their contributions.


Post a Comment