Tuesday, September 6, 2016

Bulletproofing your Sitecore Solr and SolrCloud Configurations

Standard

Solr and SolrCloud 

As we know, Sitecore supports both Lucene and Solr search engines. However, there are some compelling reasons to use Solr instead of Lucene that are covered in this article: https://doc.sitecore.net/sitecore_experience_platform/setting_up__maintaining/search_and_indexing/indexing/using_solr_or_lucene

Solr has been the search engine choice for all of my 8.x projects over the last few years and I have recently configured SolrCloud for one of my clients where fault tolerance and high availability was an immensely important requirement.

Although I am a big fan of SolrCloud, it is important to note that Sitecore doesn't officially support SolrCloud yet. For more details, see this KB article: https://kb.sitecore.net/articles/227897.

So, should SolrCloud still be considered in your architecture?

My answer to this question is YES!

My reasoning is that members of Sitecore's Technical and Professional Services Team, have implemented a very stable patch to support SolrCloud that has been tested and used in production by extremely large scale SolrCloud implementations. More about this later.

In addition, if you are running xDB, your Analytics index will get very large over time, and the only way to handle this is to break it up unto multiple shards. SolrCloud is needed to handle this.

The Quest to Keep Solr Online 

One of our high traffic clients running xDB started having Solr issues recently and this sparked my research and work with the Sitecore Technical Services team to obtain a patch to keep Sitecore running if Solr was having issues.

As a side note; the issues that we started seeing were related to the Analytics index getting pounded. The most common error that we saw was the following:

 ERROR <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">  
 <html><head>  
 <title>502 Proxy Error</title>  
 </head><body>  
 <h1>Proxy Error</h1>  
 <p>The proxy server received an invalid  
 response from an upstream server.<br />  
 The proxy server could not handle the request <em><a href="/solr/sitecore_analytics_index/select">GET&nbsp;/solr/sitecore_analytics_index/select</a></em>.<p>  
 Reason: <strong>Error reading from remote server</strong></p></p>  
 </body></html>  

This only popped up after running xDB for several months, as our analytics index started getting fairly large. Definitively something to keep in mind when you are planning for growth, and as mentioned above, why SolrCloud is the best option for a large-scale, enterprise Sitecore search configuration.

Giving the Java Virtual Machine (JVM) running Apache more memory seemed to help, but this error would continue to rear its nasty head, every so often during periods of high traffic.

Sitecore is very sensitive to Solr connection issues, and will be brought down its knees and throw an exception if it has any trouble!

The Bulletproof Solr Patches 


Single Instance Solr Configuration - Patch #391039 

My research to keep Sitecore online if there are Solr issues led me to this post by Brijesh Patel that was published back in March. After reading though it, I decided to contact Sitecore Support about patch #391039, as it seemed to be just what I wanted for my client's single Solr server configuration.

Working with Andrew Chumachenko from support, our tests revealed that the patch published here didn't handle index "SwitchOnRebuilds". To me, this was a deal breaker.

Andrew discovered that there were several versions of patch #391039 (early versions of the patch were implemented for Sitecore versions 7.2 ), and found at least three different variations.

We found that the most recent version of the patch did in fact support "SwitchOnRebuilds", and Andrew made this available to everyone in the community on GitHub: https://github.com/andrew-at-sitecore/Sitecore.Support.391039

This is a quote from Brijesh's post to explain how it works:

"...it checks if Solr is up on Sitecore start. If no, it skips indexes initializing. However, it may lead to exceptions in log files and inconsistencies while working with Sitecore when Solr is down.

Also, there is an agent defined in the ‘Sitecore.Support.391039.config’ that checks and logs the status of Solr connection every minute (interval value should be changed if needed).

If the Solr connection is restored — indexes will be initialized, the corresponding message will be logged and the search and indexing related functionality will work fine."

SolrCloud Solr Configuration - Patch #449298 

This patch works the same way as patch #391039 described above, but supports SolrCloud.

You may be asking yourself, "isn't the point of having a highly available Solr configuration to ensure that my Solr search doesn’t have issues?"

Well, of course. But, due to the nature in which SolrCloud operates, this patch acts as a fail-safe if something goes wrong - if for example your Zookeepers are trying to determine who the leader is if you lose an instance. If there is a mere second that Sitecore is trying to query Solr, and has trouble, it will throw an exception.

So, patch #449298 accounts for this and also allows index "SwitchOnRebuilds" just like the common, single instance Solr server configurations.

GitHub for this patch: https://github.com/SitecoreSupport/Sitecore.Support.449298 

It is important to note that this patch requires an IoC container that injects proper implementations for SolrNet interfaces. It depends on patch Sitecore.Support.405677. You can download the assemblies based on your IoC container from this direct link: https://github.com/SitecoreSupport/Sitecore.Support.405677/releases

Looking Ahead 

Support for Solr out-of-the box (taking into account these patches ) is to be added to the upcoming Sitecore 8.2 U1. So definitely something to look forward to in this release.

A special thanks to Paul Stupka, who is the mastermind behind these patches, and rockstar Andrew Chumachenko for all his help.