In this post I’ll be talking about an issue which we recently encountered in one of our customer’s SharePoint 2010 environment. Let me talk about the issue first, then how we fixed it and later the lessons we learned from it.
Issue we faced:
Couple of days back we received huge number of support incidents from users across the globe stating that they were not able to access the SharePoint portal .For some reason the issue was intermittent , the site used to load fine at times and all of sudden it stopped loading . We decided to login to each WFE servers in the farm and identify which was throwing the bad request .Listed below are the steps which we did initially to identify the root cause ….
Troubleshooting steps done by us initially:
- Tried loading the problematic site from our end and checked whether we’re able to reproduce the issue from our end.
- Once we confirmed that we were able to reproduce the issue we changed the host file entry to point to all the WFE’s in the farm and tried to load the site .This was done to identify which WFE server threw the bad request.
- During this process we happened to notice some abnormal behavior in two servers (i.e. WFE1 & WFE 2) of the SharePoint farm .The CPU/RAM utilization in these two servers were continuously hitting 100% and because of that all the requests going to these servers were failing. The server was almost in an unresponsive state.
- We took a look at the event viewer and found many entries related to Mcaffee Anti-virus software update process getting failed. Then I opened the Mcaffee console to understand what’s happening and as expected I could find many update failures .I pulled the Mcaffee logs and found many entries related to that.
- In addition to that we also noticed entries in the ULS logs about the SharePoint farm trying to run a configuration change by itself and it was also invoking an upgrade process. The w3wp.exe SharePoint worker process was also consuming heavy RAM.
- Now given the fact that we noticed so many weird entries in the logs we planned to reboot the server and see if that helped. Any yes it helped and the issue was no more.
- However, the server reboot was just a week around and we wanted to identify what exactly triggered this as we noticed some weird entries in the SharePoint logs about automatic configuration change and upgrade process .Hence we decided to open a support case with Microsoft for a detailed RCA.
Now let’s take a look on what Microsoft had to say to us about this issue ….
Troubleshooting steps done by Microsoft:
We captured the ULS logs on the exact time the issue was reported and shared the same to Microsoft (please note that this issue which we are currently talking about is a non-reproducible one, meaning: we were not able to reproduce the same behavior to Microsoft as this happened only once, after the server reboot everything looked normal). Microsoft analyzed the logs and this is what they found ….
A huge performance issue was identified as you can see in the logs image below:
AppDomain recycling was happening very frequently as you can see in the screenshot below:
The App domain recycling was happening on both the WFE’s as shown in the ULS logs screenshot below:
What we identified after analyzing the logs?
Now based on the above analysis we identified that the root cause of this issue was that the AppDomain recycle was happening very frequently. This is an isolation process within the W3WP process of the web application. This process went on recycling and that caused the performance issue of the environment.
The possible root cause for this App Domain recycle can be because of the below mentioned two reasons …
a) AV exclusions are not implemented in your SharePoint farm as per the article below. Certain folders may have to be excluded from antivirus scanning when you use file-level antivirus software in SharePoint
b) The application restarts may occur in some situations if any processes accessing Web.config file in the root of the application, the Machine.config file, the Bin folder, or the Global. asax file.
In our case it’s the first one where we didn’t exclude the necessary files/folders from AV scanning and hence we decided to exclude the folders/files as mentioned in the aforementioned article. These are SharePoint system related files/folders and they have to excluded from AV scanning , else when a scheduled full scan kicks off in your SharePoint farm it will start scanning these files too and this will impact the performance of the SharePoint farm .
If you’re planning to install Antivirus software in your SharePoint farm, please make sure that all the folders mentioned in this article are excluded from getting scanned .These are SharePoint system related files & folders and every time the AV scan engine tries to scan these files it puts the farm on risk as the scanning process will interfere SharePoint’s operation .
Thanks for reading this post!!! Happy SharePointing.