19 April 2010

SharePoint, IIS, w3wp.exe, threads and app pools… Sometimes, it IS the end user after all!

I’ve been helping a good client of mine trouble shoot some performance issues with their SharePoint environment.  They have a single MOSS 2007 server under 32 bit, so their 1,000+ active users (not concurrent though) is stretched about as thin as it will go.  Recently, the server started having issues where the app pool would get locked up and take all the users down.  Now IIS app pools are designed recycle when certain limits are reached so that it would be seamless to the end user.  The app pool was set to recycle when memory consumption under the worker process (w3wp.exe) reaches 1 GB or the virtual memory consumption for the app pool reached 1.9 GB.  We were not seeing the overlapped recycle taking place automatically because the app pool would get locked up when memory reached around 940 MB.  It was not consistent though so it couldn’t be identified readily.  We ended up trimming the values back to the eventual 800 MB physical and 1.5 GB virtual memory before triggering a recycle.
imageOnce the app pool reached either of those limits in it’s memory consumption, IIS would spin up a new w3wp.exe worker process with a fresh app pool and all new SharePoint requests would be directed to that process instead.  All existing pending requests on the current worker process/app pool would complete or be terminated once the timeout configured in IIS was reached.  After all requests completed and released their execution threads, the worker process would terminate and release it’s memory back to the pool for IIS to use.
If you are seeing similar behavior in your SharePoint environment, there are a couple of things you need to pay attention to:
  1. IIS Timeout setting.
  2. Runaway/locked up threads.
  3. Time between recycles.
  4. Physical server memory.
  5. Bit architecture of the server and SharePoint.
Once your server isn’t crashing for end users any more, it’s time to tune it’s health more closely.  Identify what the IIS timeout setting is set for, for your server/app.  If your server is still on IIS6, you will want to ensure that theLogEventOnRecycle property in the IIS metabase is set to true.  Next you want to look in the Event Log under System for  message 1077 which indicates that an overlapped recycle took place for the app pool.  Make sure to note the time between these messages.  It’s best to use the smallest time which should relate to your peak volume time of day for the given server.  Lastly you want to make sure how much physical memory the server has and what the bit architecture of the server, the OS and SharePoint is, i.e. are you running 32 bit or 64 bit.
imageNow it’s time for some math.  If you have a 32 bit server running 32 bit Windows Server and 32 bit SharePoint, this is a much more crucial issue than if you were running all 64 bit.  The issue deals with memory.  You have to figure that the server will not realistically have available to your worker processes more than half of it’s actual physical ADDRESSABLEmemory.  I highlight addressable here because remember than under 32 bit architecture, your server cannot address more than 3.2 GB of memory, even if it has 8 GB of physical memory!
Thus in our all 32 bit example, even though the server has 4 GB of memory physically, the OS can only address 3.2 GB which means by my math, about 1.6 GB would be available to our worker processes in IIS.  You may be tempted to use something just below that as your recycle point, but remember that we have OVERLAPPED RECYCLE going on which means that IIS is managing two worker processes at the same time, so each would require it’s own memory in order to function properly.
That was the problem we ran into when the recycle threshold was set at 1 GB.  The worker process would trip the limit and then IIS would attempt to spin up an overlapping worker process, but since there wasn’t enough memory available to do so, it took no time at all to completely lock up IIS and bring down end users.  Only a forced recycle of the app pool, which forcibly releases all threads and memory pages, thus also dropping users, before spinning up a new worker process, could restore the server to a working state.
Dropping the memory recycle trigger down to 800 MB instead, we consumed half of our available memory, or 25% of the addressable memory.  When the worker process triggered the overlapped recycle, it would spin up a second worker process and direct traffic to it while finishing up requests in the first worker process.  Provided none of these requests had runaway threads, the worker process would typically shut down and release memory within a minute or two.
imageThis gets the server into a usable state as far as the end user is concerned because they no longer see crashes or get locked up.  On the server side, you will see the app pools recycle much more frequently and you are running the risk that a runaway thread would lockup the first worker process until the IIS timeout is reached.  That setting is 15 minutes under IIS by default, but most SharePoint shops have upped that to 30 minutes, especially where low bandwidth or VPN users are in play.  As a result, a runaway thread would keep the first worker process alive for 30 minutes.  You can see how the time between recycles now becomes superCRITICAL!  If you overlapped recycles happen more frequently than your IIS timeout value, change something.
RECOMMENDATION:  Ensure that your IIS timeout value is always LESS than your overlapped recycle time at it’s shortest interval.
Of course the answer is to solve the memory leak problems so that the app pools don’t have to recycle, but if you’ve ever tried to track down memory leaks, you know it’s HELL!    If you’ve never had the misfortune of having to do so, consider yourself truly blessed.
It’s also not always realistic to bring the IIS timeout value down.  If your server is recycling worker processes every 15 minutes, it’s certainly not likely to be doable.  That’s when it becomes mission critical to hunt down any runaway threads and determine their cause.  Anything that may cause the worker process to remain alive need to be addressed in order to keep your server up and running.  At my client’s site, we were still getting runaway processes that could potentially put us in a state where a third worker process needs to be spun up which would bring the whole thing to a screeching halt.
As an Enterprise Architect I get to see all sides of the fence.  I work with and talk to everyone involved.  When talking to developers, the feeling is usually that Ops people must have done “something” to the servers which is causing the instability.  When talking to operations personnel, the feeling is usually that Devs are writing bad code that’s causing the instability.  I’ve been in many SharePoint shops and have seen both sides of this argument be true, but not this time.
We had an awesome traffic profiling tool available for the job and that’s where we discovered two items that would cause runaway threads.
  1. imageSQL Server Reporting Services Integrated Mode.  If you’re a SharePoint Architect, you probably just had a cold shiver go over your entire body as you read that line.  Yes, every SharePoint shop dabbles with SSRSIM at some point.  Most come to the conclusion that performance is a problem and usually deploy a dedicated server to run SSRS.  That was also the case here.  Unfortunately, there was a couple of instances of IM reports that could not be moved over to the dedicated server so IM was left active.  What we discovered was a series of reports developed and built (as SSRS empowers end users to do) by end users.  Of course end users are not going to know how to write optimized queries for data so as a result, these reports performed poorly.  There were reports that would take upward of 30 seconds to load, and that was being local to the servers and on a 1 GB ethernet connection.  The reports have very large amounts of tabular data and we know how well IE renders tables.  Imagine being a user, on a remote VPN connection.  Your wait time on the report could easily go over 2 minutes.  The problem with that is the thread requesting data is locked up while all this data is transferred and interpreted for render on the browser.  Additionally, a user could easily lose patience and simply close their browser fearing that it may be “locked up”.  When a user does that, the thread still remains alive in the background until the download is complete and the loss of the end point on the client side could very well cause the thread to become a runaway thread that never releases its resources.  No matter how you slice or dice it, it’s bad.
  2. Image Rotating Banner.  We have a nifty little web part that adds pizzazz to user created pages by rotating through images determined by the designer/user.  Now as I said, any time we empower end users to do design of content delivery, a LOT of thought has to go into it.  In this case, the web part was designed for ease of use in that all the designer/user had to do was drop it on the page, set the Title for it and point it to an Image Library on the site.  Then when the page is loaded, the web part would start rotating images using JavaScript.  Nice.  But way, there’s more.  Using our awesome traffic profiling tool, we discovered pages, like main departmental home pages, were loading literally dozens of images.  Taking a closer look at the pages, they appeared to load rather slowly.  If you’ve ever dissected the loading sequence of a SharePoint page, you’d know that, even if you set them to display partially while downloading content, the JavaScript is usually the last part to be downloaded.  As a SharePoint developer would understand, a SharePoint page isn’t really functional until that JavaScript has loaded.  None of the dropdowns work etc.  But I digress.  Needless to say, until the page is completely loaded, you can’t really do too much.  What we saw was all these images loading with the page before the script would load, making the page load times very slow.  To make matters worse, we looked at some of the pictures being loaded and most were not resized to the 100 pixel banner size they displayed in.  On the contrary, the images were in their original 9 mega pixel JPG format!    Cracking open the code for the web part, we discovered that it did exactly what I just described.  It showed ALL of the pictures in the picture library regardless of SIZE.  Though that design is OK for uses where experienced developers or web designers would be using the web part, it unfortunately does not work well for end users or inexperienced designers because it’s not realistic to expect an end user to think about the number of pictures being displayed, all being preloaded on the page as well as the size of those images.  Considering some images up to 5 MB in size and libraries easily containing in excess of 20 images, you can see how the 1 MB size page, now having to preload all these images, suddenly became a 100 MB+ page.  That’s never good for performance.  Now granted, the web part should probably use AJAX to load it’s images and not preload them on the page, but this was the design that was available.  We implemented a hot fix to the code whereby we simply leveragedSharePoint’s built in thumbnails for image libraries since it’s just a banner anyway.  In addition, we display only a random 10 images from the library each time.  That meant no more than 100 KB in extra page size.  Again, you can see how a user could easily give up and close the browser, leaving a thread locked as it processes.
As we’ve seen in this case, as developers and architects, we always have to be conscious of our end users.  Tools we provide them in order to empower them can often come back to haunt us at the most inopportune times.


No comments:

Post a Comment

Comments are moderated only for the purpose of keeping pesky spammers at bay.

SharePoint Remote Event Receivers are DEAD!!!

 Well, the time has finally come.  It was evident when Microsoft started pushing everyone to WebHooks, but this FAQ and related announcement...