Wednesday, November 27, 2013

MAXDOP or MAXDOOP ?

     On my short list of SQL Server configurations that need to be set specifically to your servers needs is MAXDOP. It is the maximum number of C.P.U. that a query can use if it is going to cost more than a few seconds.
      The degree of  C.P.U. parallelism allowed by the server is often looked at like a turbo button for some of your system's heavy lifting.It can also be perceived as a freeway that has mile long backups at rush hour.
  While I have faith in this concept, many of my specific experiences have eventually whittled down to a MAXDOP of 1. This is effectively the off switch for this feature. Out of the box SQL Server set MAXDOP to zero, this is the full throttle setting. It allows any long running query to use all available C.P.U..
   It needs to be noted that SQL Server uses the configured number of C.P.U. when the cost threshold is passed. The default setting for cost threshold is 5 seconds. You need to consider what percentage of your queries run long enough to work using extra C.P.U..
   You have the tools to determine how long any query in your system takes to run. The surgeons out there will figure out what percentage of the queries take how many seconds to run. The cowboys will either estimate how many small transactions are happening versus the total time spent running large queries.
   You may read that  MAXDOP should be set to half the number of C.P.U up to 8. The question here is how likely is it that multiple long running queries will be running on the server at the same time. Too many simultaneous long running queries will cause excessive  CXPACKET or SOS_SCHEDULER_YIELD waits. You will then have to decide if you can raise the cost or if you have to lower the MAXDOP setting. Let the trial and error begin.
   The easiest place to make use of MAXDOP at its best is on your data warehouses and report servers. It is here that the processes are more likely scheduled or in sequence. Heavy lifting won't collide with other long running queries.
    The best advice when making any configuration change is to make small changes and carefully monitor the outcome. Best of Luck.

Wednesday, November 20, 2013

Database Troubleshooting: The playbook for outside the box

     Some time back  a developer from one of our products was brought into help resolve a locking issue we were experiencing. I remember him perusing through much of the server settings and muttering they were just about all default.
     It needs to be said for the 6 or so databases. Only one of which that was larger than 100 GB. This database was the collected data of 15 years of using one version of the application or another. The server had 128GB of RAM and 32 CPU.   I can confidently and callously say that even just out of the box , my server was more than able to handle anything that was asked of it.
      Most of the times we think about database and application issues as long running queries. Occasionally a query seems long running or an application freezes up because another call is blocking it. It seems logical that the offending query is itself long running. We all have our ways of identifying the offender. From the simplest Activity Monitor to the more involved D.M.V queries. My preferred method is something I modified from a Pinal Dave SqlAuthority article. I am a big fan of his articles and many of my searches often refer back to his articles.
      On this particular occasion the block was showing a query completely unrelated to the growing number of blocked users.  All of the users were attempting to remove purchase order request lines created by orders that were no longer needed. The blocker was inactive in the application. Another tool in your immediate reach must be sp_lock.  I created a script that passed the parent blocker to the sp_lock call. This call showed me in no uncertain terms that the inactive lead blocker was holding a complete lock of the PO Request  table. This is what was preventing the other users from completing their order modifications. The application was trying to delete records from the table.
       I was convinced there was some connection being left open or some transactions not being committed. The developer did not agree.We were the only client complaining about this issue. Maybe we were the only client mismanaging our p.o. requests.
       "Maybe" isn't an I.T. word. That is perhaps the most memorably sentiment of the worst I.T. Manager I ever had the displeasure of reporting to.
        Some months later the software company would quietly in the back pages of a release note mention an issue with an open dataset issue in one of the forms. This was as close to a confession as I was going to get.
   For the interim I had to lessen the impact of the blocking . I achieved this by removing the P.O. request lines proactively using other considerations. The less often someone confirming an order had request lines to delete the less likely it was to be blocked by another user who was inactive .
   I considered changing flags so the server would not escalate locks on memory contention and by number of locks. In the end I did not , I felt that I hadn't justified that this would have a real impact on the problem. I did change my scheduled restart of the server from a bi-weekly event to a weekly event.
   A restart to me is like punting. You haven't identified the problem. You have just thrown the good away with the bad. If there is a point to this pseudo-rant, it is that even when your resources have pointed you in pretty certain terms to the cause. If the other parties involved,  outside vendors or not, plead innocence you still have to MacGyver the situation.
 
     

Wednesday, November 13, 2013

Is it the Hardware or the Application ?

     As I am currently job hunting , I am being asked a small collection of baseline questions. The kind of technical questions that separate the real from the would be real. One of my favorites is how to tell whether a bottleneck is in the application or hardware resources.
      The interviewer is looking for what specific strategies you would take to determine the source. How do you prove to a developer that a slowness or locking issue is caused by the application ? How do you show your system administrator that the server memory or drive specifications are not cutting it with the high transaction order entry and billing systems ?
     I assume the process in question is some sort of data call. You need to find the latency. I like to break it down; either something is doing more than it should or something is just taking longer than it should to do it. The client makes the request. The application talks to the database. The database responds to the application. The application notifies the client. Hopefully your set up is comparable or this is merely a simplification of it.
      My network guys are infamous for saying "It's never the network.". I usually take that at face value. Looking to disprove it only after all other possibilities have been exhausted.
      The hardware guys have one of two opinions. The more common is denial. They'll tell you the hardware is more than sufficient to handle everything that is asked of it. The second opinion is an admission of guilt. Of course the hardware is insufficient; Its old and wasn't spec'd out right.
      The database guy , that's my normal role, is the most suspicious. If he can acquit his hardware of guilt , he still has a few other possible problems. Server settings such as memory allocation, File sizes and growth settings, database maintenance and poor indexing are all things that can cause performance issues. He needs to make sure the data call is only asking for what is truly needed at the time. He may have a lot of ground to cover to find the true source of the issue.
      The developer will also normally deny any responsibility. The internal developer is trying to protect the perceived quality of his work , and therefore his job. The vendor's developers are protecting the company product. You will be the only customer with this issue and they will not be able to recreate it.
      Getting back to the fact that this is an interview question, you can assume that the issue is happening under close to ideal conditions. You may want to clarify that with the interviewer, but the question makes more sense when the obvious has been confirmed.
      The first step is to find where the wait(s) are in the process. Running the process and a profiler will separate the application waits from the database waits.
      If the wait is determined to be in the database, then the activity monitor will show the kinds of waits that are impacting the process. The wait will narrow down is the issue is blocking , file contention , poorly written sql or a number of other causes.
      If the wait appears to be in the application you need to decide if it is hardware or code. The simplest way to look at it if the hardware resources are above recommended percentages then the issue can be addressed there . If there is no such elevated use of CPU or memory than the code is likely an issue.
      This may be too general to quote as an answer in an interview. Hopefully you can expand on it some and make the point. I hope to revisit some of these ideas in the future.