Wednesday, November 20, 2013

Database Troubleshooting: The playbook for outside the box

     Some time back  a developer from one of our products was brought into help resolve a locking issue we were experiencing. I remember him perusing through much of the server settings and muttering they were just about all default.
     It needs to be said for the 6 or so databases. Only one of which that was larger than 100 GB. This database was the collected data of 15 years of using one version of the application or another. The server had 128GB of RAM and 32 CPU.   I can confidently and callously say that even just out of the box , my server was more than able to handle anything that was asked of it.
      Most of the times we think about database and application issues as long running queries. Occasionally a query seems long running or an application freezes up because another call is blocking it. It seems logical that the offending query is itself long running. We all have our ways of identifying the offender. From the simplest Activity Monitor to the more involved D.M.V queries. My preferred method is something I modified from a Pinal Dave SqlAuthority article. I am a big fan of his articles and many of my searches often refer back to his articles.
      On this particular occasion the block was showing a query completely unrelated to the growing number of blocked users.  All of the users were attempting to remove purchase order request lines created by orders that were no longer needed. The blocker was inactive in the application. Another tool in your immediate reach must be sp_lock.  I created a script that passed the parent blocker to the sp_lock call. This call showed me in no uncertain terms that the inactive lead blocker was holding a complete lock of the PO Request  table. This is what was preventing the other users from completing their order modifications. The application was trying to delete records from the table.
       I was convinced there was some connection being left open or some transactions not being committed. The developer did not agree.We were the only client complaining about this issue. Maybe we were the only client mismanaging our p.o. requests.
       "Maybe" isn't an I.T. word. That is perhaps the most memorably sentiment of the worst I.T. Manager I ever had the displeasure of reporting to.
        Some months later the software company would quietly in the back pages of a release note mention an issue with an open dataset issue in one of the forms. This was as close to a confession as I was going to get.
   For the interim I had to lessen the impact of the blocking . I achieved this by removing the P.O. request lines proactively using other considerations. The less often someone confirming an order had request lines to delete the less likely it was to be blocked by another user who was inactive .
   I considered changing flags so the server would not escalate locks on memory contention and by number of locks. In the end I did not , I felt that I hadn't justified that this would have a real impact on the problem. I did change my scheduled restart of the server from a bi-weekly event to a weekly event.
   A restart to me is like punting. You haven't identified the problem. You have just thrown the good away with the bad. If there is a point to this pseudo-rant, it is that even when your resources have pointed you in pretty certain terms to the cause. If the other parties involved,  outside vendors or not, plead innocence you still have to MacGyver the situation.
 
     

No comments:

Post a Comment