Saturday, 10 March 2012

Increasing transaction performance and getting rid of deadlocks caused by high transaction count

If you're seriously into database applications, you are sure to have bumped into a customer that was really big. Like so big that their server reached a point where it was no longer capable of serving all the transactions requested by the software / users. Personally reaching this point is a genuine nightmare for me: monday everything is fine, everybody works normally and we are happy. Then they install that damned robot / new warehouse / whatever which increases the transaction count for the system. So tuesday all hell breaks loose: everything people try to do takes ages and more often than not reports the deadlock victim error. Naturally, we are no longer happy.

Solving such situations requires expensive performance tuning software, lots of analysis and optimizing query by query so that you can lift at least some burden off the poor server which is of course still struggling, but *just* not so much that everything would halt any more. To quickly recap your performance tuning options (the real purpose of this article lies beyond this list :-) ):
  1. Indexes
    1. Adding a good clustered index for that most used aggregate query will work wonders usually.
    2. For less important queries add normal indexes. 
    3. But beware: adding too many indexes kills your insert / update performance because the SQL server needs to update too many things for one addition or correction in the database.
    4. Also bear in mind that having single - column indexes will not really help much with multi-column where or group by statements. Sure - it helps a little, but nowhere near what a proper multi - segment index would do for you here
    5. Recent SQL servers also have the capability to extract all the data for a query just from an index. So if you have an (aggregate) query which only uses a few of the table columns, usually creating an index just for this query is not such a bad idea, especially if you're running said query all the time
  2. Locking
    1. You should be very careful what you lock and when
    2. The most common culprit are usually configuration tables - if you store application config in a table. This table is accessed for pretty much anything all the time and moving all these accesses out of transactions or using with(nolock) hints helps tremendously
    3. The next most common lock targets are link tables - the tables that establish data relations. Like this delivery item was originally requested in order #xxx. Depending on application design you will most likely have at least one such table in your system. Also needs special care, but unfortunately can't really be optimized since it must be kept in the transaction for data consistency.
    4. Anyway, you should let the SQL server manage locks at a level as fine as possible so that there are as few as possible lock races
  3. Query optimization
    1. Can't really say much about this since it's query - dependant, but I have seen some really yucky queries in my day
    2. Usually a simple rewrite of a nasty query will decrease query execution time multiple - fold
    3. Take care not to use multiple (basically equal) subqueries - rather use them in a join (and a group by)
    4. Joins are (almost) always better than subqueries
    5. Dynamic views (select from select) are also better to be avoided if possible though sometimes they are the basis for your query
    6. Once you detect a query that needs optimization, you optimize it and make sure indexes in the respective tables cover your query's needs
    7. General rule of thumb is that there should be absolutely no table scans executing the query and possibly no index scans. Scans take time, seeks do not!
    8. If you can't think of a way to optimize a really bad query, drop me a mail. For a modest fee I will do this for ya }:-)) (Seriously though: see the donate button at the bottom of the page and don't write without clicking it)
This is about it. The above procedures will usually give you good results but they take many hours of analysis and optimization and thus take valuable time of your programmers who could be better spending their time creating new (bad) queries for new software functionality.

There are also two programmatic options you have at your disposal which will do a lot more general good with less effort for your programmers - since they affect all transactions ever issued throughout your system. The only prerequisite here is that your application is well designed and you actually have access to code that starts / finishes / aborts transactions:

The first option is to simply catch deadlock errors and restart the transaction automatically. Aside from taking even more time, your user will never know that a deadlock even occurred. However, the SQL server will know since this will put even more stress on it - and don't forget: you're having the deadlocks because you placed too much stress on the server in the first place.

The second option is to serialize your transactions. This one is the purpose of this article. You see, until yesterday we at the company I work for always went for performance tuning / optimization. However yesterday I developed a really simple solution which I didn't really believe would solve anything. I just hoped it would do *some* good for the poor server at our client's. Boy was I in for a treat :-)
I was at the client's two days ago and I can testify that their server was just way too loaded. I have personally started 10 transactions. Of those, 5 went straight through in less than a second (normal time), 3 deadlocked after more than 30 seconds (definitely bad) and 2 went through in more than 30 seconds (meaning somebody else probably got the deadlock). Disk time was at 100% all the time and CPU was only low because the CPU was waiting for the disks. All that on one of the most powerful database servers I have ever seen deployed at our customers. The situation was just horrible. All because we installed some automatic data transformations to support their new huge automatic warehouse and a few of packaging robots.
Once I installed the serialization code, I first thought the entire system crashed. There was no activity for two minutes or so. Only after this initial period where server was so stumped that it simply couldn't do anything, things started rolling:Transactions started to come in in hundreds per minute as all the automated jobs started doing their work that failed in their previous attempts due to deadlocks. For the first fifteen minutes all I could look at was that flood of transactions being mostly nicely serialized. But then everything just stopped again - revealing that there are actually only about 20 transactions per minute. Once I reached this state, the server was at practically 0% CPU usage and 0% disk time - just smooth sailing from now on. Total, entire and unyielding success!!!!
So it turns out that it's not the server that is too weak. It's just that SQL server tries to accomodate all users by running multiple transactions in parallel, but that leads to locking and lock waiting and transactions that should take milliseconds start taking seconds. During that time new transactions come worsening the situation further - until the entire thing slows down to a deadlock infested crawl.

OK, enough of blabbing, here's what I did:
The first prerequisite was that I needed to add some code just before calls to start / stop transactions. All that code does is to call a stored procedure on the SQL server just before it initiates a transaction and again calls the same procedure just after it finishes / aborts the transaction. This is super important: doing this inside a transaction would just kill the purpose because the locking would be done on the scheduling table as well.
Anyway, really simple, really easy. The stored procedure is designed so that it will return a "transaction ID" as well as "permission" to start transaction. The "permission" part of the result is actually number of transactions that will be allowed to start before this one so you can even notify the user while this particular transaction is waiting to be started.
The sample function itself has three possible input value forms:
  1. "Announcement" null - I want to start a new transaction, give me an ID and queue #. When Queue # reaches 0 I can start the transaction
  2. "Is it my turn yet?" transaction ID (returned by previous anouncement call) - I'm waiting to be allowed to start. What's my queue #?
  3. "I'm done" negative transaction ID - I'm finishing / aborting transaction

Of course you can do this any way you like, my particular implementation is done like described, but it's in no way perfect. That's the beauty of this method: you can make your transaction scheduler any way you want as long as you serialize at least some transactions.

The general concept of this solution is this:
First we assume that all transactions are equal and that they should be executed in order (as opposed to simultaneously) to lessen the locking stress on the server. Then we eliminate any transactions that appear dead and all transactions that are classified as long taking. The method assumes that most transactions are very short and attempts to serialize those while it leaves the user-prompt infested ones take as long as they want. As explained above, it turns out that the locking stress really is stressful on the server. I could not believe the results when I saw what was going on, but the final results show that serializing the fast transactions helps more than just a lot.

A more detailed description:
Before starting a transaction, we add its announcement into a special table. Since we're doing this outside the transaction, the table is available to all at all times. A stored procedure makes sure that the entire thing runs on the server and is fast because of this. I'm running the second form (is it my turn yet?) every 100 milliseconds (on each client) and have observed no ill effects whatsoever. In order to keep the entire thing fast, I remove all records older than 15 minutes so as to keep the table size as small as possible.
The stored procedure makes sure the transaction is inserted into the queue (announcement), checks current queue # (announcement and is it my turn yet) and marks transaction as still alive and also marks the transaction as finished when called with negative transaction ID (I'm done).
Just in case some transactions take too long, the procedure will ignore them and allow other transactions to run in parallel and also if one transaction is no longer checked upon, it will also be ignored. This approach makes sure that a single transaction can't be blocking the entire system just because it's slow - small transactions just run and hope that the large transaction won't block them (turns out this is extremely rarely the case).
The client registers the transaction using the first call form and then waits in a loop until the Queue # drops to 0 for the transaction. Then transaction is ran as it always was and when it is finished, the client also reports that it is finished.

So if you now already modified your database access library to include stored procedure call as described above, all that is missing is the stored procedure and related call. You should now be running the new application just to see that nothing changed. Once the stored procedure is added, the magic should begin. I have included the two statistics selects at the bottom just for the nice side-benefit (transaction statistics) :-)
This solution is in MS SQL T-SQL language, but I suppose a conversion to any other database should be easy.

The helper table:
create table transaction_serializer (
  id int identity(1,1) primary key,
  time_inserted datetime null,
  time_queried  datetime null,
  time_executed datetime null,
  time_finished datetime null
create index transaction_serializer_i1 on transaction_serializer (time_finished, id)
create index transaction_serializer_i2 on transaction_serializer (time_queried)

The stored procedure:
create procedure serialize_transaction(@transid int)
as begin 
  set nocount on
  set ansi_warnings off
  declare @transok int

  --Remove 15 minute old records to keep this scheduler table small and fast
  delete from transaction_serializer where time_inserted < dateadd(mi, -15, getdate())
  if @transid < 0 begin
    --Transaction done
    update transaction_serializer set time_finished = GETDATE() where ID = -@transid
    set @transok = 0
  else begin
    if @transid is null
      -New transaction
      insert into transaction_serializer (time_inserted) values(GETDATE())
      set @transid = @@IDENTITY
    -Determine queue# for transaction
    select @transok = COUNT(*)
      from transaction_serializer
     where time_finished is null and id < @transid --How many transactions in queue before ours
       and isnull(datediff(ms, time_executed, GETDATE()),0) < 2000 --Eliminate long taking transactions
       and isnull(datediff(ms, time_queried,  GETDATE()),0) < 2000 --Eliminate seemingly dead transactions

    if @transok = 0 begin
      --This transaction first, let's start it
      update transaction_serializer set time_executed = GETDATE(), time_queried  = GETDATE() where ID = @transid
    else begin
      --Just mark the transaction still alive
      update transaction_serializer set time_queried  = GETDATE() where ID = @transid
  --transid contains transaction id which the client then uses for querying queue # and marking transaction as done
  --transok is 0 - if the transaction can be started or positive number which tells us that our transaction is still waiting
  select @transid, @transok

The additional selects to retrieve statistics:
select *, datediff(ms, time_executed, time_finished), datediff(ms, time_inserted, time_executed) from transaction_serializer
select count(*) as NumTrans, 
       max(datediff(ms, time_executed, time_finished)) as MaxTransTime, 
       avg(datediff(ms, time_executed, time_finished)) as AvgTransTime, 
       sum(case when datediff(ms, time_executed, time_finished) > 2000 then 1 else 0 end) as NumOfLongTransactions,
       max(datediff(ms, time_inserted, time_executed)) as MaxWaitTime,
       max(datediff(ms, time_inserted, time_executed)) as AvgWaitTime
  from transaction_serializer

Now you know that implementing this just saved your ass with the client. So for your convenience you can express your gratitude by clicking the following button. Don't be shy, you know I just saved your company thousands in costs of performance tuning :-) In case you're feeling stingy at least click a link in the banner on the right side of this page.

  For possible further inquiries drop me a mail at jure dot erznoznik at gmail dot com.


No comments: