Tuesday 24 March 2009

Faster data fetching from multiple SQL tables with multiple base elements to analyze

What this method gives you:
  1. This method doesn't require you to massively rewrite the code.
  2. The code is relatively easy to understand.
  3. The speed improvements are simply massive (30 - 50 times faster code, up to 200 in some scenarios).
  4. and best of all: you can implement this method step by step, one subquery at a time and observe the speed improvements as you go.
Note that I don't read insane amount of books. That said I haven't yet seen this method in a book and therefore don't know it's "official" name. But since my fellow programmers never know what I mean when I say it's name, I'll state for the record that I call this method "synchronized queries". The name stems from the fact that at fetching time, some synchronization must be performed among the queries affected.

Also, this method is to be considered intermediate level at best. It's no rocket science. But since I implemented it on a fair bit of different code segments, I guess it isn't the typical programmer's first line of thought.

On to the method then:

If you've ever programmed a business type application, like MRP, CRM or some such, you've most certainly had to implement an analysis that took multiple base documents as input and then had to fetch the data about those documents from a series of database tables.

The most common scenario that comes to mind is order analysis.
To analyze an order, one has to fetch data from multiple tables like some of the following:
Purchases (materials, services)
Shipping documents (finished products)
Invoices (finished product)
And then a whole myriad of data in the production which you entered to keep track of the production of the ordered product:
Work orders
Material issues
Semi-product and product completion
Worktime + tools usage
etc, etc.

In such an analysis all these tables have the requested order(s) to be analyzed in common. Meaning that for each table you can by some means get to the order id. Either it's in the table already or there's some other table that creates the connection between the two.

The most common solution to do such an analysis that I have seen goes something like this:

for each order do
analyze purchases
analyze shipping
analyze invoices
analyze work orders
analyze material issues
...
next

As you can see, this solution is OK for situations where you only have to analyze one order and it's respective details. You will have one compile for the order and one compile for each detail.

However, when you are doing for example a yearly analysis, it quickly becomes obvious that we have potentionally thousands of orders and therefore thousands of compiles for the poor SQL server.

A compile (statement execute) is relatively the most time consuming part of fetching data from an SQL database. Especially if the query itself is complex and returns relatively few records. Having no indexes on the filter and order fields helps a lot too :)

So the above example behaves optimally when there is one order to analyze and decreases in efficiency with each following order since all subqueries have to be recompiled to return data for each subsequent order.

Knowing that a compile is very expensive, the best way to go should be quite obvious - just decrease the number of compiles. But how can one do that? If I make the queries such that they will return data for all requested orders, that data will just be mixed all together making the analysis very hard to do, especially if we consider the volume of all that data. There's no way it can be fit into memory and retrieved when needed.

Well, the solution is quite simple:
In order to keep the actual analysis routine as close to the original algorithm as possible and still reap the benefits of fewer compiles, one needs to fulfill ony one additional condition: all queries must be sortable by order number.
Since SQL has an adequate command for that, namely "order by", this condition is easy to meet.

There is one additional issue that typically needs to be solved for such situations: normally you won't get a list of orders that need to be analyzed. Rather, the filter will be more like: orders from - to (date), order state, seller, etc.
So, in order to solve this problem, we have to somehow get order numbers since all these details will not be stored in every table.
The solution is quite simple: we just create a temporary table from the original order query with a select into statement. We can then use this temporary table as a join in all subsequent detail queries.

So, the example for one such detail table will be:

Create the temp table
select f1, f2, f3, f4, f5, ...
into #myTempOrders
from Orders
where c1 = xxx and c2 = yyy...

Make the detail query for purchases:
select f1, f2, f3, f4, ...
from purchases
join #myTempOrders on purchases.orderid = #myTempOrders.orderid
order by purchases.orderid

This purchases query will return data for all orders being analyzed if data for them exists.

So to complete the analysis, our new code looks like this:
Create temp table (if needed)
Compile all detail tables
for each order do
analyze purchases
analyze shipping
analyze invoices
analyze work orders
analyze material issues
...
next

Surprising, how the code is actually the same, isn't it? :)
The only additions are initial compiles that will speed up our analysis.

But there is one significant difference:
In our initial code "analyze" meant: compile the detail query, fetch the records, do what you have to do with them.
In our new code "analyze" means: synchronize the master and detail query, fetch the records, do what you have to do with them.

What is changed is "compile" versus "synchronize".
A compile would look like this (a little remodelled query from before):
select f1, f2, f3, f4, ...
from purchases
where purchases.orderid = xxx

A synchronize on the other hand looks like this:
int CmpRec(SQL sqlOrder, SQL sqlAny)
{
//Compares the "key", that is order number in both datasets
if (sqlOrder.asInt("orderid") > sqlAny.asInt("orderid")) return -1;
else if (sqlOrder.asInt("orderid") == sqlAny.asInt("orderid")) return 0;
else return 1;
}

//Advances the detail dataset until equal or greater order id is fetched
while (CmpRec(sqlVk, sVrDok, iStNal, iStPos) < 0)
if (!sqlVk.ForEach()) break;

while (CmpRec(sqlVk, sVrDok, iStNal, iStPos) == 0)
{
//Do your stuff
if (!sqlVk.ForEach()) break;
}

Have fun creating faster programs :)