lunes, 11 de enero de 2010

Converting rows in columns with T-SQL: PIVOT() and things that will brighten the day :)

Versión en castellano

Sometimes things that seem very complex have a easy solution, but this isn’t the first one that we think (yes, sound like Occam’s razor, but isn’t the same).

Well, the case is that I had a logging table in which some external process inserted the results of passing for different states, something like this:

Process State Timestamp

Process1 10 2008/12/21 12:22:05.320
Process2 10 2008/12/21 12:22:05.358
Process1 11 2008/12/21 12:22:06.210
Process1 12 2008/12/21 12:22:06.588
Process2 11 2008/12/21 12:22:10.100
Process2 12 2008/12/21 12:22:10.048
Process3 10 2008/12/21 14:30:05.358
Process3 11 2008/12/21 14:30:20.052

To monitor processes and identify performance problems at certain times: many processes at once, backups were began, the server runs out of memory, ... I get a query grouping by process and hour.

SELECT process, count(*) as requests, datepart(hh,Timestamp) as hour
FROM MiLOG with (nolock)
GROUP BY process, datepart(hh,Timestamp)

With this query we retrieve a result like this:

Process Requests Hour

Process1 10 12
Process1 2 13
Process2 5 12
Process2 3 13
Process3 25 14
Process3 4 15

Of course, this output is a bit complicated to treat because even though they are grouped by the process and the time, when you have lots of data becomes too cumbersome to check them, and just bringing the results to a spreadsheet you can see more clearly. Then I wondered, is there not a way to do it from T-SQL convert rows into columns, with a column for each hour and a single row for each process?

As always the first solution to a complicated query is a sub-query (I don't know why but it's true, someone would have to study it :P) but in this case was not a sub-query, but rather 24 sub-queries, all attacking the same table, and all using the function datepart (hh, TimeStamp). Surely you would get the expected result, but at the expense of a very poor performance.

As this situation is very common, I think that should have an easy way to fix it, so in an initial Google search did not surprise me the first results show the same resolution of sub-queries (for the principle that the first solution in a complicated query is a sub-query: P) and I was doing a huge query with a lot of sub-queries, JOINs to the same table. But on the first page of results I found a much simpler solution: the PIVOT clause (that is new SQL Server 2005, so don't try in earlier versions).

So, with a query like that:

WITH Processes(process, requests, hour)
SELECT process, count(*) as requests, datepart(hh,Timestamp) as hour
FROM MiLOG with (nolock)
GROUP BY process, datepart(hh,Timestamp)
SELECT * FROM Processes PIVOT(SUM(requests) FOR Hour IN ([0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16],[17],[18],[19],[20],[21],[22],[23])) AS Result
ORDER BY process

we have the expected result:

Process 0 1 2 ... 21 22 23

Process1 0 10 23 4 5 21
Process2 12 11 15 22 17 8
Process3 9 7 2 19 9 13

Note: PIVOT() statement works only with agregate functions (like COUNT, SUM, MAX, AVG, …).

Another interesting thing, is:

WITH Processes( process, requests, hour) AS(...)

that Specifies a temporary named result set, known as a common table expression (CTE).

Happy codding ;)

Extra ball: for this case I also thought of using CUBE, that is other summarizing tool provided by T-SQL.

No hay comentarios: