Wednesday 19 March 2014

Working with Ranking and Over in SQL Server 2012

In SQL Server 2005/2008 Microsoft introduced lots of new function like Ranking of results has many benefits and it is used in several scenarios of T-SQL. 

Most common example of ranking is to ranking Sales person by their sales. This type of ranking requires you to assign same rank to all those Sales persons who generated the same sales amount, this is called tie.
Ranking Function: Ranking functions return a ranking value for each row in a partition. Depending on the function that is used, some rows might receive the same value as other rows. Ranking functions are non-deterministic. 
SQL Server 2008 provides the following ranking functions:
·       ROW_NUMBER: Returns sequential number of a row within a partition of a result set, starting at 1 for first row in each partition.
·         RANK: Returns the rank (bigint) of each row within the partition of a result set.
·         DENSE_RANK: Same as RANK but without any gap in the ranking number.
·     NTILE: Distributes the rows in an ordered partition into a specified number of groups. Or we can say NTILE is actually used to assign a batch numbers or group numbers to the given result.
Over Clause: Determines the partitioning and ordering of the rowset before the associated ranking function is applied. Ranking or aggregate functions can be used the OVER clause for partitioning and ordering. The OVER clause cannot be used with the CHECKSUM aggregate function. 
Each ranking function, ROW_NUMBER, DENSE_RANK, RANK, NTILE use the OVER clause. 
You can now add aggregate functions to any SELECT (even without a GROUP BY clause) by specifying an OVER() partition for each function.
SELECT *, ROW_NUMBER() OVER (ORDER BY CarName) AS RowNumber FROM [dbo].[CarInfo]
The above query orders the rows in the table by column CarName and assigns sequential row numbers (much like an identity column) to these rows.  In actual this plan scans the table, sorts it on column CarName, and then the sequence project operator assigns sequential numbers to each row.
--Now if you want row number on grouping.
SELECT *, ROW_NUMBER() OVER (PARTITION BY EngineType ORDER BY CarName) AS RowNum FROM [dbo].[CarInfo]
Since column EngineType has two values (Petrol and Diesel), this query breaks the rows in the table into two groups and assigns row numbers separately to each group and then sort the result by CarName.  Execute the query and see the result.
RANK and DENSE_RANK functions are similar - both in functionality and implementation as ROW_NUMBER.
The difference is that while the ROW_NUMBER function assigns a unique (and ascending) value to each row without regard for ties in the ORDER BY values, the RANK and DENSE_RANK functions assign the same value to rows that have the same ORDER BY value. 

SELECT CarBodyType, CarCompany ,ROW_NUMBER() OVER (ORDER BY EngineType) AS 'Row Number'
,RANK() OVER (ORDER BY EngineType) AS 'Rank'
,DENSE_RANK() OVER (ORDER BY EngineType) AS 'DenRank'
,NTILE(4) OVER (ORDER BY EngineType) AS 'Quartile'
,EngineType FROM [dbo].[CarInfo]


Here you can notice how the ROW_NUMBER function ignores the duplicate values for column EngineType and assigns the unique integers from 1 to 11 rows while the RANK and DENSE_RANK functions assigns the same value to each of the pairs of duplicate rows. 
In above result have a close look how the RANK function counts the duplicate rows even while it assigns the same value to each duplicate row whereas the DENSE_RANK function does not count the duplicate rows
For example, both the RANK and DENSE_RANK functions assign a rank of 1 to the first 4 rows, but the RANK function assigns a rank of 5 to the fifth row - as it is the fifth row - while the DENSE_RANK function assigns a rank of 2 to the fifth row - as it contains the second distinct value for column EngineType
Note that the maximum value returned by the DENSE_RANK function is exactly equal to the number of distinct values in column B.
NTILE function breaks an input set down into N equal sized groups.  To determine how many rows belong in each group, SQL Server must first determine the total number of rows in the input set.  
If the NTILE function includes a PARTITION BY clause, SQL Server must compute the number of rows in each partition separately.  Once we know the number of rows in each partition, we can write the NTILE function.
SELECT *, COUNT(*) OVER (PARTITION BY EngineType) AS Cnt FROM [dbo].[CarInfo]
SELECT *, NTILE(2) OVER (PARTITION BY EngineType ORDER BY CarCompany) AS NTile FROM [CarInfo]
SELECT *, NTILE(4) OVER (ORDER BY CarCompany) AS NTile FROM [CarInfo]
Over [Enhancement with SQL Server 2012]:
You saw the uses of Over Clause in the above examples. Microsoft introduced the OVER clause in SQL Server 2005, its use in aggregate functions was pretty limited – you could use the aggregate function with just the PARTITION BY clause.
Generally when we apply aggregate functions in over clause, it will apply to all rows in that group/partition. Suppose, if we wants to restrict the number of rows, it involves some complexity. 
Assume, to calculate Next 5yrs total revenue for a company, the number of rows needs to be considered are always that row and next 4 rows.
In SQL Server 2012, with the enhancements to the OVER clause following the ANSI standard, the new functionality allow us to use the ORDER BY clause in the OVER clause with aggregate functions and also new ROWS and RANGE clauses were introduced to limit rows. 
The ORDER BY allow us define the order of rows processing and the ROWS/RANGE clauses put limits on the rows being processed in partition.

The row limit can be specified by several methods:
·         <unsigned integer> PRECEDING - fixed number of preceding rows
·         CURRENT ROW - representing current row being processed
·         UNBOUNDED PRECEDING - all previous records
·         <unsigned integer> FOLLOWING - fixed number of following rows
·         UNBOUNDED FOLLOWING - all rows following current row

--In the following example, it will calculate the sales for Current year + next 4 year
SELECT BusinessEntityID, TerritoryID
    ,CONVERT(varchar(20),SalesYTD,1) AS  SalesYTD
    ,DATEPART(yy,ModifiedDate) AS SalesYear
    ,CONVERT(varchar(20),SUM(SalesYTD)
OVER (PARTITION BY TerritoryID ORDER BY DATEPART(yy, ModifiedDate)
        ROWS BETWEEN CURRENT ROW AND 4 FOLLOWING ),1) AS CumulativeTotal
FROM Sales.SalesPerson
WHERE TerritoryID IS NULL OR TerritoryID < 5;

--Suppose, to calculate sum of last 5 yrs revenue for every row, replace line 7 code.
ROWS BETWEEN 4 PRECEDING AND CURRENT ROW

--Suppose to calculate sum of all preceding years(Cumulative sum), replace line 7 code.
--With UNBOUNDED PRECEDING the result is that the window starts at the first row of the partition.
ROWS UNBOUNDED PRECEDING

--This is a static window which aligned to the partition 
Rows Between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING

--We can also use the Range as follows 
RANGE BETWEEN UNBOUNDED PRECEEDING AND CURRENT ROW
RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
RANGE CURRENT ROW
The RANGE clause can be only used with the UNBOUNDED limit and CURRENT ROW.
The difference between ROWS and RANGE clause is - ROWS works with physical rows and RANGE works with range of rows based on the current row value in the terms of ORDER BY clause.
This means that for ROWS clause the CURRENT ROW represents the only current row being processed.
For RANGE the CURRENT ROW represents all the rows with the same value in the fields specified in the ORDER BY clause within current partition as the current row being processed.
So if we use RANGE and multiple rows have the same rank in the terms of order within the partition, then all those rows will represent current row.
When there is no ROWS/RANGE clause specified after the ORDER BY clause, then the default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is used by SQL Server.
ROWS or RANGE what should I use? 
Here little confusion to use either ROWS or RANGE, the answer comes from the definition how the ROWS and RANGE clauses works. As described, ROWS works with each unique rows and RANGE handles multiple rows with the same order position as current row.

So in case the combination of fields specified in the ORDER BY clause does not uniquely specify the order of rows, then you should use RANGE, as the processing order of rows with the same order position is not guaranteed. 
In case the rows are uniquely identified, then ROWS should be used as there are no rows with the same order in the partition.

No comments:

Post a Comment