In
SQL Server 2005/2008 Microsoft introduced lots of new function like Ranking of results has many benefits
and it is used in several scenarios of T-SQL.
Most common example of ranking is
to ranking Sales person by their sales. This type of ranking requires you to
assign same rank to all those Sales persons who generated the same sales amount, this is called tie.
Ranking
Function: Ranking
functions return a ranking value for each row in a partition. Depending on the
function that is used, some rows might receive the same value as other rows.
Ranking functions are non-deterministic.
SQL Server 2008 provides the following
ranking functions:
· ROW_NUMBER: Returns sequential
number of a row within a partition of a result set, starting at 1 for first row
in each partition.
·
RANK: Returns the rank
(bigint) of each row within the partition of a result set.
·
DENSE_RANK: Same as RANK but
without any gap in the ranking number.
· NTILE: Distributes the
rows in an ordered partition into a specified number of groups. Or we can say
NTILE is actually used to assign a batch numbers or group numbers to the given
result.
Over
Clause: Determines the partitioning and ordering
of the rowset before the associated
ranking function is applied. Ranking or aggregate functions can be used the
OVER clause for partitioning and
ordering. The OVER clause cannot be used with the CHECKSUM aggregate function.
Each ranking function, ROW_NUMBER, DENSE_RANK, RANK, NTILE use the OVER clause.
You can now add aggregate functions to any SELECT (even without a GROUP BY
clause) by specifying an OVER() partition for each function.
SELECT *, ROW_NUMBER() OVER (ORDER BY CarName) AS RowNumber FROM [dbo].[CarInfo]
The above query orders the rows in the table by column CarName
and assigns sequential row numbers
(much like an identity column) to these rows.
In actual this plan scans the table, sorts it on column CarName, and then the sequence project operator assigns sequential numbers to each row.
--Now if you want row number on
grouping.
SELECT *, ROW_NUMBER() OVER (PARTITION BY EngineType ORDER BY CarName) AS RowNum FROM [dbo].[CarInfo]
Since column EngineType has two values
(Petrol and Diesel), this query breaks
the rows in the table into two groups and assigns row numbers separately to
each group and then sort the result by CarName.
Execute the query and see the
result.
RANK and
DENSE_RANK functions are similar - both in functionality and implementation as ROW_NUMBER.
The
difference is that while the ROW_NUMBER function assigns a unique (and
ascending) value to each row without regard for ties in the ORDER BY values,
the RANK and DENSE_RANK functions assign
the same value to rows that have the same ORDER BY value.
SELECT CarBodyType, CarCompany ,ROW_NUMBER() OVER (ORDER BY EngineType) AS 'Row Number'
,RANK() OVER (ORDER BY EngineType) AS 'Rank'
,DENSE_RANK() OVER (ORDER BY EngineType) AS 'DenRank'
,NTILE(4) OVER (ORDER BY EngineType) AS 'Quartile'
,EngineType FROM
[dbo].[CarInfo]
|
|
Here you can notice how
the ROW_NUMBER
function ignores the duplicate
values for column EngineType and assigns the unique integers from 1 to 11 rows while
the RANK and
DENSE_RANK functions assigns
the same value to each of the pairs of duplicate rows.
In above result have a
close look how the RANK function counts the duplicate rows even while it assigns the same value to each
duplicate row whereas the DENSE_RANK function does not count the duplicate rows.
For example, both the
RANK and DENSE_RANK functions assign a rank of 1 to the first 4 rows, but the RANK function assigns a rank of 5 to the fifth row - as it is the fifth row - while the DENSE_RANK function
assigns a rank of 2 to the fifth row
- as it contains the second distinct value for column EngineType.
Note that the
maximum value returned by the DENSE_RANK function is exactly equal to the
number of distinct values in column B.
NTILE function breaks an input set down into N equal sized groups. To determine how
many rows belong in each group, SQL Server must first determine the total number of rows in the input set.
If the NTILE function includes a PARTITION BY clause, SQL Server must compute the number of rows in each
partition separately. Once we know the
number of rows in each partition, we can write the NTILE function.
SELECT *, COUNT(*) OVER (PARTITION BY EngineType) AS Cnt FROM [dbo].[CarInfo]
SELECT *, NTILE(2) OVER (PARTITION BY EngineType ORDER BY CarCompany) AS NTile FROM [CarInfo]
SELECT *, NTILE(4) OVER (ORDER BY CarCompany) AS NTile FROM [CarInfo]
Over [Enhancement with SQL Server 2012]:
You saw the uses of Over Clause in the above examples. Microsoft introduced the OVER clause
in SQL Server 2005, its use in aggregate functions was pretty limited – you
could use the aggregate function with just the PARTITION BY clause.
Generally when we apply aggregate functions in over clause,
it will apply to all rows in that group/partition. Suppose, if we wants to restrict the number of rows, it
involves some complexity.
Assume, to calculate Next 5yrs total revenue for a company, the number of rows needs to be considered
are always
that row and next 4 rows.
In SQL Server 2012, with the enhancements to the OVER
clause following the ANSI standard, the new functionality allow us to use the ORDER BY clause in the OVER clause with aggregate functions and
also new ROWS and RANGE clauses were
introduced to limit rows.
The ORDER BY allow us define the order of rows
processing and the ROWS/RANGE
clauses put limits on the rows being processed in partition.
The row limit can be specified by several methods:
·
<unsigned integer> PRECEDING - fixed number of
preceding rows
·
CURRENT
ROW - representing current row being processed
·
UNBOUNDED PRECEDING
- all previous records
·
<unsigned integer> FOLLOWING - fixed number of
following rows
·
UNBOUNDED FOLLOWING
- all rows following current row
--In the following example, it will calculate the sales for Current year
+ next 4 year
SELECT BusinessEntityID, TerritoryID
,CONVERT(varchar(20),SalesYTD,1) AS SalesYTD
,DATEPART(yy,ModifiedDate) AS SalesYear
,CONVERT(varchar(20),SUM(SalesYTD)
OVER (PARTITION BY TerritoryID ORDER BY DATEPART(yy, ModifiedDate)
ROWS BETWEEN CURRENT ROW AND 4 FOLLOWING ),1) AS CumulativeTotal
FROM Sales.SalesPerson
WHERE TerritoryID IS NULL OR TerritoryID < 5;
--Suppose, to calculate sum of last 5 yrs revenue for every row, replace
line 7 code.
ROWS BETWEEN 4 PRECEDING
AND CURRENT ROW
--Suppose to calculate sum of all preceding years(Cumulative sum),
replace line 7 code.
--With UNBOUNDED PRECEDING the result is that the window starts at the
first row of the partition.
ROWS UNBOUNDED PRECEDING
--This is a static window which aligned to the partition
Rows Between UNBOUNDED
PRECEDING and UNBOUNDED FOLLOWING
--We can also use the Range as follows
RANGE BETWEEN UNBOUNDED PRECEEDING AND CURRENT ROW
RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
RANGE CURRENT ROW
The RANGE
clause can be only used with the UNBOUNDED
limit and CURRENT ROW.
The difference between ROWS and RANGE clause is - ROWS works with physical rows and RANGE works with range of rows based on the current row value in the terms of ORDER
BY clause.
This means that for ROWS clause the CURRENT
ROW represents the only current row being processed.
For RANGE the CURRENT ROW represents all
the rows with the same value in the fields specified in the ORDER BY clause
within current partition as the current row being processed.
So if we use RANGE and multiple rows have
the same rank in the terms of order within the partition, then all those rows
will represent current row.
When there is no ROWS/RANGE clause specified after the ORDER BY clause, then the
default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is used by SQL Server.
ROWS
or RANGE what should I use?
Here little
confusion to use either ROWS or RANGE, the answer comes from the definition how
the ROWS and RANGE clauses works. As described, ROWS works with each unique rows
and RANGE handles multiple rows with the same order position as current row.
So in case the combination of fields specified in the ORDER BY clause does not uniquely specify the order of
rows, then you should use RANGE, as the processing order of rows with the same
order position is not guaranteed.
In case the rows are uniquely identified, then ROWS should be used as there are
no rows with the same order in the partition.
No comments:
Post a Comment