
hen you need to analyze numerical data, percentiles are handy for understanding the distribution. Oracle has been providing analytical functions such as
NTILE,
PERCENTILE_CONT, and
RANK since version 8, while SQL Server has been playing catch up. With SQL Server 2005, Microsoft finally added
NTILE,
RANK,
DENSE_RANK, and
ROW_NUMBER to SQL Server's T-SQL functions. Unfortunately, the SQL Server 2005 still does not provide
PERCENTILE_CONT, and neither will SQL Server 2008. They do, however, feature the new SQL Server Common Table Expression (CTE), which you can use to calculate percentiles.
This article presents a solution for calculating percentiles in SQL Server 2005 using SQL Server CTE. Based on a solution described in Joe Celko's SQL For Smarties, the idea is to calculate the percentile (percent rank) of each row; if one of the rows has the percentile value that you are looking for, you return it. Otherwise, you perform an interpolation between a value that is slightly more than the desired percentile and one that is slightly less (see Figure 1).
Figure 1. Calculating the Percentiles Based on Ranking: You interpolate between a value that is slightly more than the desired percentile and one that is slightly less. |
Calculating Percentiles with SQL Server CTE
This article's percentiles calculations are based on this fictitious table of employee salaries:
CREATE TABLE employee
(
ID INT IDENTITY(1,1) PRIMARY KEY,
dept VARCHAR(30),
emp_name VARCHAR(80),
salary FLOAT
)
INSERT INTO employee(dept, emp_name, salary) values('PRODUCTION', 'Janice', 2000)
INSERT INTO employee(dept, emp_name, salary) values('PRODUCTION', 'Allan', 3500)
INSERT INTO employee(dept, emp_name, salary) values('PRODUCTION', 'Peter', 3300)
INSERT INTO employee(dept, emp_name, salary) values('PRODUCTION', 'Sam', 1100)
INSERT INTO employee(dept, emp_name, salary) values('SALES', 'Jerry', 2300)
INSERT INTO employee(dept, emp_name, salary) values('SALES', 'Robert', 5540)
INSERT INTO employee(dept, emp_name, salary) values('SALES', 'Larry', 4200)
INSERT INTO employee(dept, emp_name, salary) values('SALES', 'Gordon', 2200)
INSERT INTO employee(dept, emp_name, salary) values('SALES', 'Larry', 1120)
INSERT INTO employee(dept, emp_name, salary) values('SALES', 'Jane', 4120)
INSERT INTO employee(dept, emp_name, salary) values('MARKETING', 'Holly', 1300)
INSERT INTO employee(dept, emp_name, salary) values('MARKETING', 'John', 1000)
INSERT INTO employee(dept, emp_name, salary) values('MARKETING', 'Hunter', 6330)
INSERT INTO employee(dept, emp_name, salary) values('MARKETING', 'Yoda', 5500)
INSERT INTO employee(dept, emp_name, salary) values('MARKETING', 'Jackie', 2280)
INSERT INTO employee(dept, emp_name, salary) values('MARKETING', 'Rachel', 1490)
INSERT INTO employee(dept, emp_name, salary) values('ACCOUNTING', 'Seymour', 990)
INSERT INTO employee(dept, emp_name, salary) values('ACCOUNTING', 'Yoda', 2930)
INSERT INTO employee(dept, emp_name, salary) values('ACCOUNTING', 'Dave', 5400)
GO
Here is the code to calculate the 75th percentile of employee salaries using the above table data:
DECLARE @percentile FLOAT
SELECT @percentile = .75;
WITH emp_sal(salary, prev_rank, curr_rank, next_rank) AS
(
SELECT salary,
(ROW_NUMBER() OVER ( ORDER BY salary ) - 2.0) / ((SELECT COUNT(*) FROM employee) - 1) [prev_rank],
(ROW_NUMBER() OVER ( ORDER BY salary ) - 1.0) / ((SELECT COUNT(*) FROM employee) - 1) [curr_rank],
(ROW_NUMBER() OVER ( ORDER BY salary ) + 0.0) / ((SELECT COUNT(*) FROM employee) - 1) [next_rank]
FROM employee
)
SELECT
CASE
WHEN t1.salary = t2.salary THEN t1.salary
ELSE t1.salary + (t2.salary - t1.salary) * ((@percentile - t1.curr_rank) / (t2.curr_rank - t1.curr_rank))
END
FROM emp_sal t1, emp_sal t2
WHERE (t1.curr_rank = @percentile OR (t1.curr_rank < @percentile AND t1.next_rank > @percentile))
AND (t2.curr_rank = @percentile OR (t2.curr_rank > @percentile AND t2.prev_rank < @percentile))
This query uses SQL Server CTE along with the new ROW_NUMBER() function. The emp_sal CTE at the top produces a table that ranks each row. The prev_rank and next_rank columns help locate a row for interpolation when the desired percentile does not fall onto one of the rows (see Figure 2).
Figure 2. Salary Values and Ranking: The two highlighted rows will be used to calculate the 75th percentile. |
The SELECT in the main body of the query extracts the two closest values and performs an interpolation. In this case, the formula that calculates the 75th percentile (using the values 4120 and 4200) is as follows:
75th percentile = 4120 + (4200 - 4120) * (0.75 - 0.7222) / (0.7778 - 0.7222) = 4160