Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Calculate Percentiles with SQL Server 2005

Learn one of the techniques for calculating percentiles with SQL Server 2005 using the new SQL Server Common Table Expression and the latest ROW_NUMBER function.


advertisement
hen you need to analyze numerical data, percentiles are handy for understanding the distribution. Oracle has been providing analytical functions such as NTILE, PERCENTILE_CONT, and RANK since version 8, while SQL Server has been playing catch up. With SQL Server 2005, Microsoft finally added NTILE, RANK, DENSE_RANK, and ROW_NUMBER to SQL Server's T-SQL functions. Unfortunately, the SQL Server 2005 still does not provide PERCENTILE_CONT, and neither will SQL Server 2008. They do, however, feature the new SQL Server Common Table Expression (CTE), which you can use to calculate percentiles.

This article presents a solution for calculating percentiles in SQL Server 2005 using SQL Server CTE. Based on a solution described in Joe Celko's SQL For Smarties, the idea is to calculate the percentile (percent rank) of each row; if one of the rows has the percentile value that you are looking for, you return it. Otherwise, you perform an interpolation between a value that is slightly more than the desired percentile and one that is slightly less (see Figure 1).

Click to enlarge

Figure 1. Calculating the Percentiles Based on Ranking:
You interpolate between a value that is slightly more than the desired percentile and one that is slightly less.



Calculating Percentiles with SQL Server CTE
This article's percentiles calculations are based on this fictitious table of employee salaries:

CREATE TABLE employee ( ID INT IDENTITY(1,1) PRIMARY KEY, dept VARCHAR(30), emp_name VARCHAR(80), salary FLOAT ) INSERT INTO employee(dept, emp_name, salary) values('PRODUCTION', 'Janice', 2000) INSERT INTO employee(dept, emp_name, salary) values('PRODUCTION', 'Allan', 3500) INSERT INTO employee(dept, emp_name, salary) values('PRODUCTION', 'Peter', 3300) INSERT INTO employee(dept, emp_name, salary) values('PRODUCTION', 'Sam', 1100) INSERT INTO employee(dept, emp_name, salary) values('SALES', 'Jerry', 2300) INSERT INTO employee(dept, emp_name, salary) values('SALES', 'Robert', 5540) INSERT INTO employee(dept, emp_name, salary) values('SALES', 'Larry', 4200) INSERT INTO employee(dept, emp_name, salary) values('SALES', 'Gordon', 2200) INSERT INTO employee(dept, emp_name, salary) values('SALES', 'Larry', 1120) INSERT INTO employee(dept, emp_name, salary) values('SALES', 'Jane', 4120) INSERT INTO employee(dept, emp_name, salary) values('MARKETING', 'Holly', 1300) INSERT INTO employee(dept, emp_name, salary) values('MARKETING', 'John', 1000) INSERT INTO employee(dept, emp_name, salary) values('MARKETING', 'Hunter', 6330) INSERT INTO employee(dept, emp_name, salary) values('MARKETING', 'Yoda', 5500) INSERT INTO employee(dept, emp_name, salary) values('MARKETING', 'Jackie', 2280) INSERT INTO employee(dept, emp_name, salary) values('MARKETING', 'Rachel', 1490) INSERT INTO employee(dept, emp_name, salary) values('ACCOUNTING', 'Seymour', 990) INSERT INTO employee(dept, emp_name, salary) values('ACCOUNTING', 'Yoda', 2930) INSERT INTO employee(dept, emp_name, salary) values('ACCOUNTING', 'Dave', 5400) GO

Here is the code to calculate the 75th percentile of employee salaries using the above table data:

DECLARE @percentile FLOAT SELECT @percentile = .75; WITH emp_sal(salary, prev_rank, curr_rank, next_rank) AS ( SELECT salary, (ROW_NUMBER() OVER ( ORDER BY salary ) - 2.0) / ((SELECT COUNT(*) FROM employee) - 1) [prev_rank], (ROW_NUMBER() OVER ( ORDER BY salary ) - 1.0) / ((SELECT COUNT(*) FROM employee) - 1) [curr_rank], (ROW_NUMBER() OVER ( ORDER BY salary ) + 0.0) / ((SELECT COUNT(*) FROM employee) - 1) [next_rank] FROM employee ) SELECT CASE WHEN t1.salary = t2.salary THEN t1.salary ELSE t1.salary + (t2.salary - t1.salary) * ((@percentile - t1.curr_rank) / (t2.curr_rank - t1.curr_rank)) END FROM emp_sal t1, emp_sal t2 WHERE (t1.curr_rank = @percentile OR (t1.curr_rank < @percentile AND t1.next_rank > @percentile)) AND (t2.curr_rank = @percentile OR (t2.curr_rank > @percentile AND t2.prev_rank < @percentile))

This query uses SQL Server CTE along with the new ROW_NUMBER() function. The emp_sal CTE at the top produces a table that ranks each row. The prev_rank and next_rank columns help locate a row for interpolation when the desired percentile does not fall onto one of the rows (see Figure 2).

Click to enlarge

Figure 2. Salary Values and Ranking:
The two highlighted rows will be used to calculate the 75th percentile.

The SELECT in the main body of the query extracts the two closest values and performs an interpolation. In this case, the formula that calculates the 75th percentile (using the values 4120 and 4200) is as follows:

75th percentile = 4120 + (4200 - 4120) * (0.75 - 0.7222) / (0.7778 - 0.7222) = 4160



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap
Thanks for your registration, follow us on our social networks to keep up-to-date