|SQL wasn't designed primarily for calculating statistics, but aggregate (GROUP BY) queries often demand them. Let's start with seven basic statistical terms:
POPULATION: the total number of values of interest.
SAMPLE: When working with large populations, it's often impractical to use the whole population; therefore, statisticians use a sample extracted from the entire population. The sample must be large enough to accurately represent the population.
MEAN: the average, which is calculated by dividing the sum of values by the number of values.
MODE: the value that occurs most frequently in the set that's being analyzed.
MEDIAN: the value that is halfway between the lowest and highest values in the set; that is, the value with an equal number of values above and below it.
VARIANCE: the amount of scatter in the sample, calculated as the sum of squares of differences between the mean and each value, divided by the value count.
STANDARD DEVIATION: the square root of the variance. When the standard deviation is small, the values lie very close to the mean; the larger it gets, the more values you'll find that are far away from the mean. To put it another way, with a large deviation the bell curve is low and wide; with a small deviation the bell curve is high and narrow.
Here's a list of SQL Server statistical functions, plus some standard arithmetical functions that are useful in calculating statistics.
AVG(): returns the mean
COUNT(): returns the population (or sample, depending on the row source)
SQUARE(n): returns the square of the value specified
POWER(a,n): returns the value of a to the nth power
SQRT(n): returns the square root of n
SUM(): returns the sum of the values in a set
STDDEV(): returns the standard deviation of a sample
STDDEVP(): returns the standard deviation of a population
VAR(): returns the variance of a sample
VARP(): returns the variance of a population
Notice that there aren't any built-in functions for calculating mode and median. Nevertheless, you can derive these values with a little effort.
Using the Northwind sample database, suppose that you want to determine the mode of quantities ordered. Here's how you do it:
SELECT TOP 1 quantity, COUNT(*) Count FROM [order details] GROUP BY quantity ORDER BY Count DESCCalculating the median isn't much more difficult (although there is a wrinkle). The basic strategy is to join the table to itself on the column of interest, using one table to count the rows that are less than the given value, and using the other table to count the rows that are greater than that value. This is the query that calculates the median:
SELECT a.quantity median FROM [order details] a, [order details] b GROUP BY a.quantity HAVING SUM( CASE WHEN b.quantity <= a.quantity THEN 1 ELSE 0 END) >=(COUNT(*) )/2 AND SUM( CASE WHEN b.quantity >= a.quantity THEN 1 ELSE 0 END)>=(COUNT(*)/2)Here's the wrinkle: The code above assumes that the number of rows is odd, so there is guaranteed to be a middle row. When the number of rows is even, you have a choice of which row to return. In my experience, most statisticians return the lower of the two middle values. I'm not recommending that you write an entire statistics package using SQL. But, if your needs are relatively simple, you can use the built-in functions, build upon them, and derive the basic statistical values.
Last updated 22 May 2009