How To Calculate The Percentiles

How To Calculate The Percentiles

Terminology

Order Statistics

For a series of measurements X₁, ..., X_N, denote the data ordered in increasing order of magnitude by X₍₁₎, ..., X_(N). These ordered data are called order statistics:

i	X_i	X_(i)
1	50	10
2	40	12
3	40	14
4	30	16
5	20	18
6	18	20
7	16	30
8	14	40
9	12	40
10	10	50

The k-th order statistic is equal k-th smallest value. Special cases include:

minimum: X₁
maximum: X_N
range: X_N - X₁
midrange: (X_N - X₁) × 1/2
median: refer below

Rank

If X_(j) is the order statistic that corresponds to the measurement X_i then the rank for X_i is j, i.e.

X_(j) ~ X_i => r_i = j

Ranking is the data transformation in which original values are replaced by their rank.

i	X_i	X_(i)	r_i
1	50	10	10
2	40	12	9
3	40	14	8
4	30	16	7
5	20	18	6
6	18	20	5
7	16	30	4
8	14	40	3
9	12	40	2
10	10	50	1

Percentiles

In Human Terms

Non-scientific definition:

the p-th percentile is a value below which a p% of observations fall.

In Scientific Terms

Below is the definition of percentile proposed by NIST Engineering Statistics Handbook:

the p-th percentile is a value, P_p, such that at most p% of the measurements are less than this value and at most (100−p)% are greater,

For the example above, in 20% of measurements the amount of available memory is less than 427.232, that means that P₂₀=427.232.

Percentile Rank

A percentile rank is the proportion defined in percentile: for p-th percentile, rank is p. For instance, in the above example for 20-th percentile the rank is 20.

Quantile vs Percentile

It is more common in statistics to refer to quantiles. These are the same as percentiles, but are indexed by sample fractions rather than by sample percentages.

In general, the concepts of quantile and percentile are interchangeable, as well as the scales of probability calculation — absolute and percentage.

Statistics use the term q-quantiles. It stands for values that divide the order statistic into q subsets of equal sizes.

This means that the term percentiles is the name for 100-quantiles.

Also some other q-quantiles have special names:

decile ~ 10-quantile is any of the nine values that divide the order statistic into ten equal parts, and each part represents 1/10 of the sample or population.
median ~ 2-quantile is the value that divide the order statistic to the two equal parts.
quartile ~ 4-quantile is any of the three values that divide the order statistic to the four equal parts, named quarters.

The difference between upper and lower quartiles is also called the interquartile range → IQR = Q3 − Q1.

There are relations between different types of quantiles:

0 quartile = 0.00 quantile = 0 percentile

1 quartile = 0.25 quantile = 25 percentile

2 quartile = 0.50 quantile = 50 percentile = median

3 quartile = 0.75 quantile = 75 percentile

4 quartile = 1.00 quantile = 100 percentile

Estimation Of Percentiles

When there is a small sample of measurements, the CDF of the underlying population is unknown, that is why the percentile can not be calculated, it can be only estimated instead.

Often the percentile of interest is not correspond to a specific data point. In this case, interpolation between points is required. There is no a standard universally accepted way to perform this interpolation. Hyndman, R. J. and Fan, Y. (1986) described nine different methods for computing percentiles, most of statistical software use one of them.

Example Data

Consider these methods in the following example:

i	X_i	X_(i)
1	50	10
2	40	12
3	40	14
4	30	16
5	20	18
6	18	20
7	16	30
8	14	40
9	12	40
10	10	50

N = 10

Notation:

q = p/100 - the percentile rank divided by 100

h - a computed real valued index

X_j - the j-th element of the order statistics, X₃ = 16

⌈⌉ - ceil function, for example ⌈3.2⌉ = 4

⌊⌋ - floor function, for example ⌊3.2⌋ = 3

⌊⌉ - rounding to the nearest even integer, for example ⌊3.2⌉ = 4

Discontinuous Sample

R1. Inverse of EDF

h = N × q
P_p = X_⌈h⌉
if q = 0, P₀ = X₁

Percentile	Calculations^data
25-th	• `h = 10 × 0.25 = 2.5 => ⌈h⌉ = ⌈2.5⌉ = 3` • `P₂₅ = X₃ = 14`

The approach is used by¹:

R2. Inverse of EDF with averaging at discontinuities

h = N × q + 1/2
P_p = (X_{⌈h – 1/2⌉} + X_{⌊h + 1/2⌋}) / 2
if q = 0, P₀ = X₁
if q = 1, P₁ = X_N

Percentile	Calculations^data
25-th	• `h = 10 × 0.25 + 0.5 = 3 => ⌈h – 1/2⌉ = ⌈2.5⌉ = 3, ⌊h + 1/2⌋ = ⌊3.5⌋ = 3` • `P₂₅ = (X₃ + X₃) / 2 = 14`

The approach is used by¹:

R3. SAS definition: nearest even

h = N × q
P_p = X_⌈h⌋
if q ≤ (1/2)/N, P_p = X₁

Percentile	Calculations^data
25-th	• `h = 10 × 0.25 = 2.5 => ⌈h⌋ = 2` • `P₂₅ = X₂ = 12`

The approach is used by¹:

The graph below shows a comparison of the first three methods:

All subsequent methods use linear interpolation:

P_p = X_⌊h⌋ + (h − ⌊h⌋) × (X_{⌊h⌋ + 1} - X_⌊h⌋)

Continuous Sample

R4. Linear interpolation of the EDF

h = N × q
P_p = X_⌊h⌋ + (h − ⌊h⌋) × (X_{⌊h⌋ + 1} - X_⌊h⌋)
if q < 1/N, P_p = X₁
if q = 1, P₁ = X_N

Percentile	Calculations^data
25-th	• `h = 10 × 0.25 = 2.5 => ⌊h⌋ = ⌊2⌋ = 2` • `P₂₅ = X₂ + (2.5 - 2) × (X₃ - X₂) = 13`

The approach is used by¹:

R5. Piecewise linear function

h = N × q + 1/2
P_p = X_⌊h⌋ + (h − ⌊h⌋) × (X_{⌊h⌋ + 1} - X_⌊h⌋)
if q ≤ (1/2)/N, P_p = X₁
if q ≥ (N - 1/2)/N, P_p = X_N

Percentile	Calculations^data
25-th	• `h = 10 × 0.25 + 0.5 = 3 => ⌊h⌋ = ⌊3⌋ = 3` • `P₂₅ = X₃ + (3 - 3) × (X₄ - X₃) = 14`

The approach is used by¹:

Matlab R2018b prctile()^*

^*Note for the percentiles corresponding to the probabilities outside the range, prctile assigns the minimum or maximum values of the elements in X.

The graph below shows a comparison of the 4-th and 5-th methods:

R6. Linear interpolation of the mathematical expectations

h = (N + 1) × q
P_p = X_⌊h⌋ + (h − ⌊h⌋) × (X_{⌊h⌋ + 1} - X_⌊h⌋)
if q ≤ 1/(N + 1), P_p = X₁
if q ≥ N/(N + 1), P_p = X_N

Percentile	Calculations^data
25-th	• `h = (10 + 1) × 0.25 = 2.75 => ⌊h⌋ = ⌊2.75⌋ = 2` • `P₂₅ = X₂ + (2.75 - 2) × (X₃ - X₂) = 13.5`

The approach is used by¹:

R7. Linear interpolation

h = (N - 1) × q + 1
P_p = X_⌊h⌋ + (h − ⌊h⌋) × (X_{⌊h⌋ + 1} - X_⌊h⌋)
if q = 1, P₁ = X_N

Percentile	Calculations^data
25-th	• `h = (10 - 1) × 0.25 + 1 = 3.25 => ⌊h⌋ = ⌊3.25⌋ = 3` • `P₂₅ = X₃ + (3.25 - 3) × (X₄ - X₃) = 14.5`

The approach is used by¹:

Excel PERCENTILE.INC
Guava: Google Core Libraries for Java 23.0 API
NumPy v1.15 with interpolation : 'linear'
Pandas 0.23.4 with interpolation : 'linear'
Oracle DB 10.2

This method is default in R.

The graph below shows a comparison of the 6-th and 7-th methods:

R8. Linear interpolation of the approximate medians

The resulting quantile estimates are approximately median-unbiased regardless of the distribution of X.

The bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased.

h = (N + 1/3) × q + 1/3
P_p = X_⌊h⌋ + (h − ⌊h⌋) × (X_{⌊h⌋ + 1} - X_⌊h⌋)
if q ≤ (2/3)/(N + 1/3), P_p = X₁
if q ≥ (N - 1/3)/(N + 1/3), P_p = X_N

Percentile	Calculations^data
25-th	• `h = (10 + 0.33) × 0.25 + 0.33 = 2.91 => ⌊h⌋ = ⌊2.91⌋ = 2` • `P₂₅ = X₂ + (2.91 - 2) × (X₃ - X₂) = 13.83`

This method was recommended by Hyndman and Fan.

R9. Approximately unbiased estimates (if `X` is normally distributed)

h = (N + 1/4) × q + 3/8
P_p = X_⌊h⌋ + (h − ⌊h⌋) × (X_{⌊h⌋ + 1} - X_⌊h⌋)
if q < (5/8)/(N + 1/4), P_p = X₁
if q ≥ (N - 3/8)/(N + 1/4), P_p = X_N

Percentile	Calculations^data
25-th	• `h = (10 + 0.25) × 0.25 + 0.375 = 2.93 => ⌊h⌋ = ⌊2.94⌋ = 2` • `P₂₅ = X₂ + (2.94 - 2) × (X₃ - X₂) = 13.875`

The graph below shows a comparison of the 6-th and 7-th methods:

Summaries

Key Percentiles Summary

Percentile	R1	R2	R3	R4	R5	R6	R7	R8	R9
0	10	10	10	10	10	10	10	10	10
25	14	14	12	13	14	13.5	14.5	13.83	13.875
50	18	19	18	18	19	19	19	19	19
75	40	40	40	35	40	40	37.5	40	40
90	40	45	40	40	45	49	41	46.33	46
99	50	50	50	49	50	50	49.1	50	50
100	50	50	50	50	50	50	50	50	50

Methods Differences

Method	Index `h`	Interpolation	Limits Selection
R1	`N × q`	`P_p = X_⌈h⌉`	• `q = 0` ⇒ `P₀ = X₁`
R2	`N × q + 1/2`	`P_p = (X_{⌈h – 1/2⌉} + X_{⌊h + 1/2⌋}) / 2`	• `q = 0` ⇒ `P₀ = X₁` • `q = 1` ⇒ `P₁ = X_N`
R3	`N × q`	`P_p = X_⌈h⌋`	• `q ≤ (1/2)/N` ⇒ `P_p = X₁`
R4	`N × q`	`X_⌊h⌋ + (h − ⌊h⌋) × (X_{⌊h⌋ + 1} - X_⌊h⌋)`	• `q < 1/N` ⇒ `P_p = X₁` • `q = 1` ⇒ `P₁ = X_N`
R5	`N × q + 1/2`	`X_⌊h⌋ + (h − ⌊h⌋) × (X_{⌊h⌋ + 1} - X_⌊h⌋)`	• `q ≤ (1/2)/N` ⇒ `P_p = X₁` • `q ≥ (N - 1/2)/N` ⇒ `P_p = X_N`
R6	`(N + 1) × q`	`X_⌊h⌋ + (h − ⌊h⌋) × (X_{⌊h⌋ + 1} - X_⌊h⌋)`	• `q ≤ 1/(N + 1)` ⇒ `P_p = X₁` • `q ≥ N/(N + 1)` ⇒ `P_p = X_N`
R7	`(N - 1) × q + 1`	`X_⌊h⌋ + (h − ⌊h⌋) × (X_{⌊h⌋ + 1} - X_⌊h⌋)`	• `q = 1` ⇒ `P₁ = X_N`
R8	`(N + 1/3) × q + 1/3`	`X_⌊h⌋ + (h − ⌊h⌋) × (X_{⌊h⌋ + 1} - X_⌊h⌋)`	• `q ≤ (2/3)/(N + 1/3)` ⇒ `P_p = X₁` • `q ≥ (N - 1/3)/(N + 1/3)` ⇒ `P_p = X_N`
R9	`(N + 1/4) × q + 3/8`	`X_⌊h⌋ + (h − ⌊h⌋) × (X_{⌊h⌋ + 1} - X_⌊h⌋)`	• `q < (5/8)/(N + 1/4)` ⇒ `P_p = X₁` • `q ≥ (N - 3/8)/(N + 1/4)` ⇒ `P_p = X_N`

Tools Summary

The following software provides functionality to use any of R1-R9:

Apache Commons Math 3.6 Percentile, refer to Percentile.EstimationType

q must be in interval q ∈ (0, 1], otherwise org.apache.commons.math3.exception.OutOfRangeException is thrown.

R quantile

Method	Tools
R1	• Azure Kusto • Apache Commons Math 3.6 `EstimationType.R_1` • R `type=1` • SAS `PCTLDEF=3`
R2	• Apache Commons Math 3.6 `EstimationType.R_2` • R `type=2` • SAS `PCTLDEF=5`
R3	• Apache Commons Math 3.6 `EstimationType.R_3` • R `type=3` • SAS `PCTLDEF=2`
R4	• Apache Commons Math 3.6 `EstimationType.R_4` • R `type=4` • SAS `PCTLDEF=1` • SciPy v1.1.0 `alphap=0, betap=1`
R5	• Apache Commons Math 3.6 `EstimationType.R_5` • Matlab R2018b `prctile()` • R `type=5` • SciPy v1.1.0 `alphap=0.5, betap=0.5`
R6	• Apache Commons Math 3.6 `EstimationType.R_6` • ATSD SQL PERCENTILE • ATSD Rule Engine percentile() • ATSD Data API PERCENTILE • Charts PERCENTILE • Excel PERCENTILE.EXC • NIST Dataplot `R6` • R `type=6` • SAS `PCTLDEF=4` • SciPy v1.1.0 `alphap=0, betap=0`
R7	• Apache Commons Math 3.6 `EstimationType.R_7` • Excel PERCENTILE.INC • Guava: Google Core Libraries for Java 23.0 API • NIST Dataplot `R7` • NumPy v1.15 `interpolation : 'linear'` • Oracle DB 10.2 • Pandas 0.23.4 `interpolation : 'linear'` • R `type=7` • SciPy v1.1.0 `alphap=1, betap=1`
R8	• Apache Commons Math 3.6 `EstimationType.R_8` • NIST Dataplot `R8` • R `type=8` • SciPy v1.1.0 `alphap=1/3, betap=1/3`
R9	• Apache Commons Math 3.6 `EstimationType.R_9` • R `type=9` • SciPy v1.1.0 `alphap=3/8, betap=3/8`

NaN Strategy

Apache Commons Math 3.6

Apache Commons Math 3.6 NaNStrategy:

MINIMAL - NaNs are treated as minimal in the ordering, equivalent to (that is, tied with) Double.NEGATIVE_INFINITY
MAXIMAL - NaNs are treated as maximal in the ordering, equivalent to Double.POSITIVE_INFINITY
REMOVED - NaNs are removed before the rank transform is applied
FIXED - NaNs are left "in place," that is the rank transformation is applied to the other elements in the input array, but the NaN elements are returned unchanged.
FAILED - If any NaN is encountered in the input array, an appropriate exception is thrown.

R quantile

na.rm - if true, any NA and NaN's are removed from x before the quantiles are computed
if false NA and NaN values are not allowed

ATSD
NaN values are removed before the percentiles are estimated.

Graphical Representation Of Percentiles

Box-And-Whiskers Diagram or Box Plot is the visual representation of the several percentiles of a given data set.

Outliers are described below.

Below is the Box Plot for the example data.

It is also convenient to display the percentiles in the Histogram Chart.

Percentiles are often used for thresholds checking, below values that are greater than P₇₀ = 37 are colored in red.

Additional Examples

Robust Statistics

The term "robust statistics" is closely related to the term "outliers".

Outlier is an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data.

This definition is not precise, the decision whether an observation is an outlier, is left to the subjective judgement of the researcher.

Robust statistics are resistant to outliers. In other words, if data set contains very high or very low values, then some statistics will be good estimators for population parameters, and some statistics will be poor estimators. For example, the arithmetic mean is very susceptible to outliers (it is non-robust), while the median is not affected by outliers (it is robust).

Median vs Average

Average or arithmetic mean is the sum of the numbers divided by how many numbers are being averaged:

The median is a robust measure of central tendency, while the mean is not.

For the symmetric distributions (normal in particular) the mean is also the median.

For the example sample mean is greater than median:

Median Absolute Deviation vs IQR vs Standard Deviation

Median Absolute Deviation, IQR and Standard Deviation are measures of spread (also called measures of dispersion), that means that they show something about how wide the set of data is.

The standard deviation is a measure of how spread out data is around center of the distribution. It also gives an idea of where, percentage wise, a certain value falls.

The SD is affected by extremely high or extremely low values and non normality, in other words it is not robust.

The median absolute deviation is a robust measure of how spread out a set of data is.

If data is normally distributed, the SD is usually the best choice for assessing spread, otherwise, the MAD is preferred, but it must be multiplied with scale factor k - a constant linked to the assumption of normality of the data, read additional information in [3] and [6].

The IQR is similar to the MAD, but it is less robust. The IQR is often used to find outliers in sample: observations that fall below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR are marked as outliers. In a box plot, the highest and lowest occurring value within this limit are indicated by whiskers of the box and any outliers as individual points.

IQR can also be used as a threshold.

How To Calculate The Percentiles

Table of Contents

Terminology

Order Statistics

Rank

Percentiles

In Human Terms

In Scientific Terms

Percentile Rank

Quantile vs Percentile

Estimation Of Percentiles

Example Data

Discontinuous Sample

R1. Inverse of EDF

R2. Inverse of EDF with averaging at discontinuities

R3. SAS definition: nearest even

Continuous Sample

R4. Linear interpolation of the EDF

R5. Piecewise linear function

R6. Linear interpolation of the mathematical expectations

R7. Linear interpolation

R8. Linear interpolation of the approximate medians

R9. Approximately unbiased estimates (if `X` is normally distributed)

Summaries

Key Percentiles Summary

Methods Differences

Tools Summary

NaN Strategy

Graphical Representation Of Percentiles

Additional Examples

Robust Statistics

Median vs Average

Median Absolute Deviation vs IQR vs Standard Deviation

Sources

i	X_i	X_(i)	r_i
1	50	10	10
2	40	12	9
3	40	14	8
4	30	16	7
5	20	18	6
6	18	20	5
7	16	30	4
8	14	40	3
9	12	40	2
10	10	50	1

i	X_i	X_(i)	r_i
1	50	10	10
2	40	12	9
3	40	14	8
4	30	16	7
5	20	18	6
6	18	20	5
7	16	30	4
8	14	40	3
9	12	40	2
10	10	50	1

# How To Calculate The Percentiles

# Table of Contents

# Terminology

# Order Statistics

# Rank

# Percentiles

# In Human Terms

# In Scientific Terms

# Percentile Rank

# Quantile vs Percentile

# Estimation Of Percentiles

# Example Data

# Discontinuous Sample

# R1. Inverse of EDF

# R2. Inverse of EDF with averaging at discontinuities

# R3. SAS definition: nearest even

# Continuous Sample

# R4. Linear interpolation of the EDF

# R5. Piecewise linear function

# R6. Linear interpolation of the mathematical expectations

# R7. Linear interpolation

# R8. Linear interpolation of the approximate medians

# R9. Approximately unbiased estimates (if X is normally distributed)

# Summaries

# Key Percentiles Summary

# Methods Differences

# Tools Summary

# NaN Strategy

# Graphical Representation Of Percentiles

# Additional Examples

# Robust Statistics

# Median vs Average

# Median Absolute Deviation vs IQR vs Standard Deviation

# Sources

How To Calculate The Percentiles

Table of Contents

Terminology

Order Statistics

Rank

Percentiles

In Human Terms

In Scientific Terms

Percentile Rank

Quantile vs Percentile

Estimation Of Percentiles

Example Data

Discontinuous Sample

R1. Inverse of EDF

R2. Inverse of EDF with averaging at discontinuities

R3. SAS definition: nearest even

Continuous Sample

R4. Linear interpolation of the EDF

R5. Piecewise linear function

R6. Linear interpolation of the mathematical expectations

R7. Linear interpolation

R8. Linear interpolation of the approximate medians

R9. Approximately unbiased estimates (if `X` is normally distributed)

Summaries

Key Percentiles Summary

Methods Differences

Tools Summary

NaN Strategy

Graphical Representation Of Percentiles

Additional Examples

Robust Statistics

Median vs Average

Median Absolute Deviation vs IQR vs Standard Deviation

Sources

i	X_i	X_(i)	r_i
1	50	10	10
2	40	12	9
3	40	14	8
4	30	16	7
5	20	18	6
6	18	20	5
7	16	30	4
8	14	40	3
9	12	40	2
10	10	50	1