How To Calculate The Percentiles

Table of Contents

Terminology

Order Statistics

For a series of measurements X1, ..., XN, denote the data ordered in increasing order of magnitude by X(1), ..., X(N). These ordered data are called order statistics:

i Xi X(i)
1 50 10
2 40 12
3 40 14
4 30 16
5 20 18
6 18 20
7 16 30
8 14 40
9 12 40
10 10 50

The k-th order statistic is equal k-th smallest value. Special cases include:

  • minimum: X1
  • maximum: XN
  • range: XN - X1
  • midrange: (XN - X1) × 1/2
  • median: refer below

Rank

If X(j) is the order statistic that corresponds to the measurement Xi then the rank for Xi is j, i.e.

X(j) ~ Xi => ri = j

Ranking is the data transformation in which original values are replaced by their rank.

i Xi X(i) ri
1 50 10 10
2 40 12 9
3 40 14 8
4 30 16 7
5 20 18 6
6 18 20 5
7 16 30 4
8 14 40 3
9 12 40 2
10 10 50 1

Percentiles

In Human Terms

Non-scientific definition:

the p-th percentile is a value below which a p% of observations fall.

In Scientific Terms

Below is the definition of percentile proposed by NIST Engineering Statistics Handbook:

the p-th percentile is a value, Pp, such that at most p% of the measurements are less than this value and at most (100−p)% are greater,

For the example above, in 20% of measurements the amount of available memory is less than 427.232, that means that P20=427.232.

Percentile Rank

A percentile rank is the proportion defined in percentile: for p-th percentile, rank is p. For instance, in the above example for 20-th percentile the rank is 20.

Quantile vs Percentile

It is more common in statistics to refer to quantiles. These are the same as percentiles, but are indexed by sample fractions rather than by sample percentages.

In general, the concepts of quantile and percentile are interchangeable, as well as the scales of probability calculation — absolute and percentage.

Statistics use the term q-quantiles. It stands for values that divide the order statistic into q subsets of equal sizes.

This means that the term percentiles is the name for 100-quantiles.

Also some other q-quantiles have special names:

  • decile ~ 10-quantile is any of the nine values that divide the order statistic into ten equal parts, and each part represents 1/10 of the sample or population.

  • median ~ 2-quantile is the value that divide the order statistic to the two equal parts.

  • quartile ~ 4-quantile is any of the three values that divide the order statistic to the four equal parts, named quarters.

The difference between upper and lower quartiles is also called the interquartile rangeIQR = Q3 − Q1.

There are relations between different types of quantiles:

0 quartile = 0.00 quantile = 0 percentile

1 quartile = 0.25 quantile = 25 percentile

2 quartile = 0.50 quantile = 50 percentile = median

3 quartile = 0.75 quantile = 75 percentile

4 quartile = 1.00 quantile = 100 percentile

Estimation Of Percentiles

When there is a small sample of measurements, the CDF of the underlying population is unknown, that is why the percentile can not be calculated, it can be only estimated instead.

Often the percentile of interest is not correspond to a specific data point. In this case, interpolation between points is required. There is no a standard universally accepted way to perform this interpolation. Hyndman, R. J. and Fan, Y. (1986) described nine different methods for computing percentiles, most of statistical software use one of them.

Example Data

Consider these methods in the following example:

i Xi X(i)
1 50 10
2 40 12
3 40 14
4 30 16
5 20 18
6 18 20
7 16 30
8 14 40
9 12 40
10 10 50

N = 10

Notation:

q = p/100 - the percentile rank divided by 100

h - a computed real valued index

Xj - the j-th element of the order statistics, X3 = 16

⌈⌉ - ceil function, for example ⌈3.2⌉ = 4

⌊⌋ - floor function, for example ⌊3.2⌋ = 3

⌊⌉ - rounding to the nearest even integer, for example ⌊3.2⌉ = 4

Discontinuous Sample

R1. Inverse of EDF

  • h = N × q

  • Pp = X⌈h⌉

  • if q = 0, P0 = X1

Percentile Calculationsdata
25-th h = 10 × 0.25 = 2.5 => ⌈h⌉ = ⌈2.5⌉ = 3
P25 = X3 = 14

The approach is used by1:

  1. Azure Kusto
  2. SAS

R2. Inverse of EDF with averaging at discontinuities

  • h = N × q + 1/2

  • Pp = (X⌈h – 1/2⌉ + X⌊h + 1/2⌋) / 2

  • if q = 0, P0 = X1

  • if q = 1, P1 = XN

Percentile Calculationsdata
25-th h = 10 × 0.25 + 0.5 = 3 => ⌈h – 1/2⌉ = ⌈2.5⌉ = 3, ⌊h + 1/2⌋ = ⌊3.5⌋ = 3
P25 = (X3 + X3) / 2 = 14

The approach is used by1:

  1. SAS

R3. SAS definition: nearest even

  • h = N × q

  • Pp = X⌈h⌋

  • if q ≤ (1/2)/N, Pp = X1

Percentile Calculationsdata
25-th h = 10 × 0.25 = 2.5 => ⌈h⌋ = 2
P25 = X2 = 12

The approach is used by1:

  1. SAS

The graph below shows a comparison of the first three methods:

All subsequent methods use linear interpolation:

  • Pp = X⌊h⌋ + (h − ⌊h⌋) × (X⌊h⌋ + 1 - X⌊h⌋)

Continuous Sample

R4. Linear interpolation of the EDF

  • h = N × q

  • Pp = X⌊h⌋ + (h − ⌊h⌋) × (X⌊h⌋ + 1 - X⌊h⌋)

  • if q < 1/N, Pp = X1

  • if q = 1, P1 = XN

Percentile Calculationsdata
25-th h = 10 × 0.25 = 2.5 => ⌊h⌋ = ⌊2⌋ = 2
P25 = X2 + (2.5 - 2) × (X3 - X2) = 13

The approach is used by1:

  1. SAS

R5. Piecewise linear function

  • h = N × q + 1/2

  • Pp = X⌊h⌋ + (h − ⌊h⌋) × (X⌊h⌋ + 1 - X⌊h⌋)

  • if q ≤ (1/2)/N, Pp = X1

  • if q ≥ (N - 1/2)/N, Pp = XN

Percentile Calculationsdata
25-th h = 10 × 0.25 + 0.5 = 3 => ⌊h⌋ = ⌊3⌋ = 3
P25 = X3 + (3 - 3) × (X4 - X3) = 14

The approach is used by1:

  1. Matlab R2018b prctile()*

*Note for the percentiles corresponding to the probabilities outside the range, prctile assigns the minimum or maximum values of the elements in X.

The graph below shows a comparison of the 4-th and 5-th methods:

R6. Linear interpolation of the mathematical expectations

  • h = (N + 1) × q

  • Pp = X⌊h⌋ + (h − ⌊h⌋) × (X⌊h⌋ + 1 - X⌊h⌋)

  • if q ≤ 1/(N + 1), Pp = X1

  • if q ≥ N/(N + 1), Pp = XN

Percentile Calculationsdata
25-th h = (10 + 1) × 0.25 = 2.75 => ⌊h⌋ = ⌊2.75⌋ = 2
P25 = X2 + (2.75 - 2) × (X3 - X2) = 13.5

The approach is used by1:

  1. ATSD SQL PERCENTILE
  2. ATSD Rule Engine percentile()
  3. ATSD Data API PERCENTILE
  4. Charts PERCENTILE
  5. Excel PERCENTILE.EXC
  6. SAS
  7. SciPy v1.1.0

R7. Linear interpolation

  • h = (N - 1) × q + 1

  • Pp = X⌊h⌋ + (h − ⌊h⌋) × (X⌊h⌋ + 1 - X⌊h⌋)

  • if q = 1, P1 = XN

Percentile Calculationsdata
25-th h = (10 - 1) × 0.25 + 1 = 3.25 => ⌊h⌋ = ⌊3.25⌋ = 3
P25 = X3 + (3.25 - 3) × (X4 - X3) = 14.5

The approach is used by1:

  1. Excel PERCENTILE.INC
  2. Guava: Google Core Libraries for Java 23.0 API
  3. NumPy v1.15 with interpolation : 'linear'
  4. Pandas 0.23.4 with interpolation : 'linear'
  5. Oracle DB 10.2

This method is default in R.

The graph below shows a comparison of the 6-th and 7-th methods:

R8. Linear interpolation of the approximate medians

The resulting quantile estimates are approximately median-unbiased regardless of the distribution of X.

The bias of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased.

  • h = (N + 1/3) × q + 1/3

  • Pp = X⌊h⌋ + (h − ⌊h⌋) × (X⌊h⌋ + 1 - X⌊h⌋)

  • if q ≤ (2/3)/(N + 1/3), Pp = X1

  • if q ≥ (N - 1/3)/(N + 1/3), Pp = XN

Percentile Calculationsdata
25-th h = (10 + 0.33) × 0.25 + 0.33 = 2.91 => ⌊h⌋ = ⌊2.91⌋ = 2
P25 = X2 + (2.91 - 2) × (X3 - X2) = 13.83

This method was recommended by Hyndman and Fan.

R9. Approximately unbiased estimates (if X is normally distributed)

  • h = (N + 1/4) × q + 3/8

  • Pp = X⌊h⌋ + (h − ⌊h⌋) × (X⌊h⌋ + 1 - X⌊h⌋)

  • if q < (5/8)/(N + 1/4), Pp = X1

  • if q ≥ (N - 3/8)/(N + 1/4), Pp = XN

Percentile Calculationsdata
25-th h = (10 + 0.25) × 0.25 + 0.375 = 2.93 => ⌊h⌋ = ⌊2.94⌋ = 2
P25 = X2 + (2.94 - 2) × (X3 - X2) = 13.875

The graph below shows a comparison of the 6-th and 7-th methods:

Summaries

Key Percentiles Summary

Percentile R1 R2 R3 R4 R5 R6 R7 R8 R9
0 10 10 10 10 10 10 10 10 10
25 14 14 12 13 14 13.5 14.5 13.83 13.875
50 18 19 18 18 19 19 19 19 19
75 40 40 40 35 40 40 37.5 40 40
90 40 45 40 40 45 49 41 46.33 46
99 50 50 50 49 50 50 49.1 50 50
100 50 50 50 50 50 50 50 50 50

Methods Differences

Method Index h Interpolation Limits Selection
R1 N × q Pp = X⌈h⌉ q = 0P0 = X1
R2 N × q + 1/2 Pp = (X⌈h – 1/2⌉ + X⌊h + 1/2⌋) / 2 q = 0P0 = X1
q = 1P1 = XN
R3 N × q Pp = X⌈h⌋ q ≤ (1/2)/NPp = X1
R4 N × q X⌊h⌋ + (h − ⌊h⌋) × (X⌊h⌋ + 1 - X⌊h⌋) q < 1/NPp = X1
q = 1P1 = XN
R5 N × q + 1/2 X⌊h⌋ + (h − ⌊h⌋) × (X⌊h⌋ + 1 - X⌊h⌋) q ≤ (1/2)/NPp = X1
q ≥ (N - 1/2)/NPp = XN
R6 (N + 1) × q X⌊h⌋ + (h − ⌊h⌋) × (X⌊h⌋ + 1 - X⌊h⌋) q ≤ 1/(N + 1)Pp = X1
q ≥ N/(N + 1)Pp = XN
R7 (N - 1) × q + 1 X⌊h⌋ + (h − ⌊h⌋) × (X⌊h⌋ + 1 - X⌊h⌋) q = 1P1 = XN
R8 (N + 1/3) × q + 1/3 X⌊h⌋ + (h − ⌊h⌋) × (X⌊h⌋ + 1 - X⌊h⌋) q ≤ (2/3)/(N + 1/3)Pp = X1
q ≥ (N - 1/3)/(N + 1/3)Pp = XN
R9 (N + 1/4) × q + 3/8 X⌊h⌋ + (h − ⌊h⌋) × (X⌊h⌋ + 1 - X⌊h⌋) q < (5/8)/(N + 1/4)Pp = X1
q ≥ (N - 3/8)/(N + 1/4)Pp = XN

Tools Summary

The following software provides functionality to use any of R1-R9:

q must be in interval q ∈ (0, 1], otherwise org.apache.commons.math3.exception.OutOfRangeException is thrown.

Method Tools
R1 Azure Kusto
Apache Commons Math 3.6 EstimationType.R_1
R type=1
SAS PCTLDEF=3
R2
Apache Commons Math 3.6 EstimationType.R_2
R type=2
SAS PCTLDEF=5
R3
Apache Commons Math 3.6 EstimationType.R_3
R type=3
SAS PCTLDEF=2
R4
Apache Commons Math 3.6 EstimationType.R_4
R type=4
SAS PCTLDEF=1
SciPy v1.1.0 alphap=0, betap=1
R5
Apache Commons Math 3.6 EstimationType.R_5
Matlab R2018b prctile()
R type=5
SciPy v1.1.0 alphap=0.5, betap=0.5
R6
Apache Commons Math 3.6 EstimationType.R_6
ATSD SQL PERCENTILE
ATSD Rule Engine percentile()
ATSD Data API PERCENTILE
Charts PERCENTILE
Excel PERCENTILE.EXC
NIST Dataplot R6
R type=6
SAS PCTLDEF=4
SciPy v1.1.0 alphap=0, betap=0
R7
Apache Commons Math 3.6 EstimationType.R_7
Excel PERCENTILE.INC
Guava: Google Core Libraries for Java 23.0 API
NIST Dataplot R7
NumPy v1.15 interpolation : 'linear'
Oracle DB 10.2
Pandas 0.23.4 interpolation : 'linear'
R type=7
SciPy v1.1.0 alphap=1, betap=1
R8
Apache Commons Math 3.6 EstimationType.R_8
NIST Dataplot R8
R type=8
SciPy v1.1.0 alphap=1/3, betap=1/3
R9
Apache Commons Math 3.6 EstimationType.R_9
R type=9
SciPy v1.1.0 alphap=3/8, betap=3/8

NaN Strategy

  • Apache Commons Math 3.6

    Apache Commons Math 3.6 NaNStrategy:

    MINIMAL - NaNs are treated as minimal in the ordering, equivalent to (that is, tied with) Double.NEGATIVE_INFINITY
    MAXIMAL - NaNs are treated as maximal in the ordering, equivalent to Double.POSITIVE_INFINITY
    REMOVED - NaNs are removed before the rank transform is applied
    FIXED - NaNs are left "in place," that is the rank transformation is applied to the other elements in the input array, but the NaN elements are returned unchanged.
    FAILED - If any NaN is encountered in the input array, an appropriate exception is thrown.
    
  • R quantile

    na.rm - if true, any NA and NaN's are removed from x before the quantiles are computed
    if false NA and NaN values are not allowed
    
  • ATSD

    NaN values are removed before the percentiles are estimated.

Graphical Representation Of Percentiles

Box-And-Whiskers Diagram or Box Plot is the visual representation of the several percentiles of a given data set.

Outliers are described below.

Below is the Box Plot for the example data.

It is also convenient to display the percentiles in the Histogram Chart.

Percentiles are often used for thresholds checking, below values that are greater than P70 = 37 are colored in red.

Additional Examples

Robust Statistics

The term "robust statistics" is closely related to the term "outliers".

Outlier is an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data.

This definition is not precise, the decision whether an observation is an outlier, is left to the subjective judgement of the researcher.

Robust statistics are resistant to outliers. In other words, if data set contains very high or very low values, then some statistics will be good estimators for population parameters, and some statistics will be poor estimators. For example, the arithmetic mean is very susceptible to outliers (it is non-robust), while the median is not affected by outliers (it is robust).

Median vs Average

Average or arithmetic mean is the sum of the numbers divided by how many numbers are being averaged:

The median is a robust measure of central tendency, while the mean is not.

For the symmetric distributions (normal in particular) the mean is also the median.

For the example sample mean is greater than median:

Median Absolute Deviation vs IQR vs Standard Deviation

Median Absolute Deviation, IQR and Standard Deviation are measures of spread (also called measures of dispersion), that means that they show something about how wide the set of data is.

The standard deviation is a measure of how spread out data is around center of the distribution. It also gives an idea of where, percentage wise, a certain value falls.

The SD is affected by extremely high or extremely low values and non normality, in other words it is not robust.

The median absolute deviation is a robust measure of how spread out a set of data is.

If data is normally distributed, the SD is usually the best choice for assessing spread, otherwise, the MAD is preferred, but it must be multiplied with scale factor k - a constant linked to the assumption of normality of the data, read additional information in [3] and [6].

The IQR is similar to the MAD, but it is less robust. The IQR is often used to find outliers in sample: observations that fall below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR are marked as outliers. In a box plot, the highest and lowest occurring value within this limit are indicated by whiskers of the box and any outliers as individual points.

IQR can also be used as a threshold.

Sources

  1. "A Brief Introduction to Robust Statistics", André Lucas
  2. Apache Commons Math 3.6 Percentile
  3. "Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median", Leys, C., Journal of Experimental Social Psychology (2013)
  4. "Introduction to Statistics", Stepik course
  5. NIST Engineering Statistics Handbook
  6. R mad
  7. R quantile
  8. "Statistical Computing", Hyndman, R. J. and Fan, Y. (1986)
  9. Statistics How To
  10. Wolfram MathWorld Order Statistic
  11. Wiki Quantile