Measure of Dispersion (Uni Variable)

A measure of dispersion (also called variability, scatter, or spread) is the extent to which a distribution is stretched or squeezed. It indicates the scattering of the data.

It can also be defined as the extent to which values in a distribution differs from the average of the distribution

It is responsible for the shape of the data distribution. For example in terms of symmetry

Common measures of dispersion are the range, interquartile range, variance and standard deviation

Why to Measure Dispersion

Dispersion or Spread is measured because central tendency measures (like mean, median, mode) alone does not provide the complete data information.

Consider this –

In above example, we see that although the central tendency measures for the above 3 series are same, the series are very different in nature.

Thus, to get the complete data information or picture, we measure the spread of the data.

Range

Range is difference between maximum value and minimum value in the dataset.
Consider this dataset –

1245689

Range is 9 – 1 = 8
This is meaningful for quantitative data but in case there is one extreme value, the analysis can get disturbed.

Inter Quartile Range or IQR

Quartiles are the points in the data set that divides the data set into four equal parts.

Q1, Q2 and Q3 are the first, second and third quartile of the data set.

  • 25% of the data points lie below Q1 and 75% lie above it.
  • 50% of the data points lie below Q2 and 50% lie above it. Q2 is nothing but Median.
  • 75% of the data lie below Q3 and 25% lie above it

    Variance

    Variance is a numerical value that shows how widely the individual figures in a set of data distribute themselves about the mean and hence describes the difference of each value in the dataset from the mean value.

    So if we have zero variance in a dataset we can state that all the values in it are identical.

    Population variance (σ^2)

    It helps us in stating how the data points are spread out in the specific population. It is denoted as sigma square (σ^2)

    Formula for the Population Variance :

    Where

    N is the population size

    xi are the data points

    μ is the population mean

    Sample variance (s2)

    Sample Variance is calculated in the same manner as population variation with the difference that in order to calculate sample variance, only some sample data values from the population dataset are used. It is denoted by s square(s2).

    Formula for the Sample Variance :

    Where

    n is the sample size

    xi are the data points

    x̅  is the sample mean

    Standard Deviation

    Standard Deviation (σ )

    The square root of Variance is called the Standard Deviation.

    It is a measure which shows variation of data points from the mean. If data points are close to mean, there is small variation where as if the data points are highly spread out from the mean, then there is high variation.

    Standard Deviation is very important because it helps us in measuring the probability of occurrence of a datapoint in a dataset which has Normal (or Gaussian) Distribution.

    3 σ check for Standard Deviation

    Data values beyond 3σ deviation from mean (μ) are Outliers.

    In other words x is an Outlier if

    • x < μ – 3σ      Or       
    • x > μ + 3σ    

    Practical implication of 3σ:

    Banks use it to identify High Net worth individuals in their portfolio by running a simple analysis to identify customers which have spending more than 3σ from the mean spending