Quantiles Boxplots & Outliers

There is an another way of explaining the distribution or spread of data points of some variable. This is a method that explains the distribution by binning or partitioning data on a number line.


Quantiles: This is a method of dividing or partitioning of data, after arranging the same in ascending order. This partitioning or binning is carried out in various ways. Some of these commonly used are defined below.

Quartiles: It is defined as parting the data into 4 bins or subgroups and partitions are created at 25th , 50th and 75th %ile values.

Deciles: In this data is parted into 10 bins by creating partitions at 10th, 20th, 30th,.... 80th and 90th %ile values.

Percentile: Here we are dividing data into 100 parts by drawing partitions on 1st, 2nd, 3rd.... 96th, 97th, 98th & 99th %ile.

Some other and lesser used types of quantiles are
Quntiles: 5 Quantiles
Sextiles: 6  Quantiles
Septiles: 7 Quantiles
Octiles: 8 Quantiles
Ventiles: 20 Quantiles
Permilles: 1000 Quantiles


Boxplots: A boxplot is a graphical representation of distribution of data based upon the quartiles, central tendency and outliers. A boxplot comprises of a boxplot bordered at 25th and 75th %ile ( remember IQR), a line representing median, whiskers as 1st and 3rd quartile and outliers.



Boxplots are used very frequently to compare the distribution of 2 samples based upon the mean and quartiles. 

Remember the Class-V students data that i introduced in previous posts. We have heights and weights of the students. Lets compare the distribution of male & female students heights.


Pic: Boxplot of Height of Class-V students

Outliers: Outliers are nothing but data points which are far from its peer data points. These are abnormal values or spikes in data, that occurred due to some random unexpected event or human error in data gathering. It is very important to identify some value as "outlier" as there is a chance that this very high or low value is representing some very important information and inturn variation in data. There are various methods for treating outliers, which will reduce their impact in the data as a whole.

Coming back to the employee salary data that we have introduced in previous posts about measures of central tendency. Lets plot all salaries in box plot. 



Now if we follow above boxplot, director salary is a very abnormal value & follows all the definition of an outlier. But we can not ignore or delete this observation as  this is a true value & hence a data variation is attached to it. We will focus more on outliers & their treatments in our forthcoming posts. 

No comments:

Post a Comment

My First SAS Program