Measure of Central Tendency & Dispersion

Central tendency & measure of dispersion of any populated data is important concept in statistics as most of the statistical techniques & assumptions for those methodologies are base upon them. The central tendency is a single number representation of all the data points gathered for a particular variable. While measure of dispersion suggest about the spread of these data points.

Let's understand these concepts & learn how to apply them.

Measure of Central Tendency: It is a measurement of a point in a data which is lying centrally to all the data points. This centrally located data point is a representation of all other data collected during a research study. This is also sometimes called as measure of central location or center of distribution of any data. Mean is the most frequently used 'measure' among others such as median and mode.

Although Mean, median and mode has their own properties & applications. If a data is normally distributed & follows all other assumption of a normal distribution, then all its central tendency coincides.
This is the right time to introduce very well known normal distribution "bell-shaped" curve. Although i will elaborate various types of distribution & their probability function in my forthcoming posts.


Now considering above distribution of  Heights of Class-5 student, we can see that it seems normally distributed. By normally distributed, i mean the above distribution can be divided into 2 equal halves if we draw a line from center point. This centrally located data point is known as central tendency, & all measures of central tendency i.e. mean median or mode will be equal for such a data set. Although life is not that easy for a data scientist & data distribution can take any shapes.
We will describe the same in detail in our later posts.

Mean: it is generally referred to as arithmetic mean of the data, which is the sum of all the data points and divided by no. of observations. There are other types of means like harmonic mean or geometric mean, which are used alternatively for representation of the central tendencies of data.

Arithmetic mean can be calculated as

A.M. =  (∑x )/N



Example: The monthly salary of employees of HR department are given below.
                                         20, 25, 50, 42, 38, 35, 45, 25 (All fig. in '000)
           
Mean salary of employees can be  calculated as,

                          =        20+25+50+42+38+35+45+25    
                                                           8
                 
                         =     35k
               

The above plot is for Employee Salary with respect to mean value. By this graph we can clearly see that employee 1,2,& 8 has less than average weight.

An outlier is a abruptly high or low values of data, which is recorded due to abnormal results, or human error. Mean is highly susceptible to the outliers. As a presence of outlier can pull the mean downward or upward from its true value.

Let's again consider salary of employees, but this time let's consider Salary of company director, which is 140k.

Mean salary of students    =        20+25+50+42+38+35+45+25+140    
                                                                              9
                 
                                          =     46.67 k


After introduction of director's salary  mean salary is increased by 30%. Is this new avg. salary is true representation of salary mean? Think about it. I will leave this topic with a thought of Outlier & its impact on mean in your mind. I will explain about outliers in forthcoming posts of Quintiles, boxplots and outliers.


Median: A median is a middle number, which is centrally located when all the data points are arranged in ascending or descending order. If no. of observation is odd it is obtained by adding 1 to total no. of observation divided by 2. While if observations are even it is no of observation divided by 2 th observation. For even no. of observation there are 2 medians. Remember the above calculations are the place of the data point which will be deemed as median and not median itself. 

For odd no. observations median will be,

                           Median =    (n + 1)   th  observation
                                                   2

For even no. observation median will be,

                       Median  =          th  obs.  +    n    + 1 th obs.
                                               2                        2                  
                                                                   2

Example: Again going back to the employee salary example.
                                       
                                           20, 25, 50, 42, 38, 35, 45, 25, 140  (All fig. in '000)

  Let's arrange them in ascending order.

                                           20, 25, 25, 35, 38, 42, 45, 50, 140

                    no of data points = 9

        median of observations  =   (9 + 1)      = 5th obs. i.e. 38k
                                                         2

As you can see that, median has minimal impact of outliers and observed median is still close to the 'true mean' of employee salary, before introduction of outliers.


Mode: Mode is related to frequency of data points occurring in a data. It can be calculated as maximum no of times a data value is being observed. A series of numbers can have more than one mode. Such data is said to have multi-mode or multimodal data.

Mode is useful for data where we are having high repetitive values and values are in whole numbers. It is a frequency count of data. Data point with  the higher frequency are going to be the modes. 

Example:  Take an example of  weight of class-V students data, introduced in earlier posts.
33 31 35 37 28 34 30 28 38 40 29 28 40 36 30 45 36 41 42 32 33 30 43 42 42 42 37 34 42 35
38 38 37 40 43 28 29 40 40 33 44 35 30 37 42 40 28 33 31 44 42 35 30 37 30 36 44 39 43 28

Below is the frequency bar chart of various weights.



Mode weight for student of class-V is 42 kg, its frequency in data is 7 counts. This data is uni-modal, as it has only  one data value (42) which has highest frequency. Outliers have minimal or no impact on mode as it depends on the counts of the values in a data & by definition itself outliers are spikes or peaks which are 'rarely' observed, which in turn will generate less counts for them.  





Measure of Dispersion: It is a measure of distribution and spread of data points. This measured value will give an idea about the distance of data points from the mean line or central tendency. There are different statistical measures like 
1. Variance
2. Standard Deviation 
3. Range
4. Inter-Quartile Range 
5. Mean Absolute Deviation

We will discuss all of them in detail one by one, with example & contrast them based on their applications.



Variance: A variance is an average sum of squared difference of a data point from its mean. Now understand this! Why we need a square of difference? Why an average sum?.

OK! lets consider this we have a small sample data of heights of students of Class-V comprising of 10 observations. Lets tabulate this data & then find a mean & difference of mean with data points. A mean line is one which passes as the best fit line among all the data points. The distance between mean line and data points is referred to as error, & sum of all these errors is zero. This is the reason that we need to square the difference or errors. Taking an average of these sampled data gives us the variance.

Height
(in cm)
      mean   (x - mean) (x - mean)2
78 86.2 -8.2 67.24
78 86.2 -8.2 67.24
103 86.2 16.8 282.24
71 86.2 -15.2 231.04
84 86.2 -2.2 4.84
104 86.2 17.8 316.84
85 86.2 -1.2 1.44
103 86.2 16.8 282.24
66 86.2 -20.2 408.04
  90      86.2     3.80     14.44 
862 862   -2.84E-14 1675.6


                                  Variance  =    ( sum of square of errors)    
                                                             no. of observations.    
                                         
                                          or 
                                        

                            i.e., 
                               Variance for height of students  =      1675.6                =    167.56
                                                                                                   10

It is important to  mention here that the distribution of all such samples' mean will also be normally distributed & the mean of the population can be estimated as mean of the sample mean distribution of all the samples given that  population is normally distributed & its size is very large & we are going to draw an infinite no. of samples. As Variance is a function of sampled data,  & hence for a normally distributed population, it is also distributed normally.

Large values of variance suggest that all observed data points are located far from mean line, & vice-versa in case of small variance for that particular variable.



Standard Deviation: Standard deviation is another measure of deviation of data from its mean. it is calculated as square root of variance.

Although when variance was sufficient enough, then why we need another measure. OK, let's take our previous example of height (measured in cm), If we calculate the unit for Variance it will be cm. That means if a data deviated from mean by 2 units variance will be square of 2 i.e. 4 units ,which is unexpected, hence Standard deviation is introduced.


Standard deviation can be calculated as,
                                                     

Range: Range for any series of data is defined as the difference of smallest & largest value in series. This is a simple but important statistic as it gives an idea about the spread of data on a linear scale. Range is highly susseptible to  the outliers & can be mirepresentation of the spread of  data.

Lets again go back to Employee salary  example, used previously.

                            20, 25, 50, 42, 38, 35, 45, 25, 140  (All fig. in '000)

Arrange all the salaries in ascending order on a number line.



Since director salary is more like an outlier for this series of data, we will remove it & range of the data is                                
                                                                    R = (50-20)   = 30


Interquartile Range: An interquartile range is the difference of the 3rd Quartile & 1st Quartile. This takes care of very low or high values, which are treated as outliers. The 1st & 3rd Quartiles are defined as the 25th %ile value & 75th %ile value of the data. This range will basically consider the mid-50%ile of the range.

The Inter-Quartile Range (IQR) can be calculated as
                                           
                                                                    IQR = (Q3 - Q1)

We generally plot this data in a boxplot in order to see outlier values. We will discuss Boxplots & outliers in our next post.

Mean absolute deviation:  a mean absoulute deviation or MAD as it is abbreviated sometimes, is defined as the average of absolute distance of each data points from the mean. This is having the same unit as that of variable. This is also a measure of variability or spread of data values from data mean.

Scale of Measurement

A measured variable in a data can be classified among 4 measurement scale.

Nominal Scale: As name suggest this scale is related to different names of categories. This measurement scale is basically  used to define non ordered levels or types of a categorical variable.These levels have no preference one over other. 

Example: Gender, Vehicle-type, Blood Group etc. 

Ordinal Scale: This scale has a set preference of one level or category over other, but it can not be put this measurement into any number, i.e. one level is different from other by what value. The ordered levels have a definite pattern.

Example: Grades, Travel Class, etc. 

Interval Scale: In this scale every observed value can be expressed in terms of numbers. This is a big relief, as since now we were just talking of levels & orders. The scale have a property of equal interval, but no scope of a reference i.e. 'a true zero'. This means you can add or subtract but can't do multiplication or division.

Let's elaborate with an example. Consider a compressed air application, if air is having 4 bar g pressure, you can increase the pressure by 5 bar g to make it a 9 bar g pressure. This is a simple increment and no. adds up. But we can never say that an 8 bar g pressure is twice high pressure as compared to 4 bar g. So this scale has an additive property.

Also if we say that vacuum has a zero pressure, still we can measure it and will not say an absence of pressure as there are concept of negative pressure also.

Example: Pressure, Temperature etc.

Ratio Scale: This is a measurement which has equal interval as well as a  reference of 'true zero'. This scale is placed at the top of measurement scale, as it possess all the properties of measurement. Best example for this weight scale, as it has a measurement of  0 lbs as true zero and we can't have negative weights.

Example: Height, weight.

Below is an example of all the scales in data collection process.



Types of Data

After conducting a research or gathering data from a legitimate source  a data scientist come across a slew of data which are collected from different source & measured on different scales.

This data can be classified..

..Based on Data Source:
The data can be classified as primary or secondary data based on the type of source. 

Primary Data: A primary data is collected from a survey, or focus group interviews,  POS etc. The survey or questionnaire is designed by researcher himself. Hence data collected is very relevant & to-the-point of research objectives.

Secondary Data: While a secondary  data is collected from the legitimate sources like internet sources (like Plunkett research, Business reporter etc) government census, S&P's industry surveys. white books etc, This data is being prepared for general purpose & it sometimes does not make any sense to research objectives, but gives an idea about the macro-environment of the  research study, This data is used as reference in further study.

..Based on Data Type:
When looking at type of data, it can be classified as numeric or categorical data. 

Numeric Data: As name suggest, it comprise of all type of numbers. These data is generally further sub-classified as Continuous data & discrete data. 
Continuous variable is one which can assume any value like 3.45, 12, 6.57e-2 etc. some of the examples are temperature, sensex, etc.
Discrete variable can take particular values as whole numbers. This is most of the time is count and frequency data. 

Categorical Data: A categorical variable can take values which are generally categories, as an example "type of vehicle" is a category & can take values as 2-wheeler, car, heavy vehicle etc. These are the attributes which are further used to sub-classify the data. Sometimes this data can also carry a preference or order attached to each category. Like, the graduation results which are defined on grades as A+, A, B+, B, etc. where a definite order of grade can be assumed. Such as, A+ is better than A, while A is better than B+ and so on.

Statistical Analysis - An introduction

Statistics is a science, that helps in interpreting, understanding, presenting a large volume of numerical and categorical data. It is a measure to calculate various parameters and statistic which will represent the data as a whole. Now why is it required, why not to use complete series of numbers collected by either researchers through a survey or terabytes of data gathered by machines?

Answer lies within the question. This data is humongous & it is difficult to analyze the whole data simultaneously. Hence a researcher or data scientist requires best representation of this very large collection of numbers, this is where statistics come into picture & gives an over-all idea about this "data" of  central tendencies by mean, median, mode & spread or distribution by variance or standard deviation.

This useful calculated values are information generated out of the data. How is this information going to benefit a data scientist, & whether it is going to have any "significance" over the statistic of the complete data. Enter the Analysis part of the process. Analysis is using data information & turning them into useful insights which are not visible otherwise. This is generally being supplemented by a data scientist's own knowledge about the subject & hypothesis tests.

Let's take an Example!!


We have a small data of class -V students & this data comprises of

No. of students : 60
Male students : 38
Female students : 22



Student ID Gender Height
(in cm)
Weight
(in kg)
Grade
1   Male 78 33   C
2   Female 78 31   A+
3   Female 103 35   C
4   Female 71 37   C
5   Female 84 28   A
6   Male 104 34   C
7   Male 85 30   B+
8   Male 103 28   A+
9   Male 66 38   C
10   Female 90 40   F
11   Male 108 29   B
12   Female 110 28   A+
 13   Female 108 40   A
14   Male 78 36   B+
15   Male 106 30   A
16   Male 91 45   F
17   Male 85 36   A+
18   Male 104 41   F
19   Male 73 42   A+
20   Male 89 32   B
21   Female 74 33   F
22   Male 91 30   A
23   Male 66 43   F
24   Male 104 42   A
25   Male 113 42   B
26   Male 89 42   B+
27   Male 98 37   B+
28   Female 99 34   B+
29   Male 76 42   B
30   Female 71 35   A
31   Male 82 38   B
32   Female 69 38   B
33   Female 85 37   A+
34   Female 69 40   A
35   Female 75 43   C
36   Male 109 28   B
37   Female 73 29   B
38   Male 72 40   B
39   Male 111 40   A+
40   Male 106 33   F
41   Female 115 44   A+
42   Female 68 35   F
43   Female 73 30   A
44   Male 90 37   B
45   Female 66 42   A
46   Male 114 40   B
47   Male 108 28   A
48   Female 107 33   A
49   Male 69 31   A+
50   Male 117 44   B+
51   Male 94 42   F
52   Male 109 35   A
53   Male 92 30   A
54   Male 82 37   A+
55   Female 82 30   B
56   Male 66 36   B
57   Female 105 44   A+
58   Male 88 39   B+
59   Male 83 43   A
60   Male 77 28   C
             


This data is about student's height in cm, weight in kg, & last year's grade.



Now if we say that mean height of female students is 85.2 cm & those of male students is 91.2 cm,  

it means that these mean values are single no. representations of heights for both female & male populations respectively.

While if we compare the height means for both female & male students, we find out that height of female students is lesser than the male students for class-V. Now this is an insight that we could draw looking at the mean values of the data. Let's keep it this way until we reach forthcoming topic of hypothesis testing.


A question may arises why can't we go for one to one comparison of data. Answer is simple & straight, it can be done by creating a cross table for male and female students & then further analyzing the outcomes. This is easier said than done for a large data having millions of observations over numerous variables.

This sets our stage to a further go for a complete series of data analysis & statistical approach. We will next discuss about various components of data.



>>This series of blog is an attempt to debunk the various concepts of  statistical analysis in a simpler way. We hope it is of some help to you guys in getting the things right.

Ethics and marketing go hand in hand

Today i was treveling through humble but jam packed BEST bus, these advertisement plackards pasted on the upper side of the windows caught my attention inadvertently. There is nothing new in their format , but this time they are carrying message which was subtle but important. One message reads Please check carefully beneath your seat before seating and carrying a toll-free no.. The important thing to note is this message is from a mobile phone company and the othee one from a FMCG company.

Welcome to era of ethical marketing!!

This trend is not new but is being used extensively these days by marketers. Whether it is Save tiger campaign of Aircel or Save trees use mobile phone campaign of Idea, everyone wants to associate itself with a social cause or other. Few days back Idea came for earth hour campaign when it appealed to turn off lights for an hour to cut the Carbon emmission of the World , then came its campaign that was related to the Public security where they talked about donating all the money of usage to a charity for the victims of 26/11 attack.
The same thing was tried by Times of India when they came with Lead India campaign and talked about the corrupt political structure of the country. HT never left back and came with a campaign of ethical reporting.
Way back in 2003 Surf excel brought about a new by saying
"Do bucket Pani Ab Rojana hai Bachana". & then many companies started making ads with social messages. Idea with "What an Idea sirji " campaign gve it a new direction.

Every brand has started to make its way to the minds of customers by portraying themselves as their well wisher, some one who cares for them in true sense. HT came up with NO TV DAY in mumbai to give human relationship more importance & Earth Day doesn't need any introduction











Hoping to see a lot more in coming years!!

My First SAS Program