Steps involved in Code processing in SAS

This post is an attempt to answer those questions like what exactly happen when SAS runs a code or What happens at the back-end, when a user writes a code in SAS programming window & submit or run the program. Understanding this back-end SAS processing helps a programmer to know how SAS run & sometimes make their life easy by taking advantage of execution process.

SAS process any submitted code or data in 2 steps.

Compile Stage: In this stage SAS performs following tasks
1. In compile stage SAS assigns area in memory to store dataset called input buffer
2. SAS checks for input file & determines various variable attributes ( i.e. datatype, length etc.)
3. Reads code for any invalid syntax, errors & determines names of variable.
4. A descriptor portion is formed which will store all information related to variables such as variable name, datatype, length, label, default format & informat etc.

During compile state, SAS doesnot read any data from input file & doesnot evaluate any logical or condtional loops or statement. A reserved memory called as program data vector is also created to store all information about variables, data step, errors etc. Then SAS starts checking input data code & if any variable is assigned in between, it checks for its datatype, name & assign it a place in descriptor portion. 

Tip: We can declare the length for different variables before an input statement. This length will be then stored in descriptor portion as default length. 

Lets' consider a small program for salary of employees.

DATA EMP_SAL;
INPUT EMP_ID EMP_NAM$ AGE GENDER$ SALARY;
SALPERAGE = SALARY/AGE;
DATALINES;
101 AJAY 30 M 30000
102 MANI 28 F 28500
103 SAHIL 32 M 35000
;
RUN;

After submitting above program, SAS will first create a descriptor portion, which will store all attributes of variables. Descriptor portion will look like this

Descriptor Portion:

EMP_ID
EMP_NAM
AGE
GENDER
SALARY
SALPERAGE
NUMERIC (8BYTES) 
Format 12.
Informat 12.
 CHARACTER
 (8BYTES)
Format $8.
Informat $8.
NUMERIC (8BYTES)
Format 12.
Informat 12. 
CHARACTER (8BYTES)
Format $8.
Informat $8. 
NUMERIC (8BYTES)
Format 12.
Informat 12. 
NUMERIC (8BYTES)
Format 12.
Informat 12.

As you can see SAS has allocated default memories to each variable in input buffer. 

Now SAS is ready to run it's second stage of processing code.

Execution Stage: In excution stage,  all values are set to missing or no value. In SAS numeric missing value is denoted by 'PERIOD' i.e. "." & a character missing value is denoted by 'BLANK SPACE'  i.e. "  ".SAS starts with intially setting all variable values to missing & this happens every time SAS reads a new line of data. An internal pointer keeps a track of current record executed. SAS will keep on running or executing till it reaches an end of file marker.

Program Data Vector:  _n_ =1


EMP_ID
EMP_NAM
AGE
GENDER
SALARY
SALPERAGE
NUMERIC (8BYTES) 
Format 12.
Informat 12.
 CHARACTER
 (8BYTES)
Format $8.
Informat $8.
NUMERIC (8BYTES)
Format 12.
Informat 12. 
CHARACTER (8BYTES)
Format $8.
Informat $8. 
NUMERIC (8BYTES)
Format 12.
Informat 12. 
NUMERIC (8BYTES)
Format 12.
Informat 12.
 .

. 

. 
. 


When SAS executes first obs or line of data in above program

Program Data Vector: _n_ =1
EMP_ID
EMP_NAM
AGE
GENDER
SALARY
SALPERAGE
NUMERIC (8BYTES) 
Format 12.
Informat 12.
 CHARACTER
 (8BYTES) 
Format $8.
Informat $8.
NUMERIC (8BYTES)
Format 12.
Informat 12. 
CHARACTER (8BYTES)
Format $8.
Informat $8. 
NUMERIC (8BYTES)
Format 12.
Informat 12. 
NUMERIC (8BYTES)
Format 12.
Informat 12.
 101
AJAY
30
M
30000
. 

Then it calculates SALPERAGE variable & put this value in Program data Vector

Program Data Vector: _n_ =1
EMP_ID
EMP_NAM
AGE
GENDER
SALARY
SALPERAGE
NUMERIC (8BYTES) 
Format 12.
Informat 12.
 CHARACTER
 (8BYTES) 
Format $8.
Informat $8.
NUMERIC (8BYTES)
Format 12.
Informat 12. 
CHARACTER (8BYTES)
Format $8.
Informat $8. 
NUMERIC (8BYTES)
Format 12.
Informat 12. 
NUMERIC (8BYTES)
Format 12.
Informat 12.
 101
AJAY
30 
M
30000 
1000 

Once SAS reaches a run statement, it save data in input buffer or dataset & again turns back to dataline statement. Internal pointer for program is set to 2 & all values of variables in Program Data Vector is set to missing and SAS is ready to execute second line of data.

Program Data Vector: _n_ =2
EMP_ID
EMP_NAM
AGE
GENDER
SALARY
SALPERAGE
NUMERIC (8BYTES) 
Format 12.
Informat 12.
 CHARACTER
 (8BYTES) 
Format $8.
Informat $8.
NUMERIC (8BYTES)
Format 12.
Informat 12. 
CHARACTER (8BYTES)
Format $8.
Informat $8. 
NUMERIC (8BYTES)
Format 12.
Informat 12. 
NUMERIC (8BYTES)
Format 12.
Informat 12.
 .

. 

. 
. 

SAS will keep on executing until it reaches an end of file marker.

Inputting / Importing a file into SAS.


SAS has various methods of input a file into SAS as a SAS dataset. These method depend upon the type & format of data that we are bringing into SAS. Following are some of various methods to bring in large data files as SAS dataset

1. Infile statement
2. Column input
3. Format input
4. Using library engine
5. Proc import
6. Using menu options.

Infile statement: An infile statement is the most common & most frequently used data input method. This method has various options depending upon the format of data in input file. Below is the syntax for a file in csv format. I will explain the usage of all the options in detail

DATA FILE1;
INPUT STUDENT ID GENDER$ HEIGHT WEIGHT GRADES$;
INFILE " <FILE PATH>.../CLASS5.CSV" DSD TRUNCOVER;
RUN;

An Input statement will declare all the variable names before an Infile statement. This will create a field for all the variables in discriptor portion at the compile stage.
In an infile statement we have to give a path or a file reference in " " in place of file path. This will tell SAS where the data file is located, which we have to fetch in SAS.
A DSD option tells that data is "delimiter sensitive data" & if no delimiter is mentioned  in delim = or dlm =' '  option then, SAS takes it as a comma seperated value file.

TRUNCOVER, MISSOVER, STOPOVER, SCANOVER & FLOWOVER  are some of important options we need to mention, after having a look at the csv file.

Truncover: This option is used in column input & formatted input method. It is used to assign contents of input buffer to a variable field, when default length for variable is shorter than expected.

Missover:  It will put all remaining variables to missing in a line, if no. of variables are more than no. of observation.

Flowover: It continues to read input data record if it doesnot find values in current input line.

Stopover: Causes datastep to stop processing if an input statement reaches end of current line without finding values for all variables. It sets _error_ to 1 & stops inputting values in dataset & print an incomplete dataline in log. 

Column input: A Column input is a simpler way of inputting a .txt file in SAS. There is a drawback though that if data value consists of  commas or dollar sign then it cannot be used, It can input dates as character value.

Consider a txt file having data as below

Col. No.5             10         15          20          25          30          35
_  _  _  _ || _ _ _ _  ||  _ _ _ _ || _ _ _ _ || _ _ _ _||_ _ _ _ || _ _ _ _ ||
A J  A Y   S I N G H    3 5    3 5 0 0 0   M
N E E R J A                  2 8   3 2 0 0 0    F

Here is a program showing column input!!

DATA SAL_DATA;
INPUT NAME $ 1-10 AGE 12-13 SALARY 15-19 GENDER $ 21;
INFILE "<FILE-PATH.TXT>";
RUN;

NAME column starts from 1 to 10th & hence its position is mentioned in input statement 1-10. Similarly for AGE column position is 12-13 & others are SALARY at 15-19 & GENDER at 21.


Formatted Input:
 In this method we can mention informats to let SAS know what kind of variable observation we are inputting. This method can input all non-standard or standard firmats, date formats etc. We have to mention starting position of values, and not complete length, unlike column input method where we have to mention starting & finishing position for a variable observation. All starting position are prefixed by a "@" which is called as column pointer.


Now consider data in above file, but this time in more complex form.



Col. No.5             10         15          20          25          30          35
_  _  _  _ || _ _ _ _ ||  _  _ _  _ || _ _ _ _ || _ _ _ _ ||_ _ _ _ || _ _ _ _ ||
M R .     A JA Y   S I N G H    3 5 .  5   $ 35 0 0 0 . 0 0   M
N E E R J A                             2 8          $ 32 0 0 0 . 0 0   F


DATA SAL_DATA;
INPUT  @ 1  NAME  $  15.
              @16 AGE  3.1.
              @21 SALARY DOLLAR9.2.
              @31 GENDER $ 1.;
FORMAT NAME $15. AGE 3.1. SALARY DOLLAR9.2. GENDER $1.;
INFILE "<FILE-PATH>";
RUN;

Looking at input statement, it can be mentioned in 1 line, but in order to be in more undertandable form i have mentioned each variable declaration in new line. This is a valid statement as SAS needs a ";" semicolon at end of every statement & we have closed our Input statemen after declaring GENDER variable. Notice an 'AT' "@" sign before every variable name which is pointing towards its starting position. An informat is declared after every name. This tells SAS how to read the data & stored in descriptor portion of data.

In order to store or write a data value in SAS dataset in the similar way it is read or mentioned in data file, we have to mention a format statement after input statement. Here it is required to introduce FORMATS & INFORMATS, which will be described in greater length in upcoming post. For now you can just remember that A SAS Informat is used to read data into input buffer, while a SAS Format is used to write the data into SAS Dataset. 

My First SAS Program


SAS, Statistical Analysis Software is a software tool used to implement different statistical techniques and analysis on large data. SAS is one of the market leader in analysis tools industry. Its some of the major features include wide range of function library, results acceptability & data security.
This page will introduce you to the SAS coding &guide you through various aspects of sas coding.
I will keep on updating this page, with new functions time to time.

Let's begin & open your SAS window.
Courtesy: SAS institute, this edition is for education purpose only.

Once you click on SAS software icon, a window displayed as above will open. This window consist of three major sub-windows.

Program Editor Window: A SAS User write all programming codes in this window.

Log Window: In this section, a log of all your program compile or run stage errors, warning are displayed. It is imperative for a user to understand these errors & make changes to the program accordingly.

Explorer window: This acts as a navigation for user to move from one folder to other. All the dataset, formats & macros are saved in libraries. One can go to these libraries folder by navigation window & explore them.

Output window:  All program outputs are seen in this window. There are many configuration options to optimise the output results of any procedure.

My First SAS program:

Lets write a simple code for a Class-V student program. This data consist of Student ID, Height, Weight, Gender, Grades.

Every SAS program starts with a data step or a proc step.
A data step is used while creating, editing or sub-setting a dataset.

While a proc step is a procedure applied to a dataset.  This procedure can invoke a function like calculating mean of data, creating a cross tabulation, analysis of data.

Since we are creating a dataset, we will begin with datastep.

DATA CLASS5;
INPUT ID GENDER$ HEIGHT WEIGHT GRADES$;
DATALINES;
1 Male 78 33 C
2 Female 78 31 A+
3 Female 103 35 C
4 Female 71 37 C
5 Female 84 28 A
6 Male 104 34 C
7 Male 85 30 B+
...
..
.
57 Female 105 44 A+
58 Male 88 39 B+
59 Male 83 43 A
60 Male 77 28 C
;
RUN;

PROC PRINT DATA = CLASS5;
VAR HEIGHT WEIGHT;
RUN;

This is the right time to introduce the rules of naming of a variable or a dataset.
1, A variable or dataset name should not start with a number, underscore ( _ ) character.
2. Length of a name can not be more than 32 characters.
3. Spaces or dashes are not allowed.

Names like Class5, ID, Gender, Height, Weight, Grades are all valid names & can be used in SAS programming.
Class5 is a dataset name & that's why mentioned in the DATA step.

SAS has simplified datatypes by classifying into 2 categories.

First, Numeric, which represent data consisting of numbers. Here in our dataset we have height & weight as two numeric variable. The storage length in SAS is 8 bytes.

Character type data consist of letter, special characters, & numbers. In order to tell SAS that this is a character type variable name we have to place a dollar sign '$' after variable name.The storage length is 32 bytes

DATALINES is a keyword which tells SAS that lines following are of data, which are already declared in INPUT statement.

RUN statement is mentioned in the end of every SAS program. In its absence SAS keep on running & searching for a RUN statement.

Pic: 2 Showing Dataset CLASS5



PROC PRINT is a procedure statement that prints an output in Output window. Here we are generating a simple output of all the variables. Below is the output of the program.

Pic:3 Showing Output on Output window

Arithmetic Functions

MS-Excel is an important & most commonly used software tool to perform simple mathematical functions, visualization like plots & graphs as well as advanced excel functions for statistical techniques for data analysis. It has come a long way when first introduced in 1987 by Microsoft as a pioneer spreadsheet software. Since then many new functions are introduced & capabilities are also increased of this software.

On this page we will discuss about some important functions which are used by statisticians & researches everyday. We will also look into some advanced excel functions like pivot tables, data validation, & other analysis functions which are recently introduced in Excel-2016.

I assume that you are aware of simple drag & drop methods & special short keys & skipping the same for now.

All excel formulas can be written in cells where it is being applied or after selecting the cell we can write the same after an '=' sign in formula bar.

                                     
First let's begin with simple mathematical functions.

Arithmetic functions: All arithmetic functions can be performed in excel by just selecting the values. some of the functions are given below.

SUM function: This is used to sum all the selected cells. First cell & last selected cells will appear in the formula  i.e.

                                              = SUM(Cm:Cn)



Average function: It is used to find arithmetic average of selected cells value. This is similar to arithmetic mean. for other kinds of means like Geometric or Harmonic mean excel have different formula in Excel-2016



There are other functions like AVERAGEA or AVERAGEIF that combines power of logic with calculation. AVERAGEA is similar to AVERAGE function, only it consider a text string mentioned in selected cells as 0 or 1, depending upon the argument provided. AVERAGEIF or AVERAGEIFS test for logical conditions applied to particular cells & if it tests TRUE, then only corresponding Average_range cells is considered in calculation. 


COUNT Function: As name suggest it counts no. of cells, based on some criteria attached to it. There are several count functions in this family, COUNT( ) counts for cells that has numbers in them. While COUNTA() counts for cells that are not empty. Some other COUNT functions are listed in table below with their usage. 

Function Explanation
COUNTCounts for no. of cells that has numbers in them
COUNTACounts for all non-empty cells
COUNTBLANK Counts for all blank cells
COUNTIFCounts no. of cells with in a range that meets the condition
COUNTIFSCounts no. of cells with in a range that meets the set of  condition
DCOUNTCounts no. of cells that contains number in a database
DCOUNTACounts no. of cells that are non-empty



Statistical Functions: Excel has lots of functions that are used extensively  by data scientist when analysing &  applying various hypothesis tests, & calculating various statistics of a data. Excel 2016 has special portion under Data tab, which has almost all the statistical techniques. I will explain each & everyone of it in detail with their utility & approach.

Path: Data -> Analyze -> Data Analysis


Pic3: F-Test- 2 sample variance Analysis for Height & Weight of 5 student

Concept of Probability & Probability distributions

Probability plays an important role in  estimation of various statistic and parameters, & hence it is imperative to understand probability concepts before moving forward. Whenever you are interacting with any problem related to statistic, there is always a "chance" that your results are acceptable or within a acceptable range. 
A Hypothesis testing is solely based on the chance that some statement made is supported by the results or not. Here also we have difference probability distributions that comes into picture.

Probability: A probability is a measure of chance of a favorable outcome, it can be calculated as ratio of  no. of all favorable outcomes to total no. of outcomes. This is an important concept as it helps in predicting the chances of a future event, which is at the core of any predictive or prescriptive data analysis.
Many of such events are not falling in the definition of probability but still can be figured, using probability based naive bayes theorem & other advanced probability concepts.

 Probability =      No. of favorable outcomes    
                               Total no, of outcomes

It can be represented as
                      P[x] ϵ [0,1] ,  it means probability of any event 'X' belongs to a set of values between 0 & 1 including both.

To understand the concept of probability in simpler manner, let's take an example of a railway station, from where a passenger can go only in 2 direction, either North or South. You are standing near ticket counter & a passenger comes to ticket window, what is the probability or chances that he boards a Northbound train.

Ok lets count our favorable outcomes = 1 i.e. northbound train
& total outcomes = 2 i.e. northbound or southbound train

So the chances or the probability of this passenger to take a Northbound train is
               
                                                   =    No. of favorable outcomes          =   1     = 0.5 or 50%
                                                              Total no, of outcomes                  2

So there are 50% chance or 1/2 or 0.5 probability of him boarding a northbound train.

Please consider that a passenger after taking a ticket will anyhow board the train, going to his destined direction.

Now does that mean that if there are 2000 passengers coming to station then 1000 passenger will board Northbound train and other 1000 passenger a Southbound train?
Not exactly! The probability is just a measure which defines a chance for a favorable outcome.

Hope this clears your doubt as the probability does not mean that out of 100 passenger, if  50 takes a northbound train then other 50 are bound to take a southbound train, that never happens. But repeating this experiment a large no. of times will return a no. close to 0.5. & if we keep on doing this experiment of asking infinite no. of passengers about their destination( northbound or sothbound) an infinite no. of times then we will get this ratio as 0.5.

There are certain terminalogies which one should be aware of while moving around Probability.

Experiment: An experiment is the occurrance of a random event.
Sample space : This is a set of all possible outcome of an experiment.
Sample point: Sometimes refered to as sample, this is a possible outcomes of an experiment.This is a necessary unit subset of Sample space.

Event:This is a set of one or more sample points or possible outcomes.

Mutually Exclusive Event: An event is said to be mutually exclusive if occurrance of an outcome causes non-occurrance of other events. 

Non-Mutually Exclusive Event: An event is said to be mutually exclusive if occurrance of an outcome causes non-occurrance of other events. 

Exhaustive Events:

Quantiles Boxplots & Outliers

There is an another way of explaining the distribution or spread of data points of some variable. This is a method that explains the distribution by binning or partitioning data on a number line.


Quantiles: This is a method of dividing or partitioning of data, after arranging the same in ascending order. This partitioning or binning is carried out in various ways. Some of these commonly used are defined below.

Quartiles: It is defined as parting the data into 4 bins or subgroups and partitions are created at 25th , 50th and 75th %ile values.

Deciles: In this data is parted into 10 bins by creating partitions at 10th, 20th, 30th,.... 80th and 90th %ile values.

Percentile: Here we are dividing data into 100 parts by drawing partitions on 1st, 2nd, 3rd.... 96th, 97th, 98th & 99th %ile.

Some other and lesser used types of quantiles are
Quntiles: 5 Quantiles
Sextiles: 6  Quantiles
Septiles: 7 Quantiles
Octiles: 8 Quantiles
Ventiles: 20 Quantiles
Permilles: 1000 Quantiles


Boxplots: A boxplot is a graphical representation of distribution of data based upon the quartiles, central tendency and outliers. A boxplot comprises of a boxplot bordered at 25th and 75th %ile ( remember IQR), a line representing median, whiskers as 1st and 3rd quartile and outliers.



Boxplots are used very frequently to compare the distribution of 2 samples based upon the mean and quartiles. 

Remember the Class-V students data that i introduced in previous posts. We have heights and weights of the students. Lets compare the distribution of male & female students heights.


Pic: Boxplot of Height of Class-V students

Outliers: Outliers are nothing but data points which are far from its peer data points. These are abnormal values or spikes in data, that occurred due to some random unexpected event or human error in data gathering. It is very important to identify some value as "outlier" as there is a chance that this very high or low value is representing some very important information and inturn variation in data. There are various methods for treating outliers, which will reduce their impact in the data as a whole.

Coming back to the employee salary data that we have introduced in previous posts about measures of central tendency. Lets plot all salaries in box plot. 



Now if we follow above boxplot, director salary is a very abnormal value & follows all the definition of an outlier. But we can not ignore or delete this observation as  this is a true value & hence a data variation is attached to it. We will focus more on outliers & their treatments in our forthcoming posts. 

Measure of Central Tendency & Dispersion

Central tendency & measure of dispersion of any populated data is important concept in statistics as most of the statistical techniques & assumptions for those methodologies are base upon them. The central tendency is a single number representation of all the data points gathered for a particular variable. While measure of dispersion suggest about the spread of these data points.

Let's understand these concepts & learn how to apply them.

Measure of Central Tendency: It is a measurement of a point in a data which is lying centrally to all the data points. This centrally located data point is a representation of all other data collected during a research study. This is also sometimes called as measure of central location or center of distribution of any data. Mean is the most frequently used 'measure' among others such as median and mode.

Although Mean, median and mode has their own properties & applications. If a data is normally distributed & follows all other assumption of a normal distribution, then all its central tendency coincides.
This is the right time to introduce very well known normal distribution "bell-shaped" curve. Although i will elaborate various types of distribution & their probability function in my forthcoming posts.


Now considering above distribution of  Heights of Class-5 student, we can see that it seems normally distributed. By normally distributed, i mean the above distribution can be divided into 2 equal halves if we draw a line from center point. This centrally located data point is known as central tendency, & all measures of central tendency i.e. mean median or mode will be equal for such a data set. Although life is not that easy for a data scientist & data distribution can take any shapes.
We will describe the same in detail in our later posts.

Mean: it is generally referred to as arithmetic mean of the data, which is the sum of all the data points and divided by no. of observations. There are other types of means like harmonic mean or geometric mean, which are used alternatively for representation of the central tendencies of data.

Arithmetic mean can be calculated as

A.M. =  (∑x )/N



Example: The monthly salary of employees of HR department are given below.
                                         20, 25, 50, 42, 38, 35, 45, 25 (All fig. in '000)
           
Mean salary of employees can be  calculated as,

                          =        20+25+50+42+38+35+45+25    
                                                           8
                 
                         =     35k
               

The above plot is for Employee Salary with respect to mean value. By this graph we can clearly see that employee 1,2,& 8 has less than average weight.

An outlier is a abruptly high or low values of data, which is recorded due to abnormal results, or human error. Mean is highly susceptible to the outliers. As a presence of outlier can pull the mean downward or upward from its true value.

Let's again consider salary of employees, but this time let's consider Salary of company director, which is 140k.

Mean salary of students    =        20+25+50+42+38+35+45+25+140    
                                                                              9
                 
                                          =     46.67 k


After introduction of director's salary  mean salary is increased by 30%. Is this new avg. salary is true representation of salary mean? Think about it. I will leave this topic with a thought of Outlier & its impact on mean in your mind. I will explain about outliers in forthcoming posts of Quintiles, boxplots and outliers.


Median: A median is a middle number, which is centrally located when all the data points are arranged in ascending or descending order. If no. of observation is odd it is obtained by adding 1 to total no. of observation divided by 2. While if observations are even it is no of observation divided by 2 th observation. For even no. of observation there are 2 medians. Remember the above calculations are the place of the data point which will be deemed as median and not median itself. 

For odd no. observations median will be,

                           Median =    (n + 1)   th  observation
                                                   2

For even no. observation median will be,

                       Median  =          th  obs.  +    n    + 1 th obs.
                                               2                        2                  
                                                                   2

Example: Again going back to the employee salary example.
                                       
                                           20, 25, 50, 42, 38, 35, 45, 25, 140  (All fig. in '000)

  Let's arrange them in ascending order.

                                           20, 25, 25, 35, 38, 42, 45, 50, 140

                    no of data points = 9

        median of observations  =   (9 + 1)      = 5th obs. i.e. 38k
                                                         2

As you can see that, median has minimal impact of outliers and observed median is still close to the 'true mean' of employee salary, before introduction of outliers.


Mode: Mode is related to frequency of data points occurring in a data. It can be calculated as maximum no of times a data value is being observed. A series of numbers can have more than one mode. Such data is said to have multi-mode or multimodal data.

Mode is useful for data where we are having high repetitive values and values are in whole numbers. It is a frequency count of data. Data point with  the higher frequency are going to be the modes. 

Example:  Take an example of  weight of class-V students data, introduced in earlier posts.
33 31 35 37 28 34 30 28 38 40 29 28 40 36 30 45 36 41 42 32 33 30 43 42 42 42 37 34 42 35
38 38 37 40 43 28 29 40 40 33 44 35 30 37 42 40 28 33 31 44 42 35 30 37 30 36 44 39 43 28

Below is the frequency bar chart of various weights.



Mode weight for student of class-V is 42 kg, its frequency in data is 7 counts. This data is uni-modal, as it has only  one data value (42) which has highest frequency. Outliers have minimal or no impact on mode as it depends on the counts of the values in a data & by definition itself outliers are spikes or peaks which are 'rarely' observed, which in turn will generate less counts for them.  





Measure of Dispersion: It is a measure of distribution and spread of data points. This measured value will give an idea about the distance of data points from the mean line or central tendency. There are different statistical measures like 
1. Variance
2. Standard Deviation 
3. Range
4. Inter-Quartile Range 
5. Mean Absolute Deviation

We will discuss all of them in detail one by one, with example & contrast them based on their applications.



Variance: A variance is an average sum of squared difference of a data point from its mean. Now understand this! Why we need a square of difference? Why an average sum?.

OK! lets consider this we have a small sample data of heights of students of Class-V comprising of 10 observations. Lets tabulate this data & then find a mean & difference of mean with data points. A mean line is one which passes as the best fit line among all the data points. The distance between mean line and data points is referred to as error, & sum of all these errors is zero. This is the reason that we need to square the difference or errors. Taking an average of these sampled data gives us the variance.

Height
(in cm)
      mean   (x - mean) (x - mean)2
78 86.2 -8.2 67.24
78 86.2 -8.2 67.24
103 86.2 16.8 282.24
71 86.2 -15.2 231.04
84 86.2 -2.2 4.84
104 86.2 17.8 316.84
85 86.2 -1.2 1.44
103 86.2 16.8 282.24
66 86.2 -20.2 408.04
  90      86.2     3.80     14.44 
862 862   -2.84E-14 1675.6


                                  Variance  =    ( sum of square of errors)    
                                                             no. of observations.    
                                         
                                          or 
                                        

                            i.e., 
                               Variance for height of students  =      1675.6                =    167.56
                                                                                                   10

It is important to  mention here that the distribution of all such samples' mean will also be normally distributed & the mean of the population can be estimated as mean of the sample mean distribution of all the samples given that  population is normally distributed & its size is very large & we are going to draw an infinite no. of samples. As Variance is a function of sampled data,  & hence for a normally distributed population, it is also distributed normally.

Large values of variance suggest that all observed data points are located far from mean line, & vice-versa in case of small variance for that particular variable.



Standard Deviation: Standard deviation is another measure of deviation of data from its mean. it is calculated as square root of variance.

Although when variance was sufficient enough, then why we need another measure. OK, let's take our previous example of height (measured in cm), If we calculate the unit for Variance it will be cm. That means if a data deviated from mean by 2 units variance will be square of 2 i.e. 4 units ,which is unexpected, hence Standard deviation is introduced.


Standard deviation can be calculated as,
                                                     

Range: Range for any series of data is defined as the difference of smallest & largest value in series. This is a simple but important statistic as it gives an idea about the spread of data on a linear scale. Range is highly susseptible to  the outliers & can be mirepresentation of the spread of  data.

Lets again go back to Employee salary  example, used previously.

                            20, 25, 50, 42, 38, 35, 45, 25, 140  (All fig. in '000)

Arrange all the salaries in ascending order on a number line.



Since director salary is more like an outlier for this series of data, we will remove it & range of the data is                                
                                                                    R = (50-20)   = 30


Interquartile Range: An interquartile range is the difference of the 3rd Quartile & 1st Quartile. This takes care of very low or high values, which are treated as outliers. The 1st & 3rd Quartiles are defined as the 25th %ile value & 75th %ile value of the data. This range will basically consider the mid-50%ile of the range.

The Inter-Quartile Range (IQR) can be calculated as
                                           
                                                                    IQR = (Q3 - Q1)

We generally plot this data in a boxplot in order to see outlier values. We will discuss Boxplots & outliers in our next post.

Mean absolute deviation:  a mean absoulute deviation or MAD as it is abbreviated sometimes, is defined as the average of absolute distance of each data points from the mean. This is having the same unit as that of variable. This is also a measure of variability or spread of data values from data mean.

My First SAS Program