Statistics is a science, that helps in interpreting, understanding, presenting a large volume of numerical and categorical data. It is a measure to calculate various parameters and statistic which will represent the data as a whole. Now why is it required, why not to use complete series of numbers collected by either researchers through a survey or terabytes of data gathered by machines?
Answer lies within the question. This data is humongous & it is difficult to analyze the whole data simultaneously. Hence a researcher or data scientist requires best representation of this very large collection of numbers, this is where statistics come into picture & gives an over-all idea about this "data" of central tendencies by mean, median, mode & spread or distribution by variance or standard deviation.
This useful calculated values are information generated out of the data. How is this information going to benefit a data scientist, & whether it is going to have any "significance" over the statistic of the complete data. Enter the Analysis part of the process. Analysis is using data information & turning them into useful insights which are not visible otherwise. This is generally being supplemented by a data scientist's own knowledge about the subject & hypothesis tests.
Let's take an Example!!
No. of students : 60
Female students : 22
Student ID | Gender | Height (in cm) | Weight (in kg) | Grade |
1 | Male | 78 | 33 | C |
2 | Female | 78 | 31 | A+ |
3 | Female | 103 | 35 | C |
4 | Female | 71 | 37 | C |
5 | Female | 84 | 28 | A |
6 | Male | 104 | 34 | C |
7 | Male | 85 | 30 | B+ |
8 | Male | 103 | 28 | A+ |
9 | Male | 66 | 38 | C |
10 | Female | 90 | 40 | F |
11 | Male | 108 | 29 | B |
12 | Female | 110 | 28 | A+ |
13 | Female | 108 | 40 | A |
14 | Male | 78 | 36 | B+ |
15 | Male | 106 | 30 | A |
16 | Male | 91 | 45 | F |
17 | Male | 85 | 36 | A+ |
18 | Male | 104 | 41 | F |
19 | Male | 73 | 42 | A+ |
20 | Male | 89 | 32 | B |
21 | Female | 74 | 33 | F |
22 | Male | 91 | 30 | A |
23 | Male | 66 | 43 | F |
24 | Male | 104 | 42 | A |
25 | Male | 113 | 42 | B |
26 | Male | 89 | 42 | B+ |
27 | Male | 98 | 37 | B+ |
28 | Female | 99 | 34 | B+ |
29 | Male | 76 | 42 | B |
30 | Female | 71 | 35 | A |
31 | Male | 82 | 38 | B |
32 | Female | 69 | 38 | B |
33 | Female | 85 | 37 | A+ |
34 | Female | 69 | 40 | A |
35 | Female | 75 | 43 | C |
36 | Male | 109 | 28 | B |
37 | Female | 73 | 29 | B |
38 | Male | 72 | 40 | B |
39 | Male | 111 | 40 | A+ |
40 | Male | 106 | 33 | F |
41 | Female | 115 | 44 | A+ |
42 | Female | 68 | 35 | F |
43 | Female | 73 | 30 | A |
44 | Male | 90 | 37 | B |
45 | Female | 66 | 42 | A |
46 | Male | 114 | 40 | B |
47 | Male | 108 | 28 | A |
48 | Female | 107 | 33 | A |
49 | Male | 69 | 31 | A+ |
50 | Male | 117 | 44 | B+ |
51 | Male | 94 | 42 | F |
52 | Male | 109 | 35 | A |
53 | Male | 92 | 30 | A |
54 | Male | 82 | 37 | A+ |
55 | Female | 82 | 30 | B |
56 | Male | 66 | 36 | B |
57 | Female | 105 | 44 | A+ |
58 | Male | 88 | 39 | B+ |
59 | Male | 83 | 43 | A |
60 | Male | 77 | 28 | C |
This data is about student's height in cm, weight in kg, & last year's grade.
Now if we say that mean height of female students is 85.2 cm & those of male students is 91.2 cm,
it means that these mean values are single no. representations of heights for both female & male populations respectively.
While if we compare the height means for both female & male students, we find out that height of female students is lesser than the male students for class-V. Now this is an insight that we could draw looking at the mean values of the data. Let's keep it this way until we reach forthcoming topic of hypothesis testing.
A question may arises why can't we go for one to one comparison of data. Answer is simple & straight, it can be done by creating a cross table for male and female students & then further analyzing the outcomes. This is easier said than done for a large data having millions of observations over numerous variables.
This sets our stage to a further go for a complete series of data analysis & statistical approach. We will next discuss about various components of data.
>>This series of blog is an attempt to debunk the various concepts of statistical analysis in a simpler way. We hope it is of some help to you guys in getting the things right.
No comments:
Post a Comment