Types of Data.:
Nominal Data: These are data which classify or categorise some
attribute they may be coded as numbers but the numbers has no real
meaning, its just a label they have no default or natural order.
Examples:, town of residence, colour of car, male or female (this lat
one is an example of a dichotomous variable, it can take two mutually
exclusive values.
Ordinal Data: These are data that can be put in an order, but don’t
have a numerical meaning beyond the order. So for instance, the
difference between 2 and 4 in the example of a Lickert scale below
might no be the same as the difference between 2 and 5.
Examples:
Questionnaire responses coded: 1 = strongly disagree, 2 = disagree, 3
= indifferent, 4 = agree, 5 = strongly agree. Level of pain felt in joint
rated on a scale from 0 (comfortable) to 10 (extremely painful).
Interval Data: These are numerical data where the distances between
numbers have meaning, but the zero has no real meaning. With interval
data it is not meaningful to say than one measurement is twice another,
and might not still be true if the units were changed.
Example:Temperature measured in Centigrade, a cup of coffee at 80°c isn't
twice as hot a one at 40°c.
Ratio Data: These are numerical data where the distances between
data and the zero point have real meaning. With such data it is
meaningful to say that one value is twice as much as another, and this
would still be true if the units were changed.
Examples: Heights,Weights, Salaries, Ages. If someone is twice as heavy as someone else in pounds, this will still be true in kilograms.
More restrictedin how they can be analysed Less restricted in how they can
be analysed
Typically only data from the last two types might be suitable for parametric methods,although as we'll see later it isn't always a completely straight forward decision and whendocumenting research it is reasonable to justify the choice of analysis to prevent the readerbelieving that the analysis that best supported the hypothesis was chosen rather than the one most appropriate to the data.
The important thing in this decision, as I hope we'll see, is not to make unsupported assumptions about the data and apply methods assuming "better" data than you have.
Are your data paired?
Paired data are often the result of before and after situations, e.g. before and after treatment. In such a scenario each research subject would have a pair of measurements and it might be that you look for a difference in these measurements to show an improvement due to the treatment.
In SPSS that data would be coded into two columns, each row would
hold the before and the after measurement for the same individual.
We might for example measure the balance performance of 10 subjects with a Balance Performance Monitor (BPM) before and after taking a month long course of exercise 3 designed to improve balance.
Each subject would have a pair of balance readings. This
would be paired data. In this simple form we could do several things with the data; we could find average reading for the balance (Means or Medians), we could graph the data on a boxplot this would be useful to show both level and spread and let us get a feel for the data and see any outliers.
In the example as stated above the data are paired, each subject has a pair of numbers.
What if you made your subjects do another month of exercise and measured their balance again, each subject would have three numbers, the data would still be paired, but rather than stretch the English language by talking about a pair of three we call this repeated measures.
This would be stored in three columns in SPSS.
A word of warning, sometimes you might gather paired data (as above, before we pretended there was a third column of data) but end up with independent groups. Say, for example, you decided that the design above was floored (which it is) and doesn't take into account the fact that people might simply get better at balancing on the balance performance monitor due to having had their first go a month before. i.e. we might see an
increase in balance due to using the balance monitor! to counter this possible effect we could recruit another group of similar subjects, these would be assessed on the BPM but not undertake the exercise sessions, consequently we could asses the effect of measurement without exercise on this control group.
We then have a dilemma about how to treat the two sets of data. We could analyse them separately and hope to find a significant increase in balance in our treatment group but not in the non exercise group. A better method would be to calculate the change in balance for each individual and see if
there is a significant difference in that change between the groups.
This latter method ends with the analysis actually being carried out on non-paired data. (An alternative analysis would be to use a two factor mixed factorial ANOVA - but that sounds a bit too hard just now! - maybe later.)
If you are not sure whether two columns of data are paired or not, consider whether rearranging the order of one of the columns would affect your data. If it would, they are paired. Paired data often occur in ‘before and after’ situations. They are also known as ‘related samples’. Non-paired data can also be referred to as ‘independent samples’.
Scatterplots (also called scattergrams) are only meaningful for paired data.
Parametric or Nonparametric data Before choosing a statistical test to apply to your data you should address the issue of
whether your data are parametric or not.
This is quite a subtle and convoluted decision but the guide line here should help start you thinking, remember the important rule is not to make unsupported assumptions about the data, don't just assume the data are parametric; you can use academic precedence to share the blame "Bloggs et. al. 2001 used a t-test so I will" or you might test the data for normality, we'll try this later, or you might decide that given a small sample it is sensible to opt for nonparametric methods to avoid making assumptions.
•
Ranks, scores, or categories are generally non-parametric data.
•
Measurements that come from a population that is normally distributed can usually be treated as parametric.
If in doubt treat your data as non-parametric especially if you have a relatively small sample.
Generally speaking, parametric data are assumed to be normally distributed – the normal distribution (approximated mathematically by the Gaussian distribution) is a data distribution with more data values near the mean, and gradually less far away, symmetrically. A lot of biological data fit this pattern closely. To sensibly justify applying parametric tests the data should be normally distributed.
If we you unsure about the distribution of the data in our target population then it is safest to assume the data are non–parametric. The cost of this is that the non parametric tests are generally less sensitive and so you would stand a greater chance of missing a small effect
that does exist.
Tests that depend on an assumption about the distribution of the underlying population data, (e.g. t-tests) are parametric because they assume that the data being tested come from a normally distributed population (i.e. a population we know the parameters of). Tests for the significance of correlation involving Pearson's product moment correlation coefficient involve similar assumptions.
Tests that do not depend on many assumptions about the underlying distribution of the data are called non-parametric tests. These include the Wilcoxon signed rank test, and the Mann-Whitney test and Spearman's rank correlation coefficient. They are used widely to test small samples of ordinal data. There is more on this later.
Are you looking for differences or correlation?
•
You can look for differences whenever you have two sets of data. (It might not
always be a sensible thing to do but you can do it!)
•
You can only look for correlation when you have a set of paired data, i.e. two sets of data where each data point in the first set has a partner in the second. If you aren't sure about whether your data are paired review the section on paired data.
•
You might therefore look for the difference in some attribute before and after some intervention.
Ponder these two questions...
1.Does paracetamol lower temperature?
2.Does the number of exercises performed affect the amount of increase in muscle strength?
Which of these is about a difference and which is addressing correlation? - well they aren't all that well described but I recon the first on is about seeing a difference and the second is about correlation, i.e. does the amount of exercise correlate with muscle strength, whereas the first is about "does this drug make a difference".
A variant on this is when conducting a reliability study, in many respects the data structure is similar to a corelational experiment however the technique used to analyse the data is different.
Nominal Data: These are data which classify or categorise some
attribute they may be coded as numbers but the numbers has no real
meaning, its just a label they have no default or natural order.
Examples:, town of residence, colour of car, male or female (this lat
one is an example of a dichotomous variable, it can take two mutually
exclusive values.
Ordinal Data: These are data that can be put in an order, but don’t
have a numerical meaning beyond the order. So for instance, the
difference between 2 and 4 in the example of a Lickert scale below
might no be the same as the difference between 2 and 5.
Examples:
Questionnaire responses coded: 1 = strongly disagree, 2 = disagree, 3
= indifferent, 4 = agree, 5 = strongly agree. Level of pain felt in joint
rated on a scale from 0 (comfortable) to 10 (extremely painful).
Interval Data: These are numerical data where the distances between
numbers have meaning, but the zero has no real meaning. With interval
data it is not meaningful to say than one measurement is twice another,
and might not still be true if the units were changed.
Example:Temperature measured in Centigrade, a cup of coffee at 80°c isn't
twice as hot a one at 40°c.
Ratio Data: These are numerical data where the distances between
data and the zero point have real meaning. With such data it is
meaningful to say that one value is twice as much as another, and this
would still be true if the units were changed.
Examples: Heights,Weights, Salaries, Ages. If someone is twice as heavy as someone else in pounds, this will still be true in kilograms.
More restrictedin how they can be analysed Less restricted in how they can
be analysed
Typically only data from the last two types might be suitable for parametric methods,although as we'll see later it isn't always a completely straight forward decision and whendocumenting research it is reasonable to justify the choice of analysis to prevent the readerbelieving that the analysis that best supported the hypothesis was chosen rather than the one most appropriate to the data.
The important thing in this decision, as I hope we'll see, is not to make unsupported assumptions about the data and apply methods assuming "better" data than you have.
Are your data paired?
Paired data are often the result of before and after situations, e.g. before and after treatment. In such a scenario each research subject would have a pair of measurements and it might be that you look for a difference in these measurements to show an improvement due to the treatment.
In SPSS that data would be coded into two columns, each row would
hold the before and the after measurement for the same individual.
We might for example measure the balance performance of 10 subjects with a Balance Performance Monitor (BPM) before and after taking a month long course of exercise 3 designed to improve balance.
Each subject would have a pair of balance readings. This
would be paired data. In this simple form we could do several things with the data; we could find average reading for the balance (Means or Medians), we could graph the data on a boxplot this would be useful to show both level and spread and let us get a feel for the data and see any outliers.
In the example as stated above the data are paired, each subject has a pair of numbers.
What if you made your subjects do another month of exercise and measured their balance again, each subject would have three numbers, the data would still be paired, but rather than stretch the English language by talking about a pair of three we call this repeated measures.
This would be stored in three columns in SPSS.
A word of warning, sometimes you might gather paired data (as above, before we pretended there was a third column of data) but end up with independent groups. Say, for example, you decided that the design above was floored (which it is) and doesn't take into account the fact that people might simply get better at balancing on the balance performance monitor due to having had their first go a month before. i.e. we might see an
increase in balance due to using the balance monitor! to counter this possible effect we could recruit another group of similar subjects, these would be assessed on the BPM but not undertake the exercise sessions, consequently we could asses the effect of measurement without exercise on this control group.
We then have a dilemma about how to treat the two sets of data. We could analyse them separately and hope to find a significant increase in balance in our treatment group but not in the non exercise group. A better method would be to calculate the change in balance for each individual and see if
there is a significant difference in that change between the groups.
This latter method ends with the analysis actually being carried out on non-paired data. (An alternative analysis would be to use a two factor mixed factorial ANOVA - but that sounds a bit too hard just now! - maybe later.)
If you are not sure whether two columns of data are paired or not, consider whether rearranging the order of one of the columns would affect your data. If it would, they are paired. Paired data often occur in ‘before and after’ situations. They are also known as ‘related samples’. Non-paired data can also be referred to as ‘independent samples’.
Scatterplots (also called scattergrams) are only meaningful for paired data.
Parametric or Nonparametric data Before choosing a statistical test to apply to your data you should address the issue of
whether your data are parametric or not.
This is quite a subtle and convoluted decision but the guide line here should help start you thinking, remember the important rule is not to make unsupported assumptions about the data, don't just assume the data are parametric; you can use academic precedence to share the blame "Bloggs et. al. 2001 used a t-test so I will" or you might test the data for normality, we'll try this later, or you might decide that given a small sample it is sensible to opt for nonparametric methods to avoid making assumptions.
•
Ranks, scores, or categories are generally non-parametric data.
•
Measurements that come from a population that is normally distributed can usually be treated as parametric.
If in doubt treat your data as non-parametric especially if you have a relatively small sample.
Generally speaking, parametric data are assumed to be normally distributed – the normal distribution (approximated mathematically by the Gaussian distribution) is a data distribution with more data values near the mean, and gradually less far away, symmetrically. A lot of biological data fit this pattern closely. To sensibly justify applying parametric tests the data should be normally distributed.
If we you unsure about the distribution of the data in our target population then it is safest to assume the data are non–parametric. The cost of this is that the non parametric tests are generally less sensitive and so you would stand a greater chance of missing a small effect
that does exist.
Tests that depend on an assumption about the distribution of the underlying population data, (e.g. t-tests) are parametric because they assume that the data being tested come from a normally distributed population (i.e. a population we know the parameters of). Tests for the significance of correlation involving Pearson's product moment correlation coefficient involve similar assumptions.
Tests that do not depend on many assumptions about the underlying distribution of the data are called non-parametric tests. These include the Wilcoxon signed rank test, and the Mann-Whitney test and Spearman's rank correlation coefficient. They are used widely to test small samples of ordinal data. There is more on this later.
Are you looking for differences or correlation?
•
You can look for differences whenever you have two sets of data. (It might not
always be a sensible thing to do but you can do it!)
•
You can only look for correlation when you have a set of paired data, i.e. two sets of data where each data point in the first set has a partner in the second. If you aren't sure about whether your data are paired review the section on paired data.
•
You might therefore look for the difference in some attribute before and after some intervention.
Ponder these two questions...
1.Does paracetamol lower temperature?
2.Does the number of exercises performed affect the amount of increase in muscle strength?
Which of these is about a difference and which is addressing correlation? - well they aren't all that well described but I recon the first on is about seeing a difference and the second is about correlation, i.e. does the amount of exercise correlate with muscle strength, whereas the first is about "does this drug make a difference".
A variant on this is when conducting a reliability study, in many respects the data structure is similar to a corelational experiment however the technique used to analyse the data is different.
Its very useful Types of DATA post!!!
ReplyDeleteGranular Analytics
Analytics for Micro Markets
Hyper-Local Data
Hyper Local insights
Types of DATA useful blog spot!!!!!
ReplyDeleteConsumer Affinity
Dealer Sales in Mumbai
Neighborhood Profile in Delhi
Media Mix in Mumbai