How to Reduce Number of
Variables and Detect Relationships, Principal Components and Factor Analysis
General
Purpose
The main applications of
factor analytic techniques are: (1) to reduce the number of variables
and (2) to detect structure in the relationships between variables, that
is to classify variables. Therefore, factor analysis is applied as a
data reduction or structure detection method (the term factor analysis
was first introduced by Thurstone, 1931). The topics listed below will describe
the principles of factor analysis, and how it can be applied towards these two
purposes.
There are many excellent
books on factor analysis. For example, a hands-on how-to approach can be found
in Stevens (1986); more detailed technical descriptions are provided in Cooley
and Lohnes (1971); Harman (1976); Kim and Mueller, (1978a, 1978b); Lawley and
Maxwell (1971); Lindeman, Merenda, and Gold (1980); Morrison (1967); or Mulaik
(1972). The interpretation of secondary factors in hierarchical factor
analysis, as an alternative to traditional oblique rotational strategies, is
explained in detail by Wherry (1984).
Confirmatory factor
analysis. Structural Equation Modeling (SEPATH) allows
you to test specific hypotheses about the factor structure for a set of
variables, in one or several samples (e.g., you can compare factor structures
across samples).
Correspondence analysis. Correspondence
analysis is a descriptive/exploratory technique designed to analyze two-way and
multi-way tables containing some measure of correspondence between the rows and
columns. The results provide information which is similar in nature to those
produced by factor analysis techniques, and they allow you to explore the
structure of categorical variables included in the table.
|
Suppose we conducted a
(rather "silly") study in which we measure 100 people's height in
inches and centimeters. Thus, we would have two variables that measure height.
If in future studies, we want to research, for example, the effect of different
nutritional food supplements on height, would we continue to use both measures?
Probably not; height is one characteristic of a person, regardless of how it is
measured.
Let's now extrapolate
from this "silly" study to something that you might actually do
as a researcher. Suppose we want to measure people's satisfaction with their
lives. We design a satisfaction questionnaire with various items; among other
things we ask our subjects how satisfied they are with their hobbies (item 1)
and how intensely they are pursuing a hobby (item 2). Most likely, the
responses to the two items are highly correlated with each other. (If you are
not familiar with the correlation coefficient, we recommend that you read the
description in Basic Statistics - Correlations) Given a
high correlation between the two items, we can conclude that they are quite
redundant.
Combining Two Variables
into a Single Factor. You can summarize the
correlation between two variables in a scatterplot. A regression line can then be
fitted that represents the "best" summary of the linear relationship
between the variables. If we could define a variable that would approximate the
regression line in such a plot, then that variable would capture most of the
"essence" of the two items. Subjects' single scores on that new
factor, represented by the regression line, could then be used in future data
analyses to represent that essence of the two items. In a sense we have reduced
the two variables to one factor. Note that the new factor is actually a linear
combination of the two variables.
Principal Components
Analysis. The example described above, combining two correlated variables
into one factor, illustrates the basic idea of factor analysis, or of principal
components analysis to be precise (we will return to this later). If we extend
the two-variable example to multiple variables, then the computations become
more involved, but the basic principle of expressing two or more variables by a
single factor remains the same.
Extracting Principal
Components. We do not want to go into the details about the computational
aspects of principal components analysis here, which can be found elsewhere
(references were provided at the beginning of this section). However,
basically, the extraction of principal components amounts to a variance
maximizing (varimax) rotation of the original variable space. For example,
in a scatterplot we can think of the regression line as the original X
axis, rotated so that it approximates the regression line. This type of
rotation is called variance maximizing because the criterion for (goal
of) the rotation is to maximize the variance (variability) of the
"new" variable (factor), while minimizing the variance around the new
variable (see Rotational Strategies).
Generalizing to the Case
of Multiple Variables. When there are more than
two variables, we can think of them as defining a "space," just as
two variables defined a plane. Thus, when we have three variables, we could
plot a three- dimensional scatterplot, and, again we could fit a plane through
the data.
With more than three
variables it becomes impossible to illustrate the points in a scatterplot,
however, the logic of rotating the axes so as to maximize the variance of the
new factor remains the same.
How many Factors to
Extract? Remember that, so far, we are considering principal components
analysis as a data reduction method, that is, as a method for reducing the
number of variables. The question then is, how many factors do we want to
extract? Note that as we extract consecutive factors, they account for less and
less variability. The decision of when to stop extracting factors basically
depends on when there is only very little "random" variability left.
The nature of this decision is arbitrary; however, various guidelines have been
developed, and they are reviewed in Reviewing the Results of a Principal
Components Analysis under Eigenvalues and the Number-of- Factors Problem.
Factor
Analysis as a Classification Method
Let us now return to the
interpretation of the standard results from a factor analysis. We will
henceforth use the term factor analysis generically to encompass both
principal components and principal factors analysis. Let us assume that we are
at the point in our analysis where we basically know how many factors to
extract. We may now want to know the meaning of the factors, that is, whether
and how we can interpret them in a meaningful manner. To illustrate how this
can be accomplished, let us work "backwards," that is, begin with a
meaningful structure and then see how it is reflected in the results of a
factor analysis. Let us return to our satisfaction example; shown below is the
correlation matrix for items pertaining to satisfaction at work and items pertaining
to satisfaction at home.
STATISTICA
FACTOR ANALYSIS |
Correlations (factor.sta)
Casewise deletion of MD n=100 |
|||||
Variable
|
WORK_1
|
WORK_2
|
WORK_3
|
HOME_1
|
HOME_2
|
HOME_3
|
WORK_1
WORK_2 WORK_3 HOME_1 HOME_2 HOME_3 |
1.00
.65 .65 .14 .15 .14 |
.65
1.00 .73 .14 .18 .24 |
.65
.73 1.00 .16 .24 .25 |
.14
.14 .16 1.00 .66 .59 |
.15
.18 .24 .66 1.00 .73 |
.14
.24 .25 .59 .73 1.00 |
The work satisfaction items are highly correlated amongst themselves, and the home satisfaction items are highly intercorrelated amongst themselves. The correlations across these two types of items (work satisfaction items with home satisfaction items) is comparatively small. It thus seems that there are two relatively independent factors reflected in the correlation matrix, one related to satisfaction at work, the other related to satisfaction at home.
Apparently, the first
factor is generally more highly correlated with the variables than the second
factor. This is to be expected because, as previously described, these factors
are extracted successively and will account for less and less variance overall.
Rotating the Factor
Structure. We could plot the factor loadings shown above in a scatterplot. In that plot, each variable
is represented as a point. In this plot we could rotate the axes in any
direction without changing the relative locations of the points to each
other; however, the actual coordinates of the points, that is, the factor
loadings would of course change. In this example, if you produce the plot it
will be evident that if we were to rotate the axes by about 45 degrees we might
attain a clear pattern of loadings identifying the work satisfaction items and
the home satisfaction items.
Rotational strategies. There
are various rotational strategies that have been proposed. The goal of all of
these strategies is to obtain a clear pattern of loadings, that is, factors
that are somehow clearly marked by high loadings for some variables and low
loadings for others. This general pattern is also sometimes referred to as simple
structure (a more formalized definition can be found in most standard
textbooks). Typical rotational strategies are varimax, quartimax,
and equamax.
We have described the
idea of the varimax rotation before (see Extracting Principal Components), and it
can be applied to this problem as well. As before, we want to find a rotation
that maximizes the variance on the new axes; put another way, we want to obtain
a pattern of loadings on each factor that is as diverse as possible, lending
itself to easier interpretation. Below is the table of rotated factor loadings.
STATISTICA
FACTOR ANALYSIS |
Factor Loadings (Varimax
normalized)
Extraction: Principal components |
|
Variable
|
Factor 1
|
Factor 2
|
WORK_1
WORK_2 WORK_3 HOME_1 HOME_2 HOME_3 |
.862443
.890267 .886055 .062145 .107230 .140876 |
.051643
.110351 .152603 .845786 .902913 .869995 |
Expl.Var
Prp.Totl |
2.356684
.392781 |
2.325629
.387605 |
Interpreting the Factor
Structure. Now the pattern is much clearer. As expected, the first factor
is marked by high loadings on the work satisfaction items, the second factor is
marked by high loadings on the home satisfaction items. We would thus conclude
that satisfaction, as measured by our questionnaire, is composed of those two
aspects; hence we have arrived at a classification of the variables.
Consider another example,
this time with four additional Hobby/Misc variables added to our earlier
example.
In the plot of factor
loadings above, 10 variables were reduced to three specific factors, a work
factor, a home factor and a hobby/misc. factor. Note that factor loadings for
each factor are spread out over the values of the other two factors but are
high for its own values. For example, the factor loadings for the hobby/misc
variables (in green) have both high and low "work" and
"home" values, but all four of these variables have high factor
loadings on the "hobby/misc" factor.
Its very useful Reduce Number of Variables and Detect Relationships, Principal Components and Factor Analysis post!!
ReplyDeleteGranular Analytics
Analytics for Micro Markets
Hyper-Local Data
Hyper Local insights
How to Reduce Number of Variables and Detect Relationships, Principal Components and Factor Analysis useful blog!!!
ReplyDeletePharma company culture in Mumbai
Pricing Strategy in Mumbai
Store Location in Gurugram
Marketing variables in Gurugram