Saturday, 9 May 2015

Discriminant Analysis


Discriminant Analysis

Discriminant analysis is a statistical method that is used by researchers to help them understand the relationship between a "dependent variable" and one or more "independent variables." A dependent variable is the variable that a researcher is trying to explain or predict from the values of the independent variables. Discriminant analysis is similar to regression analysis and analysis of variance (ANOVA). The principal difference between discriminant analysis and the other two methods is with regard to the nature of the dependent variable.

Discriminant analysis requires the researcher to have measures of the dependent variable and all of the independent variables for a large number of cases. In regression analysis and ANOVA, the dependent variable must be a "continuous variable." A numeric variable indicates the degree to which a subject possesses some characteristic, so that the higher the value of the variable, the greater the level of the characteristic. A good example of a continuous variable is a person's income.

In discriminant analysis, the dependent variable must be a "categorical variable." The values of a categorical variable serve only to name groups and do not necessarily indicate the degree to which some characteristic is present. An example of a categorical variable is a measure indicating to which one of several different market segments a customer belongs; another example is a measure indicating whether or not a particular employee is a "high potential" worker. The categories must be mutually exclusive; that is, a subject can belong to one and only one of the groups indicated by the categorical variable. While a categorical variable must have at least two values (as in the "high potential" case), it may have numerous values (as in the case of the market segmentation measure).

There are two basic steps in discriminant analysis. The first involves estimating coefficients, or weighting factors, that can be applied to the known characteristics of job candidates (i.e., the independent variables) to calculate some measure of their tendency or propensity to become high performers. This measure is called a "discriminant function." Second, this information can then be used to develop a decision rule that specifies some cut-off value for predicting which job candidates are likely to become high performers.

The tendency of an individual to become a high performer can be written as a linear equation. The values of the various predictors of high performer status (i.e., independent variables) are multiplied by "discriminant function coefficients" and these products are added together to obtain a predicted discriminant function score. This score is used in the second step to predict the job candidates likelihood of becoming a high performer. Suppose that you were to use three different independent variables in the discriminant analysis. Then the discriminant function has the following form:

 where  D  = discriminant function score,

 B  , = discriminant function coefficient relating independent variable i to the discriminant function score,

 X  = value of independent variable  i. 

The equation is quite similar to a regression equation. Conventional regression analysis should not be used in place of discriminant analysis. The dependent variable would have only two values (high performer and low performer) and would thus violate important assumptions of the regression model. Discriminant analysis does not have these limitations with respect to the dependent variable.

There are various tests of significance that can be used in discriminant analysis. One widely used test statistic is based on Wilks lambda, which provides an assessment of the discriminating power of the function derived from the analysis. If this value is found to be statistically significant, then the set of independent variables can be assumed to differentiate between the groups of the categorical variable. This test, which is analogous to the F-ratio test in ANOVA and regression, is useful in evaluating the overall adequacy of the analysis.

Once the analysis is completed, the discriminant function coefficients can be used to assess the contributions of the various independent variables to the tendency of an employee to be a high performer. The discriminant function coefficients are analogous regression coefficients and they range between values of -1.0 and 1.0. The first box in Figure 1 (on the facing page) provides hypothetical results of the discriminant analysis. The second box provides the within-group averages for the discriminant function for the two categories of the dependent variable. Note that the high performers have an average score of 1.45 on the discriminant function, while the low performers have an average score of -.89. The discriminant function is treated as a standardized variable, so it has a mean of zero and a standard deviation of one. The average values of the discriminant function scores are meaningful only in that they help us interpret the coefficients. Since the high performers are at the upper end of the scale, all of the positive coefficients indicate that the greater the value of those variables, the greater the likelihood of a worker being a high performer (e.g., education, motivation).

The magnitudes of the coefficients also tell us something about the relative contributions of the independent variables. The closer the value of a coefficient is to zero, the weaker it is as a predictor of the dependent variable. On the other hand, the closer the value of a coefficient is to either 1.0 or -1.0, the stronger it is as a predictor of the dependent variable. In this example, then, years of education and ability to handle stress both have positive coefficients, though the latter is quite weak. Finally, individuals who place high importance on family life are less likely to be high performers than those who do not.

The second step in discriminant analysis involves predicting to which group in the dependent variable a particular case belongs. A subject's discriminant score can be translated into a probability of being in a particular group by means of Bayes Rule. Separate probabilities are computed for each group and the subject is assigned to the group with the highest probability. Another test of the adequacy of a model is the degree to which known cases are correctly classified. As in other statistical procedures, it is generally preferable to test the model on a set of cases that were not used to estimate the model's parameters. This provides a more conservative test of the model. Thus, a set of cases should, if possible, be saved for this purpose. Having completed the analysis, the results can be used to predict the work potential of job candidates and hopefully serve to improve the selection process.

2 comments: