Tuesday, 12 May 2015

Discriminant Analysis


Discriminant function analysis is used to classify individuals into the predetermined groups. It is a multivariate analogue of analysis of variance, and can be considered as an a posterior procedure of multivariate analysis of variance.

If discriminant function analysis is effective for a set of data, the classification table of correct and incorrect estimates will yield a high percentage correct. Multiple discriminant function analysis (sometimes called canonical variate analysis) is used when there are three or more groups.

Discriminant analysis (in the broad sense) is a very powerful statistical tool for many types of analyses. The IRS uses discriminant analysis to identify people that want to audit. The threshold discriminant function coefficients and scores are probably more secure than nuclear technology and Bill Clinton’s black book.


Multiple Group Discriminant Analysis (=Canonical Discriminant Analysis) 



Canonical discriminant analysis is used as a means of distinguishing among a group of samples from potentially different populations. The goals are to:

(1) find the axis of greatest discrimination between groups identified a priori.
(2) test whether the means of those groups along that axis are significantly different.
(3) attempt to assign individual specimens to groups. 

When there are two groups to be separated, the technique is known as discriminant function analysis (DFA); with more than two groups, the same question is addressed through canonical variates analysis (CVA). Often these terms are used interchangeably. 

The key assumption of canonical discriminant analysis is that all individuals can be assigned to one and only one group in advance, through some means external to the data being analysed. In this way it is different from PCA and factor analysis, which assume that any sub structuring in the data is unknown prior to the analysis


Key Terms and Concepts 


Discriminant function: 
A discriminant function, also called a canonical vector, is a latent variable which is created as a linear combination of discriminating (independent) variables, such that Z = b1x1 + b 2x2 + ... + b nxn + a, where the b's are discriminant coefficients, the x's are discriminating variables, and a is a constant. This is analogous to multiple regression, but the b's are discriminant coefficients which maximize the distance between the means of the criterion (dependent) variable. 

Number of discriminant functions:
 There is one discriminant function for 2- group discriminant analysis, but for higher order DA, the number of functions is the lesser of (g - 1), where g is the number of groups, or p,the number of discriminating (independent) variables. Each discriminant function is orthogonal to the others. 
Multiple group DF is equivalent to canonical correlation

Standardized discriminant coefficients:
Also termed the standardized canonical discriminant function coefficients, are used to compare the relative importance of the independent variables, much as beta weights are used in regression. 

Functions at group centroids:
Are the mean values of each group of the dependent variable for each function. The group centroid is the mean value for the discriminant scores for a given group. Two-group discriminant analysis has two centroids, one for each group. 

Discriminant function plot:
It is the plot of functions at group centroids, where the axes are two of the dimensions and any given point represents the group mean of a dependent variable group on those two dimensions. The farther apart one point is from another on the plot, the more the dimension represented by that axis differentiates those two groups. 


Assumptions: 


Discriminant function analysis is computationally very similar to MANOVA (Multivariate Analysis of Variance), and all assumptions for MANOVA apply. 

Sample size: The sample size of the smallest group needs to exceed the number of predictor variables. The maximum number of independent variables is n - 2, where n is the sample size. It is best to have 4 or 5 times as many observations as independent variables. 

Normal distribution: It is assumed that the data (for the variables) represent a sample from a multivariate normal distribution. 

Homogeneity of variances/covariances: DA is very sensitive to heterogeneity of variance-covariance matrices. . 

Outliers: DA is highly sensitive to the inclusion of outliers




2 comments: