Thursday 14 May 2015

Linear discriminant analysis



Introduction


Linear Discriminant Analysis (LDA) is most commonly used as dimensionality reduction technique in the pre-processing step for pattern-classification and machine learning applications. The goal is to project a dataset onto a lower-dimensional space with good class-separability in order avoid overfitting ("curse of dimensionality") and also reduce computational costs.
Ronald A. Fisher formulated the Linear Discriminant in 1936 (The Use of Multiple Measurements in Taxonomic Problems), and it also has some practical uses as classifier. The original Linear discriminant was described for a 2-class problem, and it was then later generalized as "multi-class Linear Discriminant Analysis" or "Multiple Discriminant Analysis" by C. R. Rao in 1948 (The utilization of multiple measurements in problems of biological classification)
The general LDA approach is very similar to a Principal Component Analysis (for more information about the PCA, see the previous article Implementing a Principal Component Analysis (PCA) in Python step by step), but in addition to finding the component axes that maximize the variance of our data (PCA), we are additionally interested in the axes that maximize the separation between multiple classes (LDA).
So, in a nutshell, often the goal of an LDA is to project a feature space (a dataset n-dimensional samples) onto a smaller subspace k (where k ≤ n-1) while maintaining the class-discriminatory information.
In general, dimensionality reduction does not only help reducing computational costs for a given classification task, but it can also be helpful to avoid overfitting by minimizing the error in parameter estimation ("curse of dimensionality").


Principal Component Analysis vs. Linear Discriminant Analysis


Both Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are linear transformation techniques that are commonly used for dimensionality reduction. PCA can be described as an "unsupervised" algorithm, since it "ignores" class labels and its goal is to find the directions (the so-called principal components) that maximize the variance in a dataset. In contrast to PCA, LDA is "supervised" and computes the directions ("linear discriminants") that will represent the axes that maximize the separation between multiple classes.
Although it might sound intuitive that LDA is superior to PCA for a multi-class classification task where the class labels are known, this might not always be the case.
For example, comparisons between classification accuracies for image recognition after using PCA or LDA show that PCA tends to outperform LDA if the number of samples per class is relatively small (PCA vs. LDA, A.M. Martinez et al., 2001). In practice, it is also not uncommon to use both LDA and PCA in combination: E.g., PCA for dimensionality reduction followed by an LDA.

What is a "good" feature subspace?


Let's assume that our goal is to reduce the dimensions of a d-dimensional dataset by projecting it onto ak-dimensional subspace (where k ≤ d). So, how do we know what size we should choose for k (k = the number of dimensions of the new feature subspace), and how do we know if we have a feature space that represents our data "well"?
Later, we will compute eigenvectors (the components) from our dataset and collect them in a so-called scatter-matrices (i.e., the between-class scatter matrix and within-class scatter matrix).
Each of these eigenvectors is associated with an eigenvalue, which tells us about the "length" or "magnitude" of the eigenvectors.
If we would observe that all eigenvalues have a similar magnitude, then this may be a good indicator that our data is already projected on a "good" feature space.
And in the other scenario, if some of the eigenvalues are much much larger than others, we might be interested in keeping only those eigenvectors with the highest eigenvalues, since they contain more information about our data distribution. Vice versa, eigenvalues that are close to 0 are less informative and we might consider dropping those for constructing the new feature subspace.







Choosing the Right Type of Rotation in PCA and EFA



What Is Rotation?
In the PCA/EFA literature, definitions of rotation abound. For example, McDonald (1985, p. 40) 
defines rotation as “performing arithmetic to obtain a new set of factor loadings (v-ƒ regression 
weights) from a given set,” and Bryant and Yarnold (1995, p. 132) define it as “a procedure in which 
the eigenvectors (factors) are rotated in an attempt to achieve simple structure.” Perhaps a bit more 
helpful is the definition supplied in Vogt (1993, p. 91): “Any of several methods in factor analysis by 
which the researcher attempts to relate the calculated factors to theoretical entities. This is done 
differently depending upon whether the factors are believed to be correlated (oblique) or uncorrelated 
(orthogonal).” And even more helpful is Yaremko, Harari, Harrison, and Lynn (1986), who define 
factor rotation as follows: “In factor or principal-components analysis, rotation of the factor axes 
(dimensions) identified in the initial extraction of factors, in order to obtain simple and interpretable 
factors.” They then go on to explain and list some of the types of orthogonal and oblique procedures. 
 How can a concept with a goal of simplification be so complicated? Let me try defining rotation
from the perspective of a language researcher, while trying to keep it simple. I think of rotation as any 
of a variety of methods (explained below) used to further analyze initial PCA or EFA results with the 
goal of making the pattern of loadings clearer, or more pronounced. This process is designed to reveal 
the simple structure. 
The choices that researchers make among the orthogonal and oblique varieties of these rotation 
methods and the notion of simple structure will be the main topics in the rest of this column. 21
21
What Are the Different Types of Rotation?
As mentioned earlier, rotation methods are either orthogonal or oblique. Simply put, orthogonal 
rotation methods assume that the factors in the analysis are uncorrelated. Gorsuch (1983, pp. 203-204) 
lists four different orthogonal methods: equamax, orthomax, quartimax, and varimax. In contrast, 
oblique rotation methods assume that the factors are correlated. Gorsuch (1983, pp. 203-204) lists 15 
different oblique methods.1
Version 16 of SPSS offers five rotation methods: varimax, direct oblimin, quartimax, equamax, and 
promax, in that order. Three of those are orthogonal (varimax, quartimax, & equimax), and two are 
oblique (direct oblimin & promax). Factor analysis is not the focus of my life, nor am I eager to learn 
how to use a new statistical program or calculate rotations by hand (though I’m sure I could do it if I 
had a couple of spare weeks), so those five SPSS options serve as boundaries for the choices I make. 
But how should I choose which one to use? 
Tabachnick and Fiddell (2007, p. 646) argue that “Perhaps the best way to decide between 
orthogonal and oblique rotation is to request oblique rotation [e.g., direct oblimin or promax from 
SPSS] with the desired number of factors [see Brown, 2009b] and look at the correlations among 
factors…if factor correlations are not driven by the data, the solution remains nearly orthogonal. Look 
at the factor correlation matrix for correlations around .32 and above. If correlations exceed .32, then 
there is 10% (or more) overlap in variance among factors, enough variance to warrant oblique rotation 
unless there are compelling reasons for orthogonal rotation.” 
 For example, using the same Brazilian data I used for examples in Brown 2009a and b (based on 
the 12 subtests of the Y/G Personality Inventory from Guilford & Yatabe, 1957), I ran a three- factor 
EFA followed by a direct oblimin rotation. The resulting correlation matrix for the factors that the 
analysis produced is shown in Table 1. Notice that the highest correlation is .084. Since none of the 
correlations exceeds the Tabachnick and Fiddell threshold of .32 described in the previous paragraph, 
“the solution remains nearly orthogonal.” Thus, I could just as well run an orthogonal rotation. 
Table 1. Correlation Matrix for the Three Factors in an EFA with Direct Oblimin Rotation for the Brazilian Y/GPI Data
Factor 1 2 3
1 1.000 -0.082 0.084
2 -0.082 1.000 -0.001
3 0.084 -0.001 1.000
Moreover, as Kim and Mueller (1978, p. 50) put it, “Even the issue of whether factors are 
correlated or not may not make much difference in the exploratory stages of analysis. It even can be 
argued that employing a method of orthogonal rotation (or maintaining the arbitrary imposition that the 
factors remain orthogonal) may be preferred over oblique rotation, if for no other reason than that the 
former is much simpler to understand and interpret.” 

How Do Researchers Decide Which Particular Type of Rotation to Use?
We can think of the goal of rotation and of choosing a particular type of rotation as seeking 
something called simple structure, or put another way, one way we know if we have selected an 
adequate rotation method is if the results achieve simple structure. But what is simple structure? 
Bryant and Yarnold (1995, p. 132-133) define simple structure as:

A condition in which variables load at near 1 (in absolute value) or at near 0 on an eigenvector (factor). Variables 
that load near 1 are clearly important in the interpretation of the factor, and variables that load near 0 are clearly 
unimportant. Simple structure thus simplifies the task of interpreting the factors.
Using logic like that in the preceding quote, Thurstone (1947) first proposed and argued for five 
criteria that needed to be met for simple structure to be achieved: 
1. Each variable should produce at least one zero loading on some factor. 
2. Each factor should have at least as many zero loadings as there are factors.
3. Each pair of factors should have variables with significant loadings on one 
and zero loadings on the other. 
4. Each pair of factors should have a large proportion of zero loadings on both factors 
(if there are say four or more factors total).
5. Each pair of factors should have only a few complex variables




Get Smarter about Big Data

  • Data is often imperfect – and that’s usually a good thing! You don’t need perfect information to find interesting relationships in the data – in fact, counter-intuitively, “dirty” data is sometimes better for finding relationships, because cleansing may remove the very attribute that enables matching. On the other hand, some information is a lie, as “bad guys” will intentionally try to fool you, or to separate their interactions with your firm into different channels (web, mobile, store) to avoid detection. You should assign a trust level to “known” information, and it rarely approaches 100%.
  • Your data can make you smarter as time passes. As new observations continue to accumulate, they enable you to refine your understanding, and even to reverse earlier assertions of your analysis based on what you knew at the time. Therefore, be sure to rerun earlier analyses over the full dataset, and don’t assume the conclusions of your previous analysis were correct.
  • Partial information is often enough. It’s surprising how soon you can start to see a picture emerge – with puzzles, the picture can often be identified with only 50% of the pieces, and this aspect of human cognition often applies to machine learning, too. Once the picture starts to emerge, you can more quickly understand each new puzzle piece (observation) by seeing it in the context it occupies among those around it.
This emerging picture should inform your collection efforts – you might need to obtain a newinformation source to follow up a lead from an earlier analysis, or to discard an information source (and the cost of collecting and analyzing it) once you realize it’s not helping.
  • More data is always good. The case for accumulating more data – Big Data – is strong: not only does it bring deeper insights, it also can reduce your compute workload – Jeff’s experience shows that the length of time it takes to link a new observation into a large information network actually goesdown as the total number of observations goes up, beyond a certain threshold.
One of the most interesting new sources of Big Data insights is data about the interactions of people with systems – even their mistakes! That’s how Google knows to ask “did you meant this?”
  • Can you count? Good! Accurate counting of entities (people, cases, transactions), a.k.a. Entity Resolution, is critical to deeper analysis – if you can’t count, you can’t determine a vector or velocity, and without those, you can’t make predictions. Many interesting analyses in fraud detection involve detecting identities – accurately counting people, knowing when two identities are the same person, or when one identity is actually being used by more than one person, or even when an identity is not a real person at all… Identity matching is also the source of analyses that identify dead people voting and other such fraud.
  • Privacy matters, but it’s not an obstacle. Once identity comes into play, then privacy concerns (and regulations) must of course be taken into consideration. There are advanced techniques such as one-way hashes that can be used to anonymize a data set without reducing its usefulness for analytical purposes.
  • Bad guys can be smart, too. Skilled adversaries present unique problems, but they can be overcome: to catch them, you must collect observations the adversary doesn’t know you have (e.g. a camera on a route, that they don’t know you have), or compute over your observations in a way the adversary can’t imagine (e.g. recognizing faces or license plates, and correlating that with other location information).
So as adversaries get smarter and more capable of avoiding detection all the time, savvy analysts must continually push the edge of the envelope of applying new techniques and technology to the game.

Pseudo-Random Number Generators (PRNGs)

As the word ‘pseudo’ suggests, pseudo-random numbers are not random in the way you might expect, at least not if you're used to dice rolls or lottery tickets. Essentially, PRNGs are algorithms that use mathematical formulae or simply precalculated tables to produce sequences of numbers that appear random. A good example of a PRNG is the linear congruential method. A good deal of research has gone into pseudo-random number theory, and modern algorithms for generating pseudo-random numbers are so good that the numbers look exactly like they were really random.
The basic difference between PRNGs and TRNGs is easy to understand if you compare computer-generated random numbers to rolls of a die. Because PRNGs generate random numbers by using mathematical formulae or precalculated lists, using one corresponds to someone rolling a die many times and writing down the results. Whenever you ask for a die roll, you get the next on the list. Effectively, the numbers appear random, but they are really predetermined. TRNGs work by getting a computer to actually roll the die — or, more commonly, use some other physical phenomenon that is easier to connect to a computer than a die is.
PRNGs are efficient, meaning they can produce many numbers in a short time, and deterministic, meaning that a given sequence of numbers can be reproduced at a later date if the starting point in the sequence is known. Efficiency is a nice characteristic if your application needs many numbers, and determinism is handy if you need to replay the same sequence of numbers again at a later stage. PRNGs are typically also periodic, which means that the sequence will eventually repeat itself. While periodicity is hardly ever a desirable characteristic, modern PRNGs have a period that is so long that it can be ignored for most practical purposes.
These characteristics make PRNGs suitable for applications where many numbers are required and where it is useful that the same sequence can be replayed easily. Popular examples of such applications are simulation and modeling applications. PRNGs are not suitable for applications where it is important that the numbers are really unpredictable, such as data encryption and gambling.
It should be noted that even though good PRNG algorithms exist, they aren't always used, and it's easy to get nasty surprises. Take the example of the popular web programming language PHP. If you use PHP for GNU/Linux, chances are you will be perfectly happy with your random numbers. However, if you use PHP for Microsoft Windows, you will probably find that your random numbers aren't quite up to scratch as shown in this visual analysis from 2008. Another example dates back to 2002 when one researcher reported that the PRNG on MacOS was not good enough for scientific simulation of virus infections. The bottom line is that even if a PRNG will serve your application's needs, you still need to be careful about which one you use.

True Random Number Generators (TRNGs)

In comparison with PRNGs, TRNGs extract randomness from physical phenomena and introduce it into a computer. You can imagine this as a die connected to a computer, but typically people use a physical phenomenon that is easier to connect to a computer than a die is. The physical phenomenon can be very simple, like the little variations in somebody's mouse movements or in the amount of time between keystrokes. In practice, however, you have to be careful about which source you choose. For example, it can be tricky to use keystrokes in this fashion, because keystrokes are often buffered by the computer's operating system, meaning that several keystrokes are collected before they are sent to the program waiting for them. To a program waiting for the keystrokes, it will seem as though the keys were pressed almost simultaneously, and there may not be a lot of randomness there after all.
However, there are many other ways to get true randomness into your computer. A really good physical phenomenon to use is a radioactive source. The points in time at which a radioactive source decays are completely unpredictable, and they can quite easily be detected and fed into a computer, avoiding any buffering mechanisms in the operating system. The HotBits service at Fourmilab in Switzerland is an excellent example of a random number generator that uses this technique. Another suitable physical phenomenon is atmospheric noise, which is quite easy to pick up with a normal radio. This is the approach used by RANDOM.ORG. You could also use background noise from an office or laboratory, but you'll have to watch out for patterns. The fan from your computer might contribute to the background noise, and since the fan is a rotating device, chances are the noise it produces won't be as random as atmospheric noise.



As long as you are careful, the possibilities are endless. Undoubtedly the visually coolest approach was the lavarand generator, which was built by Silicon Graphics and used snapshots of lava lamps to generate true random numbers. Unfortunately, lavarand is no longer operational, but one of its inventors is carrying on the work (without the lava lamps) at the LavaRnd web site. Yet another approach is theJava EntropyPool, which gathers random bits from a variety of sources including HotBits and RANDOM.ORG, but also from web page hits received by the EntropyPool's own web server.
Regardless of which physical phenomenon is used, the process of generating true random numbers involves identifying little, unpredictable changes in the data. For example, HotBits uses little variations in the delay between occurrences of radioactive decay, and RANDOM.ORG uses little variations in the amplitude of atmospheric noise.
The characteristics of TRNGs are quite different from PRNGs. First, TRNGs are generally rather inefficient compared to PRNGs, taking considerably longer time to produce numbers. They are also nondeterministic, meaning that a given sequence of numbers cannot be reproduced, although the same sequence may of course occur several times by chance. TRNGs have no period.

Comparison of PRNGs and TRNGs

The table below sums up the characteristics of the two types of random number generators.
CharacteristicPseudo-Random Number GeneratorsTrue Random Number Generators
EfficiencyExcellentPoor
DeterminismDeterminsticNondeterministic
PeriodicityPeriodicAperiodic
These characteristics make TRNGs suitable for roughly the set of applications that PRNGs are unsuitable for, such as data encryption, games and gambling. Conversely, the poor efficiency and nondeterministic nature of TRNGs make them less suitable for simulation and modeling applications, which often require more data than it's feasible to generate with a TRNG. The following table contains a summary of which applications are best served by which type of generator:
ApplicationMost Suitable Generator
Lotteries and DrawsTRNG
Games and GamblingTRNG
Random Sampling (e.g., drug screening)TRNG
Simulation and ModellingPRNG
Security (e.g., generation of data encryption keys)TRNG
The ArtsVaries


Data Scientist vs Data Engineer

Data Scientist
A data scientist is responsible for pulling insights from data. It is the data scientists job to pull data, create models, create data products, and tell a story. A data scientist should typically have interactions with customers and/or executives. A data scientist should love scrubbing a dataset for more and more understanding.

The main goal of a data scientist is to produce data products and tell the stories of the data. A data scientist would typically have stronger statistics and presentation skills than a data engineer.


Data Engineer
Data Engineering is more focused on the systems that store and retrieve data. A data engineer will be responsible for building and deploying storage systems that can adequately handle the needs. Sometimes the needs are fast real-time incoming data streams. Other times the needs are massive amounts of large video files. Still other times the needs are many many reads of the data. In other words, a data engineer needs to build systems that can handle the 3 Vs of big data.

The main goal of data engineer is to make sure the data is properly stored and available to the data scientist and others that need access. A data engineer would typically have stronger software engineering and programming skills than a data scientist.

data-scientist-vs-data-engineer.jpg (594×389)

Conclusion
It is too early to tell if these 2 roles will ever have a clear distinction of responsibilities, but it is nice to see a little separation of responsibilities for the mythical all-in-one data scientist. Both of these roles are important to a properly functioning data science team.

Chebyshev’s Inequality

Chebyshev’s theorem refers to several theorems, all proven by Russian mathematician Pafnuty Chebyshev. They include: Chebyshev’s inequality, Bertrand’s postulate, Chebyshev’s sum inequality and Chebyshev’s equioscillation theorem. Chebyshev’s inequality is the theorem most often used in stats. 
It states that no more than 1/k2 of a distribution’s values are more than “k” standard deviations away from the mean. With a normal distributionstandard deviations tell you how much of that distribution’s data are within k standard deviations from the mean. If you have a distribution that isn’t normal, you can use Chebyshev’s to help you find out what percentage of the data is clustered around the mean.
Chebyshev’s Inequality relates to the distribution of numbers in a set. In layman’s terms, the formula helps you figure out the number of values that are inside and outside the standard deviation. The standard deviation tells you how far away values are from the average of the set. Roughly two-thirds of the values should fall within one standard deviation either side of mean in a normal distribution. In statistics, it’s often referred to as Chebyshev’s Theorem (as opposed to Chebyshev’s Inequality). 
Chebyschev’s Inequality formula is able to prove (with little information given on your part) the probability of outliers existing at a certain interval. Given X is a random variable, A stands for the mean of the set, K is the number of standard deviations, and Y is the value of the standard deviation, the formula reads as follows:

 Pr(|X-A|=>KY)<=1/K2

The absolute value of the difference of X minus A is greater than or equal to the K times Y has the probability of less than or equal to one divided by K squared.


Applications of Chebyshev's Inequality

The formula was used with calculus to develop the weak version of the law of large numbers. This law states that as a sample set increases in size, the closer it should be to its theoretical mean. A simple example is that when rolling a six-sided die, the probable average is 3.5. A sample size of 5 rolls may result in drastically different results. Roll the die 20 times; The average should begin approaching 3.5. As you add more and more rolls, the average should continue to near 3.5 until reaching it. Or, it becomes so close that they are pretty much equal.
Another application is in finding the difference between the mean and median of a set of numbers. Using a one-sided version of Chebyshev’s Inequality theorem, also known as Cantelli’s theorem, you can prove the absolute value of the difference between the median and the mean will always be less than or equal to the standard deviation. This is handy in determining if a median you derived is plausible.

Tuesday 12 May 2015

Shivanker kumar for the class of May 9th

z-score and its application 


Z-scores are expressed in terms of standard deviations from their means. Resultantly, these z-scores have a distribution with a mean of 0 and a standard deviation of 1. The formula for calculating the standard score is given below:
Standard Score Calculation

Z = (X-µ)/s


As the formula shows, the standard score is simply the score, minus the mean score, divided by the standard deviation. Let’s see the application of z-score.
Application –
1. How well did Sarah perform in her English Literature coursework compared to the other 50 students?
To answer this question, we can re-phrase it as: What percentage (or number) of students scored higher than Sarah and what percentage (or number) of students scored lower than Sarah? First, let's reiterate that Sarah scored 70 out of 100, the mean score was 60, and the standard deviation was 15 (see below).
                Score      Mean      Standard Deviation
                (X)           µ              s
                70           60           15
In terms of z-scores, this gives us:
Z = (X-µ)/s = (70-60)/15 = .6667
Standard Score Calculation
The z-score is 0.67 (to 2 decimal places), but now we need to work out the percentage (or number) of students that scored higher and lower than Sarah. To do this, we need to refer to the standard normal distribution table.
This table helps us to identify the probability that a score is greater or less than our z-score score. To use the table, which is easier than it might look at first sight, we start with our z-score, 0.67 (if our z-score had more than two decimal places, for example, ours was 0.6667, we would round it up or down accordingly; hence, 0.6667 would become 0.67). The y-axis in the table highlights the first two digits of our z-score and the x-axis the second decimal place. Therefore, we start with the y-axis, finding 0.6, and then move along the x-axis until we find 0.07, before finally reading off the appropriate number; in this case, 0.2514. This means that the probability of a score being greater than 0.67 is 0.2514. If we look at this as a percentage, we simply times the score by 100; hence 0.2514 x 100 = 25.14%. In other words, around 25% of the class got a better mark than Sarah (roughly 13 students since there is no such thing as part of a student!).

Going back to our question, "How well did Sarah perform in her English Literature coursework compared to the other 50 students?", clearly we can see that Sarah did better than a large proportion of students, with 74.86% of the class scoring lower than her (100% - 25.14% = 74.86%). We can also see how well she performed relative to the mean score by subtracting her score from the mean (0.5 - 0.2514 = 0.2486). Hence, 24.86% of the scores (0.2486 x 100 = 24.86%) were lower than Sarah's, but above the mean score. However, the key finding is that Sarah's score was not one of the best marks. It wasn't even in the top 10% of scores in the class, even though at first sight we may have expected it to be. This leads us onto the second question.
2. Which students came in the top 10% of the class?
A better way of phrasing this would be to ask: What mark would a student have to achieve to be in the top 10% of the class and qualify for the advanced English Literature class?
To answer this question, we need to find the mark (which we call "X") on our frequency distribution that reflects the top 10% of marks. Since the mean score was 60 out of 100, we immediately know that the mark will be greater than 60. After all, if we refer to our frequency distribution below, we are interested in the area to the right of the mean score of 60 that reflects the top 10% of marks (shaded in red). As a decimal, the top 10% of marks would be those marks above 0.9 (i.e., 100% - 90% = 10% or 1 - 0.9 = 0.1).



First, we should convert our frequency distribution into a standard normal distribution. As such, our mean score of 60 becomes 0 and the score (X) we are looking for, 0.9, becomes our z-score, which is currently unknown.
The next step involves finding out the value for our z-score. To do this, we refer back to the standard normal distribution table.


In answering the first question in this guide, we already knew the z-score, 0.67, which we used to find the appropriate percentage (or number) of students that scored higher than Sarah, 0.2514 (i.e., 25.14% or roughly 25 students achieve a higher mark than Sarah). Using the z-score, 0.67, and the y-axis and x-axis of the standard normal distribution table, this guided us to the appropriate value, 0.2514. In this case, we need to do the exact reverse to find our z-score.
We know the percentage we are trying to find, the top 10% of students, corresponds to 0.9. As such, we first need to find the value 0.9 in standard normal distribution table. When looking at the table, you may notice that the closest value to 0.9 is 0.8997. If we take the 0.8997 value as our starting point and then follow this row across to the left, we are presented with the first part of the z-score. You will notice that the value on the y-axis for 0.8997 is 1.2. We now need to do the same for the x-axis, using the 0.8997 value as our starting point and following the column up. This time, the value on the x-axis for 0.8997 is 0.08. This forms the second part of the z-score. Putting these two values together, the z-score for 0.8997 is 1.28 (i.e., 1.2 + 0.08 = 1.28).
There is only one problem with this z-score; that is, it is based on a value of 0.8997 rather than the 0.9 value we are interested in. This is one of the difficulties of refer to the standard normal distribution table because it cannot give every possible z-score value (that we require a quite enormous table!). Therefore, you can either take the closest two values, 0.8997 and 0.9015, to your desired value, 0.9, which reflect the z-scores of 1.28 and 1.29, and then calculate the exact value of "z" for 0.9, or you can use a z-score calculator. If we use a z-score calculator, our value of 0.9 corresponds with a z-score of 1.282. In other words, P ( z > 1.282 ) = 0.1.


Now that we have the key information (that is, the mean score, µ, the standard deviation, s , and z-score, z), we can answer our question directly, namely: What mark would a student have to achieve to be in the top 10% of the class and qualify for the advanced English Literature class? First, let us reiterate the facts:
Score
Mean
Standard Deviation
z-score
(X)
µ
s
z
?
60
15
1.282
To find out the relevant score, we apply the following formula:


Therefore, students that scored above 79.23 marks out of 100 came in the top 10% of the English Literature class, qualifying for the advanced English Literature class as a result.

Hope you find it important and relevant!!