Thursday, 14 May 2015

Linear discriminant analysis



Introduction


Linear Discriminant Analysis (LDA) is most commonly used as dimensionality reduction technique in the pre-processing step for pattern-classification and machine learning applications. The goal is to project a dataset onto a lower-dimensional space with good class-separability in order avoid overfitting ("curse of dimensionality") and also reduce computational costs.
Ronald A. Fisher formulated the Linear Discriminant in 1936 (The Use of Multiple Measurements in Taxonomic Problems), and it also has some practical uses as classifier. The original Linear discriminant was described for a 2-class problem, and it was then later generalized as "multi-class Linear Discriminant Analysis" or "Multiple Discriminant Analysis" by C. R. Rao in 1948 (The utilization of multiple measurements in problems of biological classification)
The general LDA approach is very similar to a Principal Component Analysis (for more information about the PCA, see the previous article Implementing a Principal Component Analysis (PCA) in Python step by step), but in addition to finding the component axes that maximize the variance of our data (PCA), we are additionally interested in the axes that maximize the separation between multiple classes (LDA).
So, in a nutshell, often the goal of an LDA is to project a feature space (a dataset n-dimensional samples) onto a smaller subspace k (where k ≤ n-1) while maintaining the class-discriminatory information.
In general, dimensionality reduction does not only help reducing computational costs for a given classification task, but it can also be helpful to avoid overfitting by minimizing the error in parameter estimation ("curse of dimensionality").


Principal Component Analysis vs. Linear Discriminant Analysis


Both Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are linear transformation techniques that are commonly used for dimensionality reduction. PCA can be described as an "unsupervised" algorithm, since it "ignores" class labels and its goal is to find the directions (the so-called principal components) that maximize the variance in a dataset. In contrast to PCA, LDA is "supervised" and computes the directions ("linear discriminants") that will represent the axes that maximize the separation between multiple classes.
Although it might sound intuitive that LDA is superior to PCA for a multi-class classification task where the class labels are known, this might not always be the case.
For example, comparisons between classification accuracies for image recognition after using PCA or LDA show that PCA tends to outperform LDA if the number of samples per class is relatively small (PCA vs. LDA, A.M. Martinez et al., 2001). In practice, it is also not uncommon to use both LDA and PCA in combination: E.g., PCA for dimensionality reduction followed by an LDA.

What is a "good" feature subspace?


Let's assume that our goal is to reduce the dimensions of a d-dimensional dataset by projecting it onto ak-dimensional subspace (where k ≤ d). So, how do we know what size we should choose for k (k = the number of dimensions of the new feature subspace), and how do we know if we have a feature space that represents our data "well"?
Later, we will compute eigenvectors (the components) from our dataset and collect them in a so-called scatter-matrices (i.e., the between-class scatter matrix and within-class scatter matrix).
Each of these eigenvectors is associated with an eigenvalue, which tells us about the "length" or "magnitude" of the eigenvectors.
If we would observe that all eigenvalues have a similar magnitude, then this may be a good indicator that our data is already projected on a "good" feature space.
And in the other scenario, if some of the eigenvalues are much much larger than others, we might be interested in keeping only those eigenvectors with the highest eigenvalues, since they contain more information about our data distribution. Vice versa, eigenvalues that are close to 0 are less informative and we might consider dropping those for constructing the new feature subspace.







Choosing the Right Type of Rotation in PCA and EFA



What Is Rotation?
In the PCA/EFA literature, definitions of rotation abound. For example, McDonald (1985, p. 40) 
defines rotation as “performing arithmetic to obtain a new set of factor loadings (v-ƒ regression 
weights) from a given set,” and Bryant and Yarnold (1995, p. 132) define it as “a procedure in which 
the eigenvectors (factors) are rotated in an attempt to achieve simple structure.” Perhaps a bit more 
helpful is the definition supplied in Vogt (1993, p. 91): “Any of several methods in factor analysis by 
which the researcher attempts to relate the calculated factors to theoretical entities. This is done 
differently depending upon whether the factors are believed to be correlated (oblique) or uncorrelated 
(orthogonal).” And even more helpful is Yaremko, Harari, Harrison, and Lynn (1986), who define 
factor rotation as follows: “In factor or principal-components analysis, rotation of the factor axes 
(dimensions) identified in the initial extraction of factors, in order to obtain simple and interpretable 
factors.” They then go on to explain and list some of the types of orthogonal and oblique procedures. 
 How can a concept with a goal of simplification be so complicated? Let me try defining rotation
from the perspective of a language researcher, while trying to keep it simple. I think of rotation as any 
of a variety of methods (explained below) used to further analyze initial PCA or EFA results with the 
goal of making the pattern of loadings clearer, or more pronounced. This process is designed to reveal 
the simple structure. 
The choices that researchers make among the orthogonal and oblique varieties of these rotation 
methods and the notion of simple structure will be the main topics in the rest of this column. 21
21
What Are the Different Types of Rotation?
As mentioned earlier, rotation methods are either orthogonal or oblique. Simply put, orthogonal 
rotation methods assume that the factors in the analysis are uncorrelated. Gorsuch (1983, pp. 203-204) 
lists four different orthogonal methods: equamax, orthomax, quartimax, and varimax. In contrast, 
oblique rotation methods assume that the factors are correlated. Gorsuch (1983, pp. 203-204) lists 15 
different oblique methods.1
Version 16 of SPSS offers five rotation methods: varimax, direct oblimin, quartimax, equamax, and 
promax, in that order. Three of those are orthogonal (varimax, quartimax, & equimax), and two are 
oblique (direct oblimin & promax). Factor analysis is not the focus of my life, nor am I eager to learn 
how to use a new statistical program or calculate rotations by hand (though I’m sure I could do it if I 
had a couple of spare weeks), so those five SPSS options serve as boundaries for the choices I make. 
But how should I choose which one to use? 
Tabachnick and Fiddell (2007, p. 646) argue that “Perhaps the best way to decide between 
orthogonal and oblique rotation is to request oblique rotation [e.g., direct oblimin or promax from 
SPSS] with the desired number of factors [see Brown, 2009b] and look at the correlations among 
factors…if factor correlations are not driven by the data, the solution remains nearly orthogonal. Look 
at the factor correlation matrix for correlations around .32 and above. If correlations exceed .32, then 
there is 10% (or more) overlap in variance among factors, enough variance to warrant oblique rotation 
unless there are compelling reasons for orthogonal rotation.” 
 For example, using the same Brazilian data I used for examples in Brown 2009a and b (based on 
the 12 subtests of the Y/G Personality Inventory from Guilford & Yatabe, 1957), I ran a three- factor 
EFA followed by a direct oblimin rotation. The resulting correlation matrix for the factors that the 
analysis produced is shown in Table 1. Notice that the highest correlation is .084. Since none of the 
correlations exceeds the Tabachnick and Fiddell threshold of .32 described in the previous paragraph, 
“the solution remains nearly orthogonal.” Thus, I could just as well run an orthogonal rotation. 
Table 1. Correlation Matrix for the Three Factors in an EFA with Direct Oblimin Rotation for the Brazilian Y/GPI Data
Factor 1 2 3
1 1.000 -0.082 0.084
2 -0.082 1.000 -0.001
3 0.084 -0.001 1.000
Moreover, as Kim and Mueller (1978, p. 50) put it, “Even the issue of whether factors are 
correlated or not may not make much difference in the exploratory stages of analysis. It even can be 
argued that employing a method of orthogonal rotation (or maintaining the arbitrary imposition that the 
factors remain orthogonal) may be preferred over oblique rotation, if for no other reason than that the 
former is much simpler to understand and interpret.” 

How Do Researchers Decide Which Particular Type of Rotation to Use?
We can think of the goal of rotation and of choosing a particular type of rotation as seeking 
something called simple structure, or put another way, one way we know if we have selected an 
adequate rotation method is if the results achieve simple structure. But what is simple structure? 
Bryant and Yarnold (1995, p. 132-133) define simple structure as:

A condition in which variables load at near 1 (in absolute value) or at near 0 on an eigenvector (factor). Variables 
that load near 1 are clearly important in the interpretation of the factor, and variables that load near 0 are clearly 
unimportant. Simple structure thus simplifies the task of interpreting the factors.
Using logic like that in the preceding quote, Thurstone (1947) first proposed and argued for five 
criteria that needed to be met for simple structure to be achieved: 
1. Each variable should produce at least one zero loading on some factor. 
2. Each factor should have at least as many zero loadings as there are factors.
3. Each pair of factors should have variables with significant loadings on one 
and zero loadings on the other. 
4. Each pair of factors should have a large proportion of zero loadings on both factors 
(if there are say four or more factors total).
5. Each pair of factors should have only a few complex variables




Get Smarter about Big Data

  • Data is often imperfect – and that’s usually a good thing! You don’t need perfect information to find interesting relationships in the data – in fact, counter-intuitively, “dirty” data is sometimes better for finding relationships, because cleansing may remove the very attribute that enables matching. On the other hand, some information is a lie, as “bad guys” will intentionally try to fool you, or to separate their interactions with your firm into different channels (web, mobile, store) to avoid detection. You should assign a trust level to “known” information, and it rarely approaches 100%.
  • Your data can make you smarter as time passes. As new observations continue to accumulate, they enable you to refine your understanding, and even to reverse earlier assertions of your analysis based on what you knew at the time. Therefore, be sure to rerun earlier analyses over the full dataset, and don’t assume the conclusions of your previous analysis were correct.
  • Partial information is often enough. It’s surprising how soon you can start to see a picture emerge – with puzzles, the picture can often be identified with only 50% of the pieces, and this aspect of human cognition often applies to machine learning, too. Once the picture starts to emerge, you can more quickly understand each new puzzle piece (observation) by seeing it in the context it occupies among those around it.
This emerging picture should inform your collection efforts – you might need to obtain a newinformation source to follow up a lead from an earlier analysis, or to discard an information source (and the cost of collecting and analyzing it) once you realize it’s not helping.
  • More data is always good. The case for accumulating more data – Big Data – is strong: not only does it bring deeper insights, it also can reduce your compute workload – Jeff’s experience shows that the length of time it takes to link a new observation into a large information network actually goesdown as the total number of observations goes up, beyond a certain threshold.
One of the most interesting new sources of Big Data insights is data about the interactions of people with systems – even their mistakes! That’s how Google knows to ask “did you meant this?”
  • Can you count? Good! Accurate counting of entities (people, cases, transactions), a.k.a. Entity Resolution, is critical to deeper analysis – if you can’t count, you can’t determine a vector or velocity, and without those, you can’t make predictions. Many interesting analyses in fraud detection involve detecting identities – accurately counting people, knowing when two identities are the same person, or when one identity is actually being used by more than one person, or even when an identity is not a real person at all… Identity matching is also the source of analyses that identify dead people voting and other such fraud.
  • Privacy matters, but it’s not an obstacle. Once identity comes into play, then privacy concerns (and regulations) must of course be taken into consideration. There are advanced techniques such as one-way hashes that can be used to anonymize a data set without reducing its usefulness for analytical purposes.
  • Bad guys can be smart, too. Skilled adversaries present unique problems, but they can be overcome: to catch them, you must collect observations the adversary doesn’t know you have (e.g. a camera on a route, that they don’t know you have), or compute over your observations in a way the adversary can’t imagine (e.g. recognizing faces or license plates, and correlating that with other location information).
So as adversaries get smarter and more capable of avoiding detection all the time, savvy analysts must continually push the edge of the envelope of applying new techniques and technology to the game.

Pseudo-Random Number Generators (PRNGs)

As the word ‘pseudo’ suggests, pseudo-random numbers are not random in the way you might expect, at least not if you're used to dice rolls or lottery tickets. Essentially, PRNGs are algorithms that use mathematical formulae or simply precalculated tables to produce sequences of numbers that appear random. A good example of a PRNG is the linear congruential method. A good deal of research has gone into pseudo-random number theory, and modern algorithms for generating pseudo-random numbers are so good that the numbers look exactly like they were really random.
The basic difference between PRNGs and TRNGs is easy to understand if you compare computer-generated random numbers to rolls of a die. Because PRNGs generate random numbers by using mathematical formulae or precalculated lists, using one corresponds to someone rolling a die many times and writing down the results. Whenever you ask for a die roll, you get the next on the list. Effectively, the numbers appear random, but they are really predetermined. TRNGs work by getting a computer to actually roll the die — or, more commonly, use some other physical phenomenon that is easier to connect to a computer than a die is.
PRNGs are efficient, meaning they can produce many numbers in a short time, and deterministic, meaning that a given sequence of numbers can be reproduced at a later date if the starting point in the sequence is known. Efficiency is a nice characteristic if your application needs many numbers, and determinism is handy if you need to replay the same sequence of numbers again at a later stage. PRNGs are typically also periodic, which means that the sequence will eventually repeat itself. While periodicity is hardly ever a desirable characteristic, modern PRNGs have a period that is so long that it can be ignored for most practical purposes.
These characteristics make PRNGs suitable for applications where many numbers are required and where it is useful that the same sequence can be replayed easily. Popular examples of such applications are simulation and modeling applications. PRNGs are not suitable for applications where it is important that the numbers are really unpredictable, such as data encryption and gambling.
It should be noted that even though good PRNG algorithms exist, they aren't always used, and it's easy to get nasty surprises. Take the example of the popular web programming language PHP. If you use PHP for GNU/Linux, chances are you will be perfectly happy with your random numbers. However, if you use PHP for Microsoft Windows, you will probably find that your random numbers aren't quite up to scratch as shown in this visual analysis from 2008. Another example dates back to 2002 when one researcher reported that the PRNG on MacOS was not good enough for scientific simulation of virus infections. The bottom line is that even if a PRNG will serve your application's needs, you still need to be careful about which one you use.

True Random Number Generators (TRNGs)

In comparison with PRNGs, TRNGs extract randomness from physical phenomena and introduce it into a computer. You can imagine this as a die connected to a computer, but typically people use a physical phenomenon that is easier to connect to a computer than a die is. The physical phenomenon can be very simple, like the little variations in somebody's mouse movements or in the amount of time between keystrokes. In practice, however, you have to be careful about which source you choose. For example, it can be tricky to use keystrokes in this fashion, because keystrokes are often buffered by the computer's operating system, meaning that several keystrokes are collected before they are sent to the program waiting for them. To a program waiting for the keystrokes, it will seem as though the keys were pressed almost simultaneously, and there may not be a lot of randomness there after all.
However, there are many other ways to get true randomness into your computer. A really good physical phenomenon to use is a radioactive source. The points in time at which a radioactive source decays are completely unpredictable, and they can quite easily be detected and fed into a computer, avoiding any buffering mechanisms in the operating system. The HotBits service at Fourmilab in Switzerland is an excellent example of a random number generator that uses this technique. Another suitable physical phenomenon is atmospheric noise, which is quite easy to pick up with a normal radio. This is the approach used by RANDOM.ORG. You could also use background noise from an office or laboratory, but you'll have to watch out for patterns. The fan from your computer might contribute to the background noise, and since the fan is a rotating device, chances are the noise it produces won't be as random as atmospheric noise.



As long as you are careful, the possibilities are endless. Undoubtedly the visually coolest approach was the lavarand generator, which was built by Silicon Graphics and used snapshots of lava lamps to generate true random numbers. Unfortunately, lavarand is no longer operational, but one of its inventors is carrying on the work (without the lava lamps) at the LavaRnd web site. Yet another approach is theJava EntropyPool, which gathers random bits from a variety of sources including HotBits and RANDOM.ORG, but also from web page hits received by the EntropyPool's own web server.
Regardless of which physical phenomenon is used, the process of generating true random numbers involves identifying little, unpredictable changes in the data. For example, HotBits uses little variations in the delay between occurrences of radioactive decay, and RANDOM.ORG uses little variations in the amplitude of atmospheric noise.
The characteristics of TRNGs are quite different from PRNGs. First, TRNGs are generally rather inefficient compared to PRNGs, taking considerably longer time to produce numbers. They are also nondeterministic, meaning that a given sequence of numbers cannot be reproduced, although the same sequence may of course occur several times by chance. TRNGs have no period.

Comparison of PRNGs and TRNGs

The table below sums up the characteristics of the two types of random number generators.
CharacteristicPseudo-Random Number GeneratorsTrue Random Number Generators
EfficiencyExcellentPoor
DeterminismDeterminsticNondeterministic
PeriodicityPeriodicAperiodic
These characteristics make TRNGs suitable for roughly the set of applications that PRNGs are unsuitable for, such as data encryption, games and gambling. Conversely, the poor efficiency and nondeterministic nature of TRNGs make them less suitable for simulation and modeling applications, which often require more data than it's feasible to generate with a TRNG. The following table contains a summary of which applications are best served by which type of generator:
ApplicationMost Suitable Generator
Lotteries and DrawsTRNG
Games and GamblingTRNG
Random Sampling (e.g., drug screening)TRNG
Simulation and ModellingPRNG
Security (e.g., generation of data encryption keys)TRNG
The ArtsVaries


Data Scientist vs Data Engineer

Data Scientist
A data scientist is responsible for pulling insights from data. It is the data scientists job to pull data, create models, create data products, and tell a story. A data scientist should typically have interactions with customers and/or executives. A data scientist should love scrubbing a dataset for more and more understanding.

The main goal of a data scientist is to produce data products and tell the stories of the data. A data scientist would typically have stronger statistics and presentation skills than a data engineer.


Data Engineer
Data Engineering is more focused on the systems that store and retrieve data. A data engineer will be responsible for building and deploying storage systems that can adequately handle the needs. Sometimes the needs are fast real-time incoming data streams. Other times the needs are massive amounts of large video files. Still other times the needs are many many reads of the data. In other words, a data engineer needs to build systems that can handle the 3 Vs of big data.

The main goal of data engineer is to make sure the data is properly stored and available to the data scientist and others that need access. A data engineer would typically have stronger software engineering and programming skills than a data scientist.

data-scientist-vs-data-engineer.jpg (594×389)

Conclusion
It is too early to tell if these 2 roles will ever have a clear distinction of responsibilities, but it is nice to see a little separation of responsibilities for the mythical all-in-one data scientist. Both of these roles are important to a properly functioning data science team.

Chebyshev’s Inequality

Chebyshev’s theorem refers to several theorems, all proven by Russian mathematician Pafnuty Chebyshev. They include: Chebyshev’s inequality, Bertrand’s postulate, Chebyshev’s sum inequality and Chebyshev’s equioscillation theorem. Chebyshev’s inequality is the theorem most often used in stats. 
It states that no more than 1/k2 of a distribution’s values are more than “k” standard deviations away from the mean. With a normal distributionstandard deviations tell you how much of that distribution’s data are within k standard deviations from the mean. If you have a distribution that isn’t normal, you can use Chebyshev’s to help you find out what percentage of the data is clustered around the mean.
Chebyshev’s Inequality relates to the distribution of numbers in a set. In layman’s terms, the formula helps you figure out the number of values that are inside and outside the standard deviation. The standard deviation tells you how far away values are from the average of the set. Roughly two-thirds of the values should fall within one standard deviation either side of mean in a normal distribution. In statistics, it’s often referred to as Chebyshev’s Theorem (as opposed to Chebyshev’s Inequality). 
Chebyschev’s Inequality formula is able to prove (with little information given on your part) the probability of outliers existing at a certain interval. Given X is a random variable, A stands for the mean of the set, K is the number of standard deviations, and Y is the value of the standard deviation, the formula reads as follows:

 Pr(|X-A|=>KY)<=1/K2

The absolute value of the difference of X minus A is greater than or equal to the K times Y has the probability of less than or equal to one divided by K squared.


Applications of Chebyshev's Inequality

The formula was used with calculus to develop the weak version of the law of large numbers. This law states that as a sample set increases in size, the closer it should be to its theoretical mean. A simple example is that when rolling a six-sided die, the probable average is 3.5. A sample size of 5 rolls may result in drastically different results. Roll the die 20 times; The average should begin approaching 3.5. As you add more and more rolls, the average should continue to near 3.5 until reaching it. Or, it becomes so close that they are pretty much equal.
Another application is in finding the difference between the mean and median of a set of numbers. Using a one-sided version of Chebyshev’s Inequality theorem, also known as Cantelli’s theorem, you can prove the absolute value of the difference between the median and the mean will always be less than or equal to the standard deviation. This is handy in determining if a median you derived is plausible.

Tuesday, 12 May 2015

Shivanker kumar for the class of May 9th

z-score and its application 


Z-scores are expressed in terms of standard deviations from their means. Resultantly, these z-scores have a distribution with a mean of 0 and a standard deviation of 1. The formula for calculating the standard score is given below:
Standard Score Calculation

Z = (X-µ)/s


As the formula shows, the standard score is simply the score, minus the mean score, divided by the standard deviation. Let’s see the application of z-score.
Application –
1. How well did Sarah perform in her English Literature coursework compared to the other 50 students?
To answer this question, we can re-phrase it as: What percentage (or number) of students scored higher than Sarah and what percentage (or number) of students scored lower than Sarah? First, let's reiterate that Sarah scored 70 out of 100, the mean score was 60, and the standard deviation was 15 (see below).
                Score      Mean      Standard Deviation
                (X)           µ              s
                70           60           15
In terms of z-scores, this gives us:
Z = (X-µ)/s = (70-60)/15 = .6667
Standard Score Calculation
The z-score is 0.67 (to 2 decimal places), but now we need to work out the percentage (or number) of students that scored higher and lower than Sarah. To do this, we need to refer to the standard normal distribution table.
This table helps us to identify the probability that a score is greater or less than our z-score score. To use the table, which is easier than it might look at first sight, we start with our z-score, 0.67 (if our z-score had more than two decimal places, for example, ours was 0.6667, we would round it up or down accordingly; hence, 0.6667 would become 0.67). The y-axis in the table highlights the first two digits of our z-score and the x-axis the second decimal place. Therefore, we start with the y-axis, finding 0.6, and then move along the x-axis until we find 0.07, before finally reading off the appropriate number; in this case, 0.2514. This means that the probability of a score being greater than 0.67 is 0.2514. If we look at this as a percentage, we simply times the score by 100; hence 0.2514 x 100 = 25.14%. In other words, around 25% of the class got a better mark than Sarah (roughly 13 students since there is no such thing as part of a student!).

Going back to our question, "How well did Sarah perform in her English Literature coursework compared to the other 50 students?", clearly we can see that Sarah did better than a large proportion of students, with 74.86% of the class scoring lower than her (100% - 25.14% = 74.86%). We can also see how well she performed relative to the mean score by subtracting her score from the mean (0.5 - 0.2514 = 0.2486). Hence, 24.86% of the scores (0.2486 x 100 = 24.86%) were lower than Sarah's, but above the mean score. However, the key finding is that Sarah's score was not one of the best marks. It wasn't even in the top 10% of scores in the class, even though at first sight we may have expected it to be. This leads us onto the second question.
2. Which students came in the top 10% of the class?
A better way of phrasing this would be to ask: What mark would a student have to achieve to be in the top 10% of the class and qualify for the advanced English Literature class?
To answer this question, we need to find the mark (which we call "X") on our frequency distribution that reflects the top 10% of marks. Since the mean score was 60 out of 100, we immediately know that the mark will be greater than 60. After all, if we refer to our frequency distribution below, we are interested in the area to the right of the mean score of 60 that reflects the top 10% of marks (shaded in red). As a decimal, the top 10% of marks would be those marks above 0.9 (i.e., 100% - 90% = 10% or 1 - 0.9 = 0.1).



First, we should convert our frequency distribution into a standard normal distribution. As such, our mean score of 60 becomes 0 and the score (X) we are looking for, 0.9, becomes our z-score, which is currently unknown.
The next step involves finding out the value for our z-score. To do this, we refer back to the standard normal distribution table.


In answering the first question in this guide, we already knew the z-score, 0.67, which we used to find the appropriate percentage (or number) of students that scored higher than Sarah, 0.2514 (i.e., 25.14% or roughly 25 students achieve a higher mark than Sarah). Using the z-score, 0.67, and the y-axis and x-axis of the standard normal distribution table, this guided us to the appropriate value, 0.2514. In this case, we need to do the exact reverse to find our z-score.
We know the percentage we are trying to find, the top 10% of students, corresponds to 0.9. As such, we first need to find the value 0.9 in standard normal distribution table. When looking at the table, you may notice that the closest value to 0.9 is 0.8997. If we take the 0.8997 value as our starting point and then follow this row across to the left, we are presented with the first part of the z-score. You will notice that the value on the y-axis for 0.8997 is 1.2. We now need to do the same for the x-axis, using the 0.8997 value as our starting point and following the column up. This time, the value on the x-axis for 0.8997 is 0.08. This forms the second part of the z-score. Putting these two values together, the z-score for 0.8997 is 1.28 (i.e., 1.2 + 0.08 = 1.28).
There is only one problem with this z-score; that is, it is based on a value of 0.8997 rather than the 0.9 value we are interested in. This is one of the difficulties of refer to the standard normal distribution table because it cannot give every possible z-score value (that we require a quite enormous table!). Therefore, you can either take the closest two values, 0.8997 and 0.9015, to your desired value, 0.9, which reflect the z-scores of 1.28 and 1.29, and then calculate the exact value of "z" for 0.9, or you can use a z-score calculator. If we use a z-score calculator, our value of 0.9 corresponds with a z-score of 1.282. In other words, P ( z > 1.282 ) = 0.1.


Now that we have the key information (that is, the mean score, µ, the standard deviation, s , and z-score, z), we can answer our question directly, namely: What mark would a student have to achieve to be in the top 10% of the class and qualify for the advanced English Literature class? First, let us reiterate the facts:
Score
Mean
Standard Deviation
z-score
(X)
µ
s
z
?
60
15
1.282
To find out the relevant score, we apply the following formula:


Therefore, students that scored above 79.23 marks out of 100 came in the top 10% of the English Literature class, qualifying for the advanced English Literature class as a result.

Hope you find it important and relevant!!

Statistical Analysis in Cryptography

In today’s world the need to protect vocal and written communication between individuals,
institutions, entities and commercial agencies is ever present and growing. Digital communication has, in part, been integrated into our social life. For many, the day begins with the perusal of e-mail and the tedious task of eliminating spam and other messages we do not consider worthy of our attention. We turn to the internet to read newspaper articles, to see what’s on at the cinema, to check flight arrivals, the telephone book, the state of our checking account and stock holdings, to send and receive money transfers, to shop on line, for students’ research and for many other reasons. But the digital society must adequately protect communication from intruders, whether persons or institutions which attack our privacy. Cryptography (from o&, hidden), the study and creation of secret writing systems in numbers or codes, is essential to the development of digital communication which is absolutely private insofar as being impossible to be read by anyone to whom it is not addressed. Cryptography seeks to study and create systems for ciphering and to verify and authenticate the integrity of data. One must make the distinction between cryptoanalysis, the research of methods an “enemy” might use to read the messages of others and cryptography. Cryptography and cryptoanalysis are what make up cryptology.

Until the 1950s cryptography was essentially used only for military and diplomatic communication. The decryption of German messages by the English and of Japanese messages by the Americans played a very important role in the outcome of the Second World War. The great mathematician Alan Turing made an essential contribution to the war effort with his decryption of the famous Enigma machine which was considered absolutely secure by the Germans. It was the Poles, however,
who had laid the basis for finding its weak link. Cryptography also played a vital role in the Pacific at the battle of Midway Regarding Italy, the naval battles of Punta Stilo and of Capo Matapan were strongly influenced by the interception and decryption of messages.

There are four disciplines which have important roles in cryptography:
1. Linguistics, in particular Statistical Linguistics
2. Statistics, in particular the Theory of the Tests for the Analysis of Randomness
and of Primality and Data Mining
3. Mathematics, in particular Discrete Mathematics
4. The Theory of Information

The technique of Data Mining seems to be of more use in the analysis of a great number of data which are exchanged on a daily basis such as satellite data. Technical developments are largely inter-disciplinary. This suggests that new applications will be found which will, in turn, lead to new queries and problems for the scholars of Number Theory,Modular Arithmetic, Polynomial Algebra, Information Theory and Statistics to apply to cryptography.

Until the 1950s the decryption of messages was based exclusively on statistical methods and specific techniques of cryptography. In substance, the working instruments of cryptography, both for the planning of coding systems and for reading messages which the sender intended remain secret, were statistical methods applied to linguistics. The key to decoding systems using poly-alphabetic substitution and simple and double transposition has always been the analysis of the statistical distribution of graphemes (letters, figures, punctuation marks, etc.). Mathematics was not fundamental to the work of the cryptoanalyst.

Today,with the advent of data processing technology, coding of messages is done by coding machines. The structure of reference is the algebra of Galois (GF(q)). The search for prime numbers, in particular tests of primality, are of notable interest to modern cryptology.

Descriptive Statistics

Descriptive Statistics

Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data.

Descriptive statistics are typically distinguished from inferential statistics. With descriptive statistics you are simply describing what is or what the data shows. With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what's going on in our data.

Descriptive Statistics are used to present quantitative descriptions in a manageable form. In a research study we may have lots of measures. Or we may measure a large number of people on any measure. Descriptive statistics help us to simplify large amounts of data in a sensible way. Each descriptive statistic reduces lots of data into a simpler summary. For instance, consider a simple number used to summarize how well a batter is performing in baseball, the batting average. This single number is simply the number of hits divided by the number of times at bat (reported to three significant digits). A batter who is hitting .333 is getting a hit one time in every three at bats. One batting .250 is hitting one time in four. The single number describes a large number of discrete events. Or, consider the scourge of many students, the Grade Point Average (GPA). This single number describes the general performance of a student across a potentially wide range of course experiences.
Every time you try to describe a large set of observations with a single indicator you run the risk of distorting the original data or losing important detail. The batting average doesn't tell you whether the batter is hitting home runs or singles. It doesn't tell whether she's been in a slump or on a streak. The GPA doesn't tell you whether the student was in difficult courses or easy ones, or whether they were courses in their major field or in other disciplines. Even given these limitations, descriptive statistics provide a powerful summary that may enable comparisons across people or other units.

Univariate Analysis

Univariate analysis involves the examination across cases of one variable at a time. There are three major characteristics of a single variable that we tend to look at:
  • the distribution
  • the central tendency
  • the dispersion
In most situations, we would describe all three of these characteristics for each of the variables in our study.

The Distribution. The distribution is a summary of the frequency of individual values or ranges of values for a variable. The simplest distribution would list every value of a variable and the number of persons who had each value. For instance, a typical way to describe the distribution of college students is by year in college, listing the number or percent of students at each of the four years. Or, we describe gender by listing the number or percent of males and females. In these cases, the variable has few enough values that we can list each one and summarize how many sample cases had the value. But what do we do for a variable like income or GPA? With these variables there can be a large number of possible values, with relatively few people having each one. In this case, we group the raw scores into categories according to ranges of values. For instance, we might look at GPA according to the letter grade ranges. Or, we might group income into four or five ranges of income values.

Table 1. Frequency distribution table.
One of the most common ways to describe a single variable is with a frequency distribution. Depending on the particular variable, all of the data values may be represented, or you may group the values into categories first (e.g., with age, price, or temperature variables, it would usually not be sensible to determine the frequencies for each value. Rather, the value are grouped into ranges and the frequencies determined.). Frequency distributions can be depicted in two ways, as a table or as a graph. Table 1 shows an age frequency distribution with five categories of age ranges defined. The same frequency distribution can be depicted in a graph as shown in Figure 2. This type of graph is often referred to as a histogram or bar chart.

Table 2. Frequency distribution bar chart.
Distributions may also be displayed using percentages. For example, you could use percentages to describe the:
  • percentage of people in different income levels
  • percentage of people in different age ranges
  • percentage of people in different ranges of standardized test scores
Central Tendency. The central tendency of a distribution is an estimate of the "center" of a distribution of values. There are three major types of estimates of central tendency:
  • Mean
  • Median
  • Mode
The Mean or average is probably the most commonly used method of describing central tendency. To compute the mean all you do is add up all the values and divide by the number of values. For example, the mean or average quiz score is determined by summing all the scores and dividing by the number of students taking the exam. For example, consider the test score values:
15, 20, 21, 20, 36, 15, 25, 15
The sum of these 8 values is 167, so the mean is 167/8 = 20.875.
The Median is the score found at the exact middle of the set of values. One way to compute the median is to list all scores in numerical order, and then locate the score in the center of the sample. For example, if there are 500 scores in the list, score #250 would be the median. If we order the 8 scores shown above, we would get:
15,15,15,20,20,21,25,36
 
There are 8 scores and score #4 and #5 represent the halfway point. Since both of these scores are 20, the median is 20. If the two middle scores had different values, you would have to interpolate to determine the median.

The mode is the most frequently occurring value in the set of scores. To determine the mode, you might again order the scores as shown above, and then count each one. The most frequently occurring value is the mode. In our example, the value 15 occurs three times and is the model. In some distributions there is more than one modal value. For instance, in a bimodal distribution there are two values that occur most frequently.
Notice that for the same set of 8 scores we got three different values -- 20.875, 20, and 15 -- for the mean, median and mode respectively. If the distribution is truly normal (i.e., bell-shaped), the mean, median and mode are all equal to each other.

Dispersion. Dispersion refers to the spread of the values around the central tendency. There are two common measures of dispersion, the range and the standard deviation. The range is simply the highest value minus the lowest value. In our example distribution, the high value is 36 and the low is 15, so the range is 36 - 15 = 21.

The Standard Deviation is a more accurate and detailed estimate of dispersion because an outlier can greatly exaggerate the range (as was true in this example where the single outlier value of 36 stands apart from the rest of the values. The Standard Deviation shows the relation that set of scores has to the mean of the sample. Again lets take the set of scores:
15,20,21,20,36,15,25,15
to compute the standard deviation, we first find the distance between each value and the mean. We know from above that the mean is 20.875. So, the differences from the mean are:
15 - 20.875 = -5.875
20 - 20.875 = -0.875
21 - 20.875 = +0.125
20 - 20.875 = -0.875
36 - 20.875 = 15.125
15 - 20.875 = -5.875
25 - 20.875 = +4.125
15 - 20.875 = -5.875
Notice that values that are below the mean have negative discrepancies and values above it have positive ones. Next, we square each discrepancy:
-5.875 * -5.875 = 34.515625
-0.875 * -0.875 = 0.765625
+0.125 * +0.125 = 0.015625
-0.875 * -0.875 = 0.765625
15.125 * 15.125 = 228.765625
-5.875 * -5.875 = 34.515625
+4.125 * +4.125 = 17.015625
-5.875 * -5.875 = 34.515625
Now, we take these "squares" and sum them to get the Sum of Squares (SS) value. Here, the sum is 350.875. Next, we divide this sum by the number of scores minus 1. Here, the result is 350.875 / 7 = 50.125. This value is known as the variance. To get the standard deviation, we take the square root of the variance (remember that we squared the deviations earlier). This would be SQRT(50.125) = 7.079901129253.
Although this computation may seem convoluted, it's actually quite simple. To see this, consider the formula for the standard deviation:
In the top part of the ratio, the numerator, we see that each score has the the mean subtracted from it, the difference is squared, and the squares are summed. In the bottom part, we take the number of scores minus 1. The ratio is the variance and the square root is the standard deviation. In English, we can describe the standard deviation as:
the square root of the sum of the squared deviations from the mean divided by the number of scores minus one
Although we can calculate these univariate statistics by hand, it gets quite tedious when you have more than a few values and variables. Every statistics program is capable of calculating them easily for you. For instance, I put the eight scores into SPSS and got the following table as a result:
N8
Mean20.8750
Median20.0000
Mode15.00
Std. Deviation7.0799
Variance50.1250
Range21.00
which confirms the calculations I did by hand above.
The standard deviation allows us to reach some conclusions about specific scores in our distribution. Assuming that the distribution of scores is normal or bell-shaped (or close to it!), the following conclusions can be reached:
  • approximately 68% of the scores in the sample fall within one standard deviation of the mean
  • approximately 95% of the scores in the sample fall within two standard deviations of the mean
  • approximately 99% of the scores in the sample fall within three standard deviations of the mean
For instance, since the mean in our example is 20.875 and the standard deviation is 7.0799, we can from the above statement estimate that approximately 95% of the scores will fall in the range of 20.875-(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and 35.0348. This kind of information is a critical stepping stone to enabling us to compare the performance of an individual on one variable with their performance on another, even when the variables are measured on entirely different scales