Linear Discriminant Analysis was originally developed by R.A. Fisher to classify subjects into one of the two clearly defined groups. It was later expanded to classify subjects inoto more than two groups. It helps to find linear combination of original variables that provide the best possible separation between the groups. Linear Discriminant Analysis is focused on maximizing the separability among known categories. The problem is when 2 features are not sufficient to capture the most of variation. In PCA, we solve this problem reducing the dimensionality by focusing on the feature with the most variation. LDA is like PCA, but is focused to maximize the separability between the two groups. PCA is unsupervised, but LDA is supervised.
data(iris) library(psych) pairs.panels(iris[1:4], gap = 0, bg = c("red", "green", "blue")[iris$Species], pch = 21)
We are using the iris dataset with 4 numerical variables and 1 factor which has 3 levels as described above. We can also see that the numerical variables have different ranges, it is a good pratice to normalize the data. From the graph above we have scatterplots of each combination of variabels. In the upper triangle we have correlation coefficients. We can see that Sepal.Length and Petal.Length are good to separate between thr three Species. In other cases, there is a overlapping and not a clear separation between the three Species.
# Data Partitioning set.seed(123) ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.6, 0.4)) training <- iris[ind==1,] testing <- iris[ind==2,] # Linear Discriminant Analysis library(MASS) linear <- lda(Species~., data=training) linear
Call: lda(Species ~ ., data = training) Prior probabilities of groups: setosa versicolor virginica 0.3370787 0.3370787 0.3258427 Group means: Sepal.Length Sepal.Width Petal.Length Petal.Width setosa 4.946667 3.380000 1.443333 0.250000 versicolor 5.943333 2.803333 4.240000 1.316667 virginica 6.527586 2.920690 5.489655 2.048276 Coefficients of linear discriminants: LD1 LD2 Sepal.Length 0.3629008 0.05215114 Sepal.Width 2.2276982 1.47580354 Petal.Length -1.7854533 -1.60918547 Petal.Width -3.9745504 4.10534268 Proportion of trace: LD1 LD2 0.9932 0.0068
From the resul above we have the Coefficients of linear discriminants for each of the four variables. The first discriminant function LD1 is a linear combination of the four variables: (0.3629008 x Sepal.Length) + (2.2276982 x Sepal.Width) + (-1.7854533 x Petal.Length) + (-3.9745504 x Petal.Width). Note that Discriminant functions are scaled. We have aslo the Proportion of trace, the percentage separations archived by the first discriminant function LD1 is 99.32%.
Now we can create a Stacked Histogram of Discriminant Function values.
# Histogram p <- predict(linear, training) ldahist(data = p$x[,1], g = training$Species) # p$x[,1] give data from LD1
From the graph above, we have histogram from LD1, and w can see that the separatin between setosa and the oder two Species is quite large with no overlap. On the contrary, there is a certan amont of overlapping between versicolor and virginica. We already said that the percentage of separation archived by LD1 is 99.32%, that is we he can see a very clear separation from the histogram above. Now, we can try to do the same for LD2.
# Histogram p <- predict(linear, training) ldahist(data = p$x[,2], g = training$Species) # p$x[,1] give data from LD1
As we can see from the histogram here above LD2 we have a lot of overlap, which is not great. Now we can try to create the Bi-Plot.
# # Bi-Plot library(ggord) ggord(linear, training$Species, ylim = c(-10, 10))
From the Bi-Plot above, we have in the x-axis the LD1 and is able to separate the three Species quite well. There is some amount of overlap between versicolor in green and virginica in blue. We can also see that Sepal.Width and Sepal.Length are both in a positive direction. The contrary is for Petal.Width and Petal.Length.
Now we can build the Partition Plot.
# # Bi-Plot with Linear Discriminant Analysis Model library(klaR) partimat(Species~., data=training, method="lda")
From the Partition Plot above, we can see classification for eachof observation in the training dataset based on the Linear Discriminant Analysis Model, and for every combination of two variables. From the right bottom graph, we can see that setosa s is quite far away from the other two Species, and bewtween versicolor and virginica there is some amount of overlap. The graph above is for a Linear Discriminant, we can also use a Quadratic Discriminant Analysis Model.
# Bi-Plot with Quadratic Discriminant Analysis Model partimat(Species~., data=training, method="qda")
# Confusion Matrix and Accuracy p1 <- predict(linear, testing)$class tab1 <- table(Predicted = p1, Actual = testing$Species) tab1
Actual Predicted setosa versicolor virginica setosa 20 0 0 versicolor 0 19 1 virginica 0 1 20
accuracy1 <- sum(diag(tab1))/sum(tab1) accuracy1
# Quadratic Discriminant Analysis quadratic <- qda(Species~., data=training) p2 <- predict(quadratic, testing)$class tab2 <- table(Predicted = p2, Actual = testing$Species) tab2
Actual Predicted setosa versicolor virginica setosa 20 0 0 versicolor 0 16 2 virginica 0 4 19
accuracy2 <- sum(diag(tab2))/sum(tab2) accuracy2
From the Partition Plot above, now we have a curve to discriminate between Species. From the Accuracy estimation of the testing data, we can see that is higher with Linear Discriminant Analysis Model (96.72% vs. 90.16%), which is also confirmed comparing the confusion matrix for the linear discriminat (tab1) vs. the confusion matrix of the quadratic discriminant (tab2).