Naive Bayes Classification

Naive Bayes is an effective and commonly-used, machine learning classifier. It is a probabilistic classifier that makes classifications using the Maximum A Posteriori decision rule in a Bayesian setting. It can also be represented using a very simple Bayesian network. Naive Bayes classifiers have been especially popular for text classification, and are a traditional solution for problems such as spam detection.

An intuitive explanation for the Maximum A Posteriori Probability MAP is to think probabilities as degrees of belief. For example, how likely are we vote for a candidate depends on our prior belief. We can modify our stand based on the evidence. Our final decision, based on evidence, is the posterior belief, which is what happens after we sifted through the evidence.
MPA is simply the maximum posterior belief: after going through all the debates, what is your most likely decision.

We use Naive Bayes in an example to predict, based on some features (e.g. the rank of the school student come from), if a student is admitted or rejected.

library(naivebayes)
library(dplyr)
library(ggplot2)
library(psych)
data <- read.csv("C:/07 - R Website/dataset/ML/binary.csv")
data$rank <- as.factor(data$rank)
data$admit <- as.factor(data$admit)

# Visualization
pairs.panels(data[-1])

data %>%
  ggplot(aes(x=admit, y=gre, fill=admit)) + 
  geom_boxplot()

data %>%
  ggplot(aes(x=gre, fill=admit)) + 
  geom_density(alpha=0.8, color='black')

In order to develop a model, we have to make sure that the independent variables are not highly correlated.
We can see from the scatterplot above that the only numerical variables are gre and gpa, and they are not strogly correlated (R-squared=0.38). Moreover, looking at the boxplot that compare admit as a function of gre, there is a significant overlap between the two levels of admit. From the dnsity plot of gre as a function on admit, we can see that students not admitted (admit=1) have higher gre comapared to student admitrìted (admit=0). Anyway, there is a significat amount of overlap between the two distributioins.
The same is for gpa, here not shown.

# Data Preparation
set.seed(1234)
ind <- sample(2, nrow(data), replace=TRUE, prob = c(0.8, 0.2))
train <-data[ind == 1,]
test <-data[ind == 2,]

Now, the probability of a student admitted given he belongs to rank one: p(admit=1|rank=1) is equal to: **p(admit=1)*p(rank=1|admit=1)/p(rank=1)**

# Naive Bayes Model
model <- naive_bayes(admit ~ ., data = train)
model

================================ Naive Bayes ================================= 
Call: 
naive_bayes.formula(formula = admit ~ ., data = train)

A priori probabilities: 

        0         1 
0.6861538 0.3138462 

Tables: 
      
gre           0        1
  mean 578.6547 622.9412
  sd   116.3250 110.9240

      
gpa            0         1
  mean 3.3552466 3.5336275
  sd   0.3714542 0.3457057

    
rank          0          1
   1 0.10313901 0.24509804
   2 0.36771300 0.42156863
   3 0.33183857 0.24509804
   4 0.19730942 0.08823529

plot(model)

From the result above, we have A Priori probabilities to be admitted of 0.31: only 31% of the students were admitted to the program. Moreover, for numerical variables (gre, gpa) we have mean and standard deviation, and for categorical variables (rank) we have probabilities. For example, p(rank=1|admit=0)=0.103 and p(rank=1|admit=1)=0.245, as we can see from the result table above.
We have also three plots of the model that represent the density for the numerical variables and the bars for the categorical variable.

# Prediction
p <- predict(model, data=train, type="prob")
head(cbind(p, train))

          0         1 admit gre  gpa rank
1 0.8449088 0.1550912     0 380 3.61    3
2 0.6214983 0.3785017     1 660 3.67    3
3 0.2082304 0.7917696     1 800 4.00    1
4 0.8501030 0.1498970     1 640 3.19    4
6 0.6917580 0.3082420     1 760 3.00    2
7 0.6720365 0.3279635     1 560 2.98    1

# Confusion Matrix
p1 <- predict(model, data=train)
(tab1 <- table(p1, train$admit))

   
p1    0   1
  0 196  69
  1  27  33

# Missclassification
1 - sum(diag(tab1)) / sum(tab1)

[1] 0.2953846

From the table above, we can see that the first student has a probability of 0.84 to not be ammitted. If fact, admit=0, and he has low value of gre=380 and he came from a low rank=3. From the confusin matrix, we calculated a misclassification of 29%.
In order to reduce the misclassification rate we can use the *Kernel. In fact, kernel based densities may perform better when numerical variables are not normally distributed.

# Naive Bayes Model
model <- naive_bayes(admit ~ ., data = train, usekernel = TRUE)
p2 <- predict(model, data=train)
tab2 <- table(p2, train$admit) # confusion matrix
1 - sum(diag(tab2)) / sum(tab2) # missclasification

[1] 0.2738462

In fact, as we can see above, introducing the kernel based dentities the misclassification is reduced (from 29% to 27%).