Distinguish Benign and Malign Tumor via ANN

We try to recognize cancer in human breast using a multi-hidden layer artificial neural network via H2O package. We use the Wisconsin Breast-Cancer Dataset which is a collectioin of Dr.Wolberg real clinical cases. There are no images, but we can recognize malignal tumor based on 10 biomedical attributes. We have a total number of 699 patients divided in two classes: malignal and benign cancer. From the H2O output below, we can see that it recognised 4 cores.

library(mlbench)
library(h2o)
h2o.init(nthreads = -1) # initializing h2o
 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 hours 17 minutes 
    H2O cluster timezone:       Europe/Berlin 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.20.0.8 
    H2O cluster version age:    4 months and 14 days !!! 
    H2O cluster name:           H2O_started_from_R_perlatoa_vvr054 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   2.49 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         AutoML, Algos, Core V3, Core V4 
    R Version:                  R version 3.5.1 (2018-07-02) 

The table below shows the crucial biomedical features involved in Cancer, like for exampe the cell size and shape. In the last colum we have the outcome (malign vs. benign cancer).

library(knitr)
library(kableExtra)
library(formattable)

data("BreastCancer")
dt <- as.data.frame(BreastCancer)
dt <- dt[1:10,]

kable(dt) %>%
  kable_styling(bootstrap_options = "responsive", full_width = T, position = "center", font_size = 16)
Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class
1000025 5 1 1 1 2 1 3 1 1 benign
1002945 5 4 4 5 7 10 3 2 1 benign
1015425 3 1 1 1 2 2 3 1 1 benign
1016277 6 8 8 1 3 4 3 7 1 benign
1017023 4 1 1 3 2 1 3 1 1 benign
1017122 8 10 10 8 7 10 9 7 1 malignant
1018099 1 1 1 1 2 10 3 1 1 benign
1018561 2 1 2 1 2 1 3 1 1 benign
1033078 2 1 1 1 2 1 1 1 5 benign
1033078 4 2 1 1 2 1 2 1 1 benign
data <- BreastCancer[, -1] # remove ID
data[, c(1:ncol(data))] <- sapply(data[, c(1:ncol(data))], as.numeric) # interpret each features as numeric
data[, 'Class'] <- as.factor(data[, 'Class']) # interpret dependent variable as factor

# convert the dataset in three part in the h2o format
splitSample <- sample(1:3, size=nrow(data), prob=c(0.6,0.2,0.2), replace=TRUE)
train_h2o <- as.h2o(data[splitSample==1,])

  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
val_h2o <- as.h2o(data[splitSample==2,])

  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test_h2o <- as.h2o(data[splitSample==3,])

  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
# print dimensions
dim(train_h2o)
[1] 425  10
dim(val_h2o)
[1] 150  10
dim(test_h2o)
[1] 124  10

As we can see from the result above, we have 401 (60%) observations for training, and round 20% of observations for both validation (161) and test (136). Now, we can train our model using the deep learning function offers by H2O package.

model <-
  h2o.deeplearning(x = 1:9, # column numbers of predictors
                   y = 10, # column number of the dipendent variable
                   # data in H2O format
                   training_frame = train_h2o,
                   activation = "TanhWithDropout", # use Tanh with pruning method
                   input_dropout_ratio = 0.2, # precentage of pruning
                   balance_classes = TRUE, # try to balance malign or begnin in case of one of them is unbalanced
                   hidden = c(10,10), # two hidden layers of 10 units
                   hidden_dropout_ratios = c(0.3, 0.3), # pruning probablity per each hidden layer
                   epochs = 10, # maximum number f epochs
                   seed= 0)

  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Now, lets see the confusion matrix for the training and validation set.

# training confusion matrix
h2o.confusionMatrix(model)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.247389388195815:
         1   2    Error     Rate
1      266   8 0.029197   =8/274
2        5 267 0.018382   =5/272
Totals 271 275 0.023810  =13/546
# training confusion matrix
h2o.confusionMatrix(model, val_h2o)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.113811786899162:
        1  2    Error    Rate
1      97  5 0.049020  =5/102
2       1 47 0.020833   =1/48
Totals 98 52 0.040000  =6/150

For the training set we have an incredible around 99% (Error = 0.019), we have just 10 samples that the model is getting wrong. For the validation set, again the error il low (Error = 0.03), with just 5 error samples that the model is getting wrong. If we want to see the accuracy in a out-of-sample data, we can use the test set.

# training confusion matrix
h2o.confusionMatrix(model, test_h2o)
Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.211758576604076:
        1  2    Error    Rate
1      79  3 0.036585   =3/82
2       1 41 0.023810   =1/42
Totals 80 44 0.032258  =4/124

We also have an icredible accuracy in the test set (Error = 0.025).
Our model has an incredible good generalization capability.