Introduction

In this project, I will use ensemble methods as well as K-folds validation techniques to classify the diabetes. The project will include the use of bagging, boosting, and 10-fold cross-validation to discover the relationship between diabetes and related attributes.

Loading libraries and data

1. Exploratory Data Analysis

Firstly, I want to read the metadata and detail description provided with the data package.

?PimaIndiansDiabetes
## starting httpd help server ... done

"The data set PimaIndiansDiabetes2 contains a corrected version of the original data set. While the UCI repository index claims that there are no missing values, closer inspection of the data al, 1998)

data(PimaIndiansDiabetes)
str(PimaIndiansDiabetes)
## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
##  $ pressure: num  72 66 64 66 40 74 50 0 70 96 ...
##  $ triceps : num  35 29 0 23 35 0 32 0 45 0 ...
##  $ insulin : num  0 0 0 94 168 0 88 0 543 0 ...
##  $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
##  $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

We have 768 observations and 9 variables, 8 vaviables are numeric and 1 is factor.

Let’s take a quick look at the completeness of PimaIndiansDiabetes2 to examine how many missing values are there in the original dataset:

data(PimaIndiansDiabetes2)
summary(PimaIndiansDiabetes2)
##     pregnant         glucose         pressure         triceps     
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.41   Mean   :29.15  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##                   NA's   :5       NA's   :35       NA's   :227    
##     insulin            mass          pedigree           age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780   Min.   :21.00  
##  1st Qu.: 76.25   1st Qu.:27.50   1st Qu.:0.2437   1st Qu.:24.00  
##  Median :125.00   Median :32.30   Median :0.3725   Median :29.00  
##  Mean   :155.55   Mean   :32.46   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:190.00   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##  NA's   :374      NA's   :11                                      
##  diabetes 
##  neg:500  
##  pos:268  
##           
##           
##           
##           
## 

The summary result page reveal that the variables triceps and insulin was missing too many values. triceps was missing 227/768 (29% of observations) and insulin was missing 374/768 (49% of observations). Number of missing value was too large to ignore and assign 0s.

I will use the package ‘mice’ to learn more about the missing values and perform imputation on the missing data. In order to execute proper method in the mice package, I will follow the tutorial provided on the website: https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/

library(mice)
md.pattern(PimaIndiansDiabetes2)

##     pregnant pedigree age diabetes glucose mass pressure triceps insulin
## 392        1        1   1        1       1    1        1       1       1
## 140        1        1   1        1       1    1        1       1       0
## 192        1        1   1        1       1    1        1       0       0
## 2          1        1   1        1       1    1        0       1       0
## 26         1        1   1        1       1    1        0       0       0
## 1          1        1   1        1       1    0        1       1       1
## 1          1        1   1        1       1    0        1       1       0
## 2          1        1   1        1       1    0        1       0       0
## 7          1        1   1        1       1    0        0       0       0
## 1          1        1   1        1       0    1        1       1       1
## 4          1        1   1        1       0    1        1       1       0
##            0        0   0        0       5   11       35     227     374
##        
## 392   0
## 140   1
## 192   2
## 2     2
## 26    3
## 1     1
## 1     2
## 2     3
## 7     4
## 1     1
## 4     2
##     652

There are 392 complete observations, 140 observations with 1 missing, 192 with 2 missing and so on and so forth. Most missing values are concentrated in the insulin variable. I will pay close attention to this variable after imputation to make sure that the data that were used to replace missing values make sense.

In this implementation, I will remove the categorical variable diabetes for 2 reasons:

  1. I want to use PMM (Predictive Mean Matching), a method that works for numeric variables only.

  2. If I use diabetes to predict the missing values of a variable, then there is a possibility that the imputed variables will become bias when we try to predict diabetes.

imputed_Data <- mice(PimaIndiansDiabetes2[-9], m = 1, method = 'pmm', seed = 500)
## 
##  iter imp variable
##   1   1  glucose  pressure  triceps  insulin  mass
##   2   1  glucose  pressure  triceps  insulin  mass
##   3   1  glucose  pressure  triceps  insulin  mass
##   4   1  glucose  pressure  triceps  insulin  mass
##   5   1  glucose  pressure  triceps  insulin  mass
summary(imputed_Data)
## Class: mids
## Number of multiple imputations:  1 
## Imputation methods:
## pregnant  glucose pressure  triceps  insulin     mass pedigree      age 
##       ""    "pmm"    "pmm"    "pmm"    "pmm"    "pmm"       ""       "" 
## PredictorMatrix:
##          pregnant glucose pressure triceps insulin mass pedigree age
## pregnant        0       1        1       1       1    1        1   1
## glucose         1       0        1       1       1    1        1   1
## pressure        1       1        0       1       1    1        1   1
## triceps         1       1        1       0       1    1        1   1
## insulin         1       1        1       1       0    1        1   1
## mass            1       1        1       1       1    0        1   1

We can use the imputation to complete our dataset:

#get complete data ( first out of 5)
completeData <- complete(imputed_Data, 1)

Density plot is very useful to double check the distribution of the imputed values:

densityplot(imputed_Data)

The complete observations is shown in blue and imputed values is shown in red. We can see that the distribution of the imputed data is very similar to the observed data, especially for the variables that we concern the most: triceps and insulin.

We can verify that the mean, median and IQR was relatively the same before and after the imputation.

summary(PimaIndiansDiabetes2)
##     pregnant         glucose         pressure         triceps     
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.41   Mean   :29.15  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##                   NA's   :5       NA's   :35       NA's   :227    
##     insulin            mass          pedigree           age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780   Min.   :21.00  
##  1st Qu.: 76.25   1st Qu.:27.50   1st Qu.:0.2437   1st Qu.:24.00  
##  Median :125.00   Median :32.30   Median :0.3725   Median :29.00  
##  Mean   :155.55   Mean   :32.46   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:190.00   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##  NA's   :374      NA's   :11                                      
##  diabetes 
##  neg:500  
##  pos:268  
##           
##           
##           
##           
## 
summary(completeData)
##     pregnant         glucose         pressure         triceps     
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:21.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :28.00  
##  Mean   : 3.845   Mean   :121.6   Mean   : 72.54   Mean   :28.71  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     insulin            mass          pedigree           age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780   Min.   :21.00  
##  1st Qu.: 74.75   1st Qu.:27.48   1st Qu.:0.2437   1st Qu.:24.00  
##  Median :123.50   Median :32.30   Median :0.3725   Median :29.00  
##  Mean   :152.59   Mean   :32.45   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:184.25   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00

Now that we have a complete dataset, we can begin to perform exploratory data analysis:

# return the diabetes column to out dataset
completeData$diabetes = PimaIndiansDiabetes2$diabetes

head(completeData) 

What is the correlaiton between the numerical variables?

We can determine this by calculate the correlation matrix for all numerical variables:

cor(completeData[-9])
##             pregnant   glucose    pressure   triceps    insulin       mass
## pregnant  1.00000000 0.1304515  0.21853162 0.1084576 0.04596043 0.01446624
## glucose   0.13045154 1.0000000  0.22572746 0.2344698 0.57468776 0.23077943
## pressure  0.21853162 0.2257275  1.00000000 0.2288275 0.11261243 0.29123085
## triceps   0.10845765 0.2344698  0.22882746 1.0000000 0.23701208 0.65974454
## insulin   0.04596043 0.5746878  0.11261243 0.2370121 1.00000000 0.23934754
## mass      0.01446624 0.2307794  0.29123085 0.6597445 0.23934754 1.00000000
## pedigree -0.03352267 0.1399242 -0.01071528 0.1274370 0.14554639 0.15627218
## age       0.54434123 0.2692574  0.32443199 0.1559934 0.20068816 0.02586208
##             pedigree        age
## pregnant -0.03352267 0.54434123
## glucose   0.13992422 0.26925742
## pressure -0.01071528 0.32443199
## triceps   0.12743699 0.15599340
## insulin   0.14554639 0.20068816
## mass      0.15627218 0.02586208
## pedigree  1.00000000 0.03356131
## age       0.03356131 1.00000000

We can see that there is strong correlation between insulin and glucose (cor = 0.60959626), mass and triceps (cor = 0.64353450), age and pregnant (cor = 0.54434123). Let’s make some scatter plots to visualize these relationships:

# `insulin` and `glucose`
ggplot(completeData, aes(x = insulin, y = glucose)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The relationship between insulin and glucose is perhaps can be better describe by a non-linear relationship. The blue line is an attempt to capture this non-linear relationship.

# `age` and `pregnant`
ggplot(completeData, aes(x = age, y = pregnant)) + geom_point() + geom_smooth(method = "lm")

There is a relatively weak linear relationship between pregnant and age.

# `mass` and `triceps`
ggplot(completeData, aes(x = mass, y = triceps)) + geom_point() + geom_smooth(method = "lm")

We can observe that there is a strong linear relationship between triceps and mass. However, there is an outlier that can potentially have high leverage on the relationship. We should take a quick look at this outlier using filter() command:

completeData %>% filter(triceps > 80)

We can see that this outlier has relatively high glucose and is positive with diabetes. Therefore, it make sense that this person would also has thick triceps skin fold. However in reality, due to the abnormality of the value, we should check this specific case with the source to ensure data accuracy.

We can add diabetes label to the scatter plot of triceps and mass to see the asscociation:

ggplot(completeData, aes(x = mass, y = triceps, col = diabetes)) + geom_point()

We can observe that individuals diagnosed with diabetes tend to have higher triceps and mass measurement. However, there is much overlapse between the two group “neg” and “pos”. We would need much more information to be able to predict diabetes with reasonable accuracy.

Before running any machine learning algorithm to solve our problem, first we need to check the data distribution. I will use the function ggpair() from GGAlly package to automate this process:

Observations about the distributions of data:

  1. Numerical variables’ distrution are either normally distribution or right-skewed.

  2. Diabetes has class-imbalance issue, where number of “neg” instances is way more than “pos”.

2. The bagging method

Create a training and a test dataset with a 80/20 split:

set.seed(42)
train_index = sample(c(TRUE,FALSE), size = nrow(completeData), prob = c(0.8,0.2), replace = TRUE)
diabetes_train = completeData[train_index,]
diabetes_test = completeData[!train_index,]

# double check the number
nrow(diabetes_test)
## [1] 153
nrow(diabetes_train)
## [1] 615

I will use the ipred package to create the ensemble. This is method, decision trees is used, which create unstable learners that generate tree models that tend to change substantially when the input data changes slightly (Lantz, 2015).

After training and fitting the data, we can display the results using a confusion matrix:

confusionMatrix(table(bagging_prediction, diabetes_test$diabetes), positive = "pos")
## Confusion Matrix and Statistics
## 
##                   
## bagging_prediction neg pos
##                neg  80  21
##                pos  18  34
##                                          
##                Accuracy : 0.7451         
##                  95% CI : (0.6684, 0.812)
##     No Information Rate : 0.6405         
##     P-Value [Acc > NIR] : 0.003794       
##                                          
##                   Kappa : 0.4398         
##                                          
##  Mcnemar's Test P-Value : 0.748774       
##                                          
##             Sensitivity : 0.6182         
##             Specificity : 0.8163         
##          Pos Pred Value : 0.6538         
##          Neg Pred Value : 0.7921         
##              Prevalence : 0.3595         
##          Detection Rate : 0.2222         
##    Detection Prevalence : 0.3399         
##       Balanced Accuracy : 0.7173         
##                                          
##        'Positive' Class : pos            
## 

Average error from predicted results

mean(predict(mybag) != diabetes_train$diabetes)
## [1] 0.2764228

3. Perform 10-fold cross-validation using bagging

From the results above, we cam use the bagged trees with 10-fold cross-validation to see if the results stay consistent.

set.seed(42)

diabetes_baggingcv = adabag::bagging.cv (diabetes ~ ., v=10, data=diabetes_train, mfinal=10)
confusionMatrix(diabetes_baggingcv$confusion, positive = "pos")
## Confusion Matrix and Statistics
## 
##                Observed Class
## Predicted Class neg pos
##             neg 351 107
##             pos  51 106
##                                           
##                Accuracy : 0.7431          
##                  95% CI : (0.7066, 0.7772)
##     No Information Rate : 0.6537          
##     P-Value [Acc > NIR] : 1.141e-06       
##                                           
##                   Kappa : 0.3952          
##                                           
##  Mcnemar's Test P-Value : 1.211e-05       
##                                           
##             Sensitivity : 0.4977          
##             Specificity : 0.8731          
##          Pos Pred Value : 0.6752          
##          Neg Pred Value : 0.7664          
##              Prevalence : 0.3463          
##          Detection Rate : 0.1724          
##    Detection Prevalence : 0.2553          
##       Balanced Accuracy : 0.6854          
##                                           
##        'Positive' Class : pos             
## 

The accuracy of bagging with cross-validation (cv) model is lower compared to the bagging method, suggesting the bagging model perform better than expected. The accuracy in the bagging cv model is the average of the accuracy of each fold, therefore, give us a better estimation of the accuracy of the model on new data.

Average error from predicted results for 10 folds cv:

diabetes_baggingcv$error
## [1] 0.2569106

The average error from predicted results is higher for 10 folds cv on bagging method. This is consistent with the decrease in accuracy.

4. The boosting method

I will use package adabag to perform the boosting method:

library(adabag)
set.seed(42)
# train the model using adaboost
myboosting = adabag::boosting(diabetes ~ ., data = diabetes_train)

# predict on the test dataset
boosting_prediction = predict.boosting(myboosting, diabetes_test)

# results
confusionMatrix(table(boosting_prediction$class, diabetes_test$diabetes), positive = "pos")
## Confusion Matrix and Statistics
## 
##      
##       neg pos
##   neg  81  20
##   pos  17  35
##                                           
##                Accuracy : 0.7582          
##                  95% CI : (0.6824, 0.8237)
##     No Information Rate : 0.6405          
##     P-Value [Acc > NIR] : 0.00123         
##                                           
##                   Kappa : 0.4685          
##                                           
##  Mcnemar's Test P-Value : 0.74231         
##                                           
##             Sensitivity : 0.6364          
##             Specificity : 0.8265          
##          Pos Pred Value : 0.6731          
##          Neg Pred Value : 0.8020          
##              Prevalence : 0.3595          
##          Detection Rate : 0.2288          
##    Detection Prevalence : 0.3399          
##       Balanced Accuracy : 0.7314          
##                                           
##        'Positive' Class : pos             
## 

Average error from the predicted results

boosting_prediction$error
## [1] 0.2418301

5. Perform 10-fold cross-validation with the boosting method

set.seed(42)
boosting_cv = boosting.cv(diabetes ~ ., data = completeData)
## i:  1 Wed Feb 19 23:20:13 2020 
## i:  2 Wed Feb 19 23:20:52 2020 
## i:  3 Wed Feb 19 23:21:22 2020 
## i:  4 Wed Feb 19 23:21:58 2020 
## i:  5 Wed Feb 19 23:22:33 2020 
## i:  6 Wed Feb 19 23:23:04 2020 
## i:  7 Wed Feb 19 23:23:35 2020 
## i:  8 Wed Feb 19 23:24:08 2020 
## i:  9 Wed Feb 19 23:24:39 2020 
## i:  10 Wed Feb 19 23:25:09 2020
confusionMatrix(boosting_cv$confusion, positive = "pos")
## Confusion Matrix and Statistics
## 
##                Observed Class
## Predicted Class neg pos
##             neg 402 108
##             pos  98 160
##                                           
##                Accuracy : 0.7318          
##                  95% CI : (0.6989, 0.7628)
##     No Information Rate : 0.651           
##     P-Value [Acc > NIR] : 1.015e-06       
##                                           
##                   Kappa : 0.4045          
##                                           
##  Mcnemar's Test P-Value : 0.5306          
##                                           
##             Sensitivity : 0.5970          
##             Specificity : 0.8040          
##          Pos Pred Value : 0.6202          
##          Neg Pred Value : 0.7882          
##              Prevalence : 0.3490          
##          Detection Rate : 0.2083          
##    Detection Prevalence : 0.3359          
##       Balanced Accuracy : 0.7005          
##                                           
##        'Positive' Class : pos             
## 

Average error from the predicted results for 10 folds cross-validation

boosting_cv$error
## [1] 0.2682292

We can observe the same phenomenon happen for bagging method. Accuracy increase and error increase when we apply 10 folds cross-validation on boosting method. Without cross-validation, we are often over confident about our model to predict on a new dataset.

6. Compare all the results

Let’s create a table that summary the results and visualize the finding:

ensemble_summary = bind_cols(accuracy = c(0.7516, 0.7398, 0.7582, 0.7435), 
                             error = c(0.2780488, 0.2796748, 0.2418301, 0.2565104),
                             method = c("Bagging", "Bagging with 10 folds CV", "Boosting", "Boosting with 10 folds CV"))
ensemble_summary

From the results table, for this particular dataset, I observed:

  1. Boosting methods have higher accuracy and lower error compared to bagging
  2. 10-folds cross-validation methods yeild lower accuracy and higher error. However, as I discuss above, 10-folds cross-validation methods are often more reliable and more objective accuracy/error that generalize better for new dataset that the model have never seen before.

If I were to select one best method, I would select “Boosting with 10 folds CV” because of its accuracy and assurance that my model has not overfit from the training dataset.

7. Compare these two techniques.

Perhaps we can compare the two ensemble technique: Bagging and Boosting by comparing the feature important from each technique.

mybag_adabag = adabag::bagging(diabetes ~ ., data = diabetes_train) # I had to re-run bagging method because ipred doesn't help calculate feature importance
barplot(mybag_adabag$importance, main = "Bagging Relative Feature Importance")

barplot(myboosting$importance, main = "Boosting Relative Feature Importance")

We can see that feature glucose is disproportionately significant in the Bagging method, while the importance of features in Boosting method is more evenly spread. This result suggests that Bagging has a tendency to bias toward one predictor. I conclude that for a complex problem such as predicting diabetes, a single predictor should not reign over other predictors, therefore, boosting would be a better option to reduce bias in my model.

References

Lantz, Brett “Machine learning with R : discover how to build machine learning algorithms, prepare data, and dig deep into data prediction techniques with R”

http://lumen.regis.edu/record=b1734246~S3.

Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.