In this project, I will use ensemble methods as well as K-folds validation techniques to classify the diabetes. The project will include the use of bagging, boosting, and 10-fold cross-validation to discover the relationship between diabetes and related attributes.
Firstly, I want to read the metadata and detail description provided with the data package.
?PimaIndiansDiabetes
## starting httpd help server ... done
"The data set PimaIndiansDiabetes2 contains a corrected version of the original data set. While the UCI repository index claims that there are no missing values, closer inspection of the data al, 1998)
data(PimaIndiansDiabetes)
str(PimaIndiansDiabetes)
## 'data.frame': 768 obs. of 9 variables:
## $ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
## $ pressure: num 72 66 64 66 40 74 50 0 70 96 ...
## $ triceps : num 35 29 0 23 35 0 32 0 45 0 ...
## $ insulin : num 0 0 0 94 168 0 88 0 543 0 ...
## $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : num 50 31 32 21 33 30 26 29 53 54 ...
## $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
We have 768 observations and 9 variables, 8 vaviables are numeric and 1 is factor.
Let’s take a quick look at the completeness of PimaIndiansDiabetes2 to examine how many missing values are there in the original dataset:
data(PimaIndiansDiabetes2)
summary(PimaIndiansDiabetes2)
## pregnant glucose pressure triceps
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## NA's :5 NA's :35 NA's :227
## insulin mass pedigree age
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 76.25 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :125.00 Median :32.30 Median :0.3725 Median :29.00
## Mean :155.55 Mean :32.46 Mean :0.4719 Mean :33.24
## 3rd Qu.:190.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## NA's :374 NA's :11
## diabetes
## neg:500
## pos:268
##
##
##
##
##
The summary result page reveal that the variables triceps
and insulin
was missing too many values. triceps
was missing 227/768 (29% of observations) and insulin
was missing 374/768 (49% of observations). Number of missing value was too large to ignore and assign 0s.
I will use the package ‘mice’ to learn more about the missing values and perform imputation on the missing data. In order to execute proper method in the mice
package, I will follow the tutorial provided on the website: https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
library(mice)
md.pattern(PimaIndiansDiabetes2)
## pregnant pedigree age diabetes glucose mass pressure triceps insulin
## 392 1 1 1 1 1 1 1 1 1
## 140 1 1 1 1 1 1 1 1 0
## 192 1 1 1 1 1 1 1 0 0
## 2 1 1 1 1 1 1 0 1 0
## 26 1 1 1 1 1 1 0 0 0
## 1 1 1 1 1 1 0 1 1 1
## 1 1 1 1 1 1 0 1 1 0
## 2 1 1 1 1 1 0 1 0 0
## 7 1 1 1 1 1 0 0 0 0
## 1 1 1 1 1 0 1 1 1 1
## 4 1 1 1 1 0 1 1 1 0
## 0 0 0 0 5 11 35 227 374
##
## 392 0
## 140 1
## 192 2
## 2 2
## 26 3
## 1 1
## 1 2
## 2 3
## 7 4
## 1 1
## 4 2
## 652
There are 392 complete observations, 140 observations with 1 missing, 192 with 2 missing and so on and so forth. Most missing values are concentrated in the insulin
variable. I will pay close attention to this variable after imputation to make sure that the data that were used to replace missing values make sense.
In this implementation, I will remove the categorical variable diabetes
for 2 reasons:
I want to use PMM (Predictive Mean Matching), a method that works for numeric variables only.
If I use diabetes to predict the missing values of a variable, then there is a possibility that the imputed variables will become bias when we try to predict diabetes.
imputed_Data <- mice(PimaIndiansDiabetes2[-9], m = 1, method = 'pmm', seed = 500)
##
## iter imp variable
## 1 1 glucose pressure triceps insulin mass
## 2 1 glucose pressure triceps insulin mass
## 3 1 glucose pressure triceps insulin mass
## 4 1 glucose pressure triceps insulin mass
## 5 1 glucose pressure triceps insulin mass
summary(imputed_Data)
## Class: mids
## Number of multiple imputations: 1
## Imputation methods:
## pregnant glucose pressure triceps insulin mass pedigree age
## "" "pmm" "pmm" "pmm" "pmm" "pmm" "" ""
## PredictorMatrix:
## pregnant glucose pressure triceps insulin mass pedigree age
## pregnant 0 1 1 1 1 1 1 1
## glucose 1 0 1 1 1 1 1 1
## pressure 1 1 0 1 1 1 1 1
## triceps 1 1 1 0 1 1 1 1
## insulin 1 1 1 1 0 1 1 1
## mass 1 1 1 1 1 0 1 1
We can use the imputation to complete our dataset:
#get complete data ( first out of 5)
completeData <- complete(imputed_Data, 1)
Density plot is very useful to double check the distribution of the imputed values:
densityplot(imputed_Data)
The complete observations is shown in blue and imputed values is shown in red. We can see that the distribution of the imputed data is very similar to the observed data, especially for the variables that we concern the most:
triceps
and insulin
.
We can verify that the mean, median and IQR was relatively the same before and after the imputation.
summary(PimaIndiansDiabetes2)
## pregnant glucose pressure triceps
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## NA's :5 NA's :35 NA's :227
## insulin mass pedigree age
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 76.25 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :125.00 Median :32.30 Median :0.3725 Median :29.00
## Mean :155.55 Mean :32.46 Mean :0.4719 Mean :33.24
## 3rd Qu.:190.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## NA's :374 NA's :11
## diabetes
## neg:500
## pos:268
##
##
##
##
##
summary(completeData)
## pregnant glucose pressure triceps
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:21.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :28.00
## Mean : 3.845 Mean :121.6 Mean : 72.54 Mean :28.71
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## insulin mass pedigree age
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 74.75 1st Qu.:27.48 1st Qu.:0.2437 1st Qu.:24.00
## Median :123.50 Median :32.30 Median :0.3725 Median :29.00
## Mean :152.59 Mean :32.45 Mean :0.4719 Mean :33.24
## 3rd Qu.:184.25 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
Now that we have a complete dataset, we can begin to perform exploratory data analysis:
# return the diabetes column to out dataset
completeData$diabetes = PimaIndiansDiabetes2$diabetes
head(completeData)
We can determine this by calculate the correlation matrix for all numerical variables:
cor(completeData[-9])
## pregnant glucose pressure triceps insulin mass
## pregnant 1.00000000 0.1304515 0.21853162 0.1084576 0.04596043 0.01446624
## glucose 0.13045154 1.0000000 0.22572746 0.2344698 0.57468776 0.23077943
## pressure 0.21853162 0.2257275 1.00000000 0.2288275 0.11261243 0.29123085
## triceps 0.10845765 0.2344698 0.22882746 1.0000000 0.23701208 0.65974454
## insulin 0.04596043 0.5746878 0.11261243 0.2370121 1.00000000 0.23934754
## mass 0.01446624 0.2307794 0.29123085 0.6597445 0.23934754 1.00000000
## pedigree -0.03352267 0.1399242 -0.01071528 0.1274370 0.14554639 0.15627218
## age 0.54434123 0.2692574 0.32443199 0.1559934 0.20068816 0.02586208
## pedigree age
## pregnant -0.03352267 0.54434123
## glucose 0.13992422 0.26925742
## pressure -0.01071528 0.32443199
## triceps 0.12743699 0.15599340
## insulin 0.14554639 0.20068816
## mass 0.15627218 0.02586208
## pedigree 1.00000000 0.03356131
## age 0.03356131 1.00000000
We can see that there is strong correlation between insulin
and glucose
(cor = 0.60959626), mass
and triceps
(cor = 0.64353450), age
and pregnant
(cor = 0.54434123). Let’s make some scatter plots to visualize these relationships:
# `insulin` and `glucose`
ggplot(completeData, aes(x = insulin, y = glucose)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The relationship between insulin
and glucose
is perhaps can be better describe by a non-linear relationship. The blue line is an attempt to capture this non-linear relationship.
# `age` and `pregnant`
ggplot(completeData, aes(x = age, y = pregnant)) + geom_point() + geom_smooth(method = "lm")
There is a relatively weak linear relationship between pregnant
and age
.
# `mass` and `triceps`
ggplot(completeData, aes(x = mass, y = triceps)) + geom_point() + geom_smooth(method = "lm")
We can observe that there is a strong linear relationship between triceps
and mass
. However, there is an outlier that can potentially have high leverage on the relationship. We should take a quick look at this outlier using filter() command:
completeData %>% filter(triceps > 80)
We can see that this outlier has relatively high glucose and is positive with diabetes. Therefore, it make sense that this person would also has thick triceps skin fold. However in reality, due to the abnormality of the value, we should check this specific case with the source to ensure data accuracy.
We can add diabetes
label to the scatter plot of triceps
and mass
to see the asscociation:
ggplot(completeData, aes(x = mass, y = triceps, col = diabetes)) + geom_point()
We can observe that individuals diagnosed with diabetes tend to have higher triceps
and mass
measurement. However, there is much overlapse between the two group “neg” and “pos”. We would need much more information to be able to predict diabetes with reasonable accuracy.
Before running any machine learning algorithm to solve our problem, first we need to check the data distribution. I will use the function ggpair() from GGAlly
package to automate this process:
Observations about the distributions of data:
Numerical variables’ distrution are either normally distribution or right-skewed.
Diabetes has class-imbalance issue, where number of “neg” instances is way more than “pos”.
Create a training and a test dataset with a 80/20 split:
set.seed(42)
train_index = sample(c(TRUE,FALSE), size = nrow(completeData), prob = c(0.8,0.2), replace = TRUE)
diabetes_train = completeData[train_index,]
diabetes_test = completeData[!train_index,]
# double check the number
nrow(diabetes_test)
## [1] 153
nrow(diabetes_train)
## [1] 615
I will use the ipred
package to create the ensemble. This is method, decision trees is used, which create unstable learners that generate tree models that tend to change substantially when the input data changes slightly (Lantz, 2015).
After training and fitting the data, we can display the results using a confusion matrix:
confusionMatrix(table(bagging_prediction, diabetes_test$diabetes), positive = "pos")
## Confusion Matrix and Statistics
##
##
## bagging_prediction neg pos
## neg 80 21
## pos 18 34
##
## Accuracy : 0.7451
## 95% CI : (0.6684, 0.812)
## No Information Rate : 0.6405
## P-Value [Acc > NIR] : 0.003794
##
## Kappa : 0.4398
##
## Mcnemar's Test P-Value : 0.748774
##
## Sensitivity : 0.6182
## Specificity : 0.8163
## Pos Pred Value : 0.6538
## Neg Pred Value : 0.7921
## Prevalence : 0.3595
## Detection Rate : 0.2222
## Detection Prevalence : 0.3399
## Balanced Accuracy : 0.7173
##
## 'Positive' Class : pos
##
Average error from predicted results
mean(predict(mybag) != diabetes_train$diabetes)
## [1] 0.2764228
From the results above, we cam use the bagged trees with 10-fold cross-validation to see if the results stay consistent.
set.seed(42)
diabetes_baggingcv = adabag::bagging.cv (diabetes ~ ., v=10, data=diabetes_train, mfinal=10)
confusionMatrix(diabetes_baggingcv$confusion, positive = "pos")
## Confusion Matrix and Statistics
##
## Observed Class
## Predicted Class neg pos
## neg 351 107
## pos 51 106
##
## Accuracy : 0.7431
## 95% CI : (0.7066, 0.7772)
## No Information Rate : 0.6537
## P-Value [Acc > NIR] : 1.141e-06
##
## Kappa : 0.3952
##
## Mcnemar's Test P-Value : 1.211e-05
##
## Sensitivity : 0.4977
## Specificity : 0.8731
## Pos Pred Value : 0.6752
## Neg Pred Value : 0.7664
## Prevalence : 0.3463
## Detection Rate : 0.1724
## Detection Prevalence : 0.2553
## Balanced Accuracy : 0.6854
##
## 'Positive' Class : pos
##
The accuracy of bagging with cross-validation (cv) model is lower compared to the bagging method, suggesting the bagging model perform better than expected. The accuracy in the bagging cv model is the average of the accuracy of each fold, therefore, give us a better estimation of the accuracy of the model on new data.
Average error from predicted results for 10 folds cv:
diabetes_baggingcv$error
## [1] 0.2569106
The average error from predicted results is higher for 10 folds cv on bagging method. This is consistent with the decrease in accuracy.
I will use package adabag
to perform the boosting method:
library(adabag)
set.seed(42)
# train the model using adaboost
myboosting = adabag::boosting(diabetes ~ ., data = diabetes_train)
# predict on the test dataset
boosting_prediction = predict.boosting(myboosting, diabetes_test)
# results
confusionMatrix(table(boosting_prediction$class, diabetes_test$diabetes), positive = "pos")
## Confusion Matrix and Statistics
##
##
## neg pos
## neg 81 20
## pos 17 35
##
## Accuracy : 0.7582
## 95% CI : (0.6824, 0.8237)
## No Information Rate : 0.6405
## P-Value [Acc > NIR] : 0.00123
##
## Kappa : 0.4685
##
## Mcnemar's Test P-Value : 0.74231
##
## Sensitivity : 0.6364
## Specificity : 0.8265
## Pos Pred Value : 0.6731
## Neg Pred Value : 0.8020
## Prevalence : 0.3595
## Detection Rate : 0.2288
## Detection Prevalence : 0.3399
## Balanced Accuracy : 0.7314
##
## 'Positive' Class : pos
##
Average error from the predicted results
boosting_prediction$error
## [1] 0.2418301
set.seed(42)
boosting_cv = boosting.cv(diabetes ~ ., data = completeData)
## i: 1 Wed Feb 19 23:20:13 2020
## i: 2 Wed Feb 19 23:20:52 2020
## i: 3 Wed Feb 19 23:21:22 2020
## i: 4 Wed Feb 19 23:21:58 2020
## i: 5 Wed Feb 19 23:22:33 2020
## i: 6 Wed Feb 19 23:23:04 2020
## i: 7 Wed Feb 19 23:23:35 2020
## i: 8 Wed Feb 19 23:24:08 2020
## i: 9 Wed Feb 19 23:24:39 2020
## i: 10 Wed Feb 19 23:25:09 2020
confusionMatrix(boosting_cv$confusion, positive = "pos")
## Confusion Matrix and Statistics
##
## Observed Class
## Predicted Class neg pos
## neg 402 108
## pos 98 160
##
## Accuracy : 0.7318
## 95% CI : (0.6989, 0.7628)
## No Information Rate : 0.651
## P-Value [Acc > NIR] : 1.015e-06
##
## Kappa : 0.4045
##
## Mcnemar's Test P-Value : 0.5306
##
## Sensitivity : 0.5970
## Specificity : 0.8040
## Pos Pred Value : 0.6202
## Neg Pred Value : 0.7882
## Prevalence : 0.3490
## Detection Rate : 0.2083
## Detection Prevalence : 0.3359
## Balanced Accuracy : 0.7005
##
## 'Positive' Class : pos
##
Average error from the predicted results for 10 folds cross-validation
boosting_cv$error
## [1] 0.2682292
We can observe the same phenomenon happen for bagging method. Accuracy increase and error increase when we apply 10 folds cross-validation on boosting method. Without cross-validation, we are often over confident about our model to predict on a new dataset.
Let’s create a table that summary the results and visualize the finding:
ensemble_summary = bind_cols(accuracy = c(0.7516, 0.7398, 0.7582, 0.7435),
error = c(0.2780488, 0.2796748, 0.2418301, 0.2565104),
method = c("Bagging", "Bagging with 10 folds CV", "Boosting", "Boosting with 10 folds CV"))
ensemble_summary
From the results table, for this particular dataset, I observed:
If I were to select one best method, I would select “Boosting with 10 folds CV” because of its accuracy and assurance that my model has not overfit from the training dataset.
Perhaps we can compare the two ensemble technique: Bagging and Boosting by comparing the feature important from each technique.
mybag_adabag = adabag::bagging(diabetes ~ ., data = diabetes_train) # I had to re-run bagging method because ipred doesn't help calculate feature importance
barplot(mybag_adabag$importance, main = "Bagging Relative Feature Importance")
barplot(myboosting$importance, main = "Boosting Relative Feature Importance")
We can see that feature glucose
is disproportionately significant in the Bagging method, while the importance of features in Boosting method is more evenly spread. This result suggests that Bagging has a tendency to bias toward one predictor. I conclude that for a complex problem such as predicting diabetes, a single predictor should not reign over other predictors, therefore, boosting would be a better option to reduce bias in my model.
References
Lantz, Brett “Machine learning with R : discover how to build machine learning algorithms, prepare data, and dig deep into data prediction techniques with R”
http://lumen.regis.edu/record=b1734246~S3.
Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.