Main data
We have included the steps to process the main data from the CDC, in order to acheive a clean data like we have in the clean data section. There is no need to run this unless, you want to add or change variables.
Data_2020 <- read_xpt(“LLCP2020.XPT”) cols<- c(“CVDINFR4”,“CVDCRHD4”,“CVDSTRK3”,"_STATE“,”SEXVAR“,”_IMPRACE“,”_AGE_G“,”GENHLTH“,”_PHYS14D“,”_MENT14D“,”_SMOKER3“,”_RFDRHV7“,”ASTHMA3“,”CHCKDNY2“,”CHCOCNCR“,”CHCSCNCR“,”DIABETE4“,”CHCCOPD2“,”EXERANY2“,”MARITAL“,”INCOME2")
Data_2020[cols]<- lapply(Data_2020[cols],factor)
tidy_Data2020 <-Data_2020%>% dplyr::select(“CVDINFR4”,“CVDCRHD4”,“CVDSTRK3”,"_STATE“,”SEXVAR“,”_IMPRACE“,”_AGE_G“,”_BMI5“,”GENHLTH“,”_PHYS14D“,”_MENT14D“,”_SMOKER3“,”_RFDRHV7“,”ASTHMA3“,”CHCKDNY2“,”CHCOCNCR“,”CHCSCNCR“,”DIABETE4“,”CHCCOPD2“,”EXERANY2“,”MARITAL“,”INCOME2“)%>%rename(Heart_Attack=”CVDINFR4“,Coronary_Heart_Disease=”CVDCRHD4“,Stroke=”CVDSTRK3“,SEX =”SEXVAR“, State=”_STATE“, Race=”_IMPRACE“,Age=”_AGE_G“,BMI =”_BMI5“,General_Health=”GENHLTH“,Physical_Health=”_PHYS14D“,Mental_Health=”_MENT14D“,Smoking=”_SMOKER3“,Drinking=”_RFDRHV7“,Asthma=”ASTHMA3“,Kidney_Disease=”CHCKDNY2“,Any_Cancer=”CHCOCNCR“,Skin_Cancer=”CHCSCNCR“,Diabetic=”DIABETE4“,COPD=”CHCCOPD2“,Any_Excercise=”EXERANY2“,Marital_Status=”MARITAL“,Income_Level=”INCOME2")%>%mutate(BMI = BMI/100)
new_cols<- c(“Heart_Attack”,“Coronary_Heart_Disease”,“Stroke”,“SEX”,“State”,“Race”,“Age”,“BMI”,“General_Health”,“Physical_Health”,“Mental_Health”,“Smoking”,“Drinking”,“Asthma”,“Kidney_Disease”,“Any_Cancer”,“Skin_Cancer”,“Diabetic”,“COPD”,“Any_Excercise”,“Marital_Status”,“Income_Level”)
write_csv(tidy_Data2020[new_cols], “tidy_Data2020.csv”)
#Cleaned Data
Warning in recode.numeric(.x, !!!values, .default = .default, .missing
= .missing): NAs introduced by coercion
Warning in recode.numeric(.x, !!!values, .default = .default, .missing
= .missing): NAs introduced by coercion
Warning in recode.numeric(.x, !!!values, .default = .default, .missing
= .missing): NAs introduced by coercion
Warning in recode.numeric(.x, !!!values, .default = .default, .missing
= .missing): NAs introduced by coercion
After re-leveling the data to describe the numbers with the values they represent based on the code book from the CDC website, we removed the variables where there is missing data or inconclusive data.
Visualize the data
We visualized the data using a geom_bar graph from the ggplot package. We used position=fill to scale the graph in instances where there is a larger observation for one of the variables but not the other.
Looking at these graphs we noticed that for Coronary_Heart_Disease there is a 100% similarity on the data. We will remove this variable from the dataset moving forward. Some interesting associations we noticed with the target variable is: + If you had a stroke you are more likely to have a heart attack + Males were more likely to get a heart attack + Being White or Native American increases the likelihood of a heart_attack + Most of the heart attacks occur in those over the age of 45 + Those with a higher income level had less occurance of a heart attack + State does not provide a valuable information. However it increases the computation power required so we’ll remove it moving forward
Summary
Heart_Attack Coronary_Heart_Disease Stroke SEX
Yes: 15491 Yes : 15491 Yes: 10048 Male :133118
No :259897 No :259897 No :265340 Female:142270
Refused : 0
Don't know: 0
State Race Age
Minnesota : 11369 White :211278 18 to 24 :15743
Nebraska : 11235 Black : 19923 25 to 34 :31724
Ohio : 9759 Asian : 6864 35 to 44 :38164
New York : 9270 Native American: 4590 45 to 54 :43663
Maryland : 9203 Hispanic : 23318 55 to 64 :55779
Washington: 9100 Other race : 9415 65 or older:90315
(Other) :215452
BMI General_Health Physical_Health
Min. :12.02 Excellent:58079 0 days not good. :194702
1st Qu.:24.14 Very good:99048 1-13 days not good: 51850
Median :27.41 Good :79636 14+ days not good : 28836
Mean :28.46 Fair :29197
3rd Qu.:31.62 Poor : 9428
Max. :94.66
Mental_Health Smoking Drinking Asthma
0 Days not good :174921 Smokes Everyday: 28134 No :255792 Yes: 36968
1-13 Days not good: 66794 Smokes Somedays: 10572 Yes: 19596 No :238420
14+ Days not good : 33673 Former Smoker : 76253
Never Smoked :160429
Kidney_Disease Any_Cancer Skin_Cancer Diabetic COPD
Yes: 10090 Yes: 24659 Yes: 25170 Yes : 34948 Yes: 20751
No :265298 No :250729 No :250218 Yes,Pregnant : 2256 No :254637
No :232557
No,pre-diabetic: 5627
Any_Excercise Marital_Status Income_Level
Yes:215088 Married :147537 > $75,000:105448
No : 60300 Divorced : 36779 < $75,000: 45751
Widowed : 26761 < $50,000: 37540
Separated : 5312 < $35,000: 26214
Never Married : 47968 < $25,000: 22674
Unmarried Couple: 11031 < $20,000: 16951
(Other) : 20810
From the summary, one can see the levels and occurrence for each variable and also note that BMI is the only numerical variable.
The reason we have two data sets here is that Multiple Correspondence Analysis (MCA) prefers all categorical variables so we had to remove BMI.
View possible relationships
####Multiple Correspondence analysis
Warning: ggrepel: 23 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
The Scree plots show us that the first three components explain the variation in most of the data, making this multiple component analysis a relevant one. The biplot shows the global pattern of the data. We can see that there is an even amount of distribution for similar and dissimilar variables in the data. When we apply the target variable to the biplot’s individual variables we can see that there is an overlap and close to the center. We can safely assume that a normal distribution curve can be plotted from this values, and models that benefit from this would perform better on prediction.
Applying several Models
Create A Data Partition
We will start applying prediction models but first we need to partition the data into a train set and a test set.
###Logistic regression
Call:
glm(formula = Heart_Attack ~ ., family = binomial, data = train_set)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.7189 0.1013 0.1946 0.3406 1.9850
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.738237 0.254035 14.715 < 2e-16 ***
StrokeNo 0.803174 0.035902 22.372 < 2e-16 ***
SEXFemale 0.672365 0.024966 26.931 < 2e-16 ***
RaceBlack 0.397894 0.051043 7.795 6.43e-15 ***
RaceAsian 0.472789 0.114711 4.122 3.76e-05 ***
RaceNative American 0.088935 0.086994 1.022 0.306633
RaceHispanic 0.191840 0.053088 3.614 0.000302 ***
RaceOther race 0.104972 0.068378 1.535 0.124740
Age25 to 34 -0.674460 0.241744 -2.790 0.005271 **
Age35 to 44 -1.149853 0.230794 -4.982 6.29e-07 ***
Age45 to 54 -1.935581 0.224652 -8.616 < 2e-16 ***
Age55 to 64 -2.548593 0.222761 -11.441 < 2e-16 ***
Age65 or older -3.147361 0.222554 -14.142 < 2e-16 ***
BMI -0.006424 0.001849 -3.473 0.000514 ***
General_HealthVery good -0.588992 0.055645 -10.585 < 2e-16 ***
General_HealthGood -1.212904 0.054449 -22.276 < 2e-16 ***
General_HealthFair -1.668887 0.059548 -28.026 < 2e-16 ***
General_HealthPoor -2.044334 0.069912 -29.241 < 2e-16 ***
Physical_Health1-13 days not good -0.129326 0.031022 -4.169 3.06e-05 ***
Physical_Health14+ days not good -0.157897 0.036753 -4.296 1.74e-05 ***
Mental_Health1-13 Days not good -0.004806 0.030902 -0.156 0.876402
Mental_Health14+ Days not good -0.092151 0.036785 -2.505 0.012240 *
SmokingSmokes Somedays -0.065034 0.068572 -0.948 0.342922
SmokingFormer Smoker -0.170148 0.040754 -4.175 2.98e-05 ***
SmokingNever Smoked 0.050066 0.041680 1.201 0.229675
DrinkingYes 0.328041 0.058205 5.636 1.74e-08 ***
AsthmaNo 0.132616 0.032543 4.075 4.60e-05 ***
Kidney_DiseaseNo 0.558186 0.037899 14.728 < 2e-16 ***
Any_CancerNo 0.039166 0.031280 1.252 0.210523
Skin_CancerNo 0.235491 0.031084 7.576 3.56e-14 ***
DiabeticYes,Pregnant 0.585918 0.196359 2.984 0.002846 **
DiabeticNo 0.463120 0.027102 17.088 < 2e-16 ***
DiabeticNo,pre-diabetic 0.297373 0.070212 4.235 2.28e-05 ***
COPDNo 0.542675 0.031836 17.046 < 2e-16 ***
Any_ExcerciseNo 0.035452 0.026459 1.340 0.180284
Marital_StatusDivorced 0.090296 0.035045 2.577 0.009978 **
Marital_StatusWidowed -0.140205 0.033892 -4.137 3.52e-05 ***
Marital_StatusSeparated 0.069900 0.086876 0.805 0.421057
Marital_StatusNever Married 0.197540 0.046834 4.218 2.47e-05 ***
Marital_StatusUnmarried Couple 0.122729 0.089223 1.376 0.168967
Income_Level< $15,000 -0.012113 0.069279 -0.175 0.861199
Income_Level< $20,000 0.057649 0.066201 0.871 0.383851
Income_Level< $25,000 0.076649 0.064748 1.184 0.236494
Income_Level< $35,000 0.118714 0.064984 1.827 0.067727 .
Income_Level< $50,000 0.160248 0.064157 2.498 0.012498 *
Income_Level< $75,000 0.113646 0.064663 1.758 0.078829 .
Income_Level> $75,000 0.156166 0.063517 2.459 0.013946 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 71557 on 165233 degrees of freedom
Residual deviance: 56123 on 165187 degrees of freedom
AIC: 56217
Number of Fisher Scoring iterations: 8
Setting levels: control = Yes, case = No
Setting direction: controls < cases
Call:
roc.default(response = test_set$Heart_Attack, predictor = phat.log, plot = TRUE)
Data: phat.log in 6196 controls (test_set$Heart_Attack Yes) < 103958 cases (test_set$Heart_Attack No).
Area under the curve: 0.8474
Confusion Matrix and Statistics
Reference
Prediction Yes No
Yes 306 263
No 5890 103695
Accuracy : 0.9441
95% CI : (0.9428, 0.9455)
No Information Rate : 0.9438
P-Value [Acc > NIR] : 0.2896
Kappa : 0.0818
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.049387
Specificity : 0.997470
Pos Pred Value : 0.537786
Neg Pred Value : 0.946252
Prevalence : 0.056249
Detection Rate : 0.002778
Detection Prevalence : 0.005165
Balanced Accuracy : 0.523428
'Positive' Class : Yes
The logistic model used all the variables in the model data and returned an accuracy of 0.9437. Most variables were highly significant to the level of p=0. However, some variables such as income level until Income_level> $75,000, COPD No, Any Excercise No and Martial_Status Divorced, were not significant. ### LDA
Call:
lda(Heart_Attack ~ ., data = train_set)
Prior probabilities of groups:
Yes No
0.05625356 0.94374644
Group means:
StrokeNo SEXFemale RaceBlack RaceAsian RaceNative American RaceHispanic
Yes 0.8433566 0.3941904 0.05454545 0.009252286 0.01861216 0.05131791
No 0.9699049 0.5228455 0.07305421 0.026324396 0.01641026 0.08667492
RaceOther race Age25 to 34 Age35 to 44 Age45 to 54 Age55 to 64
Yes 0.02958580 0.0105433 0.02399139 0.07315761 0.2125874
No 0.03438524 0.1220349 0.14542866 0.16212109 0.2027203
Age65 or older BMI General_HealthVery good General_HealthGood
Yes 0.6774610 29.64988 0.1850457 0.3461001
No 0.3074087 28.37857 0.3717928 0.2851307
General_HealthFair General_HealthPoor Physical_Health1-13 days not good
Yes 0.2678860 0.15556751 0.2119419
No 0.0961594 0.02692078 0.1863164
Physical_Health14+ days not good Mental_Health1-13 Days not good
Yes 0.28251748 0.1922539
No 0.09407525 0.2446405
Mental_Health14+ Days not good SmokingSmokes Somedays SmokingFormer Smoker
Yes 0.1607316 0.03873050 0.4350726
No 0.1195211 0.03879722 0.2681048
SmokingNever Smoked DrinkingYes AsthmaNo Kidney_DiseaseNo Any_CancerNo
Yes 0.4135557 0.03862292 0.817106 0.8607854 0.8146315
No 0.5918083 0.07192556 0.869962 0.9695650 0.9162044
Skin_CancerNo DiabeticYes,Pregnant DiabeticNo DiabeticNo,pre-diabetic
Yes 0.8040882 0.003119957 0.6246369 0.02947821
No 0.9149283 0.008586691 0.8580086 0.02005271
COPDNo Any_ExcerciseNo Marital_StatusDivorced Marital_StatusWidowed
Yes 0.7453470 0.3595481 0.1675094 0.20667025
No 0.9351926 0.2093639 0.1314745 0.09107407
Marital_StatusSeparated Marital_StatusNever Married
Yes 0.01925767 0.07412587
No 0.01943067 0.17971130
Marital_StatusUnmarried Couple Income_Level< $15,000 Income_Level< $20,000
Yes 0.01635288 0.07401829 0.09628833
No 0.04089420 0.03724533 0.06006836
Income_Level< $25,000 Income_Level< $35,000 Income_Level< $50,000
Yes 0.11672942 0.11974180 0.1439484
No 0.08116635 0.09338908 0.1351233
Income_Level< $75,000 Income_Level> $75,000
Yes 0.1539537 0.2458311
No 0.1667703 0.3912556
Coefficients of linear discriminants:
LD1
StrokeNo 1.450785679
SEXFemale 0.428159045
RaceBlack 0.222440227
RaceAsian 0.144471937
RaceNative American 0.096099225
RaceHispanic 0.098807684
RaceOther race 0.085148702
Age25 to 34 0.044387831
Age35 to 44 0.079735009
Age45 to 54 0.018523525
Age55 to 64 -0.209055219
Age65 or older -0.753632924
BMI -0.000627774
General_HealthVery good -0.060541257
General_HealthGood -0.370651764
General_HealthFair -0.966832195
General_HealthPoor -2.000994249
Physical_Health1-13 days not good -0.065477755
Physical_Health14+ days not good -0.150798924
Mental_Health1-13 Days not good -0.021443591
Mental_Health14+ Days not good 0.024097085
SmokingSmokes Somedays -0.067474066
SmokingFormer Smoker -0.246407969
SmokingNever Smoked -0.071629269
DrinkingYes 0.137439620
AsthmaNo 0.023404655
Kidney_DiseaseNo 1.027376398
Any_CancerNo 0.050632946
Skin_CancerNo 0.309820136
DiabeticYes,Pregnant 0.598968311
DiabeticNo 0.629479955
DiabeticNo,pre-diabetic 0.525068691
COPDNo 0.936155411
Any_ExcerciseNo 0.003049723
Marital_StatusDivorced 0.069060813
Marital_StatusWidowed -0.132766173
Marital_StatusSeparated 0.082973601
Marital_StatusNever Married 0.110556467
Marital_StatusUnmarried Couple 0.093517760
Income_Level< $15,000 -0.100698305
Income_Level< $20,000 -0.059827497
Income_Level< $25,000 -0.059937190
Income_Level< $35,000 -0.027820238
Income_Level< $50,000 -0.010705997
Income_Level< $75,000 -0.049091771
Income_Level> $75,000 -0.041036644
Setting levels: control = Yes, case = No
Setting direction: controls > cases
Call:
roc.default(response = test_set$Heart_Attack, predictor = yhat.lda$posterior[, 1], plot = TRUE)
Data: yhat.lda$posterior[, 1] in 6196 controls (test_set$Heart_Attack Yes) > 103958 cases (test_set$Heart_Attack No).
Area under the curve: 0.84
Confusion Matrix and Statistics
Reference
Prediction Yes No
Yes 1368 2515
No 4828 101443
Accuracy : 0.9333
95% CI : (0.9318, 0.9348)
No Information Rate : 0.9438
P-Value [Acc > NIR] : 1
Kappa : 0.2384
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.22079
Specificity : 0.97581
Pos Pred Value : 0.35230
Neg Pred Value : 0.95457
Prevalence : 0.05625
Detection Rate : 0.01242
Detection Prevalence : 0.03525
Balanced Accuracy : 0.59830
'Positive' Class : Yes
The LDA model used all the variables from the model data and returned an accuracy of 0.9328. ### QDA
Call:
qda(Heart_Attack ~ ., data = train_set)
Prior probabilities of groups:
Yes No
0.05625356 0.94374644
Group means:
StrokeNo SEXFemale RaceBlack RaceAsian RaceNative American RaceHispanic
Yes 0.8433566 0.3941904 0.05454545 0.009252286 0.01861216 0.05131791
No 0.9699049 0.5228455 0.07305421 0.026324396 0.01641026 0.08667492
RaceOther race Age25 to 34 Age35 to 44 Age45 to 54 Age55 to 64
Yes 0.02958580 0.0105433 0.02399139 0.07315761 0.2125874
No 0.03438524 0.1220349 0.14542866 0.16212109 0.2027203
Age65 or older BMI General_HealthVery good General_HealthGood
Yes 0.6774610 29.64988 0.1850457 0.3461001
No 0.3074087 28.37857 0.3717928 0.2851307
General_HealthFair General_HealthPoor Physical_Health1-13 days not good
Yes 0.2678860 0.15556751 0.2119419
No 0.0961594 0.02692078 0.1863164
Physical_Health14+ days not good Mental_Health1-13 Days not good
Yes 0.28251748 0.1922539
No 0.09407525 0.2446405
Mental_Health14+ Days not good SmokingSmokes Somedays SmokingFormer Smoker
Yes 0.1607316 0.03873050 0.4350726
No 0.1195211 0.03879722 0.2681048
SmokingNever Smoked DrinkingYes AsthmaNo Kidney_DiseaseNo Any_CancerNo
Yes 0.4135557 0.03862292 0.817106 0.8607854 0.8146315
No 0.5918083 0.07192556 0.869962 0.9695650 0.9162044
Skin_CancerNo DiabeticYes,Pregnant DiabeticNo DiabeticNo,pre-diabetic
Yes 0.8040882 0.003119957 0.6246369 0.02947821
No 0.9149283 0.008586691 0.8580086 0.02005271
COPDNo Any_ExcerciseNo Marital_StatusDivorced Marital_StatusWidowed
Yes 0.7453470 0.3595481 0.1675094 0.20667025
No 0.9351926 0.2093639 0.1314745 0.09107407
Marital_StatusSeparated Marital_StatusNever Married
Yes 0.01925767 0.07412587
No 0.01943067 0.17971130
Marital_StatusUnmarried Couple Income_Level< $15,000 Income_Level< $20,000
Yes 0.01635288 0.07401829 0.09628833
No 0.04089420 0.03724533 0.06006836
Income_Level< $25,000 Income_Level< $35,000 Income_Level< $50,000
Yes 0.11672942 0.11974180 0.1439484
No 0.08116635 0.09338908 0.1351233
Income_Level< $75,000 Income_Level> $75,000
Yes 0.1539537 0.2458311
No 0.1667703 0.3912556
Setting levels: control = Yes, case = No
Setting direction: controls > cases
Call:
roc.default(response = test_set$Heart_Attack, predictor = yhat.qda$posterior[, 1], plot = TRUE)
Data: yhat.qda$posterior[, 1] in 6196 controls (test_set$Heart_Attack Yes) > 103958 cases (test_set$Heart_Attack No).
Area under the curve: 0.8187
Confusion Matrix and Statistics
Reference
Prediction Yes No
Yes 4650 27227
No 1546 76731
Accuracy : 0.7388
95% CI : (0.7362, 0.7414)
No Information Rate : 0.9438
P-Value [Acc > NIR] : 1
Kappa : 0.1657
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.75048
Specificity : 0.73810
Pos Pred Value : 0.14587
Neg Pred Value : 0.98025
Prevalence : 0.05625
Detection Rate : 0.04221
Detection Prevalence : 0.28939
Balanced Accuracy : 0.74429
'Positive' Class : Yes
Similarly, the QDA model used the all the variables in the model_data. However, it returned a smaller accuracy at 0.807.
Naive Bayes for Classification
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
Yes No
0.05625356 0.94374644
Conditional probabilities:
Stroke
Y Yes No
Yes 0.15668029 0.84331971
No 0.03009811 0.96990189
SEX
Y Male Female
Yes 0.6057982 0.3942018
No 0.4771547 0.5228453
Race
Y White Black Asian Native American Hispanic
Yes 0.836470209 0.054581630 0.009303076 0.018659927 0.051355130
No 0.763139501 0.073056008 0.026327096 0.016413154 0.086676457
Race
Y Other race
Yes 0.029630028
No 0.034387785
Age
Y 18 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 or older
Yes 0.002312325 0.010593676 0.024037427 0.073187782 0.212572596 0.677296193
No 0.060288441 0.122035757 0.145429070 0.162121173 0.202719601 0.307405959
BMI
Y [,1] [,2]
Yes 29.64988 6.643464
No 28.37857 6.338997
General_Health
Y Excellent Very good Good Fair Poor
Yes 0.04544232 0.18504974 0.34606077 0.26786771 0.15557946
No 0.21999596 0.37179006 0.28512936 0.09616106 0.02692356
Physical_Health
Y 0 days not good. 1-13 days not good 14+ days not good
Yes 0.50551283 0.21196149 0.28252568
No 0.71960459 0.18631786 0.09407755
Mental_Health
Y 0 Days not good 1-13 Days not good 14+ Days not good
Yes 0.6469639 0.1922767 0.1607594
No 0.6358355 0.2446414 0.1195232
Smoking
Y Smokes Everyday Smokes Somedays Former Smoker Never Smoked
Yes 0.11267075 0.03877595 0.43503281 0.41352049
No 0.10129151 0.03879993 0.26810460 0.59180395
Drinking
Y No Yes
Yes 0.96132745 0.03867255
No 0.92807169 0.07192831
Asthma
Y Yes No
Yes 0.1829281 0.8170719
No 0.1300404 0.8699596
Kidney_Disease
Y Yes No
Yes 0.13925344 0.86074656
No 0.03043799 0.96956201
Any_Cancer
Y Yes No
Yes 0.18540232 0.81459768
No 0.08379826 0.91620174
Skin_Cancer
Y Yes No
Yes 0.19594449 0.80405551
No 0.08507439 0.91492561
Diabetic
Y Yes Yes,Pregnant No No,pre-diabetic
Yes 0.342744971 0.003173067 0.624556308 0.029525653
No 0.113353768 0.008589787 0.858000782 0.020055662
COPD
Y Yes No
Yes 0.25467943 0.74532057
No 0.06481018 0.93518982
Any_Excercise
Y Yes No
Yes 0.6404367 0.3595633
No 0.7906342 0.2093658
Marital_Status
Y Married Divorced Widowed Separated Never Married
Yes 0.51597118 0.16750914 0.20665735 0.01930523 0.07415573
No 0.53740814 0.13147516 0.09107553 0.01943351 0.17971105
Marital_Status
Y Unmarried Couple
Yes 0.01640138
No 0.04089662
Income_Level
Y < $10,000 < $15,000 < $20,000 < $25,000 < $35,000 < $50,000
Yes 0.04952145 0.07404022 0.09630068 0.11673298 0.11974406 0.14394021
No 0.03498394 0.03724758 0.06007003 0.08116748 0.09338989 0.13512309
Income_Level
Y < $75,000 > $75,000
Yes 0.15394128 0.24577912
No 0.16676927 0.39124873
Setting levels: control = Yes, case = No
Setting direction: controls > cases
Call:
roc.default(response = test_set$Heart_Attack, predictor = phat.naive[, 1], plot = TRUE)
Data: phat.naive[, 1] in 6196 controls (test_set$Heart_Attack Yes) > 103958 cases (test_set$Heart_Attack No).
Area under the curve: 0.8324
Confusion Matrix and Statistics
Reference
Prediction Yes No
Yes 2431 7318
No 3765 96640
Accuracy : 0.8994
95% CI : (0.8976, 0.9012)
No Information Rate : 0.9438
P-Value [Acc > NIR] : 1
Kappa : 0.2536
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.39235
Specificity : 0.92961
Pos Pred Value : 0.24936
Neg Pred Value : 0.96250
Prevalence : 0.05625
Detection Rate : 0.02207
Detection Prevalence : 0.08850
Balanced Accuracy : 0.66098
'Positive' Class : Yes
The Naive Bayes model is 90.6% accurate. One if the interesting things to note here is looking at the A-PRIORI probabilities for the factors logistic regression identified as insignificant we can see how many times the bayes classifier designated the predictions in thoses classes. This would be helpful in creating an overall model.
Boosted Trees for Classification
Setting levels: control = Yes, case = No
Setting direction: controls > cases
Call:
roc.default(response = test_set$Heart_Attack, predictor = phat.boost.class, plot = TRUE)
Data: phat.boost.class in 6196 controls (test_set$Heart_Attack Yes) > 103958 cases (test_set$Heart_Attack No).
Area under the curve: 0.85
Confusion Matrix and Statistics
Reference
Prediction Yes No
Yes 230 187
No 5966 103771
Accuracy : 0.9441
95% CI : (0.9428, 0.9455)
No Information Rate : 0.9438
P-Value [Acc > NIR] : 0.2896
Kappa : 0.0629
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.037121
Specificity : 0.998201
Pos Pred Value : 0.551559
Neg Pred Value : 0.945634
Prevalence : 0.056249
Detection Rate : 0.002088
Detection Prevalence : 0.003786
Balanced Accuracy : 0.517661
'Positive' Class : Yes
This model also uses all the variables in model_data and is helpful to analze the relative importance graph. It identifies that General health, COPD, Stroke, Diabetes and Kidney Disease to be the top 5 important predictors. The accuracy for this model is 0.944. ###Random Forest
Confusion Matrix and Statistics
Reference
Prediction Yes No
Yes 179 159
No 6017 103799
Accuracy : 0.9439
95% CI : (0.9426, 0.9453)
No Information Rate : 0.9438
P-Value [Acc > NIR] : 0.4001
Kappa : 0.0493
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.028890
Specificity : 0.998471
Pos Pred Value : 0.529586
Neg Pred Value : 0.945208
Prevalence : 0.056249
Detection Rate : 0.001625
Detection Prevalence : 0.003068
Balanced Accuracy : 0.513680
'Positive' Class : Yes
This model has one of the highest accuracies at 0.9438 and identifies that sex, general_health, mental_health, BMI and income level are the important variables.
#Conclusion Based on all the models that fit all the variables the best accuracy was from Boosted trees, Random forest, Logistic regression, Linear Discriminant Analysis, Naive Bayes and then Quadratic Discriminant analysis in that order with accuracy levels of 0.944, 0.938, 0.9437, 0.9328, 0.9067 and 0.8073. Based on this values we suggest that the best model for predicting Heart_Attack to be a combination of Boosted trees, Random forest and Logistic regression.