Introduction

Inspiration

This project is focusing on modeling and predicting heart disease. Though an unfamiliar area for one of us, it is very curious topic for the both of us. Heart attacks are a leading cause of death worldwide. The CDC says “one person dies from heart attack every 36 seconds in the United States of America alone.” (Retrieved 21 April 2022, from https://www.cdc.gov/heartdisease/facts.htm) One of the members of the group, Brook Tilahun, who is a Neuroscience student and researcher, has some knowledge of this topic, which would provide great guidance in this navigating this project.

Data Access

We have used from the (CDC website)[Retrieved 21 April 2022, from https://www.cdc.gov/brfss/annual_data/annual_2020.html] and it’s available for free access to every user. The data we use after removing the unnecessary variables is also available in the github link listed in this project. The data is reliable and extensive, containing 401,958 rows and 279 columns of data. The data was collected from annual surveys over the phone. However, one source of bias would be from oversampling only a small part of the population. Most of the models we use apply a weight to these so that this effect is negligible.

Tidying the Data

The dataset included variables of no importance to this analysis so we filtered the data to only keep key indicators of health and also socioeconomic status. From the dataset, we retrieved the following datasets/variables:

  • Heart Attack (Target variable)
  • Coronary Heart Disease
  • Stroke
  • Sex
  • State
  • Race
  • Age
  • BMI
  • Smoking
  • Drinking
  • Asthma
  • Kidney disease
  • Cancer (Any cancer)
  • Skin cancer
  • COPD
  • General Health
  • Physical health
  • Mental Health
  • Diabetic
  • Exercise (Any Exercise)
  • Marital status (Marital_Status)
  • Income (Income_Level)

Hypothesis

Is heart attack really caused by factor X? (X is any reason people attribute to the cause of heart attack)

In this model we set out to investigate the rumors and common knowledge of what causes heart attack. Through modeling and predicting using different methods, we aim to come up with a list of variables, single or combined, and models that best predict heart attack. The data gathered from the CDC provides us with several variables that would be helpful in testing out our hypothesis.

Data Exploration

After cleaning the data, the data looks like the following:

Dataset View

Main data

We have included the steps to process the main data from the CDC, in order to acheive a clean data like we have in the clean data section. There is no need to run this unless, you want to add or change variables.

Data_2020 <- read_xpt(“LLCP2020.XPT”) cols<- c(“CVDINFR4”,“CVDCRHD4”,“CVDSTRK3”,"_STATE“,”SEXVAR“,”_IMPRACE“,”_AGE_G“,”GENHLTH“,”_PHYS14D“,”_MENT14D“,”_SMOKER3“,”_RFDRHV7“,”ASTHMA3“,”CHCKDNY2“,”CHCOCNCR“,”CHCSCNCR“,”DIABETE4“,”CHCCOPD2“,”EXERANY2“,”MARITAL“,”INCOME2")

Data_2020[cols]<- lapply(Data_2020[cols],factor)

tidy_Data2020 <-Data_2020%>% dplyr::select(“CVDINFR4”,“CVDCRHD4”,“CVDSTRK3”,"_STATE“,”SEXVAR“,”_IMPRACE“,”_AGE_G“,”_BMI5“,”GENHLTH“,”_PHYS14D“,”_MENT14D“,”_SMOKER3“,”_RFDRHV7“,”ASTHMA3“,”CHCKDNY2“,”CHCOCNCR“,”CHCSCNCR“,”DIABETE4“,”CHCCOPD2“,”EXERANY2“,”MARITAL“,”INCOME2“)%>%rename(Heart_Attack=”CVDINFR4“,Coronary_Heart_Disease=”CVDCRHD4“,Stroke=”CVDSTRK3“,SEX =”SEXVAR“, State=”_STATE“, Race=”_IMPRACE“,Age=”_AGE_G“,BMI =”_BMI5“,General_Health=”GENHLTH“,Physical_Health=”_PHYS14D“,Mental_Health=”_MENT14D“,Smoking=”_SMOKER3“,Drinking=”_RFDRHV7“,Asthma=”ASTHMA3“,Kidney_Disease=”CHCKDNY2“,Any_Cancer=”CHCOCNCR“,Skin_Cancer=”CHCSCNCR“,Diabetic=”DIABETE4“,COPD=”CHCCOPD2“,Any_Excercise=”EXERANY2“,Marital_Status=”MARITAL“,Income_Level=”INCOME2")%>%mutate(BMI = BMI/100)

new_cols<- c(“Heart_Attack”,“Coronary_Heart_Disease”,“Stroke”,“SEX”,“State”,“Race”,“Age”,“BMI”,“General_Health”,“Physical_Health”,“Mental_Health”,“Smoking”,“Drinking”,“Asthma”,“Kidney_Disease”,“Any_Cancer”,“Skin_Cancer”,“Diabetic”,“COPD”,“Any_Excercise”,“Marital_Status”,“Income_Level”)

write_csv(tidy_Data2020[new_cols], “tidy_Data2020.csv”)

#Cleaned Data

Warning in recode.numeric(.x, !!!values, .default = .default, .missing
= .missing): NAs introduced by coercion

Warning in recode.numeric(.x, !!!values, .default = .default, .missing
= .missing): NAs introduced by coercion

Warning in recode.numeric(.x, !!!values, .default = .default, .missing
= .missing): NAs introduced by coercion

Warning in recode.numeric(.x, !!!values, .default = .default, .missing
= .missing): NAs introduced by coercion

After re-leveling the data to describe the numbers with the values they represent based on the code book from the CDC website, we removed the variables where there is missing data or inconclusive data.

Visualize the data

We visualized the data using a geom_bar graph from the ggplot package. We used position=fill to scale the graph in instances where there is a larger observation for one of the variables but not the other.

Looking at these graphs we noticed that for Coronary_Heart_Disease there is a 100% similarity on the data. We will remove this variable from the dataset moving forward. Some interesting associations we noticed with the target variable is: + If you had a stroke you are more likely to have a heart attack + Males were more likely to get a heart attack + Being White or Native American increases the likelihood of a heart_attack + Most of the heart attacks occur in those over the age of 45 + Those with a higher income level had less occurance of a heart attack + State does not provide a valuable information. However it increases the computation power required so we’ll remove it moving forward

Summary

 Heart_Attack Coronary_Heart_Disease Stroke           SEX        
 Yes: 15491   Yes       : 15491      Yes: 10048   Male  :133118  
 No :259897   No        :259897      No :265340   Female:142270  
              Refused   :     0                                  
              Don't know:     0                                  
                                                                 
                                                                 
                                                                 
        State                     Race                 Age       
 Minnesota : 11369   White          :211278   18 to 24   :15743  
 Nebraska  : 11235   Black          : 19923   25 to 34   :31724  
 Ohio      :  9759   Asian          :  6864   35 to 44   :38164  
 New York  :  9270   Native American:  4590   45 to 54   :43663  
 Maryland  :  9203   Hispanic       : 23318   55 to 64   :55779  
 Washington:  9100   Other race     :  9415   65 or older:90315  
 (Other)   :215452                                               
      BMI          General_Health            Physical_Health  
 Min.   :12.02   Excellent:58079   0 days not good.  :194702  
 1st Qu.:24.14   Very good:99048   1-13 days not good: 51850  
 Median :27.41   Good     :79636   14+ days not good : 28836  
 Mean   :28.46   Fair     :29197                              
 3rd Qu.:31.62   Poor     : 9428                              
 Max.   :94.66                                                
                                                              
            Mental_Health               Smoking       Drinking     Asthma      
 0 Days not good   :174921   Smokes Everyday: 28134   No :255792   Yes: 36968  
 1-13 Days not good: 66794   Smokes Somedays: 10572   Yes: 19596   No :238420  
 14+ Days not good : 33673   Former Smoker  : 76253                            
                             Never Smoked   :160429                            
                                                                               
                                                                               
                                                                               
 Kidney_Disease Any_Cancer   Skin_Cancer             Diabetic       COPD       
 Yes: 10090     Yes: 24659   Yes: 25170   Yes            : 34948   Yes: 20751  
 No :265298     No :250729   No :250218   Yes,Pregnant   :  2256   No :254637  
                                          No             :232557               
                                          No,pre-diabetic:  5627               
                                                                               
                                                                               
                                                                               
 Any_Excercise          Marital_Status      Income_Level   
 Yes:215088    Married         :147537   > $75,000:105448  
 No : 60300    Divorced        : 36779   < $75,000: 45751  
               Widowed         : 26761   < $50,000: 37540  
               Separated       :  5312   < $35,000: 26214  
               Never Married   : 47968   < $25,000: 22674  
               Unmarried Couple: 11031   < $20,000: 16951  
                                         (Other)  : 20810  

From the summary, one can see the levels and occurrence for each variable and also note that BMI is the only numerical variable.

The reason we have two data sets here is that Multiple Correspondence Analysis (MCA) prefers all categorical variables so we had to remove BMI.

View possible relationships

####Multiple Correspondence analysis

Warning: ggrepel: 23 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

The Scree plots show us that the first three components explain the variation in most of the data, making this multiple component analysis a relevant one. The biplot shows the global pattern of the data. We can see that there is an even amount of distribution for similar and dissimilar variables in the data. When we apply the target variable to the biplot’s individual variables we can see that there is an overlap and close to the center. We can safely assume that a normal distribution curve can be plotted from this values, and models that benefit from this would perform better on prediction.

Applying several Models

Create A Data Partition

We will start applying prediction models but first we need to partition the data into a train set and a test set.

###Logistic regression


Call:
glm(formula = Heart_Attack ~ ., family = binomial, data = train_set)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.7189   0.1013   0.1946   0.3406   1.9850  

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                        3.738237   0.254035  14.715  < 2e-16 ***
StrokeNo                           0.803174   0.035902  22.372  < 2e-16 ***
SEXFemale                          0.672365   0.024966  26.931  < 2e-16 ***
RaceBlack                          0.397894   0.051043   7.795 6.43e-15 ***
RaceAsian                          0.472789   0.114711   4.122 3.76e-05 ***
RaceNative American                0.088935   0.086994   1.022 0.306633    
RaceHispanic                       0.191840   0.053088   3.614 0.000302 ***
RaceOther race                     0.104972   0.068378   1.535 0.124740    
Age25 to 34                       -0.674460   0.241744  -2.790 0.005271 ** 
Age35 to 44                       -1.149853   0.230794  -4.982 6.29e-07 ***
Age45 to 54                       -1.935581   0.224652  -8.616  < 2e-16 ***
Age55 to 64                       -2.548593   0.222761 -11.441  < 2e-16 ***
Age65 or older                    -3.147361   0.222554 -14.142  < 2e-16 ***
BMI                               -0.006424   0.001849  -3.473 0.000514 ***
General_HealthVery good           -0.588992   0.055645 -10.585  < 2e-16 ***
General_HealthGood                -1.212904   0.054449 -22.276  < 2e-16 ***
General_HealthFair                -1.668887   0.059548 -28.026  < 2e-16 ***
General_HealthPoor                -2.044334   0.069912 -29.241  < 2e-16 ***
Physical_Health1-13 days not good -0.129326   0.031022  -4.169 3.06e-05 ***
Physical_Health14+ days not good  -0.157897   0.036753  -4.296 1.74e-05 ***
Mental_Health1-13 Days not good   -0.004806   0.030902  -0.156 0.876402    
Mental_Health14+ Days not good    -0.092151   0.036785  -2.505 0.012240 *  
SmokingSmokes Somedays            -0.065034   0.068572  -0.948 0.342922    
SmokingFormer Smoker              -0.170148   0.040754  -4.175 2.98e-05 ***
SmokingNever Smoked                0.050066   0.041680   1.201 0.229675    
DrinkingYes                        0.328041   0.058205   5.636 1.74e-08 ***
AsthmaNo                           0.132616   0.032543   4.075 4.60e-05 ***
Kidney_DiseaseNo                   0.558186   0.037899  14.728  < 2e-16 ***
Any_CancerNo                       0.039166   0.031280   1.252 0.210523    
Skin_CancerNo                      0.235491   0.031084   7.576 3.56e-14 ***
DiabeticYes,Pregnant               0.585918   0.196359   2.984 0.002846 ** 
DiabeticNo                         0.463120   0.027102  17.088  < 2e-16 ***
DiabeticNo,pre-diabetic            0.297373   0.070212   4.235 2.28e-05 ***
COPDNo                             0.542675   0.031836  17.046  < 2e-16 ***
Any_ExcerciseNo                    0.035452   0.026459   1.340 0.180284    
Marital_StatusDivorced             0.090296   0.035045   2.577 0.009978 ** 
Marital_StatusWidowed             -0.140205   0.033892  -4.137 3.52e-05 ***
Marital_StatusSeparated            0.069900   0.086876   0.805 0.421057    
Marital_StatusNever Married        0.197540   0.046834   4.218 2.47e-05 ***
Marital_StatusUnmarried Couple     0.122729   0.089223   1.376 0.168967    
Income_Level< $15,000             -0.012113   0.069279  -0.175 0.861199    
Income_Level< $20,000              0.057649   0.066201   0.871 0.383851    
Income_Level< $25,000              0.076649   0.064748   1.184 0.236494    
Income_Level< $35,000              0.118714   0.064984   1.827 0.067727 .  
Income_Level< $50,000              0.160248   0.064157   2.498 0.012498 *  
Income_Level< $75,000              0.113646   0.064663   1.758 0.078829 .  
Income_Level> $75,000              0.156166   0.063517   2.459 0.013946 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 71557  on 165233  degrees of freedom
Residual deviance: 56123  on 165187  degrees of freedom
AIC: 56217

Number of Fisher Scoring iterations: 8
Setting levels: control = Yes, case = No
Setting direction: controls < cases


Call:
roc.default(response = test_set$Heart_Attack, predictor = phat.log,     plot = TRUE)

Data: phat.log in 6196 controls (test_set$Heart_Attack Yes) < 103958 cases (test_set$Heart_Attack No).
Area under the curve: 0.8474
Confusion Matrix and Statistics

          Reference
Prediction    Yes     No
       Yes    306    263
       No    5890 103695
                                          
               Accuracy : 0.9441          
                 95% CI : (0.9428, 0.9455)
    No Information Rate : 0.9438          
    P-Value [Acc > NIR] : 0.2896          
                                          
                  Kappa : 0.0818          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.049387        
            Specificity : 0.997470        
         Pos Pred Value : 0.537786        
         Neg Pred Value : 0.946252        
             Prevalence : 0.056249        
         Detection Rate : 0.002778        
   Detection Prevalence : 0.005165        
      Balanced Accuracy : 0.523428        
                                          
       'Positive' Class : Yes             
                                          

The logistic model used all the variables in the model data and returned an accuracy of 0.9437. Most variables were highly significant to the level of p=0. However, some variables such as income level until Income_level> $75,000, COPD No, Any Excercise No and Martial_Status Divorced, were not significant. ### LDA

Call:
lda(Heart_Attack ~ ., data = train_set)

Prior probabilities of groups:
       Yes         No 
0.05625356 0.94374644 

Group means:
     StrokeNo SEXFemale  RaceBlack   RaceAsian RaceNative American RaceHispanic
Yes 0.8433566 0.3941904 0.05454545 0.009252286          0.01861216   0.05131791
No  0.9699049 0.5228455 0.07305421 0.026324396          0.01641026   0.08667492
    RaceOther race Age25 to 34 Age35 to 44 Age45 to 54 Age55 to 64
Yes     0.02958580   0.0105433  0.02399139  0.07315761   0.2125874
No      0.03438524   0.1220349  0.14542866  0.16212109   0.2027203
    Age65 or older      BMI General_HealthVery good General_HealthGood
Yes      0.6774610 29.64988               0.1850457          0.3461001
No       0.3074087 28.37857               0.3717928          0.2851307
    General_HealthFair General_HealthPoor Physical_Health1-13 days not good
Yes          0.2678860         0.15556751                         0.2119419
No           0.0961594         0.02692078                         0.1863164
    Physical_Health14+ days not good Mental_Health1-13 Days not good
Yes                       0.28251748                       0.1922539
No                        0.09407525                       0.2446405
    Mental_Health14+ Days not good SmokingSmokes Somedays SmokingFormer Smoker
Yes                      0.1607316             0.03873050            0.4350726
No                       0.1195211             0.03879722            0.2681048
    SmokingNever Smoked DrinkingYes AsthmaNo Kidney_DiseaseNo Any_CancerNo
Yes           0.4135557  0.03862292 0.817106        0.8607854    0.8146315
No            0.5918083  0.07192556 0.869962        0.9695650    0.9162044
    Skin_CancerNo DiabeticYes,Pregnant DiabeticNo DiabeticNo,pre-diabetic
Yes     0.8040882          0.003119957  0.6246369              0.02947821
No      0.9149283          0.008586691  0.8580086              0.02005271
       COPDNo Any_ExcerciseNo Marital_StatusDivorced Marital_StatusWidowed
Yes 0.7453470       0.3595481              0.1675094            0.20667025
No  0.9351926       0.2093639              0.1314745            0.09107407
    Marital_StatusSeparated Marital_StatusNever Married
Yes              0.01925767                  0.07412587
No               0.01943067                  0.17971130
    Marital_StatusUnmarried Couple Income_Level< $15,000 Income_Level< $20,000
Yes                     0.01635288            0.07401829            0.09628833
No                      0.04089420            0.03724533            0.06006836
    Income_Level< $25,000 Income_Level< $35,000 Income_Level< $50,000
Yes            0.11672942            0.11974180             0.1439484
No             0.08116635            0.09338908             0.1351233
    Income_Level< $75,000 Income_Level> $75,000
Yes             0.1539537             0.2458311
No              0.1667703             0.3912556

Coefficients of linear discriminants:
                                           LD1
StrokeNo                           1.450785679
SEXFemale                          0.428159045
RaceBlack                          0.222440227
RaceAsian                          0.144471937
RaceNative American                0.096099225
RaceHispanic                       0.098807684
RaceOther race                     0.085148702
Age25 to 34                        0.044387831
Age35 to 44                        0.079735009
Age45 to 54                        0.018523525
Age55 to 64                       -0.209055219
Age65 or older                    -0.753632924
BMI                               -0.000627774
General_HealthVery good           -0.060541257
General_HealthGood                -0.370651764
General_HealthFair                -0.966832195
General_HealthPoor                -2.000994249
Physical_Health1-13 days not good -0.065477755
Physical_Health14+ days not good  -0.150798924
Mental_Health1-13 Days not good   -0.021443591
Mental_Health14+ Days not good     0.024097085
SmokingSmokes Somedays            -0.067474066
SmokingFormer Smoker              -0.246407969
SmokingNever Smoked               -0.071629269
DrinkingYes                        0.137439620
AsthmaNo                           0.023404655
Kidney_DiseaseNo                   1.027376398
Any_CancerNo                       0.050632946
Skin_CancerNo                      0.309820136
DiabeticYes,Pregnant               0.598968311
DiabeticNo                         0.629479955
DiabeticNo,pre-diabetic            0.525068691
COPDNo                             0.936155411
Any_ExcerciseNo                    0.003049723
Marital_StatusDivorced             0.069060813
Marital_StatusWidowed             -0.132766173
Marital_StatusSeparated            0.082973601
Marital_StatusNever Married        0.110556467
Marital_StatusUnmarried Couple     0.093517760
Income_Level< $15,000             -0.100698305
Income_Level< $20,000             -0.059827497
Income_Level< $25,000             -0.059937190
Income_Level< $35,000             -0.027820238
Income_Level< $50,000             -0.010705997
Income_Level< $75,000             -0.049091771
Income_Level> $75,000             -0.041036644
Setting levels: control = Yes, case = No
Setting direction: controls > cases


Call:
roc.default(response = test_set$Heart_Attack, predictor = yhat.lda$posterior[,     1], plot = TRUE)

Data: yhat.lda$posterior[, 1] in 6196 controls (test_set$Heart_Attack Yes) > 103958 cases (test_set$Heart_Attack No).
Area under the curve: 0.84
Confusion Matrix and Statistics

          Reference
Prediction    Yes     No
       Yes   1368   2515
       No    4828 101443
                                          
               Accuracy : 0.9333          
                 95% CI : (0.9318, 0.9348)
    No Information Rate : 0.9438          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2384          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.22079         
            Specificity : 0.97581         
         Pos Pred Value : 0.35230         
         Neg Pred Value : 0.95457         
             Prevalence : 0.05625         
         Detection Rate : 0.01242         
   Detection Prevalence : 0.03525         
      Balanced Accuracy : 0.59830         
                                          
       'Positive' Class : Yes             
                                          

The LDA model used all the variables from the model data and returned an accuracy of 0.9328. ### QDA

Call:
qda(Heart_Attack ~ ., data = train_set)

Prior probabilities of groups:
       Yes         No 
0.05625356 0.94374644 

Group means:
     StrokeNo SEXFemale  RaceBlack   RaceAsian RaceNative American RaceHispanic
Yes 0.8433566 0.3941904 0.05454545 0.009252286          0.01861216   0.05131791
No  0.9699049 0.5228455 0.07305421 0.026324396          0.01641026   0.08667492
    RaceOther race Age25 to 34 Age35 to 44 Age45 to 54 Age55 to 64
Yes     0.02958580   0.0105433  0.02399139  0.07315761   0.2125874
No      0.03438524   0.1220349  0.14542866  0.16212109   0.2027203
    Age65 or older      BMI General_HealthVery good General_HealthGood
Yes      0.6774610 29.64988               0.1850457          0.3461001
No       0.3074087 28.37857               0.3717928          0.2851307
    General_HealthFair General_HealthPoor Physical_Health1-13 days not good
Yes          0.2678860         0.15556751                         0.2119419
No           0.0961594         0.02692078                         0.1863164
    Physical_Health14+ days not good Mental_Health1-13 Days not good
Yes                       0.28251748                       0.1922539
No                        0.09407525                       0.2446405
    Mental_Health14+ Days not good SmokingSmokes Somedays SmokingFormer Smoker
Yes                      0.1607316             0.03873050            0.4350726
No                       0.1195211             0.03879722            0.2681048
    SmokingNever Smoked DrinkingYes AsthmaNo Kidney_DiseaseNo Any_CancerNo
Yes           0.4135557  0.03862292 0.817106        0.8607854    0.8146315
No            0.5918083  0.07192556 0.869962        0.9695650    0.9162044
    Skin_CancerNo DiabeticYes,Pregnant DiabeticNo DiabeticNo,pre-diabetic
Yes     0.8040882          0.003119957  0.6246369              0.02947821
No      0.9149283          0.008586691  0.8580086              0.02005271
       COPDNo Any_ExcerciseNo Marital_StatusDivorced Marital_StatusWidowed
Yes 0.7453470       0.3595481              0.1675094            0.20667025
No  0.9351926       0.2093639              0.1314745            0.09107407
    Marital_StatusSeparated Marital_StatusNever Married
Yes              0.01925767                  0.07412587
No               0.01943067                  0.17971130
    Marital_StatusUnmarried Couple Income_Level< $15,000 Income_Level< $20,000
Yes                     0.01635288            0.07401829            0.09628833
No                      0.04089420            0.03724533            0.06006836
    Income_Level< $25,000 Income_Level< $35,000 Income_Level< $50,000
Yes            0.11672942            0.11974180             0.1439484
No             0.08116635            0.09338908             0.1351233
    Income_Level< $75,000 Income_Level> $75,000
Yes             0.1539537             0.2458311
No              0.1667703             0.3912556
Setting levels: control = Yes, case = No
Setting direction: controls > cases


Call:
roc.default(response = test_set$Heart_Attack, predictor = yhat.qda$posterior[,     1], plot = TRUE)

Data: yhat.qda$posterior[, 1] in 6196 controls (test_set$Heart_Attack Yes) > 103958 cases (test_set$Heart_Attack No).
Area under the curve: 0.8187
Confusion Matrix and Statistics

          Reference
Prediction   Yes    No
       Yes  4650 27227
       No   1546 76731
                                          
               Accuracy : 0.7388          
                 95% CI : (0.7362, 0.7414)
    No Information Rate : 0.9438          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1657          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.75048         
            Specificity : 0.73810         
         Pos Pred Value : 0.14587         
         Neg Pred Value : 0.98025         
             Prevalence : 0.05625         
         Detection Rate : 0.04221         
   Detection Prevalence : 0.28939         
      Balanced Accuracy : 0.74429         
                                          
       'Positive' Class : Yes             
                                          

Similarly, the QDA model used the all the variables in the model_data. However, it returned a smaller accuracy at 0.807.

Naive Bayes for Classification


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
       Yes         No 
0.05625356 0.94374644 

Conditional probabilities:
     Stroke
Y            Yes         No
  Yes 0.15668029 0.84331971
  No  0.03009811 0.96990189

     SEX
Y          Male    Female
  Yes 0.6057982 0.3942018
  No  0.4771547 0.5228453

     Race
Y           White       Black       Asian Native American    Hispanic
  Yes 0.836470209 0.054581630 0.009303076     0.018659927 0.051355130
  No  0.763139501 0.073056008 0.026327096     0.016413154 0.086676457
     Race
Y      Other race
  Yes 0.029630028
  No  0.034387785

     Age
Y        18 to 24    25 to 34    35 to 44    45 to 54    55 to 64 65 or older
  Yes 0.002312325 0.010593676 0.024037427 0.073187782 0.212572596 0.677296193
  No  0.060288441 0.122035757 0.145429070 0.162121173 0.202719601 0.307405959

     BMI
Y         [,1]     [,2]
  Yes 29.64988 6.643464
  No  28.37857 6.338997

     General_Health
Y      Excellent  Very good       Good       Fair       Poor
  Yes 0.04544232 0.18504974 0.34606077 0.26786771 0.15557946
  No  0.21999596 0.37179006 0.28512936 0.09616106 0.02692356

     Physical_Health
Y     0 days not good. 1-13 days not good 14+ days not good
  Yes       0.50551283         0.21196149        0.28252568
  No        0.71960459         0.18631786        0.09407755

     Mental_Health
Y     0 Days not good 1-13 Days not good 14+ Days not good
  Yes       0.6469639          0.1922767         0.1607594
  No        0.6358355          0.2446414         0.1195232

     Smoking
Y     Smokes Everyday Smokes Somedays Former Smoker Never Smoked
  Yes      0.11267075      0.03877595    0.43503281   0.41352049
  No       0.10129151      0.03879993    0.26810460   0.59180395

     Drinking
Y             No        Yes
  Yes 0.96132745 0.03867255
  No  0.92807169 0.07192831

     Asthma
Y           Yes        No
  Yes 0.1829281 0.8170719
  No  0.1300404 0.8699596

     Kidney_Disease
Y            Yes         No
  Yes 0.13925344 0.86074656
  No  0.03043799 0.96956201

     Any_Cancer
Y            Yes         No
  Yes 0.18540232 0.81459768
  No  0.08379826 0.91620174

     Skin_Cancer
Y            Yes         No
  Yes 0.19594449 0.80405551
  No  0.08507439 0.91492561

     Diabetic
Y             Yes Yes,Pregnant          No No,pre-diabetic
  Yes 0.342744971  0.003173067 0.624556308     0.029525653
  No  0.113353768  0.008589787 0.858000782     0.020055662

     COPD
Y            Yes         No
  Yes 0.25467943 0.74532057
  No  0.06481018 0.93518982

     Any_Excercise
Y           Yes        No
  Yes 0.6404367 0.3595633
  No  0.7906342 0.2093658

     Marital_Status
Y        Married   Divorced    Widowed  Separated Never Married
  Yes 0.51597118 0.16750914 0.20665735 0.01930523    0.07415573
  No  0.53740814 0.13147516 0.09107553 0.01943351    0.17971105
     Marital_Status
Y     Unmarried Couple
  Yes       0.01640138
  No        0.04089662

     Income_Level
Y      < $10,000  < $15,000  < $20,000  < $25,000  < $35,000  < $50,000
  Yes 0.04952145 0.07404022 0.09630068 0.11673298 0.11974406 0.14394021
  No  0.03498394 0.03724758 0.06007003 0.08116748 0.09338989 0.13512309
     Income_Level
Y      < $75,000  > $75,000
  Yes 0.15394128 0.24577912
  No  0.16676927 0.39124873
Setting levels: control = Yes, case = No
Setting direction: controls > cases


Call:
roc.default(response = test_set$Heart_Attack, predictor = phat.naive[,     1], plot = TRUE)

Data: phat.naive[, 1] in 6196 controls (test_set$Heart_Attack Yes) > 103958 cases (test_set$Heart_Attack No).
Area under the curve: 0.8324
Confusion Matrix and Statistics

          Reference
Prediction   Yes    No
       Yes  2431  7318
       No   3765 96640
                                          
               Accuracy : 0.8994          
                 95% CI : (0.8976, 0.9012)
    No Information Rate : 0.9438          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2536          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.39235         
            Specificity : 0.92961         
         Pos Pred Value : 0.24936         
         Neg Pred Value : 0.96250         
             Prevalence : 0.05625         
         Detection Rate : 0.02207         
   Detection Prevalence : 0.08850         
      Balanced Accuracy : 0.66098         
                                          
       'Positive' Class : Yes             
                                          

The Naive Bayes model is 90.6% accurate. One if the interesting things to note here is looking at the A-PRIORI probabilities for the factors logistic regression identified as insignificant we can see how many times the bayes classifier designated the predictions in thoses classes. This would be helpful in creating an overall model.

Boosted Trees for Classification

Setting levels: control = Yes, case = No
Setting direction: controls > cases


Call:
roc.default(response = test_set$Heart_Attack, predictor = phat.boost.class,     plot = TRUE)

Data: phat.boost.class in 6196 controls (test_set$Heart_Attack Yes) > 103958 cases (test_set$Heart_Attack No).
Area under the curve: 0.85
Confusion Matrix and Statistics

          Reference
Prediction    Yes     No
       Yes    230    187
       No    5966 103771
                                          
               Accuracy : 0.9441          
                 95% CI : (0.9428, 0.9455)
    No Information Rate : 0.9438          
    P-Value [Acc > NIR] : 0.2896          
                                          
                  Kappa : 0.0629          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.037121        
            Specificity : 0.998201        
         Pos Pred Value : 0.551559        
         Neg Pred Value : 0.945634        
             Prevalence : 0.056249        
         Detection Rate : 0.002088        
   Detection Prevalence : 0.003786        
      Balanced Accuracy : 0.517661        
                                          
       'Positive' Class : Yes             
                                          

This model also uses all the variables in model_data and is helpful to analze the relative importance graph. It identifies that General health, COPD, Stroke, Diabetes and Kidney Disease to be the top 5 important predictors. The accuracy for this model is 0.944. ###Random Forest

Confusion Matrix and Statistics

          Reference
Prediction    Yes     No
       Yes    179    159
       No    6017 103799
                                          
               Accuracy : 0.9439          
                 95% CI : (0.9426, 0.9453)
    No Information Rate : 0.9438          
    P-Value [Acc > NIR] : 0.4001          
                                          
                  Kappa : 0.0493          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.028890        
            Specificity : 0.998471        
         Pos Pred Value : 0.529586        
         Neg Pred Value : 0.945208        
             Prevalence : 0.056249        
         Detection Rate : 0.001625        
   Detection Prevalence : 0.003068        
      Balanced Accuracy : 0.513680        
                                          
       'Positive' Class : Yes             
                                          

This model has one of the highest accuracies at 0.9438 and identifies that sex, general_health, mental_health, BMI and income level are the important variables.

#Conclusion Based on all the models that fit all the variables the best accuracy was from Boosted trees, Random forest, Logistic regression, Linear Discriminant Analysis, Naive Bayes and then Quadratic Discriminant analysis in that order with accuracy levels of 0.944, 0.938, 0.9437, 0.9328, 0.9067 and 0.8073. Based on this values we suggest that the best model for predicting Heart_Attack to be a combination of Boosted trees, Random forest and Logistic regression.