This is the third file of a four-part data analysis exercise, conducted on the dataset from McKay et al 2020, found here. This file contains data analysis and model fitting.

Load Data/Packages

#load necessary packages
#load and view data
flu_data <- readRDS(here::here("fluanalysis", "data", "flu_data_clean.RDS"))
Rows: 730
Columns: 32
$ SwollenLymphNodes <fct> Yes, Yes, Yes, Yes, Yes, No, No, No, Yes, No, Yes, Y~
$ ChestCongestion   <fct> No, Yes, Yes, Yes, No, No, No, Yes, Yes, Yes, Yes, Y~
$ ChillsSweats      <fct> No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, ~
$ NasalCongestion   <fct> No, Yes, Yes, Yes, No, No, No, Yes, Yes, Yes, Yes, Y~
$ CoughYN           <fct> Yes, Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, No, ~
$ Sneeze            <fct> No, No, Yes, Yes, No, Yes, No, Yes, No, No, No, No, ~
$ Fatigue           <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
$ SubjectiveFever   <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes~
$ Headache          <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes~
$ Weakness          <fct> Mild, Severe, Severe, Severe, Moderate, Moderate, Mi~
$ WeaknessYN        <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
$ CoughIntensity    <fct> Severe, Severe, Mild, Moderate, None, Moderate, Seve~
$ CoughYN2          <fct> Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes~
$ Myalgia           <fct> Mild, Severe, Severe, Severe, Mild, Moderate, Mild, ~
$ MyalgiaYN         <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye~
$ RunnyNose         <fct> No, No, Yes, Yes, No, No, Yes, Yes, Yes, Yes, No, No~
$ AbPain            <fct> No, No, Yes, No, No, No, No, No, No, No, Yes, Yes, N~
$ ChestPain         <fct> No, No, Yes, No, No, Yes, Yes, No, No, No, No, Yes, ~
$ Diarrhea          <fct> No, No, No, No, No, Yes, No, No, No, No, No, No, No,~
$ EyePn             <fct> No, No, No, No, Yes, No, No, No, No, No, Yes, No, Ye~
$ Insomnia          <fct> No, No, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Yes, Y~
$ ItchyEye          <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes,~
$ Nausea            <fct> No, No, Yes, Yes, Yes, Yes, No, No, Yes, Yes, Yes, Y~
$ EarPn             <fct> No, Yes, No, Yes, No, No, No, No, No, No, No, Yes, Y~
$ Hearing           <fct> No, Yes, No, No, No, No, No, No, No, No, No, No, No,~
$ Pharyngitis       <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, No, Yes, ~
$ Breathless        <fct> No, No, Yes, No, No, Yes, No, No, No, Yes, No, Yes, ~
$ ToothPn           <fct> No, No, Yes, No, No, No, No, No, Yes, No, No, Yes, N~
$ Vision            <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, ~
$ Vomit             <fct> No, No, No, No, No, No, Yes, No, No, No, Yes, Yes, N~
$ Wheeze            <fct> No, No, No, Yes, No, Yes, No, No, No, No, No, Yes, N~
$ BodyTemp          <dbl> 98.3, 100.4, 100.8, 98.8, 100.5, 98.4, 102.5, 98.4, ~

Model Fitting

Body Temperature

First, I want to fit a linear regression model using our first outcome of interest, body temperature, and a single predictor variable, runny nose.

flu_lm1 <- linear_reg() %>%
  set_engine("lm") %>%
  fit(BodyTemp ~ RunnyNose, data = flu_data)
kableExtra::kable(tidy(flu_lm1)) %>%
  kableExtra::kable_classic(full_width = FALSE, font = "Arial")
term estimate std.error statistic p.value
(Intercept) 99.1431280 0.0819076 1210.425819 0.00000
RunnyNoseYes -0.2926463 0.0971409 -3.012595 0.00268
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0123        0.0110  1.19      9.08 0.00268     1 -1162. 2329. 2343.
# i 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Based on the R-squared value and the RMSE, this does not seem to be a great fit to the data. I want to throw in all of the other variables into a global model and see if this does a better job of explaining the data than the first model.

flu_lm2 <- linear_reg() %>%
  set_engine("lm") %>%
  fit(BodyTemp ~ . , data = flu_data)
kableExtra::kable(tidy(flu_lm2)) %>%
  kableExtra::kable_classic(full_width = FALSE, font = "Arial")
term estimate std.error statistic p.value
(Intercept) 97.9252434 0.3038043 322.3300394 0.0000000
SwollenLymphNodesYes -0.1653017 0.0919592 -1.7975544 0.0726816
ChestCongestionYes 0.0873264 0.0975461 0.8952326 0.3709727
ChillsSweatsYes 0.2012657 0.1273016 1.5810148 0.1143296
NasalCongestionYes -0.2157711 0.1137978 -1.8960920 0.0583624
CoughYNYes 0.3138934 0.2407384 1.3038777 0.1927070
SneezeYes -0.3619237 0.0982987 -3.6818764 0.0002495
FatigueYes 0.2647620 0.1605576 1.6490163 0.0995962
SubjectiveFeverYes 0.4368372 0.1033982 4.2248066 0.0000271
HeadacheYes 0.0114533 0.1254052 0.0913306 0.9272562
WeaknessMild 0.0182293 0.1891690 0.0963651 0.9232584
WeaknessModerate 0.0989441 0.1978635 0.5000625 0.6171894
WeaknessSevere 0.3734354 0.2307665 1.6182393 0.1060648
WeaknessYNYes NA NA NA NA
CoughIntensityMild 0.0848811 0.2798780 0.3032791 0.7617680
CoughIntensityModerate -0.0613837 0.3019966 -0.2032595 0.8389917
CoughIntensitySevere -0.0372720 0.3140127 -0.1186957 0.9055507
MyalgiaMild 0.1642421 0.1604984 1.0233257 0.3065099
MyalgiaModerate -0.0240642 0.1678337 -0.1433810 0.8860309
MyalgiaSevere -0.1292631 0.2078542 -0.6218934 0.5342159
MyalgiaYNYes NA NA NA NA
RunnyNoseYes -0.0804851 0.1085262 -0.7416190 0.4585687
AbPainYes 0.0315744 0.1402357 0.2251524 0.8219269
ChestPainYes 0.1050706 0.1069797 0.9821547 0.3263654
DiarrheaYes -0.1568064 0.1295451 -1.2104390 0.2265220
EyePnYes 0.1315436 0.1297573 1.0137665 0.3110470
InsomniaYes -0.0068237 0.0907966 -0.0751542 0.9401137
ItchyEyeYes -0.0080161 0.1101909 -0.0727470 0.9420284
NauseaYes -0.0340655 0.1020492 -0.3338147 0.7386200
EarPnYes 0.0937899 0.1138747 0.8236242 0.4104357
HearingYes 0.2322025 0.2220428 1.0457557 0.2960374
PharyngitisYes 0.3175805 0.1213416 2.6172439 0.0090571
BreathlessYes 0.0905257 0.0998375 0.9067309 0.3648634
ToothPnYes -0.0228762 0.1137504 -0.2011084 0.8406726
VisionYes -0.2746251 0.2776810 -0.9889948 0.3230099
VomitYes 0.1652722 0.1514323 1.0913937 0.2754779
WheezeYes -0.0466649 0.1070356 -0.4359755 0.6629899
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic      p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>        <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.129        0.0860  1.14      3.02 0.0000000420    34 -1116. 2304. 2469.
# i 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Let’s compare these models side-by-side.

compare_performance(flu_lm1, flu_lm2)
# Comparison of Model Performance Indices

Name    | Model |  AIC (weights) | AICc (weights) |  BIC (weights) |    R2 | R2 (adj.) |  RMSE | Sigma
flu_lm1 |   _lm | 2329.3 (<.001) | 2329.4 (<.001) | 2343.1 (>.999) | 0.012 |     0.011 | 1.188 | 1.190
flu_lm2 |   _lm | 2303.8 (>.999) | 2307.7 (>.999) | 2469.2 (<.001) | 0.129 |     0.086 | 1.116 | 1.144

Even though it is not a great fit to the data (R-squared = 0.13), the global model is better at explaining the data than the first model based on R-sq and RMSE (as well as AIC).


Now, I want to fit a logistic regression model using our second outcome of interest, nausea, with a single predictor variable, runny nose.

flu_glm1 <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(Nausea ~ RunnyNose, data = flu_data)
kableExtra::kable(tidy(flu_glm1)) %>%
  kableExtra::kable_classic(full_width = FALSE, font = "Arial")
term estimate std.error statistic p.value
(Intercept) -0.6578078 0.1452003 -4.5303474 0.0000059
RunnyNoseYes 0.0501828 0.1718249 0.2920578 0.7702424
# A tibble: 1 x 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1          945.     729  -472.  949.  958.     945.         728   730

Same as before, I want to fit a global model using nausea and all of the other variables in the dataset and compare it to the first model.

flu_glm2 <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(Nausea ~ . , data = flu_data)
kableExtra::kable(tidy(flu_glm2)) %>%
  kableExtra::kable_classic(full_width = FALSE, font = "Arial")
term estimate std.error statistic p.value
(Intercept) 0.2228705 7.8274090 0.0284731 0.9772848
SwollenLymphNodesYes -0.2510826 0.1960289 -1.2808444 0.2002483
ChestCongestionYes 0.2755537 0.2126616 1.2957376 0.1950659
ChillsSweatsYes 0.2740967 0.2878278 0.9522940 0.3409479
NasalCongestionYes 0.4258174 0.2545609 1.6727521 0.0943761
CoughYNYes -0.1404234 0.5187985 -0.2706705 0.7866445
SneezeYes 0.1767237 0.2103493 0.8401441 0.4008276
FatigueYes 0.2290618 0.3718816 0.6159535 0.5379252
SubjectiveFeverYes 0.2777411 0.2253628 1.2324177 0.2177931
HeadacheYes 0.3312592 0.2848965 1.1627353 0.2449369
WeaknessMild -0.1216059 0.4468864 -0.2721182 0.7855312
WeaknessModerate 0.3108487 0.4544826 0.6839618 0.4939993
WeaknessSevere 0.8231869 0.5104242 1.6127505 0.1067987
WeaknessYNYes NA NA NA NA
CoughIntensityMild -0.2207938 0.5843673 -0.3778339 0.7055540
CoughIntensityModerate -0.3626784 0.6313701 -0.5744307 0.5656764
CoughIntensitySevere -0.9505442 0.6581423 -1.4442837 0.1486592
MyalgiaMild -0.0041462 0.3680943 -0.0112640 0.9910128
MyalgiaModerate 0.2047429 0.3732305 0.5485697 0.5833008
MyalgiaSevere 0.1207580 0.4449265 0.2714112 0.7860748
MyalgiaYNYes NA NA NA NA
RunnyNoseYes 0.0453236 0.2326446 0.1948189 0.8455347
AbPainYes 0.9393036 0.2814632 3.3372164 0.0008462
ChestPainYes 0.0707772 0.2278583 0.3106192 0.7560901
DiarrheaYes 1.0639338 0.2587051 4.1125354 0.0000391
EyePnYes -0.3419910 0.2777198 -1.2314249 0.2181640
InsomniaYes 0.0841752 0.1929851 0.4361747 0.6627100
ItchyEyeYes -0.0633645 0.2325014 -0.2725337 0.7852117
EarPnYes -0.1817189 0.2392074 -0.7596707 0.4474514
HearingYes 0.3230517 0.4524022 0.7140808 0.4751772
PharyngitisYes 0.2753644 0.2660588 1.0349756 0.3006803
BreathlessYes 0.5268015 0.2085792 2.5256661 0.0115479
ToothPnYes 0.4806486 0.2294739 2.0945674 0.0362095
VisionYes 0.1254977 0.5411141 0.2319247 0.8165965
VomitYes 2.4584655 0.3486077 7.0522411 0.0000000
WheezeYes -0.3044348 0.2340838 -1.3005376 0.1934168
BodyTemp -0.0312460 0.0798381 -0.3913668 0.6955261
# A tibble: 1 x 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1          945.     729  -376.  821.  982.     751.         695   730

Let’s see which model is a better fit.

compare_performance(flu_glm1, flu_glm2)
Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading

Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading

Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading

Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
prediction from a rank-deficient fit may be misleading
# Comparison of Model Performance Indices

Name     | Model | AIC (weights) | AICc (weights) | BIC (weights) | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log | Score_spherical |   PCP
flu_glm1 |  _glm | 948.6 (<.001) |  948.6 (<.001) | 957.8 (>.999) | 1.169e-04 | 0.477 | 1.139 |    0.647 |  -107.871 |           0.012 | 0.545
flu_glm2 |  _glm | 821.5 (>.999) |  825.1 (>.999) | 982.2 (<.001) |     0.247 | 0.414 | 1.040 |    0.515 |      -Inf |           0.002 | 0.658

Once again, the global model was better at explaining the data than the univariate model.