Setting Up
#Library packages
library(rmdformats)
library(dplyr)
library(tidyverse)
library(GGally)
library(leaps)
library(sjPlot)
library(ggplot2)
library(corrplot)
library(RColorBrewer)
library(car)
library(ggsci)
library(jtools)
library(ppcor)
library(rmarkdown)
#Set working directory
setwd("C:/Users/joshu/Desktop/COLLEGE_GENERAL/ASU_GENERAL/ASU_Spring_2021/CDE498")
#Read in data set
covid <- read.csv("state_covid.csv", head=T)
#Renaming and reformatting data
covid$poverty100 <- covid$poverty * 100
covid$college100 <- covid$college_grad_rate * 100
covid$democrat <- covid$Democrat_shr_2016
covid$republican <- covid$Republican_shr_2016
covid$covid <- covid$COVID_cases_per_100k
Question 1
Variables
Predictor variables: college100, republican, region_short Outcome variable: covid
college100: continuous predictor variable – I believe COVID cases will be negatively correlated with college graduation rates because college-educated individuals tend to get more exposure and training in scientific fields than individuals that are not college educated. This may lead individuals to have more faith and be more up-to-date on scientific findings related to the spread and severity of the disease around the U.S.
republican: continuous predictor variable – I believe states in which a larger proportion of citizens are Republican will have higher rates of COVID since they appear to have less strict COVID-prevention policies like mask mandates and lock downs.
region_short: categorical predictor variable – I believe states in the south and the west are less likely to implement restrictions regarding COVID-19, leading to more cases in these regions.
Summary Statistics
## [1] 35.9683
## [1] 7.52454
## [1] 48.4902
## [1] 11.8918
## [1] 0.1764706
## [1] 0.254902
## [1] 0.254902
## [1] 0.3137255
## [1] 2216.451
## [1] 856.6188
Correlations
variables <- subset(covid, select = c(covid, college100, republican))
cormat <- cor(variables, use = "complete.obs")
print(cormat)
## covid college100 republican
## covid 1.0000000 -0.2829080 0.3216431
## college100 -0.2829080 1.0000000 -0.7060591
## republican 0.3216431 -0.7060591 1.0000000
Question 2
#Scatter plot using R
plot(covid$republican, covid$covid,
main = "COVID-19 Rate by Percent Republican",
xlab = "Republican %",
ylab = "COVID-19 Cases per 100k",
col = "purple")
abline(lm(covid$covid ~ covid$republican))
#Scatter plot using ggplot2
plot2 <- ggplot(data=covid, aes(x=republican, y=covid)) +
geom_point(aes(fill=region_short), size = 6, pch = 21) +
xlab("Republican Voting Share") +
ylab("COVID-19 Cases per 100K Citizens") +
ggtitle("COVID-19 Rate by Republican Voting Share") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, size = 22, face = "bold"),
axis.text = element_text(size = 12),
axis.title = element_text(size = 12),
legend.title = element_text(size = 12, face = "bold"),
legend.position = "bottom") +
scale_fill_simpsons(name = "Regions") +
geom_text(label = covid$STATE_NAME, size = 3) +
geom_smooth(formula = y ~ x, method = "lm", se = F, color = "black")
print(plot2)
Question 3
Simple Linear Regression
I expect that Republican voting share will have a positive association with COVID cases per 100K citizens. Additionally, I expect that college graduation rates will have a negative association with COVID cases per 100K citizens.
##OLS Regression (one predictor or IV or X)
slm1 <- lm(covid$covid ~ covid$republican)
summary(slm1)
##
## Call:
## lm(formula = covid$covid ~ covid$republican)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1707.6 -677.1 206.9 583.1 1319.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1092.964 486.209 2.248 0.0291 *
## covid$republican 23.169 9.744 2.378 0.0214 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 819.3 on 49 degrees of freedom
## Multiple R-squared: 0.1035, Adjusted R-squared: 0.08516
## F-statistic: 5.654 on 1 and 49 DF, p-value: 0.02136
## MODEL INFO:
## Observations: 51
## Dependent Variable: covid$covid
## Type: OLS linear regression
##
## MODEL FIT:
## F(1,49) = 5.65, p = 0.02
## R² = 0.10
## Adj. R² = 0.09
##
## Standard errors: OLS
## ---------------------------------------------------------
## Est. S.E. t val. p
## ---------------------- --------- -------- -------- ------
## (Intercept) 1092.96 486.21 2.25 0.03
## covid$republican 23.17 9.74 2.38 0.02
## ---------------------------------------------------------
##
## Call:
## lm(formula = covid$covid ~ covid$college100)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1756.9 -631.8 169.3 563.9 1295.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3374.89 572.98 5.890 3.45e-07 ***
## covid$college100 -32.21 15.60 -2.065 0.0443 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 830 on 49 degrees of freedom
## Multiple R-squared: 0.08004, Adjusted R-squared: 0.06126
## F-statistic: 4.263 on 1 and 49 DF, p-value: 0.04426
## MODEL INFO:
## Observations: 51
## Dependent Variable: covid$covid
## Type: OLS linear regression
##
## MODEL FIT:
## F(1,49) = 4.26, p = 0.04
## R² = 0.08
## Adj. R² = 0.06
##
## Standard errors: OLS
## ---------------------------------------------------------
## Est. S.E. t val. p
## ---------------------- --------- -------- -------- ------
## (Intercept) 3374.89 572.98 5.89 0.00
## covid$college100 -32.21 15.60 -2.06 0.04
## ---------------------------------------------------------
A one-unit increase in Republican voting share is associated with an additional 23.17 COVID cases per 100K citizens. The p-value of .02136 indicates that this model is statistically significant. The r-squared value of .1035 indicates that Republican voting share explains roughly 10% of the variability in COVID cases per 100K citizens.
A one-unit increase in college graduation rate is associated with an 32.21 fewer COVID cases per 100K citizens. The p-value of .0443 indicates that this model is statistically significant. The r-squared value of .08004 indicates that college graduation rates explain roughly 8% of the variability in COVID cases per 100K citizens.
Multiple Linear Regression
covid$region_short <- as.factor(covid$region_short)
mlm1 <- lm(covid$covid ~ covid$republican +
covid$college100 + relevel(covid$region_short, ref = "South"))
summary(mlm1)
##
## Call:
## lm(formula = covid$covid ~ covid$republican + covid$college100 +
## relevel(covid$region_short, ref = "South"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1867.4 -480.3 48.4 615.4 1280.6
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 3583.3682 1442.9084
## covid$republican -0.9241 14.0848
## covid$college100 -26.5993 22.7122
## relevel(covid$region_short, ref = "South")Midwest -9.9284 284.8462
## relevel(covid$region_short, ref = "South")Northeast -883.8299 335.6613
## relevel(covid$region_short, ref = "South")West -809.2942 322.1588
## t value Pr(>|t|)
## (Intercept) 2.483 0.0168 *
## covid$republican -0.066 0.9480
## covid$college100 -1.171 0.2477
## relevel(covid$region_short, ref = "South")Midwest -0.035 0.9723
## relevel(covid$region_short, ref = "South")Northeast -2.633 0.0116 *
## relevel(covid$region_short, ref = "South")West -2.512 0.0157 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 749.2 on 45 degrees of freedom
## Multiple R-squared: 0.3116, Adjusted R-squared: 0.2351
## F-statistic: 4.073 on 5 and 45 DF, p-value: 0.003903
## MODEL INFO:
## Observations: 51
## Dependent Variable: covid$covid
## Type: OLS linear regression
##
## MODEL FIT:
## F(5,45) = 4.07, p = 0.00
## R² = 0.31
## Adj. R² = 0.24
##
## Standard errors: OLS
## -------------------------------------------------------------------
## Est. S.E. t val. p
## ------------------------------- --------- --------- -------- ------
## (Intercept) 3583.37 1442.91 2.48 0.02
## covid$republican -0.92 14.08 -0.07 0.95
## covid$college100 -26.60 22.71 -1.17 0.25
## relevel(covid$region_short, -9.93 284.85 -0.03 0.97
## ref = "South")Midwest
## relevel(covid$region_short, -883.83 335.66 -2.63 0.01
## ref = "South")Northeast
## relevel(covid$region_short, -809.29 322.16 -2.51 0.02
## ref = "South")West
## -------------------------------------------------------------------
The p-value of .003903 indicates that this model is statistically significant. The adjusted r-squared value of .2351 indicates that the model explains roughly 23% of the variability in COVID cases per 100K citizens. The effect size for Republican voting share changed when other variables were included in the model. Additionally, the effect size for college graduation rates changed when other variables were included in the model.
Controlling for college graduation rate, a one-point increase in Republican voting share is associated with .9241 fewer COVID cases per 100K citizens for states in the South.
Controlling for Republican voting share, a one-point increase in college graduation rates is associated with 26.5993 fewer COVID cases per 100K citizens for states in the South.
Compared to states in the South, states in the Midwest had 9.9284 fewer COVID cases per 100K citizens when controlling for college graduation rates.
Compared to states in the South, states in the Northeast had 883.8299 fewer COVID cases per 100K citizens when controlling for college graduation rates.
Compared to states in the South, states in the West had 809.2942 fewer COVID cases per 100K citizens when controlling for college graduation rates.
Question 4
The South was shown to have higher rates of COVID per 100K citizens compared to any other region when controlling for other variables. Additionally, states with higher college graduation rates saw fewer COVID cases per 100K citizens. Furthermore, states with a higher rate of Republican voters saw higher rates of COVID per 100K citizens in general, but the South saw lower rates of COVID associated with a higher Republican voting share.