```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
####Analytics Class####
## R programming skillset 1 _ essential pipeline of linear regression analysis
```{r}
# Step 1 -> Read data into R workstation
#### Read data from a CSV file
#### Example: Alzheimer's Disease
# filename
setwd("C:/Users/shuai/Downloads")
AD <- read.csv(file = "AD2.csv", header = TRUE)
```
```{r}
# Step 2 -> Data preprocessing
# Create your X matrix (predictors) and Y vector (outcome variable)
X <- AD[,1:15]
Y <- AD$MMSCORE
data <- data.frame(X,Y)
names(data)[16] <- c("MMSCORE")
# Create a training data (half the original data size)
train.ix <- sample(nrow(data),floor( nrow(data)/2) )
data.train <- data[train.ix,]
# Create a testing data (half the original data size)
data.test <- data[-train.ix,]
```
```{r}
# Step 3 -> Use lm() function to build a full model with all predictors
lm.AD <- lm(MMSCORE ~ ., data = data.train)
summary(lm.AD)
# Important knowledge point: 1) The use of lm() function, including the correct specification of the regression formula (e.g., MMSCORE ~ AGE + PTGENDER + PTEDUCAT + AGE*PTEDUCAT), arguments used in the lm() (help(lm)); 2) interpretation of the results, including the significant predictors, p-values, t-tests, and R-squares, etc,
```
```{r}
# Step 4 -> use step() to automatically delete all the insignificant variables
# Automatic model selection
lm.AD.reduced <- step(lm.AD, direction="backward", test="F") # direction="backward" means we start with the full model, then sequentially remove insignificant variables. There are other options, including direction="forward" or direction="both" - see help(step).
anova(lm.AD.reduced,lm.AD) # anova(model1,model2) is to compare if the two models, model1 and model2, are statistically different. If not statistically different, we prefer the simpler model
```
```{r}
# Step 5 -> Conduct diagnostics of the model
# install.packages("ggfortify")
require("ggfortify") # ggfortify is the package to do model diagnosis
autoplot(lm.AD.reduced, which = 1:6, ncol = 3, label.size = 3)
# This R function generates six figures:
# 1) the left figure in the first row, which is the scatterplot of the residuals versus fitted values of the outcome variable, it is supposed to show purely random distributions of the data points. In other words, any pattern that shows non-random characteristics, such as the curved relationship between the residuals and fitted values, and the unusual parallel lines of the data points, indicates deviance from the assumptions such as independence of the observations and constancy/homoscedasticity of the variance of the errors.
# 2) The Q-Q plot, as the middle figure in the first row, shows violation of the normality assumption of the error term. And some particularly violating data points such as the data points 282 and 256 are labelled.
# 3) The right figure in the first row, is a re-presentation of the figure in 1). As diagnostic figures are opportunistic efforts to identify problems (like doctors to see patients), seeing no problem usually doesn't mean really no problem. Thus, it is helpful to present the same information in multiple scales.
# 4) The Cook's distance shown in the left figure in the second row, shows the influential data points that have larger than average influence on the parameter estimation. The Cook's distance of a data point is built on the idea of how much change will be induced on the estimated parameters if the data point is deleted.
# 5) The leverage of a data point, on the other hand, shows the influence of the data point in another way. Mathematically, the leverage of a data point is (???y ^_i)/(???y_i ), reflecting how sensitive the prediction on the data point by the model is decided by the observed outcome value y_i. In other words, what data point will result in high leverage value? For data points that are surrounded by many close-by data points, their leverages won't be large, since the impact of removal of them will be compensated by other similar data points in the nearby. Thus, we could infer that the data points that sparsely occupy their neighbor areas will have large leverages. These data points could either be outliers that severely derivate from the linear trend represented by the majority of the data points, or could be valuable data points that align with the linear trend but lack neighbor data points, and thus, changes on their observations will generate a large impact on the predictions on the data points nearby their locations. Thus, it is important to note that, a data point that is influential doesn't necessary imply that it is bad. It only suggests that some more in-depth examination of the data point is needed.
# 6) Again, the last figure, as the 3) one, is to re-present the information in 4) and 5)
```
```{r}
# Step 6 -> Predict using your linear regession model
pred.lm <- predict(lm.AD.reduced, data.test) # a few comments: 1) predict() is a function that you can find in many R pacakges. R package developers tend to agree to write such a function predict(obj, data), where obj is the obj of the created model by that package, and data is the data points you want to predict on. 2) While in many cases this is not needed, sometimes you do need to specify the argument "type". Here, type="class" means that, you specify this is a classification problem.
cor(pred.lm, data.test$MMSCORE) #For regression model, you can use correlation to measure how close your predictions with the true outcome values of the data points
```