# Textbook

The textbook, "Analytics of Small Data: A Mode of Thinking", can be downloaded here (PDF).

Not to play devil’s advocate, this book is named as * analytics of small data * for a reason. It doesn’t mean that the methods introduced in this book could only be applied to small datasets. Rather, it is the approach of this book to introduce analytics methods through exemplary datasets as small as possible, small enough that we could grasp with perception or intuition, whatever readily accessible to us. Then, we illustrate what questions we could ask and what types of models we can build based on these small datasets. In this way, we hope to connect perceivable intuition with abstract formulations.

# Topics in a nutshell

#### Data models – regression based techniques:

- Chapter 2: Linear regression, least-square estimation, hypothesis testing, why normal distribution, its connection with experimental design, R-squared.
- Chapter 3: Logistic regression, generalized least square estimation, iterative reweighted least square (IRLS) algorithm, approximated hypothesis testing, Ranking as a linear regression
- Chapter 4: Bootstrap, data resampling, nonparametric hypothesis testing, nonparametric confidence interval estimation
- Chapter 5: Overfitting and underfitting, limitation of R-squared, training dataset and testing dataset, random sampling, K-fold cross validation, the confusion matrix, false positive and false negative, and Receiver Operating Characteristics (ROC) curve
- Chapter 6: Residual analysis, normal Q-Q plot, Cook’s distance, leverage, multicollinearity, subset selection, heterogeneity, clustering, gaussian mixture model (GMM), and the Expectation-Maximization (EM) algorithm
- Chapter 7: Support Vector Machine (SVM), generalize data versus memorize data, maximum margin, support vectors, model complexity and regularization, primal-dual formulation, quadratic programming, KKT condition, kernel trick, kernel machines, SVM as a neural network model
- Chapter 8: LASSO, sparse learning, L1-norm and L2-norm regularization, Ridge regression, feature selection, shooting algorithm, Principal Component Analysis (PCA), eigenvalue decomposition, scree plot
- Chapter 9: Kernel regression as generalization of linear regression model, kernel functions, local smoother regression model, k-nearest regression model, conditional variance regression model, heteroscedasticity, weighted least square estimation, model extension and stacking

#### Algorithmic models – tree based techniques:

- Chapter 2: Decision tree, entropy gain, node splitting, pre- and post-pruning, empirical error, generalization error, pessimistic error by binomial approximation, greedy recursive splitting.
- Chapter 4: Random forest, Gini index, weak classifiers, probabilistic mechanism why random forest works
- Chapter 5: Out-of-bag (OOB) error in random forest
- Chapter 6: Importance score, partial dependency plot, residual analysis
- Chapter 7: Ensemble learning, Adaboost, sampling with (or without) replacement
- Chapter 8: Importance score in random forest, regularized random forests (RRF), guided regularized random forests (GRRF)
- Chapter 9: System monitoring reformulated as classification, real-time contrasts method (RTC), design of monitoring statistics, sliding window, anomaly detection, false alarm
- Chapter 10: Integration of tree models, feature selection, and regression models in inTrees, random forest as a rule generator, rule extraction, pruning, selection, and summarization, confidence and support of rules, variable interactions, rule-based prediction

# Course notes, data, and codes

Lecture | Content | Slides | Essential R Pipelines | Advanced R (optional for class) |
---|---|---|---|---|

01 | Introduction | r code | ||

02 | Linear regression | r code | r code | |

03 | Decision tree | r code | r code | |

04 | Logistic regression | r code | r code | |

05 | Bootstrap | r code | r code | |

06 | Random forest | r code | r code | |

07 | Cross-validation | r code (CV) r code (ROC) | r code | |

08 | Out-of-bag (OOB) errors | r code | ||

09 | Residuals analysis | r code | r code | |

10 | Clustering | r code | r code | |

11 | LASSO | r code | r code | |

12 | Variable importance in tree models | |||

13 | Principal component analysis (PCA) | r code | r code | |

14 | Support vector machine (SVM) | r code | r code | |

15 | Kernel regression | r code | r code | |

16 | KNN regression | r code | ||

17 | AdaBoost and ensemble learning | r code | r code | |

18 | inTrees | r code | r code |