Textbook
An online version of the textbook, "Data Analytics: A Small Data Approach", can be accessed
Topics in a nutshell
Data models – regression based techniques:
- Chapter 2: Linear regression, least-square estimation, hypothesis testing, why normal distribution, its connection with experimental design, R-squared.
- Chapter 3: Logistic regression, generalized least square estimation, iterative reweighted least square (IRLS) algorithm, approximated hypothesis testing, Ranking as a linear regression
- Chapter 4: Bootstrap, data resampling, nonparametric hypothesis testing, nonparametric confidence interval estimation
- Chapter 5: Overfitting and underfitting, limitation of R-squared, training dataset and testing dataset, random sampling, K-fold cross validation, the confusion matrix, false positive and false negative, and Receiver Operating Characteristics (ROC) curve
- Chapter 6: Residual analysis, normal Q-Q plot, Cook’s distance, leverage, multicollinearity, subset selection, heterogeneity, clustering, gaussian mixture model (GMM), and the Expectation-Maximization (EM) algorithm
- Chapter 7: Support Vector Machine (SVM), generalize data versus memorize data, maximum margin, support vectors, model complexity and regularization, primal-dual formulation, quadratic programming, KKT condition, kernel trick, kernel machines, SVM as a neural network model
- Chapter 8: LASSO, sparse learning, L1-norm and L2-norm regularization, Ridge regression, feature selection, shooting algorithm, Principal Component Analysis (PCA), eigenvalue decomposition, scree plot
- Chapter 9: Kernel regression as generalization of linear regression model, kernel functions, local smoother regression model, k-nearest regression model, conditional variance regression model, heteroscedasticity, weighted least square estimation, model extension and stacking
Algorithmic models – tree based techniques:
- Chapter 2: Decision tree, entropy gain, node splitting, pre- and post-pruning, empirical error, generalization error, pessimistic error by binomial approximation, greedy recursive splitting.
- Chapter 4: Random forest, Gini index, weak classifiers, probabilistic mechanism why random forest works
- Chapter 5: Out-of-bag (OOB) error in random forest
- Chapter 6: Importance score, partial dependency plot, residual analysis
- Chapter 7: Ensemble learning, Adaboost, sampling with (or without) replacement
- Chapter 8: Importance score in random forest, regularized random forests (RRF), guided regularized random forests (GRRF)
- Chapter 9: System monitoring reformulated as classification, real-time contrasts method (RTC), design of monitoring statistics, sliding window, anomaly detection, false alarm
- Chapter 10: Integration of tree models, feature selection, and regression models in inTrees, random forest as a rule generator, rule extraction, pruning, selection, and summarization, confidence and support of rules, variable interactions, rule-based prediction
Course notes, data, and codes
Lecture | Content | Slides | Essential R Pipelines | Advanced R (optional for class) |
---|---|---|---|---|
01 | Introduction | r code | ||
02 | Linear regression | r code | r code | |
03 | Decision tree | r code | r code | |
04 | Logistic regression | r code | r code | |
05 | Bootstrap | r code | r code | |
06 | Random forest | r code | r code | |
07 | Cross-validation | r code (CV) r code (ROC) | r code | |
08 | Out-of-bag (OOB) errors | r code | ||
09 | Residuals analysis | r code | r code | |
10 | Clustering | r code | r code | |
11 | LASSO | r code | r code | |
12 | Variable importance in tree models | |||
13 | Principal component analysis (PCA) | r code | r code | |
14 | Support vector machine (SVM) | r code | r code | |
15 | Kernel regression | r code | r code | |
16 | KNN regression | r code | ||
17 | AdaBoost and ensemble learning | r code | r code | |
18 | inTrees | r code | r code | |
Datasets | mtcars AD AD2 AD_hd KR Dropout |