Textbook

The textbook, "Analytics of Small Data: A Mode of Thinking", can be downloaded here (PDF).

Not to play devil’s advocate, this book is named as analytics of small data for a reason. It doesn’t mean that the methods introduced in this book could only be applied to small datasets. Rather, it is the approach of this book to introduce analytics methods through exemplary datasets as small as possible, small enough that we could grasp with perception or intuition, whatever readily accessible to us. Then, we illustrate what questions we could ask and what types of models we can build based on these small datasets. In this way, we hope to connect perceivable intuition with abstract formulations.



Syllabus

The Syllabus could be found here (PDF)





Topics in a nutshell

Data models – regression based techniques:

  • Chapter 2: Linear regression, least-square estimation, hypothesis testing, why normal distribution, its connection with experimental design, R-squared.
  • Chapter 3: Logistic regression, generalized least square estimation, iterative reweighted least square (IRLS) algorithm, approximated hypothesis testing, Ranking as a linear regression
  • Chapter 4: Bootstrap, data resampling, nonparametric hypothesis testing, nonparametric confidence interval estimation
  • Chapter 5: Overfitting and underfitting, limitation of R-squared, training dataset and testing dataset, random sampling, K-fold cross validation, the confusion matrix, false positive and false negative, and Receiver Operating Characteristics (ROC) curve
  • Chapter 6: Residual analysis, normal Q-Q plot, Cook’s distance, leverage, multicollinearity, subset selection, heterogeneity, clustering, gaussian mixture model (GMM), and the Expectation-Maximization (EM) algorithm
  • Chapter 7: Support Vector Machine (SVM), generalize data versus memorize data, maximum margin, support vectors, model complexity and regularization, primal-dual formulation, quadratic programming, KKT condition, kernel trick, kernel machines, SVM as a neural network model
  • Chapter 8: LASSO, sparse learning, L1-norm and L2-norm regularization, Ridge regression, feature selection, shooting algorithm, Principal Component Analysis (PCA), eigenvalue decomposition, scree plot
  • Chapter 9: Kernel regression as generalization of linear regression model, kernel functions, local smoother regression model, k-nearest regression model, conditional variance regression model, heteroscedasticity, weighted least square estimation, model extension and stacking

Algorithmic models – tree based techniques:

  • Chapter 2: Decision tree, entropy gain, node splitting, pre- and post-pruning, empirical error, generalization error, pessimistic error by binomial approximation, greedy recursive splitting.
  • Chapter 4: Random forest, Gini index, weak classifiers, probabilistic mechanism why random forest works
  • Chapter 5: Out-of-bag (OOB) error in random forest
  • Chapter 6: Importance score, partial dependency plot, residual analysis
  • Chapter 7: Ensemble learning, Adaboost, sampling with (or without) replacement
  • Chapter 8: Importance score in random forest, regularized random forests (RRF), guided regularized random forests (GRRF)
  • Chapter 9: System monitoring reformulated as classification, real-time contrasts method (RTC), design of monitoring statistics, sliding window, anomaly detection, false alarm
  • Chapter 10: Integration of tree models, feature selection, and regression models in inTrees, random forest as a rule generator, rule extraction, pruning, selection, and summarization, confidence and support of rules, variable interactions, rule-based prediction


Course notes, data, and codes

Lecture Content Slides Essential R Pipelines Advanced R (optional for class)
01 Introduction pdf r code
02 Linear regression pdf r code r code
03 Decision tree pdf r code r code
04 Logistic regression pdf r code r code
05 Bootstrap pdf r code r code
06 Random forest pdf r code r code
07 Cross-validation pdf r code (CV) r code (ROC) r code
08 Out-of-bag (OOB) errors pdf r code
09 Residuals analysis pdf r code r code
10 Clustering pdf r code r code
11 LASSO pdf r code r code
12 Variable importance in tree models pdf
13 Principal component analysis (PCA) pdf r code r code
14 Support vector machine (SVM) pdf r code r code
15 Kernel regression pdf r code r code
16 KNN regression pdf r code
17 AdaBoost and ensemble learning pdf r code r code
18 inTrees pdf r code r code


Learn more of R

A nice 76 pages R tutorial can be found here (PDF)