Textbook

An online version of the textbook, "Data Analytics: A Small Data Approach", can be accessed here. You can buy the book here or through Amazon.com.



Syllabus

The Syllabus could be found here (PDF)





Topics in a nutshell

Data models – regression based techniques:

  • Chapter 2: Linear regression, least-square estimation, hypothesis testing, why normal distribution, its connection with experimental design, R-squared.
  • Chapter 3: Logistic regression, generalized least square estimation, iterative reweighted least square (IRLS) algorithm, approximated hypothesis testing, Ranking as a linear regression
  • Chapter 4: Bootstrap, data resampling, nonparametric hypothesis testing, nonparametric confidence interval estimation
  • Chapter 5: Overfitting and underfitting, limitation of R-squared, training dataset and testing dataset, random sampling, K-fold cross validation, the confusion matrix, false positive and false negative, and Receiver Operating Characteristics (ROC) curve
  • Chapter 6: Residual analysis, normal Q-Q plot, Cook’s distance, leverage, multicollinearity, subset selection, heterogeneity, clustering, gaussian mixture model (GMM), and the Expectation-Maximization (EM) algorithm
  • Chapter 7: Support Vector Machine (SVM), generalize data versus memorize data, maximum margin, support vectors, model complexity and regularization, primal-dual formulation, quadratic programming, KKT condition, kernel trick, kernel machines, SVM as a neural network model
  • Chapter 8: LASSO, sparse learning, L1-norm and L2-norm regularization, Ridge regression, feature selection, shooting algorithm, Principal Component Analysis (PCA), eigenvalue decomposition, scree plot
  • Chapter 9: Kernel regression as generalization of linear regression model, kernel functions, local smoother regression model, k-nearest regression model, conditional variance regression model, heteroscedasticity, weighted least square estimation, model extension and stacking

Algorithmic models – tree based techniques:

  • Chapter 2: Decision tree, entropy gain, node splitting, pre- and post-pruning, empirical error, generalization error, pessimistic error by binomial approximation, greedy recursive splitting.
  • Chapter 4: Random forest, Gini index, weak classifiers, probabilistic mechanism why random forest works
  • Chapter 5: Out-of-bag (OOB) error in random forest
  • Chapter 6: Importance score, partial dependency plot, residual analysis
  • Chapter 7: Ensemble learning, Adaboost, sampling with (or without) replacement
  • Chapter 8: Importance score in random forest, regularized random forests (RRF), guided regularized random forests (GRRF)
  • Chapter 9: System monitoring reformulated as classification, real-time contrasts method (RTC), design of monitoring statistics, sliding window, anomaly detection, false alarm
  • Chapter 10: Integration of tree models, feature selection, and regression models in inTrees, random forest as a rule generator, rule extraction, pruning, selection, and summarization, confidence and support of rules, variable interactions, rule-based prediction


Course notes, data, and codes

Lecture Content Slides Essential R Pipelines Advanced R (optional for class)
01 Introduction pdf r code
02 Linear regression pdf r code r code
03 Decision tree pdf r code r code
04 Logistic regression pdf r code r code
05 Bootstrap pdf r code r code
06 Random forest pdf r code r code
07 Cross-validation pdf r code (CV) r code (ROC) r code
08 Out-of-bag (OOB) errors pdf r code
09 Residuals analysis pdf r code r code
10 Clustering pdf r code r code
11 LASSO pdf r code r code
12 Variable importance in tree models pdf
13 Principal component analysis (PCA) pdf r code r code
14 Support vector machine (SVM) pdf r code r code
15 Kernel regression pdf r code r code
16 KNN regression pdf r code
17 AdaBoost and ensemble learning pdf r code r code
18 inTrees pdf r code r code


Learn more of R

A nice 76 pages R tutorial can be found here (PDF)