Textbook

An online version of the textbook, "Data Analytics: A Small Data Approach", can be accessed here. You can buy the book here or through Amazon.com.

Syllabus

The Syllabus could be found here (PDF)

Topics in a nutshell

Data models – regression based techniques:

Chapter 2: Linear regression, least-square estimation, hypothesis testing, why normal distribution, its connection with experimental design, R-squared.
Chapter 3: Logistic regression, generalized least square estimation, iterative reweighted least square (IRLS) algorithm, approximated hypothesis testing, Ranking as a linear regression
Chapter 4: Bootstrap, data resampling, nonparametric hypothesis testing, nonparametric confidence interval estimation
Chapter 5: Overfitting and underfitting, limitation of R-squared, training dataset and testing dataset, random sampling, K-fold cross validation, the confusion matrix, false positive and false negative, and Receiver Operating Characteristics (ROC) curve
Chapter 6: Residual analysis, normal Q-Q plot, Cook’s distance, leverage, multicollinearity, subset selection, heterogeneity, clustering, gaussian mixture model (GMM), and the Expectation-Maximization (EM) algorithm
Chapter 7: Support Vector Machine (SVM), generalize data versus memorize data, maximum margin, support vectors, model complexity and regularization, primal-dual formulation, quadratic programming, KKT condition, kernel trick, kernel machines, SVM as a neural network model
Chapter 8: LASSO, sparse learning, L1-norm and L2-norm regularization, Ridge regression, feature selection, shooting algorithm, Principal Component Analysis (PCA), eigenvalue decomposition, scree plot
Chapter 9: Kernel regression as generalization of linear regression model, kernel functions, local smoother regression model, k-nearest regression model, conditional variance regression model, heteroscedasticity, weighted least square estimation, model extension and stacking

Algorithmic models – tree based techniques:

Chapter 2: Decision tree, entropy gain, node splitting, pre- and post-pruning, empirical error, generalization error, pessimistic error by binomial approximation, greedy recursive splitting.
Chapter 4: Random forest, Gini index, weak classifiers, probabilistic mechanism why random forest works
Chapter 5: Out-of-bag (OOB) error in random forest
Chapter 6: Importance score, partial dependency plot, residual analysis
Chapter 7: Ensemble learning, Adaboost, sampling with (or without) replacement
Chapter 8: Importance score in random forest, regularized random forests (RRF), guided regularized random forests (GRRF)
Chapter 9: System monitoring reformulated as classification, real-time contrasts method (RTC), design of monitoring statistics, sliding window, anomaly detection, false alarm
Chapter 10: Integration of tree models, feature selection, and regression models in inTrees, random forest as a rule generator, rule extraction, pruning, selection, and summarization, confidence and support of rules, variable interactions, rule-based prediction

Course notes, data, and codes

Lecture	Content	Slides	Essential R Pipelines	Advanced R (optional for class)
01	Introduction	pdf	r code
02	Linear regression	pdf	r code	r code
03	Decision tree	pdf	r code	r code
04	Logistic regression	pdf	r code	r code
05	Bootstrap	pdf	r code	r code
06	Random forest	pdf	r code	r code
07	Cross-validation	pdf	r code (CV) r code (ROC)	r code
08	Out-of-bag (OOB) errors	pdf		r code
09	Residuals analysis	pdf	r code	r code
10	Clustering	pdf	r code	r code
11	LASSO	pdf	r code	r code
12	Variable importance in tree models	pdf
13	Principal component analysis (PCA)	pdf	r code	r code
14	Support vector machine (SVM)	pdf	r code	r code
15	Kernel regression	pdf	r code	r code
16	KNN regression	pdf	r code
17	AdaBoost and ensemble learning	pdf	r code	r code
18	inTrees	pdf	r code	r code
Datasets	mtcars AD AD2 AD_hd KR Dropout

Learn more of R

A nice 76 pages R tutorial can be found here (PDF)