OS4118 Statistical and Machine Learning

This course introduces students to the art and science of statistical and machine learning to find patterns in large and "Big" data. The focus is on the strengths and weaknesses of learning techniques and their implementation. Fundamental ideas common to learning methods are covered, and supervised/unsupervised techniques are introduced. These techniques include: re-sampling methods, advanced clustering and visualization, tree-based ensembles, stochastic gradient boosting, deep neural networks, auto-encoding and other dimension reduction techniques, and applications to natural language processing. The software package R and high-performance parallel or distributed computing will be used to demonstrate these techniques.

Prerequisite

OA4106 or consent of instructor

Lecture Hours

Lab Hours

Course Learning Outcomes

At the end of this course, students will:

· Recognize the different types of problems faced by analysts of large data sets and the pros and cons of the statistical and machine learning tools used to address them.

The data treated will be high dimensional of mixed type to include images and text. In this field, tools, algorithms, how they are implemented, which tools are popular or not, software and hardware change quickly. By gaining an understanding the fundamental ideas common to the core statistical and machine learning tools, and by being exposed to a few special topics that use extensions of these core tools in unexpected ways, the student will be able to:

· Understand new cutting-edge applications based on these techniques and combinations of them.

· Have enough hands-on experience with large data sets and open scalable source software that can be used on a variety of platforms (e.g. Hadoop style clusters) that they will be in a position to use techniques taught in this course and others in their own work environments.