The ability to process and extract insightful information from large amounts of data has become a desired, if not necessary, skill in almost every field of industry and science. Among other benefits, such information can provide useful knowledge, support decision-making, uncover hidden trends, and enable deeper understanding of observed phenomena. This course covered some of the main problems and challenges encountered in data analysis and applications, and provided fundamental tools and techniques for solving them. We discussed popular algorithms for data organization & visualization, such as principal component analysis (PCA) and multidimensional scaling (MDS). Students have become familiar with a variety of machine learning and data mining approaches. These included both supervised approaches, such as performing classification (e.g., with decision trees, Bayesian classifiers, and SVM), and unsupervised ones, such as clustering data (e.g., with k-means, density estimators, and linkage-based agglomeration).

The lectures and discussions in class were accompanied by homework exercises that combined theoretical questions, which emphasized the understanding of underlying data mining principles, together with programming tasks (e.g., in MatLab and/or Python) that demonstrated practical implementations of studied data mining techniques. Grades in this course were based on these exercises, a project, and an exam.

The course assumed basic prior knowledge in probabilities, linear algebra, data structures, algorithms, and programming.

- Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, Vipin Kumar, 2005.
- Data Mining: Concepts and Techniques, 3rd Ed., Jiawei Han, Micheline Kamber, Jian Pei, 2011.

- Introduction to data mining tasks
- Data exploration and visualization
- Distances and similarities
- Data preprocessing
- Dimensionality reduction
- Principal component analsis
- Multidimensional scaling

- Classification
- Decision trees & random forests
- Bayesian classification
- Support vector machines

- Clustering
- Partitional: k-means, k-modes, LDF (shake & bake), k-medoids, & PAM
- Density-based: DBSCAN and density-based clustering

- Hierarchical clustering
- Bisecting k-means
- Agglomorative clustering
- Large-scale methods: BIRCH, CURE, & Chameleon

- Nonlinear dimensionality reduction
- Isomap
- Diffusion maps

- Topic 01 - Intoduction
- Topic 02 - Data Exploration & Visualization
- Topic 03 - Distances & Similarities
- Topic 04 - Preprocessing & Dimensionality Reduction
- Topic 05 - Classification & Decision Trees
- Topic 06 - Bayesian Classification
- Topic 07 - Support Vector Machines
- Topic 08 - Clustering
- Topic 09 - Hierarchical Clustering
- Topic 10 - Diffusion Maps

- Tutorial by Lawrence Saul (IPAM, 2005) on spectral methods for dimensionality reduction: