Data Mining: The Textbook, Springer, May 2015

Charu C. Aggarwal.

PDF Download Link (Free for computers connected to subscribing institutions only)

Buy hard-cover or PDF (PDF has embedded links for navigation on e-readers)

Buy low-cost paperback edition (Instructions for computers connected to subscribing institutions only)

The emergence of data science as a discipline requires the development of a book that goes beyond the traditional focus of books on fundamental data mining problems. More emphasis needs to be placed on the advanced data types such as text, time series, discrete sequences, spatial data, graph data, and social networks. This comprehensive data mining book explores the different aspects of data mining, starting from the fundamentals, and subsequently explores the complex data types and their applications. Therefore, this book may be used for both introductory and advanced data mining courses. The chapters of this book fall into one of three categories:

The fundamental chapters: Data mining has four main problems, which correspond to clustering, classification, association pattern mining, and outlier analysis. These chapters comprehensively discuss a wide variety of methods for these problems.

Domain chapters: These chapters discuss the specific methods used for different domains of data such as text data, time-series data, sequence data, graph data, and spatial data.

Application chapters: These chapters study important applications such as stream mining, Web mining, ranking, recommendations, social networks, and privacy preservation. The domain chapters also have an applied flavor.

The book carefully balances mathematical details and intuition. It contains the necessary mathematical details for professors and researchers, but it is presented in a simple and intuitive style to improve accessibility for students and industrial practitioners. Numerous illustrations, examples, and exercises are included with an emphasis on semantically interpretable examples.

In general, for electronic versions, I highly recommend buying the PDF directly from springer over Amazon's Kindle edition. The PDF has embedded links that allows navigation over an e-reader, and will take about 18 MB on your device. Aside from this, one PDF allows you use over any device or computer. Since the PDFs are fully produced by Springer (rather than Amazon kindle, where Amazon plays a role in conversion), the look and feel is fully controlled by author and publisher. This makes the PDF versions of better quality than an Amazon Kindle.

The solution manual for the book is available here from Springer. There is a link for the solution manual on this page. If you are an instructor, then you can obtain a copy. Please do not ask me directly for a copy of the solution manual. It can only be distributed by Springer.

Chapter 1: An Introduction to Data Mining

Data Sets from UCI Machine Learning Repository

Open Source Data Mining Software (WEKA Workbench)

Apache Mahout Machine Learning Library

Spider Machine Learning Library (MATLAB)

KD-Nuggets: Resources in Data Mining

Data Preparator from Bobbie Stewart

Scikit-learn Data Preparation (Python)

Scikit-learn Dimensionality Reduction (Python)

PCA, SVD, and eigen-decomposition implementation by redSVD

Various forms of matrix decomposition implementations including SVD by ALGLIB

Haar Wavelet Implementation by Tom Gibara

Multidimensional Scaling (MDS) implementation from University of Konstanz

ISOMAP from Stanford University

Weka Matrix class for Matrix Operations such as SVD

ISOMAP from Stanford University

Dynamic Time Warping by D. Ellis

Longest Common Subsequence at Rosetta Code

SPMF Frequent Pattern Mining Implementations

Open Source Data Mining Software (WEKA Workbench)

FIMI Implementations on Frequent Pattern Mining

Implementations by Christian Borgelt

Scikit-learn Data Clustering (Python)

Open Source Data Mining Software (WEKA Workbench)

Apache Mahout Machine Learning Library (Clustering)

Spider Machine Learning Library (MATLAB)

Open source clustering software

Nonnegative matrix factorization in Python

OpenSubspace for high-dimensional clustering

Scikit Outlier Detection (Python)

Open Source Data Mining Software (WEKA Workbench)

Scikit-learn Data Classification and Regression (Python)

Open Source Data Mining Software (WEKA Workbench)

Apache Mahout Machine Learning Library (Classification)

Spider Machine Learning Library (MATLAB)

LibSVM for Support Vector Machines

ENTOOL for Ensemble Learning and Classification

R-archive network with lots of classification and ensemble methods

IBM Infosphere Streams Platform

Reservoir Sampling Implementation

Sketch and Lossy counting implementations

MADlib sketch and Flajolet-Martin implementation

MOA Toolkit for Massive Online Analytics

GENISM topic modeling in Python

SVMperf Software for scalable text classification

Multinomial Bayes model for classification

Time-series forecasting in R from CRAN

Gait-CAD MATLAB toolbox for clustering, classification, and regression

Cronos open source time-series package

UCR time-series clustering and classification page

GSP Sequential Pattern Mining from Weka

SPMF Sequential Pattern Mining Implementations

Open Source Data Mining Software (WEKA Workbench)

Sequence mining for computational biology

Bunch of spatial and trajectory software

Ullman algorithm for subgraph isomorphism

Frequent subgraph mining toolbox

Cheminformatics toolbox for kernels

SNAP: Stanford Network Analysis Project

PageRank implementation for very large graphs

Apache Mahout Recommender Systems

Large scale collaborative filtering

Scikit recommender systems in Python

SNAP: Stanford Network Analysis Project

Weka Package for Semisupervised Learning and Collective Classification

NetKit-SRL for Collective Classification

Link Prediction Method (Lpmade)

Open source implementation of several anonymization algorithms