Data Mining: The Textbook, Springer, May 2015
Charu C. Aggarwal.
PDF Download Link (Free for computers connected to subscribing institutions only)
Buy hard-cover or PDF (PDF has embedded links for navigation on e-readers)
Buy low-cost paperback edition (Instructions for computers connected to subscribing institutions only)
The emergence of data science as a discipline requires the development of a book that goes beyond the traditional focus of books on fundamental data mining problems. More emphasis needs to be placed on the advanced data types such as text, time series, discrete sequences, spatial data, graph data, and social networks. This comprehensive data mining book explores the different aspects of data mining, starting from the fundamentals, and subsequently explores the complex data types and their applications. Therefore, this book may be used for both introductory and advanced data mining courses. The chapters of this book fall into one of three categories:
The fundamental chapters: Data mining has four main problems, which correspond to clustering, classification, association pattern mining, and outlier analysis. These chapters comprehensively discuss a wide variety of methods for these problems.
Domain chapters: These chapters discuss the specific methods used for different domains of data such as text data, time-series data, sequence data, graph data, and spatial data.
Application chapters: These chapters study important applications such as stream mining, Web mining, ranking, recommendations, social networks, and privacy preservation. The domain chapters also have an applied flavor.
The book carefully balances mathematical details and intuition. It contains the necessary mathematical details for professors and researchers, but it is presented in a simple and intuitive style to improve accessibility for students and industrial practitioners. Numerous illustrations, examples, and exercises are included with an emphasis on semantically interpretable examples.
In general, for electronic versions, I highly recommend buying the PDF directly from springer over Amazon's Kindle edition. The PDF has embedded links that allows navigation over an e-reader, and will take about 18 MB on your device. Aside from this, one PDF allows you use over any device or computer. Since the PDFs are fully produced by Springer (rather than Amazon kindle, where Amazon plays a role in conversion), the look and feel is fully controlled by author and publisher. This makes the PDF versions of better quality than an Amazon Kindle.
The solution manual for the book is available here from Springer. There is a link for the solution manual on this page. If you are an instructor, then you can obtain a copy. Please do not ask me directly for a copy of the solution manual. It can only be distributed by Springer.
Chapter 1: An Introduction to Data Mining
Data Sets from UCI Machine Learning Repository
Open Source Data Mining Software (WEKA Workbench)
Apache Mahout Machine Learning Library
Spider Machine Learning Library (MATLAB)
KD-Nuggets: Resources in Data Mining
Data Preparator from Bobbie Stewart
Scikit-learn Data Preparation (Python)
Scikit-learn Dimensionality Reduction (Python)
PCA, SVD, and eigen-decomposition implementation by redSVD
Various forms of matrix decomposition implementations including SVD by ALGLIB
Haar Wavelet Implementation by Tom Gibara
Multidimensional Scaling (MDS) implementation from University of Konstanz
ISOMAP from Stanford University
Weka Matrix class for Matrix Operations such as SVD
ISOMAP from Stanford University
Dynamic Time Warping by D. Ellis
Longest Common Subsequence at Rosetta Code
SPMF Frequent Pattern Mining Implementations
Open Source Data Mining Software (WEKA Workbench)
FIMI Implementations on Frequent Pattern Mining
Implementations by Christian Borgelt
Scikit-learn Data Clustering (Python)
Open Source Data Mining Software (WEKA Workbench)
Apache Mahout Machine Learning Library (Clustering)
Spider Machine Learning Library (MATLAB)
Open source clustering software
Nonnegative matrix factorization in Python
OpenSubspace for high-dimensional clustering
Scikit Outlier Detection (Python)
Open Source Data Mining Software (WEKA Workbench)
Scikit-learn Data Classification and Regression (Python)
Open Source Data Mining Software (WEKA Workbench)
Apache Mahout Machine Learning Library (Classification)
Spider Machine Learning Library (MATLAB)
LibSVM for Support Vector Machines
ENTOOL for Ensemble Learning and Classification
R-archive network with lots of classification and ensemble methods
IBM Infosphere Streams Platform
Reservoir Sampling Implementation
Sketch and Lossy counting implementations
MADlib sketch and Flajolet-Martin implementation
MOA Toolkit for Massive Online Analytics
GENISM topic modeling in Python
SVMperf Software for scalable text classification
Multinomial Bayes model for classification
Time-series forecasting in R from CRAN
Gait-CAD MATLAB toolbox for clustering, classification, and regression
Cronos open source time-series package
UCR time-series clustering and classification page
GSP Sequential Pattern Mining from Weka
SPMF Sequential Pattern Mining Implementations
Open Source Data Mining Software (WEKA Workbench)
Sequence mining for computational biology
Bunch of spatial and trajectory software
Ullman algorithm for subgraph isomorphism
Frequent subgraph mining toolbox
Cheminformatics toolbox for kernels
SNAP: Stanford Network Analysis Project
PageRank implementation for very large graphs
Apache Mahout Recommender Systems
Large scale collaborative filtering
Scikit recommender systems in Python
SNAP: Stanford Network Analysis Project
Weka Package for Semisupervised Learning and Collective Classification
NetKit-SRL for Collective Classification
Link Prediction Method (Lpmade)
Open source implementation of several anonymization algorithms