Charu Aggarwal is a Research Staff member at the IBM T. J. Watson Research
Center in Yorktown Heights, New York. He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from
MIT in 1996. He has since worked in the field of
performance analysis, databases, and data mining.
He has published over 120 papers in refereed conferences
and journals, and has been granted over 45 patents.
Because of the commercial value of the above-mentioned patents,
he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM.
He is a recipient of an IBM Corporate
Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream
research.
He has served on the program committees of most major database/data mining conferences, and served as program vice-chairs of the SIAM Conference on Data
Mining , 2007, the IEEE ICDM Conference, 2007, the WWW Conference 2009, and the IEEE ICDM Conference, 2009. He served as an associate
editor of the IEEE Transactions on Knowledge and Data Engineering Journal from 2004 to 2008. He is an action editor of the Data Mining and Knowledge Discovery Journal , an associate editor of the
ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal.
He is a senior member of the IEEE, and a life-member of the ACM.
NEW: UNCERTAIN DATA BOOK :
Managing and Mining Uncertain Data (Springer) Ed. Charu Aggarwal, February 2009.
-- Comprehensive survey driven book on Uncertain Data
with chapters contributed by prominent researchers in the field.
Privacy-Preserving Data Mining: Models and Algorithms (Springer) Ed. Charu Aggarwal, Philip S. Yu, July 2008.
-- Comprehensive survey driven book on Privacy-Preserving Data Mining Research
with chapters contributed by prominent researchers in the field.
Data Streams: Models and Algorithms (Springer) Ed. Charu Aggarwal, January 2007.
-- Comprehensive survey driven book on Data Stream Research
with chapters contributed by prominent researchers in the field.
Table of Contents
My list of granted patents with full text is available from the patent office .
A searchable link (by patent number) from the US patent office can be used to access the full text
of the patents.
IBM T. J. Watson Research Center, 19 Skyline Drive, Office 4S-D40 Hawthorne, NY 10532
Email: charu (at) us (dot) ibm (dot) com
In case you have sent me an email at my earlier address with
domain name watson.ibm.com, it is likely that I have not received it.
My research has covered a wide variety of data mining related topics such
as clustering, association rule mining, text search,
web crawling, and nearest neighbor indexing. A summary discussion of some of my more interesting research results is provided below:
Uncertain Data Mining
In many real applications, the records are not specified precisely, and may have errors built into them. In such cases,
one may have to work with the errors in order to obtain more accurate data mining results. For example, if an attribute has a larger error for a classification problem, you probably want to use another attribute with lower uncertainty, even though it may not have a very high predictive power. A general
framework for density based mining of uncertain data mining problems has been proposed
in a recent paper.
High Dimensional Data Mining:
The problem of high dimensional
data mining has been widely studied by the research community and has
continued to be an intriguing problem over many years. Two important
issues often arise in many high dimensional data mining applications:
(1) Performance issues because of data sparsity (2) Qualitative
Issues because of the lack of contrast in distance calculations.
For example, if you uniformly distribute a small number of points in high dimensional
space, it can be shown that the distance of each of these points
to any randomly chosen target point is expected to be approximately the same.
If you add correlations and skew to the data, the sparsity level reduces,
but not by a lot in many real settings.
This means that the concept of spatial locality
(as understood in the traditional definitions of distance) does not exist for
high dimensional data. If such is the case, then how should nearest neighbors
be defined? How should clusters be defined? How should outliers be defined?
These high dimensional issues have been considered open questions by data
mining practitioners and researchers for
many years. Furthermore, methods such as indexing are dependent
upon meaningful spatial locality. When such spatial locality
does not exist, the performance of index structures degrades rapidly.
This has been experienced by anyone who has tried to use a multi-dimensional
index implementation (eg. R-Tree)
over high dimensional data.
In my research,
I have proposed that it is pointless to design more efficient solutions
for existing data mining applications since their end result is often
not very meaningful anyway. In many cases,
an alternative approach is to re-design the definition of
distance based applications.
This solves the dual problems of performance and meaningfulness simultaneously.
For example, the clustering problem is re-defined as
projected clustering , a form of which is also known
as local dimensionality reduction . The nearest neighbor search problem is re-defined
as projected nearest neighbor search , and
outlier search as projected outlier search.
You can read a summary paper on the topic here.
Another focus has been on re-designing distance functions in a domain-specific
way in order to construct solutions which are sensitive to the needs
of a particular data type.
Privacy Preserving Data Mining:
The ubiquity and availability
of personal information about users with many corporations has lead to
concerns about the privacy aspects of data mining. Recent
threats of restrictive legislation have raised unnecessary concerns
about the future of the data mining field. An interesting area of research
is to work with less information so as to preserve the
privacy of individual users. I have designed a framework for
quantification of privacy preserving data mining algorithms . I have also proposed
methods for
condensation based privacy preserving data mining ,
which serves as a pseudo-data based variant to $k$-anonymity. Finally,
I have shown that many traditional methods for privacy such as k-anonymity (paper link) and randomization (paper link) are susceptible to the curse of dimensionality. Methods for privacy-preserving data mining have also been extended to other domains such as text and binary data , and strings . Some recent results have also been used for leveraging association rule hiding methods against adversarial data mining .
While computer science has
come a long way in being able to perform a wide
variety of automated tasks,
there are limits to
the abilities of a computer in some fields such
as data mining.
This is because many data mining algorithms do not have precise objective
functions which can describe an application-driven task. Rather, the objective functions
are artificially constructed in order to simulate our understanding of it. In
many cases, the end results turn out to be surprising and quite unexpected.
This is because an automated algorithm or mathematically defined
objective function cannot simulate the
intuition of the human mind. I have developed human-computer
cooperative algorithms for a variety of data mining
problems such as clustering , classification , and
nearest neighbor search.
These methods combine the raw computational power of a
computer with the intuition of the human effectively.
The primary approach is to use visually driven methods to guide and
support the user in making intelligent choices towards finding
a solution for the data mining task at hand. A summary paper on
the topic may be found here.
Web Crawling for Resource Discovery:
The web is a vast repository
of documents and knowledge, waiting to be mined
for a variety of purposes. The simplest application
would be a search engine (such as Google or Yahoo!)
in which crawlers regularly
collect pages off the web and index them in order to facilitate
user-centered keyword search. But what if you want an updated resource of web
pages belonging to a particular topic? What if your query cannot be defined
by the simple keyword interface of search engines?
Recent work on web resource discovery has proposed
methods for focussed crawling of topic specific documents. While focussed crawling is an interesting
approach, it depends upon a fixed heuristic based on linkage structure.
It is much more adaptive to
use learning algorithms for focussed crawling and
resource discovery . I have developed alternatives to
focussed crawling algorithms which
are intelligent and can dynamically learn the most appropriate web sites
to crawl during the resource discovery process. This includes
linkage based learning methods for focussed crawling or
learning user browsing behavior for resource discovery .
Long Pattern Mining:
I have developed a number of tree projection
algorithms for mining association patterns from transaction data.
A depth first version of the tree projection
algorithm can mine very long patterns in dense data bases. This is particularly useful
for mining biological or DNA data sets. I have also
developed
methods for online association rule mining and have
critiqued
the standard association rule framework.
Data Mining for Electronic Commerce:
The ability to
perform automated storage of
web based transactions has resulted in vast repositories
of useful data which can be mined for commercial purposes.
Companies such as Amazon.com regularly use your buying or browsing behavior
in order to personalize their web site and portal presentations. Their aim
is to maximize their revenue by presenting you advertisements and products
which you are likely to be interested in.
This has given rise to the exciting field of data mining
for electronic commerce. How do we use the previous customer browsing behavior
to predict future behavior in a statistically robust way?
I have developed a number of methods
to perform WWW based advertising, Portal Personalization, and
collaborative filtering .
Indexing Mobile Objects and other Specialized Data Domains:
While indexing quantitative
data has been widely studied, the problem of indexing many non-conventional
domains require
specialized dimensionality reduction and indexing techniques.
Another line of research has concentrated on index friendly data representations of text
and index-friendly distance functions. Such methods improve the
effectiveness and efficiency of similarity search simultaneously.
I have also
obtained some recent results on indexing
mobile objects in nonlinear trajectories. These
are the first results in indexing mobile objects
following a non-linear trajectory. All earlier papers
have only discussed the problem of indexing moving
objects in linear trajectories.