Charu Aggarwal, Research Staff Member, IBM

Biography

Charu Aggarwal is a Research Staff member at the IBM T. J. Watson Research Center in Yorktown Heights, New York. He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from MIT in 1996. He has since worked in the field of performance analysis, databases, and data mining. He has published over 120 papers in refereed conferences and journals, and has been granted over 45 patents. Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research. He has served on the program committees of most major database/data mining conferences, and served as program vice-chairs of the SIAM Conference on Data Mining , 2007, the IEEE ICDM Conference, 2007, the WWW Conference 2009, and the IEEE ICDM Conference, 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering Journal from 2004 to 2008. He is an action editor of the Data Mining and Knowledge Discovery Journal , an associate editor of the ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal. He is a senior member of the IEEE, and a life-member of the ACM.

NEW: UNCERTAIN DATA BOOK :

Managing and Mining Uncertain Data (Springer) Ed. Charu Aggarwal, February 2009. -- Comprehensive survey driven book on Uncertain Data with chapters contributed by prominent researchers in the field.

Table of Contents and introductory survey chapters


NEW: PRIVACY-PRESERVING DATA MINING BOOK:

Privacy-Preserving Data Mining: Models and Algorithms (Springer) Ed. Charu Aggarwal, Philip S. Yu, July 2008. -- Comprehensive survey driven book on Privacy-Preserving Data Mining Research with chapters contributed by prominent researchers in the field.

Table of Contents and introductory survey chapters

Book Cover


DATA STREAM BOOK:

Data Streams: Models and Algorithms (Springer) Ed. Charu Aggarwal, January 2007. -- Comprehensive survey driven book on Data Stream Research with chapters contributed by prominent researchers in the field. Table of Contents

Survey Chapter on Synopsis Construction in Data Streams

Available at Springer , Amazon.com , and Barnes and Noble .


My podcast on data streams from IBM Research

You can download the postscript/PDF files of my frequently accessed papers from my publication page. A more comprehensive list of publications is available from the DBLP database maintained by Michael Ley.

My citations can be accessed from this link to google scholar search. According to the Harzing's Publish or Perish Software my h-index is 36.

My list of granted patents with full text is available from the patent office . A searchable link (by patent number) from the US patent office can be used to access the full text of the patents.

Here is a list of my academic honors and professional activities.

A slightly outdated resume may be found here .

Contact Information: Charu Aggarwal

IBM T. J. Watson Research Center, 19 Skyline Drive, Office 4S-D40 Hawthorne, NY 10532

Email: charu (at) us (dot) ibm (dot) com

In case you have sent me an email at my earlier address with domain name watson.ibm.com, it is likely that I have not received it.


My research has covered a wide variety of data mining related topics such as clustering, association rule mining, text search, web crawling, and nearest neighbor indexing. A summary discussion of some of my more interesting research results is provided below:

Uncertain Data Mining

In many real applications, the records are not specified precisely, and may have errors built into them. In such cases, one may have to work with the errors in order to obtain more accurate data mining results. For example, if an attribute has a larger error for a classification problem, you probably want to use another attribute with lower uncertainty, even though it may not have a very high predictive power. A general framework for density based mining of uncertain data mining problems has been proposed in a recent paper.

High Dimensional Data Mining:

The problem of high dimensional data mining has been widely studied by the research community and has continued to be an intriguing problem over many years. Two important issues often arise in many high dimensional data mining applications: (1) Performance issues because of data sparsity (2) Qualitative Issues because of the lack of contrast in distance calculations. For example, if you uniformly distribute a small number of points in high dimensional space, it can be shown that the distance of each of these points to any randomly chosen target point is expected to be approximately the same. If you add correlations and skew to the data, the sparsity level reduces, but not by a lot in many real settings. This means that the concept of spatial locality (as understood in the traditional definitions of distance) does not exist for high dimensional data. If such is the case, then how should nearest neighbors be defined? How should clusters be defined? How should outliers be defined? These high dimensional issues have been considered open questions by data mining practitioners and researchers for many years. Furthermore, methods such as indexing are dependent upon meaningful spatial locality. When such spatial locality does not exist, the performance of index structures degrades rapidly. This has been experienced by anyone who has tried to use a multi-dimensional index implementation (eg. R-Tree) over high dimensional data. In my research, I have proposed that it is pointless to design more efficient solutions for existing data mining applications since their end result is often not very meaningful anyway. In many cases, an alternative approach is to re-design the definition of distance based applications. This solves the dual problems of performance and meaningfulness simultaneously. For example, the clustering problem is re-defined as projected clustering , a form of which is also known as local dimensionality reduction . The nearest neighbor search problem is re-defined as projected nearest neighbor search , and outlier search as projected outlier search. You can read a summary paper on the topic here. Another focus has been on re-designing distance functions in a domain-specific way in order to construct solutions which are sensitive to the needs of a particular data type.

Privacy Preserving Data Mining:

The ubiquity and availability of personal information about users with many corporations has lead to concerns about the privacy aspects of data mining. Recent threats of restrictive legislation have raised unnecessary concerns about the future of the data mining field. An interesting area of research is to work with less information so as to preserve the privacy of individual users. I have designed a framework for quantification of privacy preserving data mining algorithms . I have also proposed methods for condensation based privacy preserving data mining , which serves as a pseudo-data based variant to $k$-anonymity. Finally, I have shown that many traditional methods for privacy such as k-anonymity (paper link) and randomization (paper link) are susceptible to the curse of dimensionality. Methods for privacy-preserving data mining have also been extended to other domains such as text and binary data , and strings . Some recent results have also been used for leveraging association rule hiding methods against adversarial data mining .

Data Stream Mining:

Recent advances in hardware technology have allowed us to store and record data at rates which were unimaginable a few years ago. For example, even a simple transaction such as using the credit card or the phone results in automated data storage. In many cases such as the traffic streams on the internet, one cannot even store the data, but can have small segments of the stream available for processing for a limited amount of time. In such cases, all data mining algorithms must satisfy a one-pass constraint. While the one-pass constraint is necessary, it does not fully capture many other unique aspects of data stream mining. For example, data streams show temporal locality in their distribution behavior. This temporal locality can be leveraged in order to improve the quality of the results. I have worked on a number of problems in this area such as change detection in data streams , clustering evolving quantitative data streams , clustering text and categorical data streams , clustering high dimensional streams , on demand classification of data streams , futuristic query processing of streams , abnormality detection in data streams , and biased reservoir sampling in data streams . A book on data streams with surveys is described here , and a survey chapter on synopsis construction in data streams may be found here .

Interactive Data Mining:

While computer science has come a long way in being able to perform a wide variety of automated tasks, there are limits to the abilities of a computer in some fields such as data mining. This is because many data mining algorithms do not have precise objective functions which can describe an application-driven task. Rather, the objective functions are artificially constructed in order to simulate our understanding of it. In many cases, the end results turn out to be surprising and quite unexpected. This is because an automated algorithm or mathematically defined objective function cannot simulate the intuition of the human mind. I have developed human-computer cooperative algorithms for a variety of data mining problems such as clustering , classification , and nearest neighbor search. These methods combine the raw computational power of a computer with the intuition of the human effectively. The primary approach is to use visually driven methods to guide and support the user in making intelligent choices towards finding a solution for the data mining task at hand. A summary paper on the topic may be found here.

Web Crawling for Resource Discovery:

The web is a vast repository of documents and knowledge, waiting to be mined for a variety of purposes. The simplest application would be a search engine (such as Google or Yahoo!) in which crawlers regularly collect pages off the web and index them in order to facilitate user-centered keyword search. But what if you want an updated resource of web pages belonging to a particular topic? What if your query cannot be defined by the simple keyword interface of search engines? Recent work on web resource discovery has proposed methods for focussed crawling of topic specific documents. While focussed crawling is an interesting approach, it depends upon a fixed heuristic based on linkage structure. It is much more adaptive to use learning algorithms for focussed crawling and resource discovery . I have developed alternatives to focussed crawling algorithms which are intelligent and can dynamically learn the most appropriate web sites to crawl during the resource discovery process. This includes linkage based learning methods for focussed crawling or learning user browsing behavior for resource discovery .

Long Pattern Mining:

I have developed a number of tree projection algorithms for mining association patterns from transaction data. A depth first version of the tree projection algorithm can mine very long patterns in dense data bases. This is particularly useful for mining biological or DNA data sets. I have also developed methods for online association rule mining and have critiqued the standard association rule framework.

Data Mining for Electronic Commerce:

The ability to perform automated storage of web based transactions has resulted in vast repositories of useful data which can be mined for commercial purposes. Companies such as Amazon.com regularly use your buying or browsing behavior in order to personalize their web site and portal presentations. Their aim is to maximize their revenue by presenting you advertisements and products which you are likely to be interested in. This has given rise to the exciting field of data mining for electronic commerce. How do we use the previous customer browsing behavior to predict future behavior in a statistically robust way? I have developed a number of methods to perform WWW based advertising, Portal Personalization, and collaborative filtering .

Indexing Mobile Objects and other Specialized Data Domains:

While indexing quantitative data has been widely studied, the problem of indexing many non-conventional domains require specialized dimensionality reduction and indexing techniques. Another line of research has concentrated on index friendly data representations of text and index-friendly distance functions. Such methods improve the effectiveness and efficiency of similarity search simultaneously. I have also obtained some recent results on indexing mobile objects in nonlinear trajectories. These are the first results in indexing mobile objects following a non-linear trajectory. All earlier papers have only discussed the problem of indexing moving objects in linear trajectories.

Mining Specialized Data:

One of the interesting characteristics of the data mining area is that the field evolves very fast, since new data domains crop up frequently. This is because of new data collection and extraction technologies. For example, string mining has recently received a boost because of the ability to collect and extract genome data. Similarly XML processing has spawned the new area of structural mining. In some cases such as text mining, old domains become re-energized because of the effect of new enabling technologies such as the web. For each domain, the nature of the problems are quite different and require different algorithmic techniques. I have developed a number of algorithms for string classification , XML classification , indexing categorical data or market basket data, and flexible methods for similarity search in biological data . I have also developed some partially supervised methods for text classification .
Download Link for PS/PDF files of frequently accessed papers