Faculty Candidate Seminar
Data mining large databases: scalability considerations
Add to Google Calendar
President, DMX Group
Data mining methods have grown in importance as data sets grew larger and more numerous. Many of the fundamental problems in performing data mining tasks rely on statistical estimation and modeling. However, many of the computational advances in statistical analysis methods paid little attention to the problems of scaling to massive data sets (in contrast to much of the work on the database systems side where scalability is a central theme). In this talk, we present several algorithms and considerations in the area of scaling data mining algorithms to large databases. The approaches fundamentally rely on the notion of decomposing algorithms into basic components that more easily lend themselves to scaling to large data. It turns out that most popular algorithms can be decomposed into components that need to be close to the data, and others than can operate over reduced forms or sufficient statistics of the data. The key to a good decomposition is to keep the components that need to "touch the data" simple and fast. In addition, it is important to consider the number of times an algorithm requires a scan of the data. After covering a couple of illustrative examples of scaling algorithms to large databases, we consider the converse approach: can we utilize fundamental notions in data mining to help solve classical database problems such as indexing high-dimensional data and estimating query selectivity etc. The theme here is that database considerations are important in data mining while statistical and data mining considerations play an important role in database systems considerations. We wrap-up the discussion of databases with a brief coverage of some work on integrating data mining in a major commercial database system (Microsoft SQL Server). We conclude the talk with a summary of the numerous remaining technical challenges facing the field of data mining.
Usama Fayyad is Founder & CEO of DMX Group, a data mining services firm that delivers advanced predictive and strategic data applications to some of the world's largest organizations. Prior to DMX Group, Usama co-founded digiMine, Inc (now Revenue Science, Inc.) in early 2000 and served as its President and CEO until 2003. digiMine is a venture-funded company focused on data mining and business intelligence hosted solutions that are delivered on demand. Dr. Fayyad grew the company to over 100 employees and raised $45 million of capital from top venture and institutional investors. Prior to digiMine, Dr. Fayyad founded and led Microsoft Research's Data Mining & Exploration Group. At Microsoft he led a research program focused on scalable algorithms for mining massive databases and he also led the development of data mining components within Microsoft products, including SQL Server 2000. From 1989 to 1995, Fayyad was at NASA's Jet Propulsion Laboratory (JPL), California Institute of Technology, where he founded and grew a multi-million dollar advanced research program to develop data mining systems for the analysis of large scientific databases.