Clustering Categorical Data
The problem of clustering becomes more challenging when the data is categorical,
that is, when there is no inherent distance measure between data values. LIMBO is
a scalable hierarchical categorical clustering algorithm that builds on the Information
Bottleneck (IB) framework for quantifying the relevant information preserved when clustering.
We use the IB framework to define a distance measure for categorical tuples. LIMBO
handles large data sets by producing a memory bounded summary model for the data.
Identifying Structure in Large Data Sets
When doing data design, the information content (or redundancy) of the data is
measured with respect to a prescribed model for the data. We consider the problem
of doing data redesign in an environment where the prescribed model is unknown or incomplete
and propose a set of LIMBO-based techniques for finding structural clues in an instance
of data, which may contain errors, missing values, and duplicate records.
Clustering Software Data
The majority of the algorithms in the software clustering literature utilize
structural information in order to decompose large software systems. Other
approaches, such as using file names or ownership information, have also
demonstrated merit. However, there is no intuitive way to combine information
obtained from these two different types of techniques. Using LIMBO, we combine
structural and non-structural information in an integrated fashion.