LIMBO: scaLable InforMation BOttleneck

      University of Toronto
      Department of Computer Science
    


{Description}

Members

Publications

Contact Info

Executables


Clustering Categorical Data

The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. LIMBO is a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. We use the IB framework to define a distance measure for categorical tuples. LIMBO handles large data sets by producing a memory bounded summary model for the data.

Identifying Structure in Large Data Sets

When doing data design, the information content (or redundancy) of the data is measured with respect to a prescribed model for the data. We consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete and propose a set of LIMBO-based techniques for finding structural clues in an instance of data, which may contain errors, missing values, and duplicate records.

Clustering Software Data

The majority of the algorithms in the software clustering literature utilize structural information in order to decompose large software systems. Other approaches, such as using file names or ownership information, have also demonstrated merit. However, there is no intuitive way to combine information obtained from these two different types of techniques. Using LIMBO, we combine structural and non-structural information in an integrated fashion.