## TR #406: A fast greedy pairwise distance clustering algorithm and its
use in discovering thematic structures in large data sets

**
Matthew Brand
** November 1996
Submitted to:

Journal of Artificial Intelligence Research (JAIR)

Finding cluster structure in data is a common problem in linguistics,
molecular biology, and psychometrics. Often these datasets are
characterized only by pairwise dissimilarities between items, i.e.,
there are no point coordinates in a metric space. This paper presents
a fast clustering algorithm that combines greedy merge operations with
cluster refinement in an expectation-maximization framework. The
algorithm is distinguished from previous approaches in that it does
not depend on a combinatorically expensive metric space embedding
and/or user hints as to the dimensionality, number, and variance of
the clusters. As the algorithm converges it is possible to estimate
the data's actual dimensionality from the discovered cluster
structure. We show two applications of the algorithm: Discovering
protein families given sequence alignments and extracting thematic
content from narratives using rough similarity data from an electronic
dictionary.

See TR 411 for a detailed description of
an application to visualizing the thematic structure of textual stories.