Traditional clustering methods partition objects into mutually exclusive clusters. In many applications, however, it is more realistic that objects may belong to multiple, overlapping clusters. However, if the data is available in the form of pairwise distances, there exists no probabilistic clustering model which allows overlapping clustering of objects. Examples of such pairwise distance data are genomic string alignments, protein contact maps or pairwise patient similarities. The general strategy in clustering distance data is to first embed the distance data into a (lower-dimensional) Euclidean space and then cluster the embedded data. This introduces unnecessary noise and bias. Therefore, it would be advantageous to have a model which can cater to clustering distance data directly.
In this paper, we address this problem and introduce a Probabilistic model for Overlapping Clustering on Distance data (POCD) which enables the modeling of overlapping clusters where objects are only available as ‘pairwise distances’.
In POCD, an object has the freedom to belong to one or more clusters at the same time. Since POCD is a probabilistic model, on output we obtain samples from a distribution over partitions. Further, by using an Indian Buffet Process (IBP) process prior, there is no need to explicitly fix the number of clusters, as well as the number of overlapping clusters, in advance. We demonstrate the benefits of working with distances directly and the utility of POCD in both simulated as well as real world distance data of neonatal patients and HIV1 protease inhibitor contact maps.
(This is joint work with Severin Kasser (Division of Neonatology, University Children’s Hospital Basel (UKBB), Basel, Switzerland), Sven Wellmann (Division of Neonatology, University Children’s Hospital Basel (UKBB), Basel, Switzerland) and Julia E. Vogt (Department of Mathematics and Computer Science, University of Basel, Switzerland and Swiss Institute of Bioinformatics (SIB), Basel, Switzerland))