Wesleyan University                                                                            The Honors College

 

 

 

 

 

 

 

 

 

 

 

 

Mining Clusters for Knowledge:

Finding Algorithm-Independent Groups in Microarray Data

 

by

 

Joshua E. Blumenstock

Class of 2003

 

 

 

 

 

 

 

 

 

 

A thesis submitted to the

faculty of Wesleyan University

in partial fulfillment of the requirements for the

Degree of Bachelor of Arts

with Departmental Honors in Physics

and the Computer Science Program

 

 

Middletown, Connecticut                                                                                  April, 2003

 


 

Mining Clusters for Knowledge:

Finding Algorithm-Independent Groups in Microarray Data

 

by

 

Joshua E. Blumenstock

Class of 2003

 

 

Abstract

 

Using DNA microarray technology, it is now possible to measure the expression levels of tens of thousands of genes.  Statistical analysis of these expression levels provides insight into the function of genes and their biological pathways, as well as information about the genomic underpinnings of many common diseases.  Cluster analysis is a form of unsupervised learning commonly used to analyze microarray data, and there are several different types of cluster analysis to choose from.  It is widely acknowledged that the different types of cluster analysis can produce vastly inconsistent results, yet there is no known way to deal with these inconsistencies.  In this thesis, I present a novel approach to the cluster analysis of microarray data.  The proposed methodology combines and distills the information generated by different types of cluster analysis, and produces a representative clustering structure.  Several new statistics are developed to identify dominant clusters and assess consistency across clustering algorithms.  Using real data from leukemia patients, the proposed methodology is shown to outperform the naïve choice of a single algorithm.

Table of Contents

Acknowledgements. v

1     Introduction.. 1

1.1    Abstract 1

1.2    Organization of Thesis. 2

2     Background on Microarray Technology.. 7

2.1    Genetic Background. 7

2.2    Microarrays and Expression Levels. 8

2.3    Quantitative Aspects of Microarray Data. 9

3     Cluster Analysis of Microarray Experiments. 12

3.1    Introduction to Cluster Analysis. 12

3.2    Clustering Algorithms. 15

3.2.1    Hierarchical Algorithms – Agglomerative Hierarchical Clustering. 16

3.2.2    Partitioning Algorithms - K-Means. 18

3.3    Major Sources of Variability in Clustering. 20

3.3.1    Choice of Algorithm.. 20

3.3.2    Choice of Parameters. 20

3.3.3    Choice of Input Data. 21

3.4    Summary. 22

4     Methods. 24

4.1    Motivation & Prior Work. 24

4.2    Overview of Methods. 28

4.3    Producing A Large Collection of Clusters: ‘Shotgun’ Clustering. 31

4.4    Measuring Global Consistency of the Collection: ... 33

4.5    Identifying Prevalent Clusters: ..... 34

4.6    Clustering the Clusters: ‘Condensation’ Clustering. 36

4.6.1    Clustering Clusters. 37

4.6.2    Merging Clusters to Form Multisets of Samples. 38

4.6.3    Clustering Multisets. 39

4.6.4    Distance Between Multisets. 40

4.6.5    Condensation Algorithm.. 42

5     Results. 44

5.1    Data. 44

5.2    Shotgun Clustering. 46

5.3    Diagnostic Tests. 47

5.3.1    Applications of *... 47

5.3.2    Distribution of Prevalent Clusters. 52

5.4    Biologically-Relevant Results. 53

5.4.1    Single Clustering Configurations Miss the ‘Actual’ Partition. 54

5.4.2    Mining Clusters for Information With Prevalence ..... 56

5.4.3    Mining for Patterns Using Condensation Clustering. 59

5.5    Summary of Results. 63

6     Conclusions. 67

6.1    Conclusions. 67

6.2    Directions for Future Research. 68

References. 70

Appendix A:  Sources of Variability and Indeterminacy in the Clustering Process. 76

A.1    Choice of Algorithm.. 77

A.2    Choice of Parameters. 81

A.2.1    Distance Metric. 81

A.2.2    Linkage Function (hierarchical) 85

A.2.3    Choice of k-value, and the placement of the seeds (for k-means) 89

A.3    Choice of Input Data. 90

Appendix B:  Implementation.. 94

B.1    Overview.. 94

B.2    Database Schema and Code. 96

B.3    Shotgun Stage. 99

B.4    Consistency and Prevalence. 101

B.5    Condensation Stage. 102

B.6    Results & Visualization. 104

 

 


Acknowledgements

 

If anyone is to be acknowledged first, it must be my parents.  I thank them for giving me love and getting me to a place from which I can publicly acknowledge them. 

 

With respect to this thesis, I owe an enormous debt to my advisors Michael Rice and Rick Jensen.  Dr. Rice provided tremendous feedback and direction, especially in the last few weeks as the thesis went from a mess of code to a document with paragraphs and page numbers.  Dr. Jensen has been an inspiration for many years now, and was always ready to help with an inexhaustible supply of ideas and insight. 

 

I also thank Dr. Adam Fieldsteel for helping me wade through the depths of probability and the murky waters of statistics, Dr. Michael Keane for much-needed discussion and criticism, and Dr. Danny Krizanc for letting me scribble all over his chalkboard. 

 

Lastly, thanks to Dan, Alex, Shana and the other cohorts in PacLab, mainly just for being idiots with me at 7am.  To Loveland and everyone else who asked me to explain this thesis, thank you, without that continual grounding these pages would surely not make sense.

 

 


1         Introduction

 

 

1.1  Abstract

 

 

Using DNA microarray technology, it is now possible to measure the expression levels of tens of thousands of genes.  Statistical analysis of these expression levels provides insight into the function of genes and their biological pathways, as well as information about the genomic underpinnings of many common diseases.  Cluster analysis is a form of unsupervised learning commonly used to analyze microarray data, and there are several different types of cluster analysis to choose from.  It is widely acknowledged that the different types of cluster analysis can produce vastly inconsistent results, yet there is no known way to deal with these inconsistencies.  In this thesis, I present a novel approach to the cluster analysis of microarray data.  The proposed methodology combines and distills the information generated by different types of cluster analysis, and produces a representative clustering structure.  Several new statistics are developed to identify dominant clusters and assess consistency across clustering algorithms.  Using real data from leukemia patients, the proposed methodology is shown to outperform the naïve choice of a single algorithm.

 


1.2      Organization of Thesis

 

Section 2:      Background on Microarray Technology

 

With DNA microarray technology we can now simultaneously measure the expression levels of tens of thousands of genes.  Each microarray corresponds to a specified set of experimental conditions (e.g. different patients or different cell lines), and the thousands of measurements on a microarray give an expression profile for the conditions.  By comparing expression profiles across experiments, it is often possible to understand the relationship between gene expression and external factors.  Hundreds of such experiments have uncovered biologically-relevant patterns in gene expression.  Gene expression information has been used, for example, to classify tissue types [8, 49] or differentiate between different forms of cancer [5, 26, 42, 80, 93, 101].  Additionally, such expression information often reveals information about the biological underpinnings of different conditions [4, 8, 51, 64, 76].  Section 2 presents this technology and some of the more relevant quantitative aspects of the data.

 

Section 3:      Cluster Analysis of Microarray Experiments

           

Finding biologically-relevant patterns in the enormous quantity of data, which represents millions of measurements in large experiments, requires large-scale data mining.  This thesis is concerned with cluster analysis, a statistical tool widely used in fields ranging from economics and marketing [13, 45] to archaeology and signal processing[32].  Cluster analysis is commonly used in bioinformatics to discover groups of similar genes, tissues and patients.  However, many different algorithms for cluster analysis exist, and a standard technique has yet to emerge.  In section 3, I present the algorithms most commonly used in microarray experiments, and describe major differences between algorithms.  I also mention factors that lead to indeterminacy in the clustering process.  An extensive discussion of these factors can be found in Appendix A.

 

Section 4: Methods

 

The methodology developed in section 4 is designed to overcome the indeterminacy in the clustering process.  It is assumed that the reader is familiar with the extent and sources of such variability.  If this is not the case the author recommends reading Appendix A prior to reading section 4.

The approach taken here is to use the clusters generated by many algorithms to infer biological information.  The central question addressed by this thesis can thus be stated as follows: How can the results of different forms of cluster analysis be synthesized to produce biologically-relevant information? 

The first step is to use multiple clustering configurations to produce a large database of clusters.  This collection of clusters is a robust source of information; the clusters reflect natural groupings in the data.  However, a methodology does not currently exist to interpret this collection.  In section 4, three mathematical techniques are developed to conduct exploratory data analysis on the collection of clusters.  Just as existing statistical methods are used to mine ‘normal’ expression data for patterns, these tools can be used to mine the clusters for patterns. By combining the results of many forms of cluster analysis, then identifying consistent and dominant patterns, I hope to extract meaningful information from the underlying biological data.

 

Section 5: Results

 

The methodology developed in section 4 represents a different approach to cluster analysis than is commonly used; for this reason many of the results described in section 5 are diagnostic in nature.  Using random and real data, it is demonstrated that each tool discovers information consistent with known properties of both datasets.  Being thus validated, the tools are used to infer new information about the general consistency and prevalence of patterns across different clustering algorithms.

In particular, one of the tools (the prevalence statistic) is able to identify patterns in leukemia microarray data that correspond to clinical differences between patients.  Using this unsupervised technique, a single microarray is identified as being vastly different from the other microarrays in the experiment, and four groups of patients are discovered.  The singled-out microarray, using widely-accepted standards, is found to have been the only defective microarray in the experiment.  Checking the clinical information on the groups of patients reveals that they almost exactly match known subtypes of acute leukemia. 

 

Section 6:      Conclusions

 

In section 6, the ideas of the thesis are presented within the context of the field of bioinformatics.  I conclude that much insight can be gained by mining clusters for knowledge, and give specific suggestion for how to further extend the methods of this thesis.

 

Appendix A:  Sources of Variability and Indeterminacy in the Clustering Process

 

Referring to the current state of microarray analysis, D’Haeseleer recently observed, “the field is in dire need of a comparison study of the main combinations for some of the standard applications.”[24]  The methodology developed in this thesis is designed to overcome the inherent uncertainty involved in clustering.  It assumes such variability exists, and is predicated on the assumption that there is no single best technique, that different techniques are informative in different ways.  In Appendix A I substantiate both of these assumptions by demonstrating that the clusters produced by any algorithm are extremely sensitive to:

A.1             The choice of algorithm

A.2             The parameters passed to the algorithm

A.3             The input data used as a basis for clustering.

 

Appendix B:  Implementation

 

Computational concerns can impede effective cluster analysis of microarray data[10, 57].  The methodology presented in this thesis presents a significant computational challenge.  Appendix B gives an outline of the author’s implementation of all statistical techniques and data storage schema.

To allow for the use of the “brute force” iteration over multiple configurations of multiple clustering algorithms, data at every stage of clustering is retained in a database.  This frees memory for fast paging and optimizing algorithm performance, while allowing for hundreds of concurrent experiments.  Statistical programming is done in Matlab, a mathematical programming language optimized for fast manipulation of arrays and matrices.  As the complete project consists of over 3000 lines of code, only crucial sections are included in Appendix B.  If code for a particular algorithm or statistic is included, it will be marked with a *code in the text.


2             Background on Microarray Technology

 

Recent years have seen the complete sequencing of the human genome.  Many other organisms’ genomes have been sequenced as well; examples include fruitflies, yeast, and many bacteria.  From sequence information, we can learn much of the structure of genes and DNA.  However, sequence analysis alone cannot tell us what genes are or how they are used.  To reveal more of the functional properties of genes, DNA microarrays measure genome-wide expression.  Here, I provide an abbreviated overview of the genetic underpinnings of this technology, and a quick introduction to the technology itself.

 

2.1  Genetic Background

 

Humans have tens of thousands of genes; taken together, these constitute the human genome.  Each gene is a unique subsection of the genome and consists of a sequence of a few thousand nucleotides.  A nucleotide is a special type of molecule that contains four nitrogen bases (some combination of adenine, guanine, cytosine and thymine).  From an informatic perspective, a nucleotide can be seen as a member of the four-letter alphabet {A,G,C,T}; a gene can be regarded as a special sequence of these nucleotides (e.g. …CCTATAGCAACG…).

 Genes are important because they code for amino acids, which in turn form proteins – the basic elements of every cell.  Genes code for amino acids via a two-step process of transcription and translation.  In transcription, the cell produces a piece of mRNA that is a complementary copy of the gene.  Each section of DNA uniquely corresponds to one section of mRNA, and vice versa.  In translation, amino acids are produced directly from the mRNA.[52]

Of course, this is a gross oversimplification of the process, as there are many other factors and molecules involved in transcription and translation.  Nonetheless, the important point is that mRNA is a crucial medium that enables the production of amino acids (and therefore proteins) from DNA.  As Lander summarized, “The mRNA levels sensitively reflect the state of the cell, perhaps uniquely defining cell types, stages, and responses.”[64]

 

2.2  Microarrays and Expression Levels

 

Microarrays measure the presence of mRNA.  The mRNA can be extracted from cells, tissues, etc.  By analyzing extracted mRNA, one obtains a quantitative assessment of the genetic activity of the location from which the mRNA was extracted.  Microarrays derive an expression level for each gene – a scalar value corresponding to the amount of mRNA which in turn corresponds to the gene in question.  High expression levels indicate a high amount of genetic activity, whereas low expression levels indicate inactivity.

There are three predominant types of microarray technology: high-density oligonucleotide arrays, cDNA microarrays, and SAGE (serial analysis of gene expression).  Each technology measures levels of gene expression.  Here, I focus on high-density oligonucleotide arrays, though in principle the analysis applies to all three.

Oligonucleotide arrays (Figure 2.1) consist of a high-density grid of a few hundred thousand oligonucleotides.  Each oligonucleotide, a manufactured sequence of twenty-five bases (also {A,G,C,T}), uniquely corresponds to a specific gene via the same complementarity that relates mRNA to DNA.  Using light-directed, solid-phase combinatorial chemical synthesis, these oligonucleotides are spotted onto a glass slide.  When immersed in mRNA, the mRNA hybridizes to its oligonucleotide match on the chip. The amount of hybridization is measured by flourescently staining the array, and subsequently scanning the array to measure the intensity of fluorescence.  Thus, on a single high-density array, it is possible to simultaneously obtain expression levels for thousands of individual genes.[20, 67].

 

a)                                                                     b)

Figure 2.1 Oligonucleotide arrays.  a) An Affymetrix® GeneChip. b) Scanned image of probe array.

 

2.3  Quantitative Aspects of Microarray Data

 

Once the microarray has been processed, the user is left with a scalar expression level for each gene.  Each expression level spans three to four orders of magnitude.  In the Affymetrix® arrays referred to in this paper, there are roughly 10,000 such gene expression levels.  The distribution of expression levels on each microarray is approximately lognormal (Figure 2.2a) [48, 59], normalized to a mean level specified by the user.  Taken together, the 10,000 expression levels comprise a complete expression profile for the sample mRNA.

However, the measurements are not perfect[62].  For instance, though chip technology is improving, background noise and chip defects still contaminate microarray data [34, 50].  Then too, the same mRNA, when washed on two identical chips, does not produce identical expression profiles; a complete model for this error was recently derived by Jensen et al [59].  Aside from chip errors, a common source of error is in the extracted mRNA itself.  For example, Ben-Dor et al.[8] recently noted that in the colon cancer data used by Alon et al.[6], the normal colon biopsy also included smooth muscle tissue from the colon walls.  This caused the muscle-related genes to be disproportionately expressed in the normal cells, when compared to the cancerous cells.

Such unwanted variance often leads researchers to use only those genes with high expression levels, standard deviations, and variances – such considerations are discussed in Appendix A, section 3.

 

Figure 2.2 Quantitative Aspects of Microarray Data.  Distribution of logarithm of the expression levels for an Affymetrix® human microarray.


 

3             Cluster Analysis of Microarray Experiments

Each microarray contains a complete expression profile for a specific sample of mRNA – one scalar value for each of the (roughly) 10,000 genes on the array.  The challenge, then, is to develop technologies and an interpretive framework to make sense of this large quantity of data.  Cluster analysis is a form of multivariate analysis frequently used in the analysis of microarray data.  In this section I describe cluster analysis, drawing particular attention to sources of variability and uncertainty in clustering.  The inherent uncertainty of cluster analysis is used to motivate the consensus methodology described in section 5.

 

3.1  Introduction to Cluster Analysis

 

            Machine learning can be broken into two paradigms: supervised learning and unsupervised learning.  Cluster analysis is a form of unsupervised learning.  In supervised learning certain features of the data are known a priori, and this knowledge is used to guide the analysis.  Supervised situations often involve discriminant analysis, group classification and class prediction.  In unsupervised learning, by contrast, all knowledge and structure must be ‘discovered’ in the data.  Unsupervised situations typically involve class discovery and subtype identification.  Though cluster analysis is often augmented with supervised forms of analysis (e.g. in data preprocessing and gene selection)[9, 40, 82], as a computational tool it is inherently unsupervised.  The methodology and techniques presented in this paper are designed for completely unsupervised situations.

The purpose of cluster analysis is to assign objects to clusters such that objects within a cluster are highly similar, whereas objects in different clusters are highly dissimilar.  The resulting clusters partition the original objects into non-overlapping sets, such that each of the original objects is a member of exactly one cluster and no cluster contains multiple instances of the same object.  Cluster analysis is commonly applied to data mining in a wide variety of fields, including but not limited to information retrieval, signal processing, marketing, socioeconomic research, and classification of single malt whiskeys [97].  In the analysis of microarray data, cluster analysis is one of the most commonly-used analytic tools.  In this field, clustering has become so prevalent that Goldstein recently remarked:  “It is now commonplace for researchers to perform a hierarchical clustering of microarray data to identify patterns in the clustering.  In many instances, cluster analysis is the primary technique of data analysis, regardless of the specific questions of interest.”[39]

In microarray experiments, clustering typically is used for one of the following purposes:  (1) To identify samples with similar expression profiles ([4, 6, 40, 42] Schummer 1999), (2)To identify genes with similar expression across samples [10, 22, 31, 92] Chu 1998, Iyer 1999), or (3) Some combination of the two [6, 37, 85, 87].  For purposes of clarity, in this paper I deal primarily with (1), though all of the techniques in principle apply to the others as well.

When using cluster analysis to identify samples with similar expression levels, each sample (or equivalently, each microarray) is represented as a vector of expression levels.  Thus, given m samples with n expression levels, the data can be represented as m vectors in n-dimensional space.  Computationally, input data is stored as an m x n matrix X of expression data, where each row corresponds to a sample and each column represents a gene.  The goal of clustering is to group similar samples into clusters based on expression levels across n dimensions.  Note that m and n are general features of cluster analysis corresponding to points and dimensions - for the sake of clarity I will henceforth refer to points as samples and dimensions as genes.

Ideally, each cluster will contain samples that are similar to each other and dissimilar to samples in other clusters (Figure 3.1).  In rough terms, good clusters will be crisp and compact, and geometrically separated from other clusters.  Cluster validity analysis, briefly discussed in section 4.1, formalizes these ideas.

 

Figure 3.1: Visualization of clustering process.  a) Input data set consists of 50 points (samples) in 2-dimensional space (m=50, n=2).  b) Cluster analysis reveals 3 natural clusters in the dataset.

 

            The next section presents two clustering algorithms; the notation in Table 3.1 will be used there and throughout this thesis.

 


 

Notation

Description

m

number of samples/patients/vectors

n

number of genes/dimensions

xi

the ith sample,

X

the set of all samples {x1, x2, …, xm};

equivalently, the m x n expression matrix

xi,g

the expression level of the gth gene of ith sample,

the average expression level of xi:

d(a,b)

The distance between (dissimilarity of) vectors a and b

Cp

the pth cluster of samples

Cp,i

the expression vector of the ith sample of pth cluster

D(Cp,Cq)

The distance between (dissimilarity of) Cp and Cq

the centroid of Cp: 

Table 3.1: Notation used in this thesis

 

3.2  Clustering Algorithms

 

A clustering algorithm is an algorithm by which cluster analysis is accomplished.  Though many different clustering algorithms exist, the forms most commonly applied to microarray data [78] are hierarchical algorithms and partitioning algorithms.  There are subclasses of both of these types, but the most representative forms, agglomerative hierarchical and k-means, are presented below.  For a discussion of other clustering algorithms, including grid-based, density-based and model-based algorithms, the reader is referred to [57], [12] and [99], or Appendix A for a short discussion.

 

3.2.1       Hierarchical Algorithms – Agglomerative Hierarchical Clustering

 

Hierarchical algorithms are the most common type of clustering algorithm used in microarray analysis.  Seminal experiments using this type of clustering include [31], [26] and [76].  These algorithms generate a tree-like clustering hierarchy of samples.  Unlike many forms of cluster analysis (including the partitioning algorithms discussed below), hierarchical algorithms are deterministic, though the solution is potentially non-unique[73].  Each sample on the tree (referred to as a dendrogram) is a leaf; the length of the branch connecting samples corresponds to the distance between the two (Figure 3.2).  The basic algorithm requires O(m2log m) time and O(m2) space[63], and consists of three steps:

 

(1)        Calculate dissimilarity matrix based on pairwise dissimilarities

 

                                                                       

 

There are thus  values, as the major diagonal is filled with zeros, since d(xi, xi) = 0 for each , and the values below the major diagonal are redundant.  The sample-sample distance d is calculated based on a distance function (e.g. Euclidean distance, Minkowski metric) or other dissimilarity functions (e.g. Pearson correlation, cosine distance).

 

(2)        Iteratively merge most similar clusters.  Initially, there are m singleton clusters, such that Cp = {xi} for each .  The two most similar clusters Cp and Cq, satisfying D(p,q)=min(D), are then merged to form a new cluster joined by a branch of length D(p,q).  After this first join D must be updated to account for the fact that two clusters have been merged (i.e. there are now fewer than m clusters, and at least one cluster is no longer a singleton cluster).  D is updated using a linkage function D that measures the distance between clusters (e.g. nearest neighbor distance or centroid-centroid distance).  The entire process is iterated m-1 times until all clusters are linked and a complete dendrogram is generated.

 

(3)        Divide dendrogram into distinct clusters.  The last phase of hierarchical clustering is to divide the dendrogram into distinct clusters.  Typically, a breakpoint b is chosen that represents a maximum distance permitted to exist between clusters.  For instance, in Figure 3.2, the breakpoint b = 2.5 identifies the following three clusters {1,4}, {2,5} and {3}, while the value b=1.7 produces four clusters {1,4}, {2}, {5} and {3}.  Note that b can be chosen to produce any user-defined number of clusters k, as long as .  Choosing b = 2.5 in Figure 3.2 is equivalent to choosing k=3.


 

Figure 3.2: Simple hierarchical clustering dendrogram.  Samples 1 and 4 are closely related, as are samples 2 and 5.  The centroid of cluster {1,4} is quite dissimilar to that of cluster {2,5,3}. The breakpoint b = 2.5 divides the dendrogram into three clusters: C1={1,4}, C2={2,5}, C3={3}.

 

3.2.2       Partitioning Algorithms - K-Means

 

K-means clustering is used nearly as often as hierarchical clustering (see, for example, [86, 90, 100]).  As in hierarchical clustering, partitioning algorithms divide data into groups; however, partitioning algorithms are more direct.  Rather than producing a dendrogram that must later be cut at a breakpoint, -means immediately divides the data into k subsets (clusters), and then updates the clusters until they are ‘good,’ as defined by equation below.

More precisely, each cluster Cp , , has a centroid  that is the n-dimensional mean of all samples in Cp

                                                                                                  

The k-means algorithm attempts to minimize

                                                                                         

where c(xi) is the centroid of the cluster containing the ith sample and d(xi,c(xi)) is the distance between the ith sample and the centroid.  In other words, k-means searches for an assignment of samples to clusters that minimizes the total sum of sample-to-centroid distances, summed over all m samples (or equivalently, over all k clusters).  The algorithm proceeds as follows:

 

(1)        Choose k seed points.  These points are typically selected randomly from the entire set of samples, although other techniques exist[1].

(2)        Assign samples to clusters.  Each sample xi is assigned to the nearest cluster c(xi), as defined by a distance metric d (typically one of the Minkowski metrics).  Initially, the k seed points are used as the centroids.

(3)        Recalculate all cluster centroids. At each iteration, the cluster memberships will change, so it is necessary to recompute the centroids based on equation .

(4)        Repeat steps (2) and (2) until convergence, or up to a maximum number of iterations.  Convergence is achieved when samples cease to be reassigned, or when dtot ceases to decrease.

 

Eventually, the k-means algorithm will converge to a local minimum of dtot, which, in general, is not a global minimum.  The problem of computing the global minimum solution is NP‑hard[60] and can only be achieved through exhaustive repetition of the algorithm.  However, a single run of the algorithm can be optimized to run in time proportional to O(m), though most implementations depend on k and other factors.[21]

 

3.3  Major Sources of Variability in Clustering

 

A more complete discussion of these and other sources of variability is contained in Appendix A.  Here, I briefly summarize the most important factors.

 

3.3.1       Choice of Algorithm

 

Whereas hierarchical algorithms create a clustering hierarchy in which samples are related in a tree-like structure, partitioning algorithms divide the data into absolute subsets that are completely unrelated.  Hierarchical algorithms proceed deterministically, often yielding clusters of strange shapes and sizes; k-means bounces around between local minima, generally producing “spherical” clusters of similar sizes.  Though both eventually partition the set of samples, the logic behind each partitioning is fundamentally different.  The reader is referred to section A.1 for a more thorough comparison of these algorithms.

 

3.3.2       Choice of Parameters

 

Given a clustering algorithm, there are still many decisions to make.  For instance, in k-means clustering it is necessary to specify a distance function d.  Figure 3.1 shows a picture of 2-dimensional Euclidean space, but k‑means typically is run in many-dimensional spaces that aren’t necessarily Euclidean.  The choice of d determines the structure of the metric space for the original m samples (see Figure A.2).  In hierarchical clustering, d is similarly used to generate the dissimilarity matrix D.  Additionally, hierarchical clustering utilizes a linkage function D to measure distance between clusters.  Just as d determines the metric space of samples, D determines the metric space of clusters.  Other than the distance metric, major parameters include the value and placement of the k seeds (k-means), and the placement of the breakpoint b (hierarchical).[1]  Various distance metrics, k‑values, and other parameters are addressed in more detail in section A.2, and will not be discussed any further in the text, though section 4 assumes familiarity with the concepts.

 

3.3.3       Choice of Input Data

 

Getz [37] recently observed, “the main difficulty is that each biological process on which we wish to focus may involve a relatively small subset of the genes; the large majority of those present on the microarray constitute a noisy background that may mask the effect of the small subset.”  Each sample’s expression profile contains tens of thousands of genes that could potentially be used as the basis for the expression vector in clustering; section A.3 gives strong biological, statistical, and computational reasons not to use the entire set of n genes.  These reasons, which are echoed throughout the literature, motivate the use of nused genes, where nused<n.  For this reason, a large number of experiments use filtering criteria to select the ‘most relevant’ genes.  However, it is unclear how to choose these genes, and even how many to choose.  To the author’s knowledge, no work has been done to systematically examine the effects of using a particular subset of genes as the basis for each of the nused-dimensional vectors (cf. [47]; the dearth of literature on this subject has also been noted in [39]).  As can be imagined, the definition of ‘most relevant’ varies from source to source.  Some authors use high variances [84], others use high average expression across samples [15, 76] or genes that satisfy maximum/minimum expression criteria[6], or n-fold change from median [76], etc.

 

3.4  Summary

 

Cluster analysis is commonly used in microarray experiments, but there is a large amount of uncertainty in the clustering process.  Though it has repeatedly been demonstrated that a large portion of clusters are preserved across clustering methods [65, 69, 70, 75], it is also widely acknowledged that different methods produce different clusters.  As has been demonstrated here (in Appendix A) and in the literature, the major sources of variability are: (1) the choice of clustering algorithm [7, 39, 90], (2) the parameters to the algorithm [18, 19, 46]; and (3) the subset of input data [37, 39, 47].  Each unique combination of these three factors produces a potentially unique set of clusters.  The ‘unique combination’ will henceforth be referred to as a clustering configuration.

 

Definition 3.1: Clustering Configuration

A unique combination of:

    (1)    Clustering algorithm

    (2)    Algorithm parameters

    (3)    Collection of genes used

 

The set of clusters produced by a particular clustering configuration is a partition of the original m samples – each sample appears in exactly one cluster.

 

Definition 3.2: Partition

A set of clusters  is a partition of X if and only if  and .

 


4