Mining Clusters for Knowledge:
Finding Algorithm-Independent Groups in Microarray Data
by
Joshua E. Blumenstock
Class of 2003
A thesis submitted to the
faculty of
in partial fulfillment of the requirements for the
Degree of Bachelor of Arts
with Departmental Honors in Physics
and the Computer Science Program
Mining Clusters for Knowledge:
Finding Algorithm-Independent Groups in Microarray Data
by
Joshua E. Blumenstock
Class of 2003
Abstract
Using DNA microarray technology, it is now possible to measure the expression levels of tens of thousands of genes. Statistical analysis of these expression levels provides insight into the function of genes and their biological pathways, as well as information about the genomic underpinnings of many common diseases. Cluster analysis is a form of unsupervised learning commonly used to analyze microarray data, and there are several different types of cluster analysis to choose from. It is widely acknowledged that the different types of cluster analysis can produce vastly inconsistent results, yet there is no known way to deal with these inconsistencies. In this thesis, I present a novel approach to the cluster analysis of microarray data. The proposed methodology combines and distills the information generated by different types of cluster analysis, and produces a representative clustering structure. Several new statistics are developed to identify dominant clusters and assess consistency across clustering algorithms. Using real data from leukemia patients, the proposed methodology is shown to outperform the naïve choice of a single algorithm.
2 Background
on Microarray Technology
2.2 Microarrays
and Expression Levels
2.3 Quantitative
Aspects of Microarray Data
3 Cluster
Analysis of Microarray Experiments
3.1 Introduction
to Cluster Analysis
3.2.1 Hierarchical Algorithms – Agglomerative
Hierarchical Clustering
3.2.2 Partitioning Algorithms - K-Means
3.3 Major
Sources of Variability in Clustering
4.3 Producing
A Large Collection of Clusters: ‘Shotgun’ Clustering
4.4 Measuring
Global Consistency of the Collection: ![]()
4.5 Identifying
Prevalent Clusters: ![]()
4.6 Clustering
the Clusters: ‘Condensation’ Clustering
4.6.2 Merging
Clusters to Form Multisets of Samples.
4.6.4 Distance
Between Multisets
5.3.2 Distribution
of Prevalent Clusters
5.4 Biologically-Relevant
Results
5.4.1 Single
Clustering Configurations Miss the ‘Actual’ Partition
5.4.2 Mining
Clusters for Information With Prevalence ![]()
5.4.3 Mining for
Patterns Using Condensation Clustering.
6.2 Directions
for Future Research
Appendix A: Sources of Variability and Indeterminacy in
the Clustering Process
A.2.2 Linkage
Function (hierarchical)
A.2.3 Choice of
k-value, and the placement of the seeds (for k-means)
B.4 Consistency
and Prevalence
If anyone is to be acknowledged first, it must be my parents. I thank them for giving me love and getting me to a place from which I can publicly acknowledge them.
With respect to this thesis, I owe an enormous debt to my advisors Michael Rice and Rick Jensen. Dr. Rice provided tremendous feedback and direction, especially in the last few weeks as the thesis went from a mess of code to a document with paragraphs and page numbers. Dr. Jensen has been an inspiration for many years now, and was always ready to help with an inexhaustible supply of ideas and insight.
I also thank Dr. Adam Fieldsteel for helping me wade through the depths of probability and the murky waters of statistics, Dr. Michael Keane for much-needed discussion and criticism, and Dr. Danny Krizanc for letting me scribble all over his chalkboard.
Lastly, thanks
to Dan, Alex, Shana and the other cohorts in PacLab, mainly just for being
idiots with me at
Using DNA microarray technology, it is now possible to measure the expression levels of tens of thousands of genes. Statistical analysis of these expression levels provides insight into the function of genes and their biological pathways, as well as information about the genomic underpinnings of many common diseases. Cluster analysis is a form of unsupervised learning commonly used to analyze microarray data, and there are several different types of cluster analysis to choose from. It is widely acknowledged that the different types of cluster analysis can produce vastly inconsistent results, yet there is no known way to deal with these inconsistencies. In this thesis, I present a novel approach to the cluster analysis of microarray data. The proposed methodology combines and distills the information generated by different types of cluster analysis, and produces a representative clustering structure. Several new statistics are developed to identify dominant clusters and assess consistency across clustering algorithms. Using real data from leukemia patients, the proposed methodology is shown to outperform the naïve choice of a single algorithm.
With DNA microarray technology we can now simultaneously measure the expression levels of tens of thousands of genes. Each microarray corresponds to a specified set of experimental conditions (e.g. different patients or different cell lines), and the thousands of measurements on a microarray give an expression profile for the conditions. By comparing expression profiles across experiments, it is often possible to understand the relationship between gene expression and external factors. Hundreds of such experiments have uncovered biologically-relevant patterns in gene expression. Gene expression information has been used, for example, to classify tissue types [8, 49] or differentiate between different forms of cancer [5, 26, 42, 80, 93, 101]. Additionally, such expression information often reveals information about the biological underpinnings of different conditions [4, 8, 51, 64, 76]. Section 2 presents this technology and some of the more relevant quantitative aspects of the data.
Finding biologically-relevant patterns in the enormous quantity of data, which represents millions of measurements in large experiments, requires large-scale data mining. This thesis is concerned with cluster analysis, a statistical tool widely used in fields ranging from economics and marketing [13, 45] to archaeology and signal processing[32]. Cluster analysis is commonly used in bioinformatics to discover groups of similar genes, tissues and patients. However, many different algorithms for cluster analysis exist, and a standard technique has yet to emerge. In section 3, I present the algorithms most commonly used in microarray experiments, and describe major differences between algorithms. I also mention factors that lead to indeterminacy in the clustering process. An extensive discussion of these factors can be found in Appendix A.
The methodology developed in section 4 is designed to overcome the indeterminacy in the clustering process. It is assumed that the reader is familiar with the extent and sources of such variability. If this is not the case the author recommends reading Appendix A prior to reading section 4.
The approach taken here is to use the clusters generated by many algorithms to infer biological information. The central question addressed by this thesis can thus be stated as follows: How can the results of different forms of cluster analysis be synthesized to produce biologically-relevant information?
The first step is to use multiple clustering configurations to produce a large database of clusters. This collection of clusters is a robust source of information; the clusters reflect natural groupings in the data. However, a methodology does not currently exist to interpret this collection. In section 4, three mathematical techniques are developed to conduct exploratory data analysis on the collection of clusters. Just as existing statistical methods are used to mine ‘normal’ expression data for patterns, these tools can be used to mine the clusters for patterns. By combining the results of many forms of cluster analysis, then identifying consistent and dominant patterns, I hope to extract meaningful information from the underlying biological data.
The methodology developed in section 4 represents a different approach to cluster analysis than is commonly used; for this reason many of the results described in section 5 are diagnostic in nature. Using random and real data, it is demonstrated that each tool discovers information consistent with known properties of both datasets. Being thus validated, the tools are used to infer new information about the general consistency and prevalence of patterns across different clustering algorithms.
In particular, one of the tools (the prevalence statistic) is able to identify patterns in leukemia microarray data that correspond to clinical differences between patients. Using this unsupervised technique, a single microarray is identified as being vastly different from the other microarrays in the experiment, and four groups of patients are discovered. The singled-out microarray, using widely-accepted standards, is found to have been the only defective microarray in the experiment. Checking the clinical information on the groups of patients reveals that they almost exactly match known subtypes of acute leukemia.
In section 6, the ideas of the thesis are presented within the context of the field of bioinformatics. I conclude that much insight can be gained by mining clusters for knowledge, and give specific suggestion for how to further extend the methods of this thesis.
Referring to the current state of microarray analysis, D’Haeseleer recently observed, “the field is in dire need of a comparison study of the main combinations for some of the standard applications.”[24] The methodology developed in this thesis is designed to overcome the inherent uncertainty involved in clustering. It assumes such variability exists, and is predicated on the assumption that there is no single best technique, that different techniques are informative in different ways. In Appendix A I substantiate both of these assumptions by demonstrating that the clusters produced by any algorithm are extremely sensitive to:
A.1 The choice of algorithm
A.2 The parameters passed to the algorithm
A.3 The input data used as a basis for clustering.
Computational concerns can impede effective cluster analysis of microarray data[10, 57]. The methodology presented in this thesis presents a significant computational challenge. Appendix B gives an outline of the author’s implementation of all statistical techniques and data storage schema.
To allow for the use of the “brute force” iteration over multiple configurations of multiple clustering algorithms, data at every stage of clustering is retained in a database. This frees memory for fast paging and optimizing algorithm performance, while allowing for hundreds of concurrent experiments. Statistical programming is done in Matlab, a mathematical programming language optimized for fast manipulation of arrays and matrices. As the complete project consists of over 3000 lines of code, only crucial sections are included in Appendix B. If code for a particular algorithm or statistic is included, it will be marked with a *code in the text.
Recent years have seen the complete sequencing of the human genome. Many other organisms’ genomes have been sequenced as well; examples include fruitflies, yeast, and many bacteria. From sequence information, we can learn much of the structure of genes and DNA. However, sequence analysis alone cannot tell us what genes are or how they are used. To reveal more of the functional properties of genes, DNA microarrays measure genome-wide expression. Here, I provide an abbreviated overview of the genetic underpinnings of this technology, and a quick introduction to the technology itself.
Humans have tens of thousands of genes; taken together, these constitute the human genome. Each gene is a unique subsection of the genome and consists of a sequence of a few thousand nucleotides. A nucleotide is a special type of molecule that contains four nitrogen bases (some combination of adenine, guanine, cytosine and thymine). From an informatic perspective, a nucleotide can be seen as a member of the four-letter alphabet {A,G,C,T}; a gene can be regarded as a special sequence of these nucleotides (e.g. …CCTATAGCAACG…).
Genes are important because they code for amino acids, which in turn form proteins – the basic elements of every cell. Genes code for amino acids via a two-step process of transcription and translation. In transcription, the cell produces a piece of mRNA that is a complementary copy of the gene. Each section of DNA uniquely corresponds to one section of mRNA, and vice versa. In translation, amino acids are produced directly from the mRNA.[52]
Of course, this is a gross oversimplification of the process, as there are many other factors and molecules involved in transcription and translation. Nonetheless, the important point is that mRNA is a crucial medium that enables the production of amino acids (and therefore proteins) from DNA. As Lander summarized, “The mRNA levels sensitively reflect the state of the cell, perhaps uniquely defining cell types, stages, and responses.”[64]
Microarrays measure the presence of mRNA. The mRNA can be extracted from cells, tissues, etc. By analyzing extracted mRNA, one obtains a quantitative assessment of the genetic activity of the location from which the mRNA was extracted. Microarrays derive an expression level for each gene – a scalar value corresponding to the amount of mRNA which in turn corresponds to the gene in question. High expression levels indicate a high amount of genetic activity, whereas low expression levels indicate inactivity.
There are three predominant types of microarray technology: high-density oligonucleotide arrays, cDNA microarrays, and SAGE (serial analysis of gene expression). Each technology measures levels of gene expression. Here, I focus on high-density oligonucleotide arrays, though in principle the analysis applies to all three.
Oligonucleotide arrays (Figure 2.1) consist of a high-density grid of a few hundred thousand oligonucleotides. Each oligonucleotide, a manufactured sequence of twenty-five bases (also {A,G,C,T}), uniquely corresponds to a specific gene via the same complementarity that relates mRNA to DNA. Using light-directed, solid-phase combinatorial chemical synthesis, these oligonucleotides are spotted onto a glass slide. When immersed in mRNA, the mRNA hybridizes to its oligonucleotide match on the chip. The amount of hybridization is measured by flourescently staining the array, and subsequently scanning the array to measure the intensity of fluorescence. Thus, on a single high-density array, it is possible to simultaneously obtain expression levels for thousands of individual genes.[20, 67].
a) b)

Figure 2.1 Oligonucleotide arrays. a) An
Affymetrix® GeneChip. b) Scanned image of probe
array.
Once the microarray has been processed, the user is left with a scalar expression level for each gene. Each expression level spans three to four orders of magnitude. In the Affymetrix® arrays referred to in this paper, there are roughly 10,000 such gene expression levels. The distribution of expression levels on each microarray is approximately lognormal (Figure 2.2a) [48, 59], normalized to a mean level specified by the user. Taken together, the 10,000 expression levels comprise a complete expression profile for the sample mRNA.
However, the measurements are not perfect[62]. For instance, though chip technology is improving, background noise and chip defects still contaminate microarray data [34, 50]. Then too, the same mRNA, when washed on two identical chips, does not produce identical expression profiles; a complete model for this error was recently derived by Jensen et al [59]. Aside from chip errors, a common source of error is in the extracted mRNA itself. For example, Ben-Dor et al.[8] recently noted that in the colon cancer data used by Alon et al.[6], the normal colon biopsy also included smooth muscle tissue from the colon walls. This caused the muscle-related genes to be disproportionately expressed in the normal cells, when compared to the cancerous cells.
Such unwanted variance often leads researchers to use only those genes with high expression levels, standard deviations, and variances – such considerations are discussed in Appendix A, section 3.

Figure 2.2 Quantitative Aspects of
Microarray Data. Distribution of logarithm of the
expression levels for an Affymetrix® human microarray.
Each microarray contains a complete expression profile for a specific sample of mRNA – one scalar value for each of the (roughly) 10,000 genes on the array. The challenge, then, is to develop technologies and an interpretive framework to make sense of this large quantity of data. Cluster analysis is a form of multivariate analysis frequently used in the analysis of microarray data. In this section I describe cluster analysis, drawing particular attention to sources of variability and uncertainty in clustering. The inherent uncertainty of cluster analysis is used to motivate the consensus methodology described in section 5.
Machine learning can be broken into two paradigms: supervised learning and unsupervised learning. Cluster analysis is a form of unsupervised learning. In supervised learning certain features of the data are known a priori, and this knowledge is used to guide the analysis. Supervised situations often involve discriminant analysis, group classification and class prediction. In unsupervised learning, by contrast, all knowledge and structure must be ‘discovered’ in the data. Unsupervised situations typically involve class discovery and subtype identification. Though cluster analysis is often augmented with supervised forms of analysis (e.g. in data preprocessing and gene selection)[9, 40, 82], as a computational tool it is inherently unsupervised. The methodology and techniques presented in this paper are designed for completely unsupervised situations.
The purpose of cluster analysis is to assign objects to clusters such that objects within a cluster are highly similar, whereas objects in different clusters are highly dissimilar. The resulting clusters partition the original objects into non-overlapping sets, such that each of the original objects is a member of exactly one cluster and no cluster contains multiple instances of the same object. Cluster analysis is commonly applied to data mining in a wide variety of fields, including but not limited to information retrieval, signal processing, marketing, socioeconomic research, and classification of single malt whiskeys [97]. In the analysis of microarray data, cluster analysis is one of the most commonly-used analytic tools. In this field, clustering has become so prevalent that Goldstein recently remarked: “It is now commonplace for researchers to perform a hierarchical clustering of microarray data to identify patterns in the clustering. In many instances, cluster analysis is the primary technique of data analysis, regardless of the specific questions of interest.”[39]
In microarray experiments, clustering typically is used for one of the following purposes: (1) To identify samples with similar expression profiles ([4, 6, 40, 42] Schummer 1999), (2)To identify genes with similar expression across samples [10, 22, 31, 92] Chu 1998, Iyer 1999), or (3) Some combination of the two [6, 37, 85, 87]. For purposes of clarity, in this paper I deal primarily with (1), though all of the techniques in principle apply to the others as well.
When using cluster analysis to identify samples with similar expression levels, each sample (or equivalently, each microarray) is represented as a vector of expression levels. Thus, given m samples with n expression levels, the data can be represented as m vectors in n-dimensional space. Computationally, input data is stored as an m x n matrix X of expression data, where each row corresponds to a sample and each column represents a gene. The goal of clustering is to group similar samples into clusters based on expression levels across n dimensions. Note that m and n are general features of cluster analysis corresponding to points and dimensions - for the sake of clarity I will henceforth refer to points as samples and dimensions as genes.
Ideally, each cluster will contain samples that are similar to each other and dissimilar to samples in other clusters (Figure 3.1). In rough terms, good clusters will be crisp and compact, and geometrically separated from other clusters. Cluster validity analysis, briefly discussed in section 4.1, formalizes these ideas.

Figure 3.1: Visualization of clustering process. a) Input data set consists of 50 points (samples) in 2-dimensional space (m=50, n=2). b) Cluster analysis reveals 3 natural clusters in the dataset.
The next section presents two clustering algorithms; the notation in Table 3.1 will be used there and throughout this thesis.
|
Notation |
Description |
|
m |
number of samples/patients/vectors |
|
n |
number of genes/dimensions |
|
xi |
the ith sample, |
|
X |
the set of all samples {x1, x2, …, xm}; equivalently, the m x n expression matrix |
|
xi,g |
the expression level of the gth gene of ith
sample, |
|
|
the average expression level of xi: |
|
d(a,b) |
The distance between (dissimilarity of) vectors a and b |
|
Cp |
the pth cluster of samples |
|
Cp,i |
the expression vector of the ith sample of pth cluster |
|
D(Cp,Cq) |
The distance between (dissimilarity of) Cp and Cq |
|
|
the centroid of Cp: |
Table
3.1: Notation used in this
thesis
A clustering algorithm is an algorithm by which cluster analysis is accomplished. Though many different clustering algorithms exist, the forms most commonly applied to microarray data [78] are hierarchical algorithms and partitioning algorithms. There are subclasses of both of these types, but the most representative forms, agglomerative hierarchical and k-means, are presented below. For a discussion of other clustering algorithms, including grid-based, density-based and model-based algorithms, the reader is referred to [57], [12] and [99], or Appendix A for a short discussion.
Hierarchical algorithms are the most common type of clustering algorithm used in microarray analysis. Seminal experiments using this type of clustering include [31], [26] and [76]. These algorithms generate a tree-like clustering hierarchy of samples. Unlike many forms of cluster analysis (including the partitioning algorithms discussed below), hierarchical algorithms are deterministic, though the solution is potentially non-unique[73]. Each sample on the tree (referred to as a dendrogram) is a leaf; the length of the branch connecting samples corresponds to the distance between the two (Figure 3.2). The basic algorithm requires O(m2log m) time and O(m2) space[63], and consists of three steps:
(1) Calculate dissimilarity matrix based on pairwise dissimilarities
There are thus
values, as the major
diagonal is filled with zeros, since d(xi,
xi) = 0 for each
, and the values below the major diagonal are redundant. The sample-sample distance d is calculated based on a distance
function (e.g. Euclidean distance, Minkowski metric) or other dissimilarity
functions (e.g. Pearson correlation, cosine distance).
(2)
Iteratively merge
most similar clusters. Initially,
there are m singleton clusters, such
that Cp = {xi} for each
. The two most similar
clusters Cp and Cq, satisfying D(p,q)=min(D), are then
merged to form a new cluster joined by a branch of length D(p,q). After this first
join D must be updated to account
for the fact that two clusters have been merged (i.e. there are now fewer than m clusters, and at least one cluster is
no longer a singleton cluster). D is updated using a linkage function D that measures the distance between clusters (e.g. nearest
neighbor distance or centroid-centroid distance). The entire process is iterated m-1 times until all clusters are linked
and a complete dendrogram is generated.
(3)
Divide dendrogram
into distinct clusters. The last
phase of hierarchical clustering is to divide the dendrogram into distinct
clusters. Typically, a breakpoint b is chosen that represents a maximum
distance permitted to exist between clusters.
For instance, in Figure
3.2, the breakpoint b
= 2.5 identifies the following three clusters {1,4}, {2,5} and {3}, while the
value b=1.7 produces four clusters {1,4}, {2}, {5} and {3}. Note that b
can be chosen to produce any user-defined number of clusters k, as long as
. Choosing b = 2.5 in Figure 3.2 is equivalent to choosing k=3.

Figure 3.2: Simple hierarchical clustering dendrogram. Samples 1 and 4 are closely related, as are samples 2 and 5. The centroid of cluster {1,4} is quite dissimilar to that of cluster {2,5,3}. The breakpoint b = 2.5 divides the dendrogram into three clusters: C1={1,4}, C2={2,5}, C3={3}.
K-means clustering is used nearly as often as hierarchical clustering (see, for example, [86, 90, 100]). As in hierarchical clustering, partitioning algorithms divide data into groups; however, partitioning algorithms are more direct. Rather than producing a dendrogram that must later be cut at a breakpoint, k-means immediately divides the data into k subsets (clusters), and then updates the clusters until they are ‘good,’ as defined by equation below.
More
precisely, each cluster Cp ,
, has a centroid
that is the n-dimensional mean of all samples in Cp:
The k-means algorithm attempts to minimize
where c(xi) is the centroid of the cluster containing the ith sample and d(xi,c(xi)) is the distance between the ith sample and the centroid. In other words, k-means searches for an assignment of samples to clusters that minimizes the total sum of sample-to-centroid distances, summed over all m samples (or equivalently, over all k clusters). The algorithm proceeds as follows:
(1) Choose k seed points. These points are typically selected randomly from the entire set of samples, although other techniques exist[1].
(2) Assign samples to clusters. Each sample xi is assigned to the nearest cluster c(xi), as defined by a distance metric d (typically one of the Minkowski metrics). Initially, the k seed points are used as the centroids.
(3) Recalculate all cluster centroids. At each iteration, the cluster memberships will change, so it is necessary to recompute the centroids based on equation .
(4) Repeat steps (2) and (2) until convergence, or up to a maximum number of iterations. Convergence is achieved when samples cease to be reassigned, or when dtot ceases to decrease.
Eventually, the k-means algorithm will converge to a local minimum of dtot, which, in general, is not a global minimum. The problem of computing the global minimum solution is NP‑hard[60] and can only be achieved through exhaustive repetition of the algorithm. However, a single run of the algorithm can be optimized to run in time proportional to O(m), though most implementations depend on k and other factors.[21]
A
more complete discussion of these and other sources of variability is contained
in Appendix A. Here, I briefly summarize the most important
factors.
Whereas hierarchical algorithms create a clustering hierarchy in which samples are related in a tree-like structure, partitioning algorithms divide the data into absolute subsets that are completely unrelated. Hierarchical algorithms proceed deterministically, often yielding clusters of strange shapes and sizes; k-means bounces around between local minima, generally producing “spherical” clusters of similar sizes. Though both eventually partition the set of samples, the logic behind each partitioning is fundamentally different. The reader is referred to section A.1 for a more thorough comparison of these algorithms.
Given a clustering algorithm, there are still many decisions to make. For instance, in k-means clustering it is necessary to specify a distance function d. Figure 3.1 shows a picture of 2-dimensional Euclidean space, but k‑means typically is run in many-dimensional spaces that aren’t necessarily Euclidean. The choice of d determines the structure of the metric space for the original m samples (see Figure A.2). In hierarchical clustering, d is similarly used to generate the dissimilarity matrix D. Additionally, hierarchical clustering utilizes a linkage function D to measure distance between clusters. Just as d determines the metric space of samples, D determines the metric space of clusters. Other than the distance metric, major parameters include the value and placement of the k seeds (k-means), and the placement of the breakpoint b (hierarchical).[1] Various distance metrics, k‑values, and other parameters are addressed in more detail in section A.2, and will not be discussed any further in the text, though section 4 assumes familiarity with the concepts.
Getz [37] recently observed, “the main difficulty is that each biological process on which we wish to focus may involve a relatively small subset of the genes; the large majority of those present on the microarray constitute a noisy background that may mask the effect of the small subset.” Each sample’s expression profile contains tens of thousands of genes that could potentially be used as the basis for the expression vector in clustering; section A.3 gives strong biological, statistical, and computational reasons not to use the entire set of n genes. These reasons, which are echoed throughout the literature, motivate the use of nused genes, where nused<n. For this reason, a large number of experiments use filtering criteria to select the ‘most relevant’ genes. However, it is unclear how to choose these genes, and even how many to choose. To the author’s knowledge, no work has been done to systematically examine the effects of using a particular subset of genes as the basis for each of the nused-dimensional vectors (cf. [47]; the dearth of literature on this subject has also been noted in [39]). As can be imagined, the definition of ‘most relevant’ varies from source to source. Some authors use high variances [84], others use high average expression across samples [15, 76] or genes that satisfy maximum/minimum expression criteria[6], or n-fold change from median [76], etc.
Cluster
analysis is commonly used in microarray experiments, but there is a large
amount of uncertainty in the clustering process. Though it has repeatedly been demonstrated
that a large portion of clusters are preserved across clustering methods [65, 69, 70, 75], it is also widely
acknowledged that different methods produce different clusters. As has been demonstrated here (in Appendix A)
and in the literature, the major sources of variability are: (1) the
choice of clustering algorithm [7, 39, 90], (2) the parameters to
the algorithm [18, 19, 46]; and (3) the subset of
input data [37, 39, 47]. Each unique combination of these three
factors produces a potentially unique set of clusters. The ‘unique combination’ will henceforth be
referred to as a clustering configuration.
|
Definition 3.1: Clustering Configuration |
|
A unique combination of: (1) Clustering algorithm (2) Algorithm parameters (3) Collection of genes used |
The set of clusters produced by a particular clustering configuration is a partition of the original m samples – each sample appears in exactly one cluster.
Definition 3.2: Partition |
|
A set of clusters |