User Tools

Site Tools


mkatari-bioinformatics-august-2013-clustering

This is an old revision of the document!


Back to Manny's Bioinformatics Workshop HOME

Clustering rna-seq data

continuation from DESeq

Get the significant genes

sigGenes = resSig$id

Get the normalized values for the significant genes

sigGenes.normalized = normalized[sigGenes,]

Correlate the normalized values

sigGenes.normalized.cor = cor(t(sigGenes.normalized), 
                              method="pearson")

Convert correlation into distance

sigGenes.normalized.dist = as.dist(1-sigGenes.normalized.cor)

Cluster the genes based on their distance

sigGenes.normalized.hclust = hclust(sigGenes.normalized.dist, 
                                    method="average")

We expect two different outcomes here so we can simply tell it to use a cutoff that will result in two groups

sigGenes.hclust.k2<-cutree(sigGenes.normalized.hclust, k=2)

Now to get all the genes that are in cluster 2 simply type.

hclust.k2.cluster2=names(which(sigGenes.hclust.k2==2))

Now we can create a new matrix/data frame with just these genes. This new matrix can be used to plot a heatmap to make it easier to see a expression profile of the cluster (see below).

hclust.k2.cluster2.normalized = sigGenes.normalized[hclust.k2.cluster2,]

Load silhouette library

library(cluster)

Calculate silhouette values

sigGenes.hclust.k2.sil<-silhouette(sigGenes.hclust.k2, 
                                   sigGenes.normalized.dist)

Plot the silhouette

plot(sigGenes.hclust.k2.sil)

K-means

The K-means method uses euclidean distance to measure distance. Since in biology we are more interested in gene expression profiles instead of magnitude of expression levels, let's scale our data so that the mean of the expression values is 0 and the expression values will be the standard deviations away from the mean.

sigGenesMean = rowMeans(sigGenes.normalized)
sigGenesSD = apply(sigGenes.normalized, 1, sd)

Heatmap

library(gplots)

These functions will make it easy for us to specify how we want the clustering to be performed in the heatmap function

hclust2 <- function(x, method="average", ...) {
  hclust(x, method=method, ...)
}

dist2 <- function(x, ...) {
  as.dist(1-cor(t(x), method="pearson"))  			
}

Create heatmap. We can save it to a pdf file. Note that sigGenes.normalized is just a matrix. Here we can provide any matrix of values, for example hclust.k2.cluster2.normalized which is the expression values of genes in cluster 2 (see above)

pdf("heatmap.pdf")
heatmap.2(sigGenes.normalized, 
          col=redgreen(75),
          hclustfun=hclust2,
          distfun=dist2,
          scale="row", 
          cexCol=0.6, 
          Colv=TRUE,
          sepcolor="black",
          dendrogram="both",
          key=TRUE, 
          symkey=FALSE, 
          density.info="none", 
          trace="none", 
          cexRow=0.4)
dev.off()
mkatari-bioinformatics-august-2013-clustering.1418308919.txt.gz · Last modified: 2014/12/11 14:41 by mkatari