[[mkatari-bioinformatics-august-2013|Back to Manny's Bioinformatics Workshop HOME]] ====== Clustering rna-seq data ====== continuation from [[mkatari-bioinformatics-august-2013-deseq|DESeq]] [[https://drive.google.com/file/d/0B172nc4dAaaORnh3MkZqUE9PVjA/view?usp=sharing|resSig.txt]] [[https://drive.google.com/file/d/0B172nc4dAaaONXFIX2YxeDRCbkE/view?usp=sharing|normalized.txt]] In case you didn't get DESeq to work download and load the files above


resSig = read.table("resSig.txt", header=T)
normalized = read.table("normalized.txt", header=T, row.names=1)

Get the significant genes


sigGenes = resSig$id

Get the normalized values for the significant genes


sigGenes.normalized = normalized[as.character(sigGenes),]

Correlate the normalized values


sigGenes.normalized.cor = cor(t(sigGenes.normalized), 
                              method="pearson")

Convert correlation into distance


sigGenes.normalized.dist = as.dist(1-sigGenes.normalized.cor)

Cluster the genes based on their distance


sigGenes.normalized.hclust = hclust(sigGenes.normalized.dist, 
                                    method="average")

We expect two different outcomes here so we can simply tell it to use a cutoff that will result in two groups


sigGenes.hclust.k2<-cutree(sigGenes.normalized.hclust, k=2)

Now to get all the genes that are in cluster 2 simply type.


hclust.k2.cluster2=names(which(sigGenes.hclust.k2==2))

Now we can create a new matrix/data frame with just these genes. This new matrix can be used to plot a heatmap to make it easier to see a expression profile of the cluster (see below).


hclust.k2.cluster2.normalized = sigGenes.normalized[hclust.k2.cluster2,]

Load silhouette library


library(cluster)

Calculate silhouette values


sigGenes.hclust.k2.sil<-silhouette(sigGenes.hclust.k2, 
                                   sigGenes.normalized.dist)

Plot the silhouette


plot(sigGenes.hclust.k2.sil)

====== K-means ====== The K-means method uses euclidean distance to measure distance. Since in biology we are more interested in gene expression profiles instead of magnitude of expression levels, let's scale our data so that the mean of the expression values is 0 and the expression values will be the standard deviations away from the mean.


# this function takes a vector of gene expression values.
scaleData <- function(x) {
  x = as.numeric(x)
  meanx = mean(x)
  sdx = sd(x)
  y = (x-meanx)/sdx
  return(y)
}

we need to transpose it because apply function returns the genes as different columns.


scaledSigGenes = t(apply(sigGenes.normalized, 1, scaleData))
colnames(scaledSigGenes)=colnames(sigGenes.normalized)

now to run k-means, in this case we are starting with 2 cluster.


SigGenes.kmeans.2 = kmeans(scaledSigGenes, 2, nstart=25)

To obtain the measure of how well the clustering has performed, we can look at the sum of squares between members of the outside group and sum of squares total. Higher the better.


SigGenes.kmeans.2$betweenss/SigGenes.kmeans.2$totss

In order to determine the ideal number of k, we can try many different K's and look to see how well they performed.


getBestK <- function(x) {
  kmeans_ss=numeric()
  kmeans_ss[1]=0
  
  for (i in 2:20) {
     kmeans_tmp=kmeans(x, i, nstart=25)
     #alternate way of looking at proportion of ss that is provided by between groups.
     #kmeans_ss[i] = kmeans_tmp$betweenss/kmeans_tmp$totss   

     #using silhouette width to evaluate clusters.
     kmeans_sil= (kmeans_tmp$betweenss-kmeans_tmp$withinss)/max(kmeans_tmp$betweenss, kmeans_tmp$withinss) 
     kmeans_ss[i] = mean(kmeans_sil)


  }
  return(kmeans_ss)
}

kmeans_ss=getBestK(scaledSigGenes)
plot(kmeans_ss)

To get the genes in the different clusters


SigGenes.kmeans.2.group1 = names(which(SigGenes.kmeans.2$cluster==1))
SigGenes.kmeans.2.group2 = names(which(SigGenes.kmeans.2$cluster==2))

The code below plots k-means clustering results. You simply have to provide the k-means output and the labels.


plotClusterCenters<-function(kmeansres, 
                             myxlab="Treatment", 
                             myylab="Expression",
                             mymain="K-means Clusters") {
  
  mycolors=c("blue","red","green","orange","pink","black")
  centersdim = dim(kmeansres$centers)
  plot(kmeansres$centers[1,], 
       type="b", 
       col=mycolors[1], 
       xlab=myxlab,
       ylab=myylab,
       main=mymain,
       ylim=c(round(min(kmeansres$centers)),
                               round(max(kmeansres$centers))),
                               xaxt="n")
  
  axis(1, at=c(1:centersdim[2]), labels=names(kmeansres$centers[1,]))
  
  for (i in 2:centersdim[1]) {
    lines(kmeansres$centers[i,], type="b", col=mycolors[i])
  }
  
}


plotClusterCenters(SigGenes.kmeans.2)

====== Heatmap ======


install.packages("gplots")
library(gplots)

These functions will make it easy for us to specify how we want the clustering to be performed in the heatmap function


hclust2 <- function(x, method="average", ...) {
  hclust(x, method=method, ...)
}

dist2 <- function(x, ...) {
  as.dist(1-cor(t(x), method="pearson"))  			
}

Create heatmap. We can save it to a pdf file. Note that sigGenes.normalized is just a matrix. Here we can provide any matrix of values, for example hclust.k2.cluster2.normalized which is the expression values of genes in cluster 2 (see above)


pdf("heatmap.pdf")
heatmap.2(as.matrix(sigGenes.normalized), 
          col=redgreen(75),
          hclustfun=hclust2,
          distfun=dist2,
          scale="row", 
          cexCol=0.6, 
          Colv=TRUE,
          sepcolor="black",
          dendrogram="both",
          key=TRUE, 
          symkey=FALSE, 
          density.info="none", 
          trace="none", 
          cexRow=0.4)
dev.off()