Differences

This shows you the differences between two versions of the page.

--- mkatari-bioinformatics-august-2013-clustering [2014/12/11 14:16] – mkatari
+++ mkatari-bioinformatics-august-2013-clustering [2015/06/17 13:26] (current) – mkatari
@@ Line 4: / Line 4: @@
 ====== Clustering rna-seq data ======
 continuation from [[mkatari-bioinformatics-august-2013-deseq|DESeq]]
+[[https://drive.google.com/file/d/0B172nc4dAaaORnh3MkZqUE9PVjA/view?usp=sharing|resSig.txt]]
+[[https://drive.google.com/file/d/0B172nc4dAaaONXFIX2YxeDRCbkE/view?usp=sharing|normalized.txt]]
+In case you didn't get DESeq to work download and load the files above
+<code>
+resSig = read.table("resSig.txt", header=T)
+normalized = read.table("normalized.txt", header=T, row.names=1)
+</code>
 Get the significant genes
@@ Line 12: / Line 23: @@
 Get the normalized values for the significant genes
 <code>
-sigGenes.normalized = normalized[sigGenes,]
+sigGenes.normalized = normalized[as.character(sigGenes),]
 </code>
@@ Line 67: / Line 78: @@
 ====== K-means ======
+The K-means method uses euclidean distance to measure distance. Since in biology we are more interested in gene expression profiles instead of magnitude of expression levels, let's scale our data so that the mean of the expression values is 0 and the expression values will be the standard deviations away from the mean.
+<code>
+# this function takes a vector of gene expression values.
+scaleData <- function(x) {
+  x = as.numeric(x)
+  meanx = mean(x)
+  sdx = sd(x)
+  y = (x-meanx)/sdx
+  return(y)
+}
+</code>
+we need to transpose it because apply function returns the genes as different columns.
+<code>
+scaledSigGenes = t(apply(sigGenes.normalized, 1, scaleData))
+colnames(scaledSigGenes)=colnames(sigGenes.normalized)
+</code>
+now to run k-means, in this case we are starting with 2 cluster.
+<code>
+SigGenes.kmeans.2 = kmeans(scaledSigGenes, 2, nstart=25)
+</code>
+To obtain the measure of how well the clustering has performed, we can look at the sum of squares between members of the outside group and sum of squares total. Higher the better.
+<code>
+SigGenes.kmeans.2$betweenss/SigGenes.kmeans.2$totss
+</code>
+In order to determine the ideal number of k, we can try many different K's and look to see how well they performed.
+<code>
+getBestK <- function(x) {
+  kmeans_ss=numeric()
+  kmeans_ss[1]=0
+  for (i in 2:20) {
+     kmeans_tmp=kmeans(x, i, nstart=25)
+     #alternate way of looking at proportion of ss that is provided by between groups.
+     #kmeans_ss[i] = kmeans_tmp$betweenss/kmeans_tmp$totss
+     #using silhouette width to evaluate clusters.
+     kmeans_sil= (kmeans_tmp$betweenss-kmeans_tmp$withinss)/max(kmeans_tmp$betweenss, kmeans_tmp$withinss)
+     kmeans_ss[i] = mean(kmeans_sil)
+  }
+  return(kmeans_ss)
+}
+kmeans_ss=getBestK(scaledSigGenes)
+plot(kmeans_ss)
+</code>
+To get the genes in the different clusters
+<code>
+SigGenes.kmeans.2.group1 = names(which(SigGenes.kmeans.2$cluster==1))
+SigGenes.kmeans.2.group2 = names(which(SigGenes.kmeans.2$cluster==2))
+</code>
+The code below plots k-means clustering results. You simply have to provide the k-means output and the labels.
+<code>
+plotClusterCenters<-function(kmeansres,
+                             myxlab="Treatment",
+                             myylab="Expression",
+                             mymain="K-means Clusters") {
+  mycolors=c("blue","red","green","orange","pink","black")
+  centersdim = dim(kmeansres$centers)
+  plot(kmeansres$centers[1,],
+       type="b",
+       col=mycolors[1],
+       xlab=myxlab,
+       ylab=myylab,
+       main=mymain,
+       ylim=c(round(min(kmeansres$centers)),
+                               round(max(kmeansres$centers))),
+                               xaxt="n")
+  axis(1, at=c(1:centersdim[2]), labels=names(kmeansres$centers[1,]))
+  for (i in 2:centersdim[1]) {
+    lines(kmeansres$centers[i,], type="b", col=mycolors[i])
+  }
+}
+plotClusterCenters(SigGenes.kmeans.2)
+</code>
@@ Line 73: / Line 179: @@
 <code>
+install.packages("gplots")
 library(gplots)
 </code>
@@ Line 92: / Line 199: @@
 <code>
 pdf("heatmap.pdf")
-heatmap.2(sigGenes.normalized,
+heatmap.2(as.matrix(sigGenes.normalized),
           col=redgreen(75),
           hclustfun=hclust2,