Differences

This shows you the differences between two versions of the page.

--- mkatari-bioinformatics-august-2013-clustering [2014/12/11 15:24] – [K-means] mkatari
+++ mkatari-bioinformatics-august-2013-clustering [2014/12/15 11:58] – mkatari
@@ Line 70: / Line 70: @@
 <code>
-# this function takes an vector to be calculated.
+# this function takes a vector of gene expression values.
 scaleData <- function(x) {
   x = as.numeric(x)
@@ Line 78: / Line 78: @@
   return(y)
 }
+</code>
-#we need to transpose it because apply function returns the genes as different columns.
+we need to transpose it because apply function returns the genes as different columns.
+<code>
 scaledSigGenes = t(apply(sigGenes.normalized, 1, scaleData))
 colnames(scaledSigGenes)=colnames(sigGenes.normalized)
+</code>
-#now to run k-means, in this case we are starting with 2 cluster.
+now to run k-means, in this case we are starting with 2 cluster.
-#just like for heirarchical clustering, we have to first transpose the data so compare genes.
+<code>
 SigGenes.kmeans.2 = kmeans(t(scaledSigGenes), 2)
+</code>
-#a plot of the groups
+To obtain the measure of how well the clustering has performed, we can look at the sum of squares between members of the outside group and sum of squares total. Higher the better.
-plot(SigGenes.kmeans.2$centers[1,], SigGenes.kmeans.2$centers[2,])
-# a measure of how well the clustering has performed
+<code>
-# it is the sum of squares between members of the outside group and sum of squares total
-# higher the better.
 SigGenes.kmeans.2$betweenss/SigGenes.kmeans.2$totss
+</code>
-#to get the genes in the different clusters
+In order to determine the ideal number of k, we can try many different K's and look to see how well they performed.
+<code>
+getBestK <- function(x) {
+  kmeans_ss=numeric()
+  kmeans_ss[1]=0
+  for (i in 2:20) {
+     kmeans_tmp=kmeans(x, 2, nstart=25)
+     kmeans_ss[i] = kmeans_tmp$betweenss/kmeans_tmp$totss
+  }
+  return(kmeans_ss)
+}
+kmeans_ss=getBestK(scaledSigGenes)
+plot(kmeans_ss)
+</code>
+To get the genes in the different clusters
+<code>
 SigGenes.kmeans.2.group1 = names(which(SigGenes.kmeans.2$cluster==1))
 SigGenes.kmeans.2.group2 = names(which(SigGenes.kmeans.2$cluster==2))
+</code>
 </code>