Differences

This shows you the differences between two versions of the page.

--- mkatari-bioinformatics-august-2013-clustering [2013/10/11 15:01] – mkatari
+++ mkatari-bioinformatics-august-2013-clustering [2014/12/11 15:11] – [K-means] mkatari
@@ Line 65: / Line 65: @@
 plot(sigGenes.hclust.k2.sil)
 </code>
+====== K-means ======
+The K-means method uses euclidean distance to measure distance. Since in biology we are more interested in gene expression profiles instead of magnitude of expression levels, let's scale our data so that the mean of the expression values is 0 and the expression values will be the standard deviations away from the mean.
+<code>
+# this function takes an vector to be calculated.
+scaleData <- function(x) {
+  x = as.numeric(x)
+  meanx = mean(x)
+  sdx = sd(x)
+  y = (x-meanx)/sdx
+  return(y)
+}
+#we need to transpose it because apply function returns the genes as different columns.
+scaledSigGenes = t(apply(sigGenes.normalized, 1, scaleData))
+colnames(scaledSigGenes)=colnames(sigGenes.normalized)
+#now to run k-means
+SigGenes.kmeans.2 = kmeans(scaledSigGenes, 2)
+# a measure of how well the clustering has performed
+# it is the sum of squares between members of the outside group and sum of squares total
+# higher the better.
+SigGenes.kmeans.2$betweenss/SigGenes.kmeans.2$totss
+#to get the genes in the different clusters
+SigGenes.kmeans.2.group1 = names(which(SigGenes.kmeans.2$cluster==1))
+SigGenes.kmeans.2.group2 = names(which(SigGenes.kmeans.2$cluster==2))
+</code>
 ====== Heatmap ======