Differences

This shows you the differences between two versions of the page.

--- mkatari-bioinformatics-august-2013-clustering [2013/08/29 09:29] – created mkatari
+++ mkatari-bioinformatics-august-2013-clustering [2014/12/11 15:24] – [K-means] mkatari
@@ Line 2: / Line 2: @@
-Clustering rna-seq data, continuation from [[mkatari-bioinformatics-august-2013-deseq|DESeq]]
+====== Clustering rna-seq data ======
+continuation from [[mkatari-bioinformatics-august-2013-deseq|DESeq]]
 Get the significant genes
@@ Line 35: / Line 36: @@
 <code>
 sigGenes.hclust.k2<-cutree(sigGenes.normalized.hclust, k=2)
+</code>
+Now to get all the genes that are in cluster 2 simply type.
+<code>
+hclust.k2.cluster2=names(which(sigGenes.hclust.k2==2))
+</code>
+Now we can create a new matrix/data frame with just these genes. This new matrix can be used to plot a heatmap to make it easier to see a expression profile of the cluster (see below).
+<code>
+hclust.k2.cluster2.normalized = sigGenes.normalized[hclust.k2.cluster2,]
 </code>
@@ Line 44: / Line 57: @@
 Calculate silhouette values
 <code>
-sigGenes.hclust.k2.sil<-silhouette(sigGenes.hclust.k2, sigGenes.normalized.dist)
+sigGenes.hclust.k2.sil<-silhouette(sigGenes.hclust.k2,
+                                   sigGenes.normalized.dist)
 </code>
@@ Line 52: / Line 66: @@
 </code>
-Heatmap
+====== K-means ======
+The K-means method uses euclidean distance to measure distance. Since in biology we are more interested in gene expression profiles instead of magnitude of expression levels, let's scale our data so that the mean of the expression values is 0 and the expression values will be the standard deviations away from the mean.
+<code>
+# this function takes an vector to be calculated.
+scaleData <- function(x) {
+  x = as.numeric(x)
+  meanx = mean(x)
+  sdx = sd(x)
+  y = (x-meanx)/sdx
+  return(y)
+}
+#we need to transpose it because apply function returns the genes as different columns.
+scaledSigGenes = t(apply(sigGenes.normalized, 1, scaleData))
+colnames(scaledSigGenes)=colnames(sigGenes.normalized)
+#now to run k-means, in this case we are starting with 2 cluster.
+#just like for heirarchical clustering, we have to first transpose the data so compare genes.
+SigGenes.kmeans.2 = kmeans(t(scaledSigGenes), 2)
+#a plot of the groups
+plot(SigGenes.kmeans.2$centers[1,], SigGenes.kmeans.2$centers[2,])
+# a measure of how well the clustering has performed
+# it is the sum of squares between members of the outside group and sum of squares total
+# higher the better.
+SigGenes.kmeans.2$betweenss/SigGenes.kmeans.2$totss
+#to get the genes in the different clusters
+SigGenes.kmeans.2.group1 = names(which(SigGenes.kmeans.2$cluster==1))
+SigGenes.kmeans.2.group2 = names(which(SigGenes.kmeans.2$cluster==2))
+</code>
+====== Heatmap ======
 <code>
@@ Line 60: / Line 112: @@
 These functions will make it easy for us to specify how we want the clustering to be performed in the heatmap function
-</code>
+<code>
 hclust2 <- function(x, method="average", ...) {
   hclust(x, method=method, ...)
@@ Line 70: / Line 122: @@
 </code>
-Create heatmap. We can save it to a pdf file
+Create heatmap. We can save it to a pdf file. Note that sigGenes.normalized is just a matrix. Here we can provide any matrix of values, for example hclust.k2.cluster2.normalized which is the expression values of genes in cluster 2 (see above)
 <code>
@@ Line 90: / Line 142: @@
 dev.off()
 </code>