User Tools

Site Tools


mkatari-bioinformatics-august-2013-clustering

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revisionBoth sides next revision
mkatari-bioinformatics-august-2013-clustering [2014/12/11 15:24] – [K-means] mkatarimkatari-bioinformatics-august-2013-clustering [2014/12/15 11:58] mkatari
Line 70: Line 70:
  
 <code> <code>
-# this function takes an vector to be calculated.+# this function takes vector of gene expression values.
 scaleData <- function(x) { scaleData <- function(x) {
   x = as.numeric(x)   x = as.numeric(x)
Line 78: Line 78:
   return(y)   return(y)
 } }
 +</code>
  
-#we need to transpose it because apply function returns the genes as different columns.+we need to transpose it because apply function returns the genes as different columns. 
 + 
 +<code>
 scaledSigGenes = t(apply(sigGenes.normalized, 1, scaleData)) scaledSigGenes = t(apply(sigGenes.normalized, 1, scaleData))
 colnames(scaledSigGenes)=colnames(sigGenes.normalized) colnames(scaledSigGenes)=colnames(sigGenes.normalized)
 +</code>
  
-#now to run k-means, in this case we are starting with 2 cluster+now to run k-means, in this case we are starting with 2 cluster.
-#just like for heirarchical clustering, we have to first transpose the data so compare genes.+
  
 +<code>
 SigGenes.kmeans.2 = kmeans(t(scaledSigGenes), 2) SigGenes.kmeans.2 = kmeans(t(scaledSigGenes), 2)
 +</code>
  
-#a plot of the groups +To obtain the measure of how well the clustering has performedwe can look at the sum of squares between members of the outside group and sum of squares totalHigher the better.
-plot(SigGenes.kmeans.2$centers[1,], SigGenes.kmeans.2$centers[2,])+
  
-# a measure of how well the clustering has performed +<code>
-# it is the sum of squares between members of the outside group and sum of squares total +
-# higher the better.+
 SigGenes.kmeans.2$betweenss/SigGenes.kmeans.2$totss SigGenes.kmeans.2$betweenss/SigGenes.kmeans.2$totss
 +</code>
  
-#to get the genes in the different clusters+In order to determine the ideal number of k, we can try many different K's and look to see how well they performed. 
 + 
 +<code> 
 +getBestK <- function(x) { 
 +  kmeans_ss=numeric() 
 +  kmeans_ss[1]=0 
 +   
 +  for (i in 2:20) { 
 +     kmeans_tmp=kmeans(x, 2, nstart=25) 
 +     kmeans_ss[i] = kmeans_tmp$betweenss/kmeans_tmp$totss     
 +     
 +  } 
 +  return(kmeans_ss) 
 +
 + 
 +kmeans_ss=getBestK(scaledSigGenes) 
 +plot(kmeans_ss) 
 + 
 +</code> 
 +To get the genes in the different clusters 
 +<code>
 SigGenes.kmeans.2.group1 = names(which(SigGenes.kmeans.2$cluster==1)) SigGenes.kmeans.2.group1 = names(which(SigGenes.kmeans.2$cluster==1))
 SigGenes.kmeans.2.group2 = names(which(SigGenes.kmeans.2$cluster==2)) SigGenes.kmeans.2.group2 = names(which(SigGenes.kmeans.2$cluster==2))
 +</code>
 +
  
 </code> </code>
mkatari-bioinformatics-august-2013-clustering.txt · Last modified: 2015/06/17 13:26 by mkatari