User Tools

Site Tools


mkatari-bioinformatics-august-2013-deseq

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
mkatari-bioinformatics-august-2013-deseq [2013/08/23 14:18] – created mkatarimkatari-bioinformatics-august-2013-deseq [2015/08/21 14:13] (current) mkatari
Line 1: Line 1:
-[[mkatari-bioinformatics-august-2013|Manny's Bioinformatics Workshop]]+[[mkatari-bioinformatics-august-2013|Back to Manny's Bioinformatics Workshop HOME]]
  
-Here we will discuss how to create an R script that can be executed on HPC. Majority of the script is the same except for the first few commands that read the arguments from command line.+Here we will discuss how to create an R script (DESeq.R) that can be executed on HPC. This script has been adapted from the DESeq manual [[ http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf|DESeq manual]]  except paths to the files are replaced with variables and the plots are being saved as pdf documents in an output directory.
  
 If you are going to run DESeq in R on your desktop you will have to make sure DESeq is already installed. If you are going to run DESeq in R on your desktop you will have to make sure DESeq is already installed.
- In order to install DESeq type the following:+In order to install DESeq type the following:
  
 <code> <code>
Line 11: Line 11:
 </code> </code>
  
-However for the script that is available on HPC the script will automatically find DESeq. R uses variables to store locations where it should look for packages. Here we can simply add a path to where a specific module is located. This will prevent the need for others to have to install the module themselves. This is done int he following lines of the code+Regardless of whether you are running the code locally or on the server the script must load the library
  
 <code> <code>
-mannyLibPaths = '/home/mkatari/R/x86_64-unknown-linux-gnu-library/3.0' +library(DESeq)
-.libPaths(new='/home/mkatari/R/x86_64-unknown-linux-gnu-library/3.0')+
 </code> </code>
  
-Regardless of whether you are running the code locally or on the server the script must load the library+An example of the count data file is provided [[https://docs.google.com/file/d/0B172nc4dAaaOMG44Zk1BT2NFdkU/edit?usp=sharing|here]]
  
 +First we will load the count data file.
 <code> <code>
-library(DESeq)+counts = read.table("NextGenRaw.txt", header=T, row.names=1)
 </code> </code>
  
-Now for our script we will use a command that allows the R to read in arguments from command line automaticallyThis will be helpful when we are using an R script in an analysis pipelineThe code that reads arguments from the command line are: +Then we will load the experimental designAn example is provided [[https://docs.google.com/file/d/0B172nc4dAaaOaE5fTVVhUHJKazg/edit?usp=sharing|here]]:
 <code> <code>
-userargs commandArgs(TRUE) +expdesign read.table("expdesign.txt")
-pathToCountsData = userargs[1] +
-pathToExpDesign = userargs[2] +
-pathToOutput = userargs[3]+
 </code> </code>
  
- +The counts that were loaded as a data.frame are now used to create a new type of objectcount data set 
-The input for DESeq is a matrix/data.frame containing read counts. An example is provided [[https://docs.google.com/file/d/0B172nc4dAaaOMG44Zk1BT2NFdkU/edit?usp=sharing|here]] +<code>
- +
-#counts = read.table("NextGenRaw.txt", header=T, row.names=1) +
-counts = read.table(pathToCountsData, header=T, row.names=1) +
- +
-#This is simply meta-data to store information about the samples. +
-#expdesign = data.frame( +
-#  row.names=colnames(counts), +
-#  condition=c("untreated","untreated","treated","treated"), +
-#  libType=c("single-end","single-end","single-end","single-end"+
-#) +
-expdesign = read.table(pathToExpDesign) +
- +
-#The counts that were loaded as a data.frame are now used to create +
-#a new type of object-> count data set+
 cds = newCountDataSet(counts, expdesign$condition) cds = newCountDataSet(counts, expdesign$condition)
 +</code>
  
-#Now we can perform operations on the dataset and save the results in +Now we can perform operations on the dataset and save the results in the same object.
-#the same object.+
  
-#first lets estimate the size factor based on the number of aligned reads +First lets estimate the size factor based on the number of aligned reads from each sample. 
-#from each sample.+<code>
 cds = estimateSizeFactors(cds) cds = estimateSizeFactors(cds)
 +</code>
  
-#to see the size factors:+To see the size factors: 
 +<code>
 sizeFactors(cds) sizeFactors(cds)
 +</code>
  
-#To perform a normalization you can simply use this command. +To perform a normalization you can simply use this command. Note that the normalized values will not be used for identifying differentially expressed genes but we can use for some downstream analysis. 
-#Note that the normalized values will not be used for identifying  +<code>
-#differentially expressed genes+
 normalized=counts( cds, normalized=TRUE )  normalized=counts( cds, normalized=TRUE ) 
 +</code>
  
-#An important part of DESeq is to estimate dispersion. This is simply +An important part of DESeq is to estimate dispersion. This is simply a form of variance for the genes. 
-#a form of variance for the genes.+<code> 
 +# if you have replicates do the following:
 cds = estimateDispersions( cds ) cds = estimateDispersions( cds )
 +### HOWEVER, If you have NO replicates, then try this
 +cds = estimateDispersions( cds, method="blind" , sharingMode = "fit-only" )
 +</code>
  
-#To visualize the disperson graph +To visualize the disperson graph 
-pdf("Dispersion.pdf")+<code> 
 +dispersionFile = "Dispersion.pdf" 
 +pdf(dispersionFile)
 plotDispEsts( cds ) plotDispEsts( cds )
 dev.off() dev.off()
 +</code>
  
 #To see the dispersion values which will be used for the final test #To see the dispersion values which will be used for the final test
 +<code>
 head( fData(cds) ) head( fData(cds) )
 +</code>
  
-#Finally to perform the negative binomial test on the dataset to identify +Finally to perform the negative binomial test on the dataset to identify differentially expressed genes. 
-#differentially expressed genes.+<code>
 res = nbinomTest( cds, "untreated", "treated" ) res = nbinomTest( cds, "untreated", "treated" )
 +</code>
  
-#An MA plot allows us to see the fold change vs level of expression. +An MA plot allows us to see the fold change vs level of expression. In the plot, the red points are for genes that have FDR of 10%.
-#In the plot, the red points are for genes that have FDR of 10%.+
  
-pdf("MAplot.pdf")+<code> 
 +maFile = "MAplot.pdf" 
 +pdf(maFile)
 plotMA(res) plotMA(res)
 dev.off() dev.off()
 +</code>
  
-#To get the genes that have FDR of 10% +#To get the genes that have FDR of 10% and save it in the output directory. 
-write.table(resSig,  +<code> 
-            pathToOutput+resSigind = res[ which(res$padj < 0.1 & res$log2FoldChange > 1), ] 
 +resSigrep = res[ which(res$padj < 0.1 & res$log2FoldChange < -1), ] 
 + 
 +indoutfile = "Deseq.indresults.txt"  
 +repoutfile = "Deseq.represults.txt"  
 + 
 +write.table(resSigind,  
 +            indoutfile
             sep="\t",              sep="\t", 
             col.names=T,              col.names=T, 
Line 97: Line 101:
             quote=F)             quote=F)
  
-#DESeq manual: http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf+write.table(resSigrep,  
 +            repoutfile,  
 +            sep="\t",  
 +            col.names=T,  
 +            row.names=F, 
 +            quote=F) 
 +</code> 
 + 
mkatari-bioinformatics-august-2013-deseq.1377267488.txt.gz · Last modified: 2013/08/23 14:18 by mkatari