User Tools

Site Tools


mkatari-bioinformatics-august-2013-deseq

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
mkatari-bioinformatics-august-2013-deseq [2013/08/23 14:41] mkatarimkatari-bioinformatics-august-2013-deseq [2015/08/21 13:53] mkatari
Line 1: Line 1:
 [[mkatari-bioinformatics-august-2013|Back to Manny's Bioinformatics Workshop HOME]] [[mkatari-bioinformatics-august-2013|Back to Manny's Bioinformatics Workshop HOME]]
  
-Here we will discuss how to create an R script that can be executed on HPC. Majority of the script is the same except for the first few commands that read the arguments from command line.+Here we will discuss how to create an R script (DESeq.R) that can be executed on HPC. This script has been adapted from the DESeq manual [[ http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf|DESeq manual]]  except paths to the files are replaced with variables and the plots are being saved as pdf documents in an output directory.
  
 If you are going to run DESeq in R on your desktop you will have to make sure DESeq is already installed. If you are going to run DESeq in R on your desktop you will have to make sure DESeq is already installed.
- In order to install DESeq type the following:+In order to install DESeq type the following:
  
 <code> <code>
 source("http://bioconductor.org/biocLite.R") source("http://bioconductor.org/biocLite.R")
 biocLite("DESeq") biocLite("DESeq")
-</code> 
- 
-However for the script that is available on HPC the script will automatically find DESeq. R uses variables to store locations where it should look for packages. Here we can simply add a path to where a specific module is located. This will prevent the need for others to have to install the module themselves. This is done int he following lines of the code 
- 
-<code> 
-mannyLibPaths = '/home/mkatari/R/x86_64-unknown-linux-gnu-library/3.0' 
-.libPaths(new='/home/mkatari/R/x86_64-unknown-linux-gnu-library/3.0') 
 </code> </code>
  
Line 24: Line 17:
 </code> </code>
  
-Now for our script we will use a command that allows the R to read in arguments from command line automaticallyThis will be helpful when we are using an R script in an analysis pipelineThe code that reads arguments from the command line are:+An example of the count data file is provided [[https://docs.google.com/file/d/0B172nc4dAaaOMG44Zk1BT2NFdkU/edit?usp=sharing|here]]
  
 +First we will load the count data file.
 <code> <code>
-userargs commandArgs(TRUE) +counts read.table("NextGenRaw.txt", header=T, row.names=1)
-pathToCountsData userargs[1+
-pathToExpDesign = userargs[2] +
-pathToOutput = userargs[3]+
 </code> </code>
  
- +Then we will load the experimental design. An example is provided [[https://docs.google.com/file/d/0B172nc4dAaaOaE5fTVVhUHJKazg/edit?usp=sharing|here]]:
-The input for DESeq is a matrix/data.frame containing read counts. An example is provided [[https://docs.google.com/file/d/0B172nc4dAaaOMG44Zk1BT2NFdkU/edit?usp=sharing|here]] +
- +
-You have to first load the file into your workspace. +
- +
-If you are running it locally+
 <code> <code>
-counts = read.table("NextGenRaw.txt", header=T, row.names=1)+expdesign = read.table("expdesign.txt")
 </code> </code>
-If you are writing a script 
-counts = read.table(pathToCountsData, header=T, row.names=1) 
  
-#This is simply meta-data to store information about the samples. +The counts that were loaded as a data.frame are now used to create a new type of objectcount data set 
-#expdesign = data.frame( +<code>
-#  row.names=colnames(counts), +
-#  condition=c("untreated","untreated","treated","treated"), +
-#  libType=c("single-end","single-end","single-end","single-end"+
-#) +
-expdesign = read.table(pathToExpDesign) +
- +
-#The counts that were loaded as a data.frame are now used to create +
-#a new type of object-> count data set+
 cds = newCountDataSet(counts, expdesign$condition) cds = newCountDataSet(counts, expdesign$condition)
 +</code>
  
-#Now we can perform operations on the dataset and save the results in +Now we can perform operations on the dataset and save the results in the same object.
-#the same object.+
  
-#first lets estimate the size factor based on the number of aligned reads +First lets estimate the size factor based on the number of aligned reads from each sample. 
-#from each sample.+<code>
 cds = estimateSizeFactors(cds) cds = estimateSizeFactors(cds)
 +</code>
  
-#to see the size factors:+To see the size factors: 
 +<code>
 sizeFactors(cds) sizeFactors(cds)
 +</code>
  
-#To perform a normalization you can simply use this command. +To perform a normalization you can simply use this command. Note that the normalized values will not be used for identifying differentially expressed genes but we can use for some downstream analysis. 
-#Note that the normalized values will not be used for identifying  +<code>
-#differentially expressed genes+
 normalized=counts( cds, normalized=TRUE )  normalized=counts( cds, normalized=TRUE ) 
 +</code>
  
-#An important part of DESeq is to estimate dispersion. This is simply +An important part of DESeq is to estimate dispersion. This is simply a form of variance for the genes. 
-#a form of variance for the genes.+<code> 
 +# if you have replicates do the following:
 cds = estimateDispersions( cds ) cds = estimateDispersions( cds )
 +### HOWEVER, If you have NO replicates, then try this
 +cds = estimateDispersions( cds, method="blind" )
 +</code>
  
-#To visualize the disperson graph +To visualize the disperson graph 
-pdf("Dispersion.pdf")+<code> 
 +dispersionFile = "Dispersion.pdf" 
 +pdf(dispersionFile)
 plotDispEsts( cds ) plotDispEsts( cds )
 dev.off() dev.off()
 +</code>
  
 #To see the dispersion values which will be used for the final test #To see the dispersion values which will be used for the final test
 +<code>
 head( fData(cds) ) head( fData(cds) )
 +</code>
  
-#Finally to perform the negative binomial test on the dataset to identify +Finally to perform the negative binomial test on the dataset to identify differentially expressed genes. 
-#differentially expressed genes.+<code>
 res = nbinomTest( cds, "untreated", "treated" ) res = nbinomTest( cds, "untreated", "treated" )
 +</code>
  
-#An MA plot allows us to see the fold change vs level of expression. +An MA plot allows us to see the fold change vs level of expression. In the plot, the red points are for genes that have FDR of 10%.
-#In the plot, the red points are for genes that have FDR of 10%.+
  
-pdf("MAplot.pdf")+<code> 
 +maFile = "MAplot.pdf" 
 +pdf(maFile)
 plotMA(res) plotMA(res)
 dev.off() dev.off()
 +</code>
  
-#To get the genes that have FDR of 10% +#To get the genes that have FDR of 10% and save it in the output directory. 
-write.table(resSig,  +<code> 
-            pathToOutput+resSigind = res[ which(res$padj < 0.1 & res$log2FoldChange > 1), ] 
 +resSigrep = res[ which(res$padj < 0.1 & res$log2FoldChange < -1), ] 
 + 
 +indoutfile = "Deseq.indresults.txt"  
 +repoutfile = "Deseq.represults.txt"  
 + 
 +write.table(resSigind,  
 +            indoutfile
             sep="\t",              sep="\t", 
             col.names=T,              col.names=T, 
Line 103: Line 101:
             quote=F)             quote=F)
  
-#DESeq manual: http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf+write.table(resSigrep,  
 +            repoutfile,  
 +            sep="\t",  
 +            col.names=T,  
 +            row.names=F, 
 +            quote=F) 
 +</code> 
 + 
mkatari-bioinformatics-august-2013-deseq.txt · Last modified: 2015/08/21 14:13 by mkatari