Differences

This shows you the differences between two versions of the page.

--- mkatari-bioinformatics-august-2013-deseq [2013/08/23 15:23] – mkatari
+++ mkatari-bioinformatics-august-2013-deseq [2013/08/23 16:20] – mkatari
@@ Line 1: / Line 1: @@
 [[mkatari-bioinformatics-august-2013|Back to Manny's Bioinformatics Workshop HOME]]
-Here we will discuss how to create an R script (DESeq.R) that can be executed on HPC. Majority of the script is the same as if you were running it interactively except paths to the files are replaced with variables.
+Here we will discuss how to create an R script (DESeq.R) that can be executed on HPC. This script has been adapted from the DESeq manual [[ http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf|DESeq manual]]  except paths to the files are replaced with variables and the plots are being saved as pdf documents in an output directory.
 If you are going to run DESeq in R on your desktop you will have to make sure DESeq is already installed.
@@ Line 11: / Line 11: @@
 </code>
-However to make the script easy to run for anyone on the server, we will tell the R script where exactly to look for DESeq. R uses a variable (.libPaths) to store locations where it should look for packages. We will simply add the path to this variable. This way the person running the script does not need to have DESeq installed in their local R libraries. The other option is to tell the system administrator to add the packages. This is done in the following lines of the code
+Once we have installed the DESeq package in our local R repository we need to let our script know to look in this directory so if anyone wants to run our script, it will know exactly where to look for DESeq. R uses a variable (.libPaths) to store locations where it should look for packages. By default R will look in the global R library and the user's local path (if one exists). We will simply add our R library path to this variable. The other option is to tell the system administrator to add the packages to the global library path. This is done in the following lines of the code:
 <code>
@@ Line 24: / Line 24: @@
 </code>
-Now for our script we will use a function (commandArgs) that allows the R to read in arguments from command line automatically. We will use a command Rscript to run one of our R scripts (DESeq.R). This will be helpful when we are using an R script in an analysis pipeline. The code that reads arguments from the command line are:
+Now in our script we will use a function (commandArgs) that will allow us to read in arguments from command line automatically. In order to run our script the user will simply call our script using Rscript followed by our script (DESeq.R) and the arguments. The code will read in all the words that follow our script name one word at a time and save it as a character vector:
 <code>
@@ Line 33: / Line 33: @@
 </code>
-All the arguments provided in command line will be saved as a character vector in userargs. The value TRUE in the commandArgs argument make sure only the trailing arguments are saved (which is what we will be providing. If the value is FALSE you will see additional R arguments when the command Rscript is executed.
+Here we are saving all the words as a character vector called userargs. The value TRUE in the commandArgs argument is to make sure only the trailing arguments are saved. If the value is FALSE you will see additional R arguments when the command Rscript is executed. Notice the order of arguments is important. First we will provide the path to the count data file, then the path to the file containing the experimental design and finally the path to the directory where to save the results (The directory must contain a trailing /.
-The input for DESeq is a matrix/data.frame containing read counts. An example is provided [[https://docs.google.com/file/d/0B172nc4dAaaOMG44Zk1BT2NFdkU/edit?usp=sharing|here]]
+An example of the count data file is provided [[https://docs.google.com/file/d/0B172nc4dAaaOMG44Zk1BT2NFdkU/edit?usp=sharing|here]]
-You have to first load the file into your workspace.
+First we will load the count data file.
-If you are running it locally
-<code>
-counts = read.table("NextGenRaw.txt", header=T, row.names=1)
-</code>
-If you are writing a script
 <code>
 counts = read.table(pathToCountsData, header=T, row.names=1)
 </code>
-#This is simply meta-data to store information about the samples.
+Then we will load the experimental design. An example is provided [[https://docs.google.com/file/d/0B172nc4dAaaOaE5fTVVhUHJKazg/edit?usp=sharing|here]]:
-#expdesign = data.frame(
+<code>
-#  row.names=colnames(counts),
-#  condition=c("untreated","untreated","treated","treated"),
-#  libType=c("single-end","single-end","single-end","single-end")
-#)
 expdesign = read.table(pathToExpDesign)
+</code>
-#The counts that were loaded as a data.frame are now used to create
+The counts that were loaded as a data.frame are now used to create a new type of object: count data set
-#a new type of object-> count data set
+<code>
 cds = newCountDataSet(counts, expdesign$condition)
+</code>
-#Now we can perform operations on the dataset and save the results in
+Now we can perform operations on the dataset and save the results in the same object.
-#the same object.
-#first lets estimate the size factor based on the number of aligned reads
+First lets estimate the size factor based on the number of aligned reads from each sample.
-#from each sample.
+<code>
 cds = estimateSizeFactors(cds)
+</code>
-#to see the size factors:
+To see the size factors:
+<code>
 sizeFactors(cds)
+</code>
-#To perform a normalization you can simply use this command.
+To perform a normalization you can simply use this command. Note that the normalized values will not be used for identifying differentially expressed genes but we can use for some downstream analysis.
-#Note that the normalized values will not be used for identifying
+<code>
-#differentially expressed genes
 normalized=counts( cds, normalized=TRUE )
+</code>
-#An important part of DESeq is to estimate dispersion. This is simply
+An important part of DESeq is to estimate dispersion. This is simply a form of variance for the genes.
-#a form of variance for the genes.
+<code>
 cds = estimateDispersions( cds )
+</code>
-#To visualize the disperson graph
+To visualize the disperson graph
-pdf("Dispersion.pdf")
+<code>
+dispersionFile = paste(pathToOutputDir, "Dispersion.pdf", sep="")
+pdf(dispersionFile)
 plotDispEsts( cds )
 dev.off()
+</code>
 #To see the dispersion values which will be used for the final test
+<code>
 head( fData(cds) )
+</code>
-#Finally to perform the negative binomial test on the dataset to identify
+Finally to perform the negative binomial test on the dataset to identify differentially expressed genes.
-#differentially expressed genes.
+<code>
 res = nbinomTest( cds, "untreated", "treated" )
+</code>
-#An MA plot allows us to see the fold change vs level of expression.
+An MA plot allows us to see the fold change vs level of expression. In the plot, the red points are for genes that have FDR of 10%.
-#In the plot, the red points are for genes that have FDR of 10%.
+<code>
 pdf("MAplot.pdf")
 plotMA(res)
 dev.off()
+</code>
 #To get the genes that have FDR of 10%
@@ Line 106: / Line 108: @@
             quote=F)
-#DESeq manual: http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf