Differences

This shows you the differences between two versions of the page.

--- mkatari-bioinformatics-august-2013-deseq [2013/08/23 14:41] – mkatari
+++ mkatari-bioinformatics-august-2013-deseq [2015/08/21 14:13] (current) – mkatari
@@ Line 1: / Line 1: @@
 [[mkatari-bioinformatics-august-2013|Back to Manny's Bioinformatics Workshop HOME]]
-Here we will discuss how to create an R script that can be executed on HPC. Majority of the script is the same except for the first few commands that read the arguments from command line.
+Here we will discuss how to create an R script (DESeq.R) that can be executed on HPC. This script has been adapted from the DESeq manual [[ http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf|DESeq manual]]  except paths to the files are replaced with variables and the plots are being saved as pdf documents in an output directory.
 If you are going to run DESeq in R on your desktop you will have to make sure DESeq is already installed.
- In order to install DESeq type the following:
+In order to install DESeq type the following:
 <code>
 source("http://bioconductor.org/biocLite.R")
 biocLite("DESeq")
-</code>
-However for the script that is available on HPC the script will automatically find DESeq. R uses variables to store locations where it should look for packages. Here we can simply add a path to where a specific module is located. This will prevent the need for others to have to install the module themselves. This is done int he following lines of the code
-<code>
-mannyLibPaths = '/home/mkatari/R/x86_64-unknown-linux-gnu-library/3.0'
-.libPaths(new='/home/mkatari/R/x86_64-unknown-linux-gnu-library/3.0')
 </code>
@@ Line 24: / Line 17: @@
 </code>
-Now for our script we will use a command that allows the R to read in arguments from command line automatically. This will be helpful when we are using an R script in an analysis pipeline. The code that reads arguments from the command line are:
+An example of the count data file is provided [[https://docs.google.com/file/d/0B172nc4dAaaOMG44Zk1BT2NFdkU/edit?usp=sharing|here]]
+First we will load the count data file.
 <code>
-userargs = commandArgs(TRUE)
+counts = read.table("NextGenRaw.txt", header=T, row.names=1)
-pathToCountsData = userargs[1]
-pathToExpDesign = userargs[2]
-pathToOutput = userargs[3]
 </code>
+Then we will load the experimental design. An example is provided [[https://docs.google.com/file/d/0B172nc4dAaaOaE5fTVVhUHJKazg/edit?usp=sharing|here]]:
-The input for DESeq is a matrix/data.frame containing read counts. An example is provided [[https://docs.google.com/file/d/0B172nc4dAaaOMG44Zk1BT2NFdkU/edit?usp=sharing|here]]
-You have to first load the file into your workspace.
-If you are running it locally
 <code>
-counts = read.table("NextGenRaw.txt", header=T, row.names=1)
+expdesign = read.table("expdesign.txt")
 </code>
-If you are writing a script
-counts = read.table(pathToCountsData, header=T, row.names=1)
-#This is simply meta-data to store information about the samples.
+The counts that were loaded as a data.frame are now used to create a new type of object: count data set
-#expdesign = data.frame(
+<code>
-#  row.names=colnames(counts),
-#  condition=c("untreated","untreated","treated","treated"),
-#  libType=c("single-end","single-end","single-end","single-end")
-#)
-expdesign = read.table(pathToExpDesign)
-#The counts that were loaded as a data.frame are now used to create
-#a new type of object-> count data set
 cds = newCountDataSet(counts, expdesign$condition)
+</code>
-#Now we can perform operations on the dataset and save the results in
+Now we can perform operations on the dataset and save the results in the same object.
-#the same object.
-#first lets estimate the size factor based on the number of aligned reads
+First lets estimate the size factor based on the number of aligned reads from each sample.
-#from each sample.
+<code>
 cds = estimateSizeFactors(cds)
+</code>
-#to see the size factors:
+To see the size factors:
+<code>
 sizeFactors(cds)
+</code>
-#To perform a normalization you can simply use this command.
+To perform a normalization you can simply use this command. Note that the normalized values will not be used for identifying differentially expressed genes but we can use for some downstream analysis.
-#Note that the normalized values will not be used for identifying
+<code>
-#differentially expressed genes
 normalized=counts( cds, normalized=TRUE )
+</code>
-#An important part of DESeq is to estimate dispersion. This is simply
+An important part of DESeq is to estimate dispersion. This is simply a form of variance for the genes.
-#a form of variance for the genes.
+<code>
+# if you have replicates do the following:
 cds = estimateDispersions( cds )
+### HOWEVER, If you have NO replicates, then try this
+cds = estimateDispersions( cds, method="blind" , sharingMode = "fit-only" )
+</code>
-#To visualize the disperson graph
+To visualize the disperson graph
-pdf("Dispersion.pdf")
+<code>
+dispersionFile = "Dispersion.pdf"
+pdf(dispersionFile)
 plotDispEsts( cds )
 dev.off()
+</code>
 #To see the dispersion values which will be used for the final test
+<code>
 head( fData(cds) )
+</code>
-#Finally to perform the negative binomial test on the dataset to identify
+Finally to perform the negative binomial test on the dataset to identify differentially expressed genes.
-#differentially expressed genes.
+<code>
 res = nbinomTest( cds, "untreated", "treated" )
+</code>
-#An MA plot allows us to see the fold change vs level of expression.
+An MA plot allows us to see the fold change vs level of expression. In the plot, the red points are for genes that have FDR of 10%.
-#In the plot, the red points are for genes that have FDR of 10%.
-pdf("MAplot.pdf")
+<code>
+maFile = "MAplot.pdf"
+pdf(maFile)
 plotMA(res)
 dev.off()
+</code>
-#To get the genes that have FDR of 10%
+#To get the genes that have FDR of 10% and save it in the output directory.
-write.table(resSig,
+<code>
-            pathToOutput,
+resSigind = res[ which(res$padj < 0.1 & res$log2FoldChange > 1), ]
+resSigrep = res[ which(res$padj < 0.1 & res$log2FoldChange < -1), ]
+indoutfile = "Deseq.indresults.txt"
+repoutfile = "Deseq.represults.txt"
+write.table(resSigind,
+            indoutfile,
             sep="\t",
             col.names=T,
@@ Line 103: / Line 101: @@
             quote=F)
-#DESeq manual: http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf
+write.table(resSigrep,
+            repoutfile,
+            sep="\t",
+            col.names=T,
+            row.names=F,
+            quote=F)
+</code>