[[mkatari-bioinformatics-august-2013|Back to Manny's Bioinformatics Workshop HOME]]
Here we will discuss how to create an R script (DESeq.R) that can be executed on HPC. This script has been adapted from the DESeq manual [[ http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf|DESeq manual]] except paths to the files are replaced with variables and the plots are being saved as pdf documents in an output directory.
If you are going to run DESeq in R on your desktop you will have to make sure DESeq is already installed.
In order to install DESeq type the following:
source("http://bioconductor.org/biocLite.R")
biocLite("DESeq")
Regardless of whether you are running the code locally or on the server the script must load the library
library(DESeq)
An example of the count data file is provided [[https://docs.google.com/file/d/0B172nc4dAaaOMG44Zk1BT2NFdkU/edit?usp=sharing|here]]
First we will load the count data file.
counts = read.table("NextGenRaw.txt", header=T, row.names=1)
Then we will load the experimental design. An example is provided [[https://docs.google.com/file/d/0B172nc4dAaaOaE5fTVVhUHJKazg/edit?usp=sharing|here]]:
expdesign = read.table("expdesign.txt")
The counts that were loaded as a data.frame are now used to create a new type of object: count data set
cds = newCountDataSet(counts, expdesign$condition)
Now we can perform operations on the dataset and save the results in the same object.
First lets estimate the size factor based on the number of aligned reads from each sample.
cds = estimateSizeFactors(cds)
To see the size factors:
sizeFactors(cds)
To perform a normalization you can simply use this command. Note that the normalized values will not be used for identifying differentially expressed genes but we can use for some downstream analysis.
normalized=counts( cds, normalized=TRUE )
An important part of DESeq is to estimate dispersion. This is simply a form of variance for the genes.
# if you have replicates do the following:
cds = estimateDispersions( cds )
### HOWEVER, If you have NO replicates, then try this
cds = estimateDispersions( cds, method="blind" , sharingMode = "fit-only" )
To visualize the disperson graph
dispersionFile = "Dispersion.pdf"
pdf(dispersionFile)
plotDispEsts( cds )
dev.off()
#To see the dispersion values which will be used for the final test
head( fData(cds) )
Finally to perform the negative binomial test on the dataset to identify differentially expressed genes.
res = nbinomTest( cds, "untreated", "treated" )
An MA plot allows us to see the fold change vs level of expression. In the plot, the red points are for genes that have FDR of 10%.
maFile = "MAplot.pdf"
pdf(maFile)
plotMA(res)
dev.off()
#To get the genes that have FDR of 10% and save it in the output directory.
resSigind = res[ which(res$padj < 0.1 & res$log2FoldChange > 1), ]
resSigrep = res[ which(res$padj < 0.1 & res$log2FoldChange < -1), ]
indoutfile = "Deseq.indresults.txt"
repoutfile = "Deseq.represults.txt"
write.table(resSigind,
indoutfile,
sep="\t",
col.names=T,
row.names=F,
quote=F)
write.table(resSigrep,
repoutfile,
sep="\t",
col.names=T,
row.names=F,
quote=F)