User Tools

Site Tools


mkatari-bioinformatics-august-2013-deseq

This is an old revision of the document!


Back to Manny's Bioinformatics Workshop HOME

Here we will discuss how to create an R script (DESeq.R) that can be executed on HPC. Majority of the script is the same as if you were running it interactively except paths to the files are replaced with variables.

If you are going to run DESeq in R on your desktop you will have to make sure DESeq is already installed. In order to install DESeq type the following:

source("http://bioconductor.org/biocLite.R")
biocLite("DESeq")

However to make the script easy to run for anyone on the server, we will tell the R script where exactly to look for DESeq. R uses a variable (.libPaths) to store locations where it should look for packages. We will simply add the path to this variable. This way the person running the script does not need to have DESeq installed in their local R libraries. The other option is to tell the system administrator to add the packages. This is done in the following lines of the code

mannyLibPaths = '/home/mkatari/R/x86_64-unknown-linux-gnu-library/3.0'
.libPaths(new='/home/mkatari/R/x86_64-unknown-linux-gnu-library/3.0')

Regardless of whether you are running the code locally or on the server the script must load the library

library(DESeq)

Now for our script we will use a function (commandArgs) that allows the R to read in arguments from command line automatically. We will use a command Rscript to run one of our R scripts (DESeq.R). This will be helpful when we are using an R script in an analysis pipeline. The code that reads arguments from the command line are:

userargs = commandArgs(TRUE)
pathToCountsData = userargs[1]
pathToExpDesign = userargs[2]
pathToOutput = userargs[3]

All the arguments provided in command line will be saved as a character vector in userargs. The value TRUE in the commandArgs argument make sure only the trailing arguments are saved (which is what we will be providing. If the value is FALSE you will see additional R arguments when the command Rscript is executed.

The input for DESeq is a matrix/data.frame containing read counts. An example is provided here

You have to first load the file into your workspace.

If you are running it locally

counts = read.table("NextGenRaw.txt", header=T, row.names=1)

If you are writing a script

counts = read.table(pathToCountsData, header=T, row.names=1)

#This is simply meta-data to store information about the samples. #expdesign = data.frame( # row.names=colnames(counts), # condition=c("untreated","untreated","treated","treated"), # libType=c("single-end","single-end","single-end","single-end") #) expdesign = read.table(pathToExpDesign)

#The counts that were loaded as a data.frame are now used to create #a new type of object→ count data set cds = newCountDataSet(counts, expdesign$condition)

#Now we can perform operations on the dataset and save the results in #the same object.

#first lets estimate the size factor based on the number of aligned reads #from each sample. cds = estimateSizeFactors(cds)

#to see the size factors: sizeFactors(cds)

#To perform a normalization you can simply use this command. #Note that the normalized values will not be used for identifying #differentially expressed genes normalized=counts( cds, normalized=TRUE )

#An important part of DESeq is to estimate dispersion. This is simply #a form of variance for the genes. cds = estimateDispersions( cds )

#To visualize the disperson graph pdf("Dispersion.pdf") plotDispEsts( cds ) dev.off()

#To see the dispersion values which will be used for the final test head( fData(cds) )

#Finally to perform the negative binomial test on the dataset to identify #differentially expressed genes. res = nbinomTest( cds, "untreated", "treated" )

#An MA plot allows us to see the fold change vs level of expression. #In the plot, the red points are for genes that have FDR of 10%.

pdf("MAplot.pdf") plotMA(res) dev.off()

#To get the genes that have FDR of 10% write.table(resSig,

          pathToOutput, 
          sep="\t", 
          col.names=T, 
          row.names=F,
          quote=F)

#DESeq manual: http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf

mkatari-bioinformatics-august-2013-deseq.1377271438.txt.gz · Last modified: 2013/08/23 15:23 by mkatari