User Tools

Site Tools


ppp

Pathogen Profiling Pipeline

The Pathogen Profiling Pipeline project aims to develop a metagenomics procedure independent of laboratory cultivation and a flexible bioinformatics pipeline for the rapid identification and analysis of pathogens in samples containing complex mixtures of host and microbial nucleic acids. Sequence reads derived from next generation high throughput DNA sequencing (Roche GS FLX pyrosequencing) technology are passed through customizable metagenomic analysis pipelines, which subsequently filter and report on the best taxonomic hits.

Generated raw sequences may be filtered to exclude host and normal flora, thereby facilitating pathogen searching within the large and cumbersome pyrosequencing data sets while retaining specificity. Researchers can upload any available biological databases for their analysis resulting in potentially limitless configurations for data analysis pipelines. For maximum throughput within time constraints (such as in emergency response situations), the application may run on a high performance parallel computing cluster to distribute the processor-intensive analyses.

The Pathogen Profiling Pipeline facilitates the analysis, reporting, and data management aspects of large-scale pathogen discovery projects aimed at quickly identifying candidate etiological agents in complex nucleic acid mixtures. This outcome has enhanced outbreak preparedness by enhancing capacity for early recognition and containment of pathogens.

Developed by:

Tom Matthews and Gary Van Domselaar

National Microbiology Laboratory Public Health Agency of Canada 820 Elgin St., Winnipeg, MB, Canada R3E 3R2

t.c.matthews@gmail.com, gary.vandomselaar@gmail.com

Installation

After a fresh installation of Rocks 5.2 on the HPC cluster. Download PPP: http://www.corefacility.ca/ppp

From the README in ppp.tar.gz:

SOFTWARE REQUIREMENTS

    Compute cluster:
        - BLAST
        - BioPerl -- 1.5 or newer
        - DRMAA compliant scheduler -- Sun Grid Engine suggested
    Web server:
        - Apache2
        - Mod-Perl
        - BioPerl -- 1.5 or newer
        - Graphviz

On the head node

The installation and configuration of the head node should have taken care of the Apache2, mod_perl, and BioPerl requirements. See the upgrading_rocks page if you haven't satisfied those yet.

We need graphviz, Rocks installs a copy, but it's located in the Rocks special directories. Install another copy in the system by following the instructions on their website for CentOS/RedHat Enterprise Linux: http://www.graphviz.org/Download_linux_rhel.php

  1. Download graphviz-rhel.repo and copy it to /etc/yum.repos.d/
  2. yum install 'graphviz*'

Configuring perl modules

PPP's web interface needs XML::Simple, which is in yum:

# yum install perl-XML-Simple

PPP's DRMAA scheduler

DRMAA is an API for job scheduling. Sun Grid Engine is DRMAA compliant, but it needs the help of a perl module. We will compile from source because we need to tell it where to look to find the C headers for SGE's drmaa support.

  1. Read the README :)
  2. Prepare the environment for compiling the perl module:
$ source /opt/gridengine/default/common/settings.sh
$ export LD_LIBRARY_PATH=$SGE_ROOT/lib/`$SGE_ROOT/util/arch`
$ ln -s $SGE_ROOT/include/drmaa.h

Build and install the perl module:

perl Makefile.PL
make
make test
sudo make install

Install PPP

PPP's perl scripts need to be accessible to all nodes, so change directory to somewhere accessible to the nodes:

# cd /mnt/export3

Unzip PPP:

# tar -zxf ~alan/src/ppp.tar.gz

Rename so its less confusing:

# mv ppp PathogenPP

Now read the readme and install as per install instructions in INSTALL.PDF… In a nutshell:

# cd ppp-backend
# mkdir db scratch data

Edit the config file (conf/local.conf) to reflect the locations of software in your installation, most importantly:

#blast_loc
/opt/Bio/ncbi/bin/blastall

#formatdb_loc
/opt/Bio/ncbi/bin/formatdb

#bp_index_loc
/usr/bin/bp_index.pl

#rootPath
/mnt/export3/PathogenPP/ppp-backend/

Run bin/customjob.pl, if there are no errors you can save that output into the conf/jobs.xml file:

# perl bin/customjob.pl > conf/jobs.xml

Download and extract taxonomy databases from NCBI:

$ wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
$ cd taxon
$ tar zxf taxdump.tar.gz

From the taxon folder, format the taxonomy databases using the taxonformat.pl script from Tom Matthews (PPP developer):

#!/usr/bin/perl

use Bio::DB::Taxonomy;
use Bio::Taxonomy::Taxon;
use FindBin;
use strict;

my $taxondir = $FindBin::Bin;

if(!-d $taxondir)
{
	die "$taxondir is not a directory";
}

if(!-e "$taxondir/nodes")
{
	print "It doesn't look like $taxondir is formatted.  Formatting $taxondir\n";
}

my $db = new Bio::DB::Taxonomy(-source => 'flatfile',
	-directory => $taxondir,
        -nodesfile => "$taxondir/nodes.dmp",
        -namesfile => "$taxondir/names.dmp");

if(-e "$taxondir/nodes" && -e "$taxondir/id2names" && -e "$taxondir/names2id" && -e "$taxondir/parents")
{
	print "Success!  Taxonomy directory $taxondir appers to be properly formatted.\n";
}
else
{
	print "There may be an error.  Check to ensure that $taxondir has the extracted taxonomy database and try again\n";
}

Copy the ppp-web folder to Apache's document root:

# cp -R ppp-web/ /var/www/html/

Create a link to the ppp-backend directory in ppp-web:

# ln -s /mnt/export3/PathogenPP/ppp-backend /var/www/html/ppp-web/ppp

Edit Apache's config file to load ppp as perl scripts. Create a new file /etc/httpd/conf.d/ppp-web.conf:

<IfModule mod_perl.c>
        <Directory "/var/www/html/ppp-web">
                AllowOverride None
                Order allow,deny
                allow from all
                AddHandler perl-script cgi-script .cgi .pl
                Options None
        </Directory>

        <Directory "/var/www/html/ppp-web/cgi-bin">
                AllowOverride None
                Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
                Order allow,deny
                Allow from all
                SetHandler perl-script
                PerlResponseHandler ModPerl::Registry
        </Directory>
</IfModule>

Restart Apache:

# apachectl graceful

Change the permissions on everything so that Apache's user can read/write:

# chown -R root:apache *
# chmod -R g+w *

Test! http://hpc.ilri.cgiar.org/ppp-web/

If you get errors, check the Apache error_log. :)

If it worked, go ahead and start the job manager:

# cd bin
# perl drmaamanager.pl 
> Job manager initilized...

Now PPP's web interface should indicate that there is a job server running (green circle!)

To start the job manager and send it to the background:

# cd /mnt/export3/PathogenPP/ppp-backend/bin/
# nohup perl drmaamanager.pl &

nohup tells the program to ignore hangup signals, such as the parent shell disconnecting. This will allow it to keep running in the background without a controlling terminal.


Managing the pipeline

Manually adding databases

  • Add the fasta files to your 'db' directory.
  • Format the database with the formatdb utility included with BLAST. The command
  • "formatdb –help" will provide you with the appropriate arguments, but here's an

example: formatdb -i viral.fna -p F

  • If you would like a BioPerl index, you can also make it manually. Running "bp_index.pl"
  • with no arguments will provide you with a perldoc page for the script, but again here's
  • an example:

bp_index.pl -dir <FULL PATH TO DB DIR> viral.fna.idx viral.fna

Adding Input Files

From the Administration page, click the Upload Files button. From here adding input files is very similar to adding databases.

Again note that you can manually add the files to your data directory from the command line if you wish. Seeing they don't need to be formatted, they will be ready to use as soon as they are placed in the appropriate directory.

Troubleshooting

CHECK THIS FIRST - If a problem occurred with the entry point script it may have locked the job cache. Ensure "ppp.pl" is not running on the web server or head cluster node, then remove the "ppp-backend/cache/jobcache.lock" file if it exists. This may resolve all kinds of problems.

Job manager error message: Could not contact DRM system - Your scheduler is not started. If using SGE, you need to start "sge_execd" on all execution hosts and "sge_qmaster" and "sge_schedd" on your submit host (head node).

Web front not displaying or trying to download pages - The apache2 configuration isn't properly set up. Check that the "ppp-web" apache2 configuration file is in apache2's "sites- available" folder and linked in "sites-enabled". Also ensure ModPerl (libapache2-mod-perl2) is installed. Finally, restart apache2 (apache2ctl restart).

Submitted jobs are not picked up by job manager - Check first that the job cache files are being created in "ppp-backend/cache". They will have the form "##.exec". If the files don't exist, it is probably a permissions problem. Ensure the apache2 web user has read/write access to the "ppp-backend/bin" and "ppp-backend/cache" folders.

Filtering jobs not producing results or immediately failing - Your paths may be set up wrong in the local configuration file. Look at "ppp-backend/conf/local.conf" and ensure all paths are set up correctly. Also, Sun Grid Engine may be failing. Check your gridengine/default/qmaster/messages file for a diagnosis of failing jobs.

Jobs appear to be running but producing no results - Again probably a permissions problem. Your web server is writing the job information, but the user running the execution jobs may not have read/write permissions to the scratch folders.

ppp.txt · Last modified: 2010/09/09 18:58 by evilliers