User Tools

Site Tools


seqclean:seqclean

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Last revisionBoth sides next revision
seqclean:seqclean [2010/06/23 15:41] – created 172.26.15.75seqclean:seqclean [2010/06/23 16:07] 172.26.15.75
Line 13: Line 13:
     during the sequencing process (vectors, adapters)     during the sequencing process (vectors, adapters)
   * strong matches with other contaminants or unwanted sequences   * strong matches with other contaminants or unwanted sequences
-    (mitochondrial, ribosomal, bacterial, other species than the  +    (mitochondrial, ribosomal, bacterial, other species than the target organism etc.)
-    target organism etc.)+
          
 The user is expected to provide the contaminant databases,  The user is expected to provide the contaminant databases, 
 they are not included in this package they are not included in this package
 +
 +====Usage and methods====
 +----
 +A short usage message is displayed when seqclean script is launched without
 +any parameters.
 +
 +The seqclean script takes an input sequence file (fasta formatted) as the only 
 +required parameter:
 +
 +seqclean your_est_file
 +
 +seqclean creates two output files of interest: 
 + 1. the filtered FASTA file (your_est_file.clean for the example above) 
 +    containing only valid (non-trashed) and trimmed ("clear range") sequences
 + 2. a "cleaning report" (your_est_file.cln) providing details about 
 +   sequence trimming and trashing (coordinates, reasons for trashing, 
 +   contaminant names etc. - see below for a detailed description).
 +
 +However, the simple usage example above will not perform any searches 
 +against contaminant databases (as there are none specified) but it will 
 +only provide basic analysis, removing the polyA/polyT tail, possibly 
 +clipping low-quality ends (the ends rich in undetermined bases) 
 +and trashing the ones which are too short (shorter than 100 or 
 +the -l parameter value) or which appear to be mostly low-complexity
 +sequence.
 +
 +As suggested in the "Introduction", the contaminant databases provided 
 +by the user can be considered to be of two types:
 +  1. vector/adapter databases, which can determine the trimming 
 +    of the analyzed sequences even when only very short terminal matches 
 +    (down to 12 base pairs) are found. These database files should 
 +    be provided with the -v option (vector detection)
 +  2. extensive contaminants databases: the alignments between these 
 +    contaminants and the analyzed sequences are only considered if 
 +    they are longer than 60 base pairs with at least 94% identity; these 
 +    are provided with the -s option (screening for contamination)
 +
 +In both cases the analyzed sequences will be searched against the provided 
 +files and the overlaps are analyzed. The contaminant databases should be 
 +all formatted as required for blastall (using NCBI's formatdb program).
 +
 +In the first case (vector/linker scan), the overlaps are only considered 
 +if they are above 92% identity, they have very short gaps and they are 
 +located in the 30% distance from either end. Also, the shorter these 
 +overlaps are, the closer to either end of the analyzed sequence they 
 +should be, in order to be considered for trimming of the target sequence. 
 +Multiple vector/adapter databases can be provided at the -v option, separated 
 +by comma (do not use spaces around the comma). Example:
 +
 +seqclean your_est_file -v /usr/db/UniVec,/usr/db/adaptors,/usr/db/linkers
 +
 +In this example three database files are checked for short terminal 
 +matches with the analyzed sequences from "your_est_file".
 +
 +The -s option case 2. above) works in a similar way, as more than one file can
 +be provided, but in that case only larger, statistically more significant 
 +hits are considered. Example:
 +
 +seqclean your_est_file -v /usr/db/UniVec,/usr/db/linkers \
 + -s /usr/db/ecoli_genome,/usr/db/mito_ribo_seqs
 +
 +In both cases, the contaminant database files should be provided with 
 +their full path unless they can be found in the current working directory.
 +The searches against "-v" files are performed using blastall (blastn) with 
 +low stringency, while for "-s" provided files, megablast is used, for 
 +very fast screening. By default, the "smart" low-complexity filter is used 
 +during both type of searches (the -F "m D" option of blastall/megablast).
 +However, in some cases, short vector/adaptor terminal overlaps might 
 +be expected in regions of low-complexity, so the dust filter can be 
 +disabled completely for any database file given at the "-v" option, 
 +by appending the "^" character at the end of the file name:
 +
 +seqclean your_est_file -v /usr/db/adapters^,/usr/db/UniVec,/usr/db/linkers^ \
 + -s /usr/db/ecoli_genome,/usr/db/mito_ribo_seqs
 +
 +In the example above, the "dust" filter is totally disabled for blastn 
 +searches against /usr/db/adapters and /usr/db/linkers, while for the 
 +other files (/usr/db/UniVec) it will still be set to work in "smart" 
 +mode as mentioned above.
 +
 +
 +
 +
seqclean/seqclean.txt · Last modified: 2010/06/23 16:25 by 172.26.15.75