Differences

This shows you the differences between two versions of the page.

--- seqclean:seqclean [2010/06/23 15:41] – created 172.26.15.75
+++ seqclean:seqclean [2010/06/23 16:07] – 172.26.15.75
@@ Line 13: / Line 13: @@
     during the sequencing process (vectors, adapters)
   * strong matches with other contaminants or unwanted sequences
-    (mitochondrial, ribosomal, bacterial, other species than the
+    (mitochondrial, ribosomal, bacterial, other species than the target organism etc.)
-    target organism etc.)
 The user is expected to provide the contaminant databases,
 they are not included in this package
+====Usage and methods====
+----
+A short usage message is displayed when seqclean script is launched without
+any parameters.
+The seqclean script takes an input sequence file (fasta formatted) as the only
+required parameter:
+seqclean your_est_file
+seqclean creates two output files of interest:
+. the filtered FASTA file (your_est_file.clean for the example above)
+    containing only valid (non-trashed) and trimmed ("clear range") sequences
+. a "cleaning report" (your_est_file.cln) providing details about
+   sequence trimming and trashing (coordinates, reasons for trashing,
+   contaminant names etc. - see below for a detailed description).
+However, the simple usage example above will not perform any searches
+against contaminant databases (as there are none specified) but it will
+only provide basic analysis, removing the polyA/polyT tail, possibly
+clipping low-quality ends (the ends rich in undetermined bases)
+and trashing the ones which are too short (shorter than 100 or
+the -l parameter value) or which appear to be mostly low-complexity
+sequence.
+As suggested in the "Introduction", the contaminant databases provided
+by the user can be considered to be of two types:
+. vector/adapter databases, which can determine the trimming
+    of the analyzed sequences even when only very short terminal matches
+    (down to 12 base pairs) are found. These database files should
+    be provided with the -v option (vector detection)
+. extensive contaminants databases: the alignments between these
+    contaminants and the analyzed sequences are only considered if
+    they are longer than 60 base pairs with at least 94% identity; these
+    are provided with the -s option (screening for contamination)
+In both cases the analyzed sequences will be searched against the provided
+files and the overlaps are analyzed. The contaminant databases should be
+all formatted as required for blastall (using NCBI's formatdb program).
+In the first case (vector/linker scan), the overlaps are only considered
+if they are above 92% identity, they have very short gaps and they are
+located in the 30% distance from either end. Also, the shorter these
+overlaps are, the closer to either end of the analyzed sequence they
+should be, in order to be considered for trimming of the target sequence.
+Multiple vector/adapter databases can be provided at the -v option, separated
+by comma (do not use spaces around the comma). Example:
+seqclean your_est_file -v /usr/db/UniVec,/usr/db/adaptors,/usr/db/linkers
+In this example three database files are checked for short terminal
+matches with the analyzed sequences from "your_est_file".
+The -s option case 2. above) works in a similar way, as more than one file can
+be provided, but in that case only larger, statistically more significant
+hits are considered. Example:
+seqclean your_est_file -v /usr/db/UniVec,/usr/db/linkers \
+ -s /usr/db/ecoli_genome,/usr/db/mito_ribo_seqs
+In both cases, the contaminant database files should be provided with
+their full path unless they can be found in the current working directory.
+The searches against "-v" files are performed using blastall (blastn) with
+low stringency, while for "-s" provided files, megablast is used, for
+very fast screening. By default, the "smart" low-complexity filter is used
+during both type of searches (the -F "m D" option of blastall/megablast).
+However, in some cases, short vector/adaptor terminal overlaps might
+be expected in regions of low-complexity, so the dust filter can be
+disabled completely for any database file given at the "-v" option,
+by appending the "^" character at the end of the file name:
+seqclean your_est_file -v /usr/db/adapters^,/usr/db/UniVec,/usr/db/linkers^ \
+ -s /usr/db/ecoli_genome,/usr/db/mito_ribo_seqs
+In the example above, the "dust" filter is totally disabled for blastn
+searches against /usr/db/adapters and /usr/db/linkers, while for the
+other files (/usr/db/UniVec) it will still be set to work in "smart"
+mode as mentioned above.