seqclean:seqclean
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revisionLast revisionBoth sides next revision | ||
seqclean:seqclean [2010/06/23 15:41] – created 172.26.15.75 | seqclean:seqclean [2010/06/23 16:07] – 172.26.15.75 | ||
---|---|---|---|
Line 13: | Line 13: | ||
during the sequencing process (vectors, adapters) | during the sequencing process (vectors, adapters) | ||
* strong matches with other contaminants or unwanted sequences | * strong matches with other contaminants or unwanted sequences | ||
- | (mitochondrial, | + | (mitochondrial, |
- | | + | |
| | ||
The user is expected to provide the contaminant databases, | The user is expected to provide the contaminant databases, | ||
they are not included in this package | they are not included in this package | ||
+ | |||
+ | ====Usage and methods==== | ||
+ | ---- | ||
+ | A short usage message is displayed when seqclean script is launched without | ||
+ | any parameters. | ||
+ | |||
+ | The seqclean script takes an input sequence file (fasta formatted) as the only | ||
+ | required parameter: | ||
+ | |||
+ | seqclean your_est_file | ||
+ | |||
+ | seqclean creates two output files of interest: | ||
+ | 1. the filtered FASTA file (your_est_file.clean for the example above) | ||
+ | containing only valid (non-trashed) and trimmed (" | ||
+ | 2. a " | ||
+ | | ||
+ | | ||
+ | |||
+ | However, the simple usage example above will not perform any searches | ||
+ | against contaminant databases (as there are none specified) but it will | ||
+ | only provide basic analysis, removing the polyA/polyT tail, possibly | ||
+ | clipping low-quality ends (the ends rich in undetermined bases) | ||
+ | and trashing the ones which are too short (shorter than 100 or | ||
+ | the -l parameter value) or which appear to be mostly low-complexity | ||
+ | sequence. | ||
+ | |||
+ | As suggested in the " | ||
+ | by the user can be considered to be of two types: | ||
+ | 1. vector/ | ||
+ | of the analyzed sequences even when only very short terminal matches | ||
+ | (down to 12 base pairs) are found. These database files should | ||
+ | be provided with the -v option (vector detection) | ||
+ | 2. extensive contaminants databases: the alignments between these | ||
+ | contaminants and the analyzed sequences are only considered if | ||
+ | they are longer than 60 base pairs with at least 94% identity; these | ||
+ | are provided with the -s option (screening for contamination) | ||
+ | |||
+ | In both cases the analyzed sequences will be searched against the provided | ||
+ | files and the overlaps are analyzed. The contaminant databases should be | ||
+ | all formatted as required for blastall (using NCBI's formatdb program). | ||
+ | |||
+ | In the first case (vector/ | ||
+ | if they are above 92% identity, they have very short gaps and they are | ||
+ | located in the 30% distance from either end. Also, the shorter these | ||
+ | overlaps are, the closer to either end of the analyzed sequence they | ||
+ | should be, in order to be considered for trimming of the target sequence. | ||
+ | Multiple vector/ | ||
+ | by comma (do not use spaces around the comma). Example: | ||
+ | |||
+ | seqclean your_est_file -v / | ||
+ | |||
+ | In this example three database files are checked for short terminal | ||
+ | matches with the analyzed sequences from " | ||
+ | |||
+ | The -s option case 2. above) works in a similar way, as more than one file can | ||
+ | be provided, but in that case only larger, statistically more significant | ||
+ | hits are considered. Example: | ||
+ | |||
+ | seqclean your_est_file -v / | ||
+ | -s / | ||
+ | |||
+ | In both cases, the contaminant database files should be provided with | ||
+ | their full path unless they can be found in the current working directory. | ||
+ | The searches against " | ||
+ | low stringency, while for " | ||
+ | very fast screening. By default, the " | ||
+ | during both type of searches (the -F "m D" option of blastall/ | ||
+ | However, in some cases, short vector/ | ||
+ | be expected in regions of low-complexity, | ||
+ | disabled completely for any database file given at the " | ||
+ | by appending the " | ||
+ | |||
+ | seqclean your_est_file -v / | ||
+ | -s / | ||
+ | |||
+ | In the example above, the " | ||
+ | searches against / | ||
+ | other files (/ | ||
+ | mode as mentioned above. | ||
+ | |||
+ | |||
+ | |||
+ |
seqclean/seqclean.txt · Last modified: 2010/06/23 16:25 by 172.26.15.75