seqclean:seqclean
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| seqclean:seqclean [2010/06/23 15:41] – created 172.26.15.75 | seqclean:seqclean [2010/06/23 16:25] (current) – 172.26.15.75 | ||
|---|---|---|---|
| Line 13: | Line 13: | ||
| during the sequencing process (vectors, adapters) | during the sequencing process (vectors, adapters) | ||
| * strong matches with other contaminants or unwanted sequences | * strong matches with other contaminants or unwanted sequences | ||
| - | (mitochondrial, | + | (mitochondrial, |
| - | | + | |
| | | ||
| The user is expected to provide the contaminant databases, | The user is expected to provide the contaminant databases, | ||
| they are not included in this package | they are not included in this package | ||
| + | ====Usage and methods==== | ||
| + | ---- | ||
| + | A short usage message is displayed when seqclean script is launched without | ||
| + | any parameters. | ||
| + | |||
| + | The seqclean script takes an input sequence file (fasta formatted) as the only | ||
| + | required parameter: | ||
| + | |||
| + | seqclean your_est_file | ||
| + | |||
| + | seqclean creates two output files of interest: | ||
| + | 1. the filtered FASTA file (your_est_file.clean for the example above) | ||
| + | containing only valid (non-trashed) and trimmed (" | ||
| + | 2. a " | ||
| + | | ||
| + | | ||
| + | |||
| + | However, the simple usage example above will not perform any searches | ||
| + | against contaminant databases (as there are none specified) but it will | ||
| + | only provide basic analysis, removing the polyA/polyT tail, possibly | ||
| + | clipping low-quality ends (the ends rich in undetermined bases) | ||
| + | and trashing the ones which are too short (shorter than 100 or | ||
| + | the -l parameter value) or which appear to be mostly low-complexity | ||
| + | sequence. | ||
| + | |||
| + | As suggested in the " | ||
| + | by the user can be considered to be of two types: | ||
| + | 1. vector/ | ||
| + | of the analyzed sequences even when only very short terminal matches | ||
| + | (down to 12 base pairs) are found. These database files should | ||
| + | be provided with the -v option (vector detection) | ||
| + | 2. extensive contaminants databases: the alignments between these | ||
| + | contaminants and the analyzed sequences are only considered if | ||
| + | they are longer than 60 base pairs with at least 94% identity; these | ||
| + | are provided with the -s option (screening for contamination) | ||
| + | |||
| + | In both cases the analyzed sequences will be searched against the provided | ||
| + | files and the overlaps are analyzed. The contaminant databases should be | ||
| + | all formatted as required for blastall (using NCBI's formatdb program). | ||
| + | |||
| + | In the first case (vector/ | ||
| + | if they are above 92% identity, they have very short gaps and they are | ||
| + | located in the 30% distance from either end. Also, the shorter these | ||
| + | overlaps are, the closer to either end of the analyzed sequence they | ||
| + | should be, in order to be considered for trimming of the target sequence. | ||
| + | Multiple vector/ | ||
| + | by comma (do not use spaces around the comma). Example: | ||
| + | |||
| + | seqclean your_est_file -v / | ||
| + | |||
| + | In this example three database files are checked for short terminal | ||
| + | matches with the analyzed sequences from " | ||
| + | |||
| + | The -s option case 2. above) works in a similar way, as more than one file can | ||
| + | be provided, but in that case only larger, statistically more significant | ||
| + | hits are considered. Example: | ||
| + | |||
| + | seqclean your_est_file -v / | ||
| + | -s / | ||
| + | |||
| + | In both cases, the contaminant database files should be provided with | ||
| + | their full path unless they can be found in the current working directory. | ||
| + | The searches against " | ||
| + | low stringency, while for " | ||
| + | very fast screening. By default, the " | ||
| + | during both type of searches (the -F "m D" option of blastall/ | ||
| + | However, in some cases, short vector/ | ||
| + | be expected in regions of low-complexity, | ||
| + | disabled completely for any database file given at the " | ||
| + | by appending the " | ||
| + | |||
| + | seqclean your_est_file -v / | ||
| + | -s / | ||
| + | |||
| + | In the example above, the " | ||
| + | searches against / | ||
| + | other files (/ | ||
| + | mode as mentioned above. | ||
| + | |||
| + | The cleaning scripts keep track of iterative trimming of the input | ||
| + | sequences through multiple matches with various contaminants, | ||
| + | if that's the case.The 5' end (end5) coordinate of each input sequence | ||
| + | is initially set to 1, and the 3' end (end3) coordinate is considered | ||
| + | to be the length of the initial sequences. During the above mentioned | ||
| + | trimming procedures, end5 can be increased and/or end3 can be decreased. | ||
| + | The final end3-end5+1 range is considered to be the "clear range" of the | ||
| + | sequence after going through the cleaning procedure. No matter if trimming | ||
| + | was applied or not, if the "clear range" length is shorter than a minimum | ||
| + | value (default 100nt, can be set by -l option), the sequence will | ||
| + | be considered invalid and it will be trashed. Also, at the end of | ||
| + | the cleaning procedure, the percentage of undetermined bases from the | ||
| + | clear range is computed and the sequence is also trashed if this | ||
| + | percentage is larger than 3%. | ||
| + | |||
| + | ==== Cleaning report format ==== | ||
| + | ---- | ||
| + | Each line in the cleaning report file (*.cln) has 7 tab-delimited fields | ||
| + | as follows: | ||
| + | |||
| + | 1. the name of the input sequence | ||
| + | 2. the percentage of undetermined bases in the clear range | ||
| + | 3. 5' coordinate after cleaning | ||
| + | 4. 3' coordinate after cleaning | ||
| + | 5. initial length of the sequence | ||
| + | 6. trash code | ||
| + | 7. trimming comments (contaminant names, reasons for trimming/ | ||
| + | |||
| + | The trash code field (6) should be empty if (part of) a sequence is | ||
| + | considered valid - so it can be found in the final filtered file (*.clean) | ||
| + | |||
| + | The trash code field will be set to the file name of the last contaminant | ||
| + | database, if that determined the clear range to fall below the minimum value | ||
| + | (-l parameter, default 100). There are three reserved values of | ||
| + | the trash code: | ||
| + | |||
| + | " | ||
| + | below the minimum accepted length (-l) after polyA | ||
| + | or low quality ends trimming; | ||
| + | " | ||
| + | is greater than 3% in the clear range; | ||
| + | " | ||
| + | is left unmasked by the " | ||
| + | |||
| + | |||
| + | The reasons and the coordinates for trimming are mentioned in the 7th | ||
| + | field. When trimming was due to a contaminant match, the contaminant | ||
| + | name and the overlap coordinates are mentioned. When trimming was due | ||
| + | to polyA tail or low quality ends removal, the " | ||
| + | mentioned along with the trimming coordinates. | ||
| + | |||
| + | Besides the -s and -v parameters mentioned above, here is a brief summary | ||
| + | of the other parameters: | ||
| + | -c : enables parallel processing by specifying the number of local CPUs | ||
| + | to use (for a SMP machine) or a filename containing a list of PVM node | ||
| + | names (one host name per line, per CPU). In the PVM case, if a node | ||
| + | is also a SMP machine and you want to use more than one CPU on that | ||
| + | node, you should list that same node name as many times as many CPUs | ||
| + | you want to use on that node. If this option is not provided, | ||
| + | only one CPU is used on the local machine. | ||
| + | -n : the input file is not usually processed as one single query file. | ||
| + | Instead, it is sliced up into little parts and each part | ||
| + | is processed separately; this option is useful to tweak when | ||
| + | you also make use of the multi-CPU option (-c), as each slice can | ||
| + | be processed by one CPU. | ||
| + | -l : the minimum accepted length of the clear range in order | ||
| + | to be considered valid. If the length of the clear range falls | ||
| + | below this value, a trash code is assigned to the input sequence | ||
| + | and it will be exclued from the output filtered file (*.clean) | ||
| + | -r : custom name of the cleaning report file (default: add the " | ||
| + | suffix to the input file name) | ||
| + | -o : custom name of the final "clear range" | ||
| + | the valid sequences (default: append " | ||
| + | input file name) | ||
| + | -x : set the minimum percent identity to be considered for an | ||
| + | alignment with a contaminant (default 96) | ||
| + | -y : minimum length of a terminal vector hit to be considered | ||
| + | (>11, default 11) | ||
| + | -N : disable any attempt of trimming of low quality ends (ends rich in | ||
| + | N = undetermined bases) | ||
| + | -M : completely disable trashing of low quality (N-rich) sequences | ||
| + | -A : disable trimming of polyA tails from 3' end or polyT from 5' end of | ||
| + | the input sequences | ||
| + | -L : disable low-complexity analysis and the trashing of input sequences | ||
| + | by this criterion | ||
| + | -I : do not rebuild the .cidx file (if already there) | ||
| + | -m : enable sending of e-mail notification to the mentioned address, at | ||
| + | the end of the cleaning process or in case of error | ||
| + | |||
| + | If after seqclean one needs to trim the corresponding quality values too, | ||
| + | according to the new coordinates or trash codes found by seqclean, the | ||
| + | utility script " | ||
| + | a fasta-like file containing space delimited quality values for each nucleotide of | ||
| + | the original sequences. It should be run after the seqclean, as it parses the | ||
| + | trimming (" | ||
| + | and applies them to the quality records. | ||
| + | |||
| + | |||
| + | |||
seqclean/seqclean.1277307711.txt.gz · Last modified: by 172.26.15.75
