seqclean:seqclean
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
seqclean:seqclean [2010/06/23 15:42] – 172.26.15.75 | seqclean:seqclean [2010/06/23 16:25] (current) – 172.26.15.75 | ||
---|---|---|---|
Line 17: | Line 17: | ||
The user is expected to provide the contaminant databases, | The user is expected to provide the contaminant databases, | ||
they are not included in this package | they are not included in this package | ||
+ | ====Usage and methods==== | ||
+ | ---- | ||
+ | A short usage message is displayed when seqclean script is launched without | ||
+ | any parameters. | ||
+ | |||
+ | The seqclean script takes an input sequence file (fasta formatted) as the only | ||
+ | required parameter: | ||
+ | |||
+ | seqclean your_est_file | ||
+ | |||
+ | seqclean creates two output files of interest: | ||
+ | 1. the filtered FASTA file (your_est_file.clean for the example above) | ||
+ | containing only valid (non-trashed) and trimmed (" | ||
+ | 2. a " | ||
+ | | ||
+ | | ||
+ | |||
+ | However, the simple usage example above will not perform any searches | ||
+ | against contaminant databases (as there are none specified) but it will | ||
+ | only provide basic analysis, removing the polyA/polyT tail, possibly | ||
+ | clipping low-quality ends (the ends rich in undetermined bases) | ||
+ | and trashing the ones which are too short (shorter than 100 or | ||
+ | the -l parameter value) or which appear to be mostly low-complexity | ||
+ | sequence. | ||
+ | |||
+ | As suggested in the " | ||
+ | by the user can be considered to be of two types: | ||
+ | 1. vector/ | ||
+ | of the analyzed sequences even when only very short terminal matches | ||
+ | (down to 12 base pairs) are found. These database files should | ||
+ | be provided with the -v option (vector detection) | ||
+ | 2. extensive contaminants databases: the alignments between these | ||
+ | contaminants and the analyzed sequences are only considered if | ||
+ | they are longer than 60 base pairs with at least 94% identity; these | ||
+ | are provided with the -s option (screening for contamination) | ||
+ | |||
+ | In both cases the analyzed sequences will be searched against the provided | ||
+ | files and the overlaps are analyzed. The contaminant databases should be | ||
+ | all formatted as required for blastall (using NCBI's formatdb program). | ||
+ | |||
+ | In the first case (vector/ | ||
+ | if they are above 92% identity, they have very short gaps and they are | ||
+ | located in the 30% distance from either end. Also, the shorter these | ||
+ | overlaps are, the closer to either end of the analyzed sequence they | ||
+ | should be, in order to be considered for trimming of the target sequence. | ||
+ | Multiple vector/ | ||
+ | by comma (do not use spaces around the comma). Example: | ||
+ | |||
+ | seqclean your_est_file -v / | ||
+ | |||
+ | In this example three database files are checked for short terminal | ||
+ | matches with the analyzed sequences from " | ||
+ | |||
+ | The -s option case 2. above) works in a similar way, as more than one file can | ||
+ | be provided, but in that case only larger, statistically more significant | ||
+ | hits are considered. Example: | ||
+ | |||
+ | seqclean your_est_file -v / | ||
+ | -s / | ||
+ | |||
+ | In both cases, the contaminant database files should be provided with | ||
+ | their full path unless they can be found in the current working directory. | ||
+ | The searches against " | ||
+ | low stringency, while for " | ||
+ | very fast screening. By default, the " | ||
+ | during both type of searches (the -F "m D" option of blastall/ | ||
+ | However, in some cases, short vector/ | ||
+ | be expected in regions of low-complexity, | ||
+ | disabled completely for any database file given at the " | ||
+ | by appending the " | ||
+ | |||
+ | seqclean your_est_file -v / | ||
+ | -s / | ||
+ | |||
+ | In the example above, the " | ||
+ | searches against / | ||
+ | other files (/ | ||
+ | mode as mentioned above. | ||
+ | |||
+ | The cleaning scripts keep track of iterative trimming of the input | ||
+ | sequences through multiple matches with various contaminants, | ||
+ | if that's the case.The 5' end (end5) coordinate of each input sequence | ||
+ | is initially set to 1, and the 3' end (end3) coordinate is considered | ||
+ | to be the length of the initial sequences. During the above mentioned | ||
+ | trimming procedures, end5 can be increased and/or end3 can be decreased. | ||
+ | The final end3-end5+1 range is considered to be the "clear range" of the | ||
+ | sequence after going through the cleaning procedure. No matter if trimming | ||
+ | was applied or not, if the "clear range" length is shorter than a minimum | ||
+ | value (default 100nt, can be set by -l option), the sequence will | ||
+ | be considered invalid and it will be trashed. Also, at the end of | ||
+ | the cleaning procedure, the percentage of undetermined bases from the | ||
+ | clear range is computed and the sequence is also trashed if this | ||
+ | percentage is larger than 3%. | ||
+ | |||
+ | ==== Cleaning report format ==== | ||
+ | ---- | ||
+ | Each line in the cleaning report file (*.cln) has 7 tab-delimited fields | ||
+ | as follows: | ||
+ | |||
+ | 1. the name of the input sequence | ||
+ | 2. the percentage of undetermined bases in the clear range | ||
+ | 3. 5' coordinate after cleaning | ||
+ | 4. 3' coordinate after cleaning | ||
+ | 5. initial length of the sequence | ||
+ | 6. trash code | ||
+ | 7. trimming comments (contaminant names, reasons for trimming/ | ||
+ | |||
+ | The trash code field (6) should be empty if (part of) a sequence is | ||
+ | considered valid - so it can be found in the final filtered file (*.clean) | ||
+ | |||
+ | The trash code field will be set to the file name of the last contaminant | ||
+ | database, if that determined the clear range to fall below the minimum value | ||
+ | (-l parameter, default 100). There are three reserved values of | ||
+ | the trash code: | ||
+ | |||
+ | " | ||
+ | below the minimum accepted length (-l) after polyA | ||
+ | or low quality ends trimming; | ||
+ | " | ||
+ | is greater than 3% in the clear range; | ||
+ | " | ||
+ | is left unmasked by the " | ||
+ | |||
+ | |||
+ | The reasons and the coordinates for trimming are mentioned in the 7th | ||
+ | field. When trimming was due to a contaminant match, the contaminant | ||
+ | name and the overlap coordinates are mentioned. When trimming was due | ||
+ | to polyA tail or low quality ends removal, the " | ||
+ | mentioned along with the trimming coordinates. | ||
+ | |||
+ | Besides the -s and -v parameters mentioned above, here is a brief summary | ||
+ | of the other parameters: | ||
+ | -c : enables parallel processing by specifying the number of local CPUs | ||
+ | to use (for a SMP machine) or a filename containing a list of PVM node | ||
+ | names (one host name per line, per CPU). In the PVM case, if a node | ||
+ | is also a SMP machine and you want to use more than one CPU on that | ||
+ | node, you should list that same node name as many times as many CPUs | ||
+ | you want to use on that node. If this option is not provided, | ||
+ | only one CPU is used on the local machine. | ||
+ | -n : the input file is not usually processed as one single query file. | ||
+ | Instead, it is sliced up into little parts and each part | ||
+ | is processed separately; this option is useful to tweak when | ||
+ | you also make use of the multi-CPU option (-c), as each slice can | ||
+ | be processed by one CPU. | ||
+ | -l : the minimum accepted length of the clear range in order | ||
+ | to be considered valid. If the length of the clear range falls | ||
+ | below this value, a trash code is assigned to the input sequence | ||
+ | and it will be exclued from the output filtered file (*.clean) | ||
+ | -r : custom name of the cleaning report file (default: add the " | ||
+ | suffix to the input file name) | ||
+ | -o : custom name of the final "clear range" | ||
+ | the valid sequences (default: append " | ||
+ | input file name) | ||
+ | -x : set the minimum percent identity to be considered for an | ||
+ | alignment with a contaminant (default 96) | ||
+ | -y : minimum length of a terminal vector hit to be considered | ||
+ | (>11, default 11) | ||
+ | -N : disable any attempt of trimming of low quality ends (ends rich in | ||
+ | N = undetermined bases) | ||
+ | -M : completely disable trashing of low quality (N-rich) sequences | ||
+ | -A : disable trimming of polyA tails from 3' end or polyT from 5' end of | ||
+ | the input sequences | ||
+ | -L : disable low-complexity analysis and the trashing of input sequences | ||
+ | by this criterion | ||
+ | -I : do not rebuild the .cidx file (if already there) | ||
+ | -m : enable sending of e-mail notification to the mentioned address, at | ||
+ | the end of the cleaning process or in case of error | ||
+ | |||
+ | If after seqclean one needs to trim the corresponding quality values too, | ||
+ | according to the new coordinates or trash codes found by seqclean, the | ||
+ | utility script " | ||
+ | a fasta-like file containing space delimited quality values for each nucleotide of | ||
+ | the original sequences. It should be run after the seqclean, as it parses the | ||
+ | trimming (" | ||
+ | and applies them to the quality records. | ||
+ | |||
+ | |||
+ |
seqclean/seqclean.txt · Last modified: 2010/06/23 16:25 by 172.26.15.75