seqclean:seqclean
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revision | |||
seqclean:seqclean [2010/06/23 16:07] – 172.26.15.75 | seqclean:seqclean [2010/06/23 16:25] (current) – 172.26.15.75 | ||
---|---|---|---|
Line 17: | Line 17: | ||
The user is expected to provide the contaminant databases, | The user is expected to provide the contaminant databases, | ||
they are not included in this package | they are not included in this package | ||
- | |||
====Usage and methods==== | ====Usage and methods==== | ||
---- | ---- | ||
Line 97: | Line 96: | ||
mode as mentioned above. | mode as mentioned above. | ||
+ | The cleaning scripts keep track of iterative trimming of the input | ||
+ | sequences through multiple matches with various contaminants, | ||
+ | if that's the case.The 5' end (end5) coordinate of each input sequence | ||
+ | is initially set to 1, and the 3' end (end3) coordinate is considered | ||
+ | to be the length of the initial sequences. During the above mentioned | ||
+ | trimming procedures, end5 can be increased and/or end3 can be decreased. | ||
+ | The final end3-end5+1 range is considered to be the "clear range" of the | ||
+ | sequence after going through the cleaning procedure. No matter if trimming | ||
+ | was applied or not, if the "clear range" length is shorter than a minimum | ||
+ | value (default 100nt, can be set by -l option), the sequence will | ||
+ | be considered invalid and it will be trashed. Also, at the end of | ||
+ | the cleaning procedure, the percentage of undetermined bases from the | ||
+ | clear range is computed and the sequence is also trashed if this | ||
+ | percentage is larger than 3%. | ||
+ | |||
+ | ==== Cleaning report format ==== | ||
+ | ---- | ||
+ | Each line in the cleaning report file (*.cln) has 7 tab-delimited fields | ||
+ | as follows: | ||
+ | |||
+ | 1. the name of the input sequence | ||
+ | 2. the percentage of undetermined bases in the clear range | ||
+ | 3. 5' coordinate after cleaning | ||
+ | 4. 3' coordinate after cleaning | ||
+ | 5. initial length of the sequence | ||
+ | 6. trash code | ||
+ | 7. trimming comments (contaminant names, reasons for trimming/ | ||
+ | |||
+ | The trash code field (6) should be empty if (part of) a sequence is | ||
+ | considered valid - so it can be found in the final filtered file (*.clean) | ||
+ | |||
+ | The trash code field will be set to the file name of the last contaminant | ||
+ | database, if that determined the clear range to fall below the minimum value | ||
+ | (-l parameter, default 100). There are three reserved values of | ||
+ | the trash code: | ||
+ | |||
+ | " | ||
+ | below the minimum accepted length (-l) after polyA | ||
+ | or low quality ends trimming; | ||
+ | " | ||
+ | is greater than 3% in the clear range; | ||
+ | " | ||
+ | is left unmasked by the " | ||
+ | |||
+ | |||
+ | The reasons and the coordinates for trimming are mentioned in the 7th | ||
+ | field. When trimming was due to a contaminant match, the contaminant | ||
+ | name and the overlap coordinates are mentioned. When trimming was due | ||
+ | to polyA tail or low quality ends removal, the " | ||
+ | mentioned along with the trimming coordinates. | ||
+ | |||
+ | Besides the -s and -v parameters mentioned above, here is a brief summary | ||
+ | of the other parameters: | ||
+ | -c : enables parallel processing by specifying the number of local CPUs | ||
+ | to use (for a SMP machine) or a filename containing a list of PVM node | ||
+ | names (one host name per line, per CPU). In the PVM case, if a node | ||
+ | is also a SMP machine and you want to use more than one CPU on that | ||
+ | node, you should list that same node name as many times as many CPUs | ||
+ | you want to use on that node. If this option is not provided, | ||
+ | only one CPU is used on the local machine. | ||
+ | -n : the input file is not usually processed as one single query file. | ||
+ | Instead, it is sliced up into little parts and each part | ||
+ | is processed separately; this option is useful to tweak when | ||
+ | you also make use of the multi-CPU option (-c), as each slice can | ||
+ | be processed by one CPU. | ||
+ | -l : the minimum accepted length of the clear range in order | ||
+ | to be considered valid. If the length of the clear range falls | ||
+ | below this value, a trash code is assigned to the input sequence | ||
+ | and it will be exclued from the output filtered file (*.clean) | ||
+ | -r : custom name of the cleaning report file (default: add the " | ||
+ | suffix to the input file name) | ||
+ | -o : custom name of the final "clear range" | ||
+ | the valid sequences (default: append " | ||
+ | input file name) | ||
+ | -x : set the minimum percent identity to be considered for an | ||
+ | alignment with a contaminant (default 96) | ||
+ | -y : minimum length of a terminal vector hit to be considered | ||
+ | (>11, default 11) | ||
+ | -N : disable any attempt of trimming of low quality ends (ends rich in | ||
+ | N = undetermined bases) | ||
+ | -M : completely disable trashing of low quality (N-rich) sequences | ||
+ | -A : disable trimming of polyA tails from 3' end or polyT from 5' end of | ||
+ | the input sequences | ||
+ | -L : disable low-complexity analysis and the trashing of input sequences | ||
+ | by this criterion | ||
+ | -I : do not rebuild the .cidx file (if already there) | ||
+ | -m : enable sending of e-mail notification to the mentioned address, at | ||
+ | the end of the cleaning process or in case of error | ||
+ | |||
+ | If after seqclean one needs to trim the corresponding quality values too, | ||
+ | according to the new coordinates or trash codes found by seqclean, the | ||
+ | utility script " | ||
+ | a fasta-like file containing space delimited quality values for each nucleotide of | ||
+ | the original sequences. It should be run after the seqclean, as it parses the | ||
+ | trimming (" | ||
+ | and applies them to the quality records. | ||
seqclean/seqclean.1277309254.txt.gz · Last modified: 2010/06/23 16:07 by 172.26.15.75