User Tools

Site Tools


seqclean:seqclean

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
seqclean:seqclean [2010/06/23 16:07] 172.26.15.75seqclean:seqclean [2010/06/23 16:25] (current) 172.26.15.75
Line 17: Line 17:
 The user is expected to provide the contaminant databases,  The user is expected to provide the contaminant databases, 
 they are not included in this package they are not included in this package
- 
 ====Usage and methods==== ====Usage and methods====
 ---- ----
Line 97: Line 96:
 mode as mentioned above. mode as mentioned above.
  
 +The cleaning scripts keep track of iterative trimming of the input 
 +sequences through multiple matches with various contaminants, 
 +if that's the case.The 5' end (end5) coordinate of each input sequence 
 +is initially set to 1, and the 3' end (end3) coordinate is considered 
 +to be the length of the initial sequences. During the above mentioned 
 +trimming procedures, end5 can be increased and/or end3 can be decreased.
 +The final end3-end5+1 range is considered to be the "clear range" of the 
 +sequence after going through the cleaning procedure. No matter if trimming 
 +was applied or not, if the "clear range" length is shorter than a minimum
 +value (default 100nt, can be set by -l option), the sequence will 
 +be considered invalid and it will be trashed. Also, at the end of 
 +the cleaning procedure, the percentage of undetermined bases from the 
 +clear range is computed and the sequence is also trashed if this 
 +percentage is larger than 3%.
 +
 +==== Cleaning report format ====
 +----
 +Each line in the cleaning report file (*.cln) has 7 tab-delimited fields 
 +as follows:
 +
 +1. the name of the input sequence
 +2. the percentage of undetermined bases in the clear range 
 +3. 5' coordinate after cleaning
 +4. 3' coordinate after cleaning
 +5. initial length of the sequence
 +6. trash code
 +7. trimming comments (contaminant names, reasons for trimming/trashing)
 +
 +The trash code field (6) should be empty if (part of) a sequence is 
 +considered valid - so it can be found in the final filtered file (*.clean)
 +
 +The trash code field will be set to the file name of the last contaminant 
 +database, if that determined the clear range to fall below the minimum value 
 +(-l parameter, default 100). There are three reserved values of 
 +the trash code:
 +
 +  "shortq" - assigned when the sequence length decreases 
 +             below the minimum accepted length (-l) after polyA            
 +             or low quality ends trimming;
 +"low_qual" - assigned when the percentage of undetermined bases
 +             is greater than 3% in the clear range;
 +    "dust" - assigned when less than 40nt of the sequence 
 +             is left unmasked by the "dust" low-complexity filter;
 +
 +
 +The reasons and the coordinates for trimming are mentioned in the 7th
 +field. When trimming was due to a contaminant match, the contaminant 
 +name and the overlap coordinates are mentioned. When trimming was due 
 +to polyA tail or low quality ends removal, the "trimpoly" program name is 
 +mentioned along with the trimming coordinates.
 +
 +Besides the -s and -v parameters mentioned above, here is a brief summary
 +of the other parameters:
 +-c  : enables parallel processing by specifying the number of local CPUs 
 +      to use (for a SMP machine) or a filename containing a list of PVM node 
 +      names (one host name per line, per CPU). In the PVM case, if a node 
 +      is also a SMP machine and you want to use more than one CPU on that 
 +      node, you should list that same node name as many times as many CPUs 
 +      you want to use on that node. If this option is not provided, 
 +      only one CPU is used on the local machine.
 +-n  : the input file is not usually processed as one single query file. 
 +      Instead, it is sliced up into little parts and each part 
 +      is processed separately; this option is useful to tweak when 
 +      you also make use of the multi-CPU option (-c), as each slice can 
 +      be processed by one CPU.
 +-l  : the minimum accepted length of the clear range in order 
 +      to be considered valid. If the length of the clear range falls
 +      below this value, a trash code is assigned to the input sequence 
 +      and it will be exclued from the output filtered file (*.clean)
 +-r  : custom name of the cleaning report file (default: add the ".cln"
 +      suffix to the input file name)
 +-o  : custom name of the final "clear range"-only FASTA file containing only 
 +      the valid sequences (default: append ".clean" suffix to the 
 +      input file name)
 +-x  : set the minimum percent identity to be considered for an 
 +      alignment with a contaminant (default 96)
 +-y  : minimum length of a terminal vector hit to be considered
 +      (>11, default 11)           
 +-N  : disable any attempt of trimming of low quality ends (ends rich in 
 +      N = undetermined bases)
 +-M  : completely disable trashing of low quality (N-rich) sequences     
 +-A  : disable trimming of polyA tails from 3' end or polyT from 5' end of
 +      the input sequences
 +-L  : disable low-complexity analysis and the trashing of input sequences 
 +      by this criterion
 +-I  : do not rebuild the .cidx file (if already there)
 +-m  : enable sending of e-mail notification to the mentioned address, at 
 +      the end of the cleaning process or in case of error
 +
 +If after seqclean one needs to trim the corresponding quality values too, 
 +according to the new coordinates or trash codes found by seqclean, the 
 +utility script "cln2qual" is included (see the usage message). It expects 
 +a fasta-like file containing space delimited quality values for each nucleotide of 
 +the original sequences. It should be run after the seqclean, as it parses the 
 +trimming ("clear range") coordinates and trash codes from the cleaning report 
 +and applies them to the quality records.
  
  
  
seqclean/seqclean.1277309254.txt.gz · Last modified: 2010/06/23 16:07 by 172.26.15.75