Differences

This shows you the differences between two versions of the page.

--- seqclean:seqclean [2010/06/23 16:07] – 172.26.15.75
+++ seqclean:seqclean [2010/06/23 16:25] (current) – 172.26.15.75
@@ Line 17: / Line 17: @@
 The user is expected to provide the contaminant databases,
 they are not included in this package
 ====Usage and methods====
 ----
@@ Line 97: / Line 96: @@
 mode as mentioned above.
+The cleaning scripts keep track of iterative trimming of the input
+sequences through multiple matches with various contaminants,
+if that's the case.The 5' end (end5) coordinate of each input sequence
+is initially set to 1, and the 3' end (end3) coordinate is considered
+to be the length of the initial sequences. During the above mentioned
+trimming procedures, end5 can be increased and/or end3 can be decreased.
+The final end3-end5+1 range is considered to be the "clear range" of the
+sequence after going through the cleaning procedure. No matter if trimming
+was applied or not, if the "clear range" length is shorter than a minimum
+value (default 100nt, can be set by -l option), the sequence will
+be considered invalid and it will be trashed. Also, at the end of
+the cleaning procedure, the percentage of undetermined bases from the
+clear range is computed and the sequence is also trashed if this
+percentage is larger than 3%.
+==== Cleaning report format ====
+----
+Each line in the cleaning report file (*.cln) has 7 tab-delimited fields
+as follows:
+. the name of the input sequence
+. the percentage of undetermined bases in the clear range
+. 5' coordinate after cleaning
+. 3' coordinate after cleaning
+. initial length of the sequence
+. trash code
+. trimming comments (contaminant names, reasons for trimming/trashing)
+The trash code field (6) should be empty if (part of) a sequence is
+considered valid - so it can be found in the final filtered file (*.clean)
+The trash code field will be set to the file name of the last contaminant
+database, if that determined the clear range to fall below the minimum value
+(-l parameter, default 100). There are three reserved values of
+the trash code:
+  "shortq" - assigned when the sequence length decreases
+             below the minimum accepted length (-l) after polyA
+             or low quality ends trimming;
+"low_qual" - assigned when the percentage of undetermined bases
+             is greater than 3% in the clear range;
+    "dust" - assigned when less than 40nt of the sequence
+             is left unmasked by the "dust" low-complexity filter;
+The reasons and the coordinates for trimming are mentioned in the 7th
+field. When trimming was due to a contaminant match, the contaminant
+name and the overlap coordinates are mentioned. When trimming was due
+to polyA tail or low quality ends removal, the "trimpoly" program name is
+mentioned along with the trimming coordinates.
+Besides the -s and -v parameters mentioned above, here is a brief summary
+of the other parameters:
+-c  : enables parallel processing by specifying the number of local CPUs
+      to use (for a SMP machine) or a filename containing a list of PVM node
+      names (one host name per line, per CPU). In the PVM case, if a node
+      is also a SMP machine and you want to use more than one CPU on that
+      node, you should list that same node name as many times as many CPUs
+      you want to use on that node. If this option is not provided,
+      only one CPU is used on the local machine.
+-n  : the input file is not usually processed as one single query file.
+      Instead, it is sliced up into little parts and each part
+      is processed separately; this option is useful to tweak when
+      you also make use of the multi-CPU option (-c), as each slice can
+      be processed by one CPU.
+-l  : the minimum accepted length of the clear range in order
+      to be considered valid. If the length of the clear range falls
+      below this value, a trash code is assigned to the input sequence
+      and it will be exclued from the output filtered file (*.clean)
+-r  : custom name of the cleaning report file (default: add the ".cln"
+      suffix to the input file name)
+-o  : custom name of the final "clear range"-only FASTA file containing only
+      the valid sequences (default: append ".clean" suffix to the
+      input file name)
+-x  : set the minimum percent identity to be considered for an
+      alignment with a contaminant (default 96)
+-y  : minimum length of a terminal vector hit to be considered
+      (>11, default 11)
+-N  : disable any attempt of trimming of low quality ends (ends rich in
+      N = undetermined bases)
+-M  : completely disable trashing of low quality (N-rich) sequences
+-A  : disable trimming of polyA tails from 3' end or polyT from 5' end of
+      the input sequences
+-L  : disable low-complexity analysis and the trashing of input sequences
+      by this criterion
+-I  : do not rebuild the .cidx file (if already there)
+-m  : enable sending of e-mail notification to the mentioned address, at
+      the end of the cleaning process or in case of error
+If after seqclean one needs to trim the corresponding quality values too,
+according to the new coordinates or trash codes found by seqclean, the
+utility script "cln2qual" is included (see the usage message). It expects
+a fasta-like file containing space delimited quality values for each nucleotide of
+the original sequences. It should be run after the seqclean, as it parses the
+trimming ("clear range") coordinates and trash codes from the cleaning report
+and applies them to the quality records.