BecA Bioinformatics

Quick Links

Useful Resources

Data analysis using CLC main workbench

The CLC Main Work bench is software package that supports analysis of your sequence data. The data analysis shown is of sequenced data generated by Sanger sequencing.
The functionalities of CLC Main Workbench are used for DNA, RNA, and protein sequence data analysis, such as gene expression analysis, primer design, molecular cloning, phylogenetic analyses, and sequence data management, amongst a wide variety of other features – all wrapped in an intuitive graphical user-interface.CLC Main Workbench is available on Windows, Mac OS X, and Linux platforms.

LINK: Download CLC Main workbench here.

Definitions
Sequence assembly: aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence.
Assembly to a reference sequence: assembly of the sequenced fragments using a similar sequence (a reference sequence) as a guide.
De novo assembly: assembly using only your sequenced fragments. This can be more difficult if the region of overlap is very small.
Conflict: a nucleotide position in the assembled sequence overlap which has different bases, or where there is a gap on one sequence and a base at the corresponding position on the other sequence.
DNA Contig: Contiguous sequence of DNA created by assembling overlapping DNA sequence reads. It consists of a multiple sequence alignment of sequence reads plus a consensus sequence.
Consensus sequence: A sequence of nucleotides [or amino acids] in common between regions of homology in related DNA [or protein] sequences.

Getting started

Download the workbench from the link provided above
Open the CLC Main Workbench software

Navigation area

View area

Navigation area is where all the data imported into CLC is kept.
View area is the main window to the right, where data is viewed and manipulated.

Creating a folder, it is best to organize data in the navigation area in folders.

To create a folder go to File | New | Folder

click on the new folder icon on the tool bar

Importing data: Allows you to bring sequenced data into CLC from where it is stored on your computer. You can import many file types but in this tutorial we will look at importing trace files for further analysis.
- >Go to File | import or click the import icon on the tool bar.
- Navigate to where your sequences are stored on your computer
- Select the file format to import.
- Select all the sequences you want to import (to import them concurrently) and click next.
- Select the folder to save the sequences to in CLC- in this case highlight My_folder and click finish
- The data is imported into CLC if you expand My_folder in the navigation area you should see your sequences
- Double click the sequence in the Navigation area to view it Figure 3.

Assembling sequences: trimming, conflicts, consensus

Trimming the sequences: The first step when analyzing the sequencing reads is to trim them. Trimming eliminates reads of poor quality (and in the case of cloned inserts removal of vector contamination).
- To trim sequences go to Toolbox menu Sequencing Data Analysis | Trim Sequences
- Select all the sequences to be trimmed on the left panel and use the arrow to move them to the right panel, click next.
- Specify the trimming parameters the limit on trim quality scores can be increased or reduced to make trimming more or less stringent respectively
- If the sequences were cloned products select the vector trimming options.
- Click next and choose to save the results.
Fig 4

When the trimming is done the parts of the sequences that are trimmed are not actually removed but trim annotations are saved to the sequences. These annotated sections are ignored in further analysis.

Fig 5
Assemble sequences: In most cases forward and reverse primers are used, hence on sequencing you end up with a forward and reverse sequence. Assembling the two sequences aligns the sequences where they overlap to get a contiguous sequence called a contig.

Assembly can be done to a reference sequence
- Comparison with a similar sequence obtained from a sequence database Or
De novo assembly: common in the cases that you do not have a reference sequence to the sequenced data.

Go to

Toolbox | Sequencing Data Analysis | Assemble sequences

Click next, this dialogue box allows one to define assembly parameters see Figure 7 below

Conflict can be voted on (A, C, G, T), Assigned degenerate bases or assigned an N for ambiguous base when conflicts arise in the overlapping regions; where the forward and reverse strand have different bases at the same position. This can be resolved at a later stage.
Output options, select create full contigs, including trace data. Click next.
The next dialogue box allows you to set trimming parameters if you had not trimmed the sequences prior to this, click next
Select the save option to save your assembled contig.

Edit | rename

Finding and editing conflicts: Conflict may arise during sequence assemble, as a result of a mismatch in the overlapping regions of the forward and reverse sequences.

find conflict

Resolved conflict”.

Quality of reads on both strands (take data from the most consistent sequence)
Which of the peaks is of better quality
In some cases two bases may be picked on either sequences because it is genuinely a SNP position. Judgment should be based on quality of reads but also background knowledge on the sequences been analyzed.
Degenerate codes – are used in the case that you cannot resolve conflicts.

Saving the consensus sequence: When done editing the contig, you can extract the consensus sequence to be used for further analysis.
Right click on the name “consensus” Open copy of sequence | Save This will make it possible to use the sequence for further analysis.
It is recommended that you create a separate folder to save the consensus sequences to avoid mix up with trace files or contigs.
Exporting sequences: You can export your consensus sequence out of CLC and save it in another location on you computer.
Highlight the sequence and go to File | Export or click the export icon on the task bar
This brings a dialogue box that allows you to select the file format you want to save your sequence in and the location in your computer to save the sequence file.
You can select to export your sequence in the FASTA (.fa / .fsa / .fasta) format as it is common import for many sequence analysis programs.

Sequence alignment using CLC Main workbench

CLC Main workbench can align two or more sequences to find regions of similarity.
To align sequences you need to use the consensus sequences or input sequences in fasta format:

From the toolbox menu select Alignment and Trees | Create Alignments
Set parameters for the alignment algorithm

You can choose to increase the gap settings to make the alignment more stringent.
Select either fast (less accurate) or slow (very accurate) alignment
You can choose to open the alignment and view it or save it.
Double click the alignment to view it.

Building Phylogenetic trees in CLC Main Workbench

It is possible to infer evolutionary relationships for aligned sequences. In CLC this can be done either by distance based methods or by statistical methods (Maximum likelihood approach)

From the toolbox menu select Alignment and Trees | Create Trees

Alignment and Trees | Maximum Likelihood Phylogeny

This options open up a dialogue box that requires you to select the alignment on which your tree will be built. Select the alignment on the left panel and move it to the right panel and click next.
The set parameters dialogue box allows one to set the tree parameters.
- For the distance based methods the tree building parameters include:
  - Selecting the algorithm. Either UPGMA or Neighbor Joining.
  - Bootstrap analysis is also done to evaluate the reliability of the inferred trees. The wizard allows one to adjust the number of replicates of bootstrap analysis.
  - Click finish to create the tree.
- For the Maximum Likelihood tree, a number of parameters are set,